Hadoop + Jackson parsing: ObjectMapper reads Object and then breaks

Hadoop + Jackson parsing: ObjectMapper reads Object and then breaks - java

I am implementing a JSON RecordReader in Hadoop with Jackson.
By now I am testing locally with JUnit + MRUnit.
The JSON files contain one object each, that after some headers, it has a field whose value is an array of entries, each of which I want to be understood as a Record (so I need to skip those headers).
I am able to do this by advancing the FSDataInputStream up to the point of reading.
In my local testing, I do the following:
fs = FileSystem.get(new Configuration());
in = fs.open(new Path(filename));
long offset = getOffset(in, "HEADER_START_HERE");
in.seek(offset);
where getOffset is a function where points the InputStream where the field value starts - which works OK, if we look at in.getPos() value.
I am reading the first record by:
ObjectMapper mapper = new ObjectMapper();
JsonNode actualObj = mapper.readValue (in, JsonNode.class);
The first record comes back fine. I can use mapper.writeValueAsString(actualObj) and it has read it fine, and it was valid.
Fine till here.
So I try to iterate the objects, by doing:
ObjectMapper mapper = new ObjectMapper();
JsonNode actualObj = null;
do {
actualObj = mapper.readValue (in, JsonNode.class);
if( actualObj != null) {
LOG.info("ELEMENT:\n" + mapper.writeValueAsString(actualObj) );
}
} while (actualObj != null) ;
And it reads the first one, but then it breaks:
java.lang.NullPointerException: null
at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:54)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:57)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:243)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:273)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:225)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:193)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.codehaus.jackson.impl.ByteSourceBootstrapper.ensureLoaded(ByteSourceBootstrapper.java:340)
at org.codehaus.jackson.impl.ByteSourceBootstrapper.detectEncoding(ByteSourceBootstrapper.java:116)
at org.codehaus.jackson.impl.ByteSourceBootstrapper.constructParser(ByteSourceBootstrapper.java:197)
at org.codehaus.jackson.JsonFactory._createJsonParser(JsonFactory.java:503)
at org.codehaus.jackson.JsonFactory.createJsonParser(JsonFactory.java:365)
at org.codehaus.jackson.map.ObjectMapper.readValue(ObjectMapper.java:1158)
Why is this exception happening?
Does it have to do with being reading locally?
Is it needed some kind of reset or something when reusing an ObjectMapper or its underlying stream?

I managed to work it around. In case it helps:
First of all, I'm using Jackson 1.x latest version.
It seems that once JsonParser is instantiated with an InputStream, it takes control over it.
So, when using readValue(), once it is read (internally it calls _readMapAndClose() which automatically closes the stream.
There is a setting that you can set to tell the JsonParser not to close the underlying stream. You can pass it to your JsonFactory like this before your create your JsonParser:
JsonFactory f = new MappingJsonFactory();
f.configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
Beware you are responsible for closing the stream (FSDataInputStream in my case).
So, answers:
Why is this exception happening?
Because the parser manages the stream, and closes it after readValue().
Does it have to do with being reading locally?
No
Is it needed some kind of reset or something when reusing an ObjectMapper or its underlying stream?
No. What you need to be aware of when using Streaming API mixed with ObjectMapper-like methods, is that sometimes the mapper/parser may take control of the underlying stream. Refer to the Javadoc of JsonParser and check the documentation on each of the reading methods to meet your needs.

Related

Checkmarx scan issue - deserilization of unsanitized xml data from the input

I am currently facing issue during checkmarx scan. It is highlighting that we are deserializing of Untrusted data in the last line mentioned below. How to rectify this issue ?
Scan Issue : Deserialization of Untrusted Data
Note: We do not have any xsd
String message = request.getParameter("param_name"); // Input xml string
XStream parser = new XStream(new StaxDriver());
MyMessage messageObj = (MyMessage) parser.fromXML(message); // This line is flagged by CHECKMARX SCAN

I will assume that you intended to say that you're getting results for Deserialization of Untrusted Data.
The reason you're getting that message is that XStream will happily attempt to create an instance of just about any object specified in the XML by default. The technique is to allow only the types you intend to be deserialized. One would presume you've ensured those types are safe.
I ran this code derived from your example and verified that the two lines I added were detected as sanitization.
String message = request.getParameter("param_name");
XStream parser = new XStream(new StaxDriver());
parser.addPermission(NoTypePermission.NONE);
parser.allowTypes(new Class[] {MyMessage.class, String.class});
MyMessage messageObj = (MyMessage) parser.fromXML(message);
I added the String.class type since I'd presume some of your properties on MyMessage are String. String itself, like most primitives, is generally safe for deserialization. While the string itself is safe, you'll want to make sure how you use it is safe. (e.g. if you are deserializing a string and passing it to the OS as part of a shell exec, that could be a different vulnerability.)

List of HashMap into JSONs string with new line in Java

I have to convert a List into jsons string with new line.
Right now the code which i am using converts the List of HashMap into single jsons string. like below:
List<HashMap> mapList= new ArrayList<>();
HashMap hashmap = new HashMap();
hashmap.add("name","SO");
hashmap.add("rollNo","1");
mapList.put(hashmap);
HashMap hashmap1 = new HashMap();
hashmap1.add("name","SO1");
hashmap1.add("rollNo","2");
mapList.put(hashmap1 );
Now I am converting it into jsons string using ObjectMapper and the output would be
ObjectMapper mapper = new ObjectMapper();
String output = mapper.writeValueAsString(mapList);
Output:
[{"name":"SO","rollNo":1},{"name":"SO1","rollNo":2}]
Its working fine but I need the output inthe format shown below, i.e for every HashMap there should be new line in the JSON string.
[{"name":"SO","rollNo":1},
{"name":"SO1","rollNo":2}]

If i clearly understand the question, you can use:
output.replaceAll(",",",\n");
or you can go through each HashMap.Entry and call
mapper.writeValueAsString(entry);
or use configuration
ObjectMapper objectMapper = new ObjectMapper();
objectMapper.configure(SerializationConfig.Feature.INDENT_OUTPUT, true);

I suggest a slightly different path, and that is use a custom serializer, as outlined here for example.
It boils down to have your own
public static class MgetSerializer extends JsonSerializer<Mget> {
Which works for List for example.
The point is: I would avoid to "mix" things, as: having a solution where your code writes part of the output, and jackson creates other parts of the output. Rather enable jackson to do exactly what you want it to do.
Beyond that, I find the whole approach a bit dubious in the first place. JSON strings do not care about newlines. So, if you care how things are formatted, rather look into the tools you are using to look at your JSON.
Meaning: why waste your time formatting a string that isn't meant for direct human consumption in the first place? Browser consoles will show you JSON strings in a "folded" way, and any decent editor has similar capabilities these days.
In other words: I think you are investing your energy in the wrong place. JSON is a transport format, and you should only worry about the content you want to transmit, not in (essentially meaningless formatting effects).

You can use String methods to change/replace the output's String. However, this is not correct for json Strings as they may contain commas or other characters that you should escape in the String replace methods.
Alternatively, you should parse the Json String and use JsonNode on the Json as below:
ObjectMapper mapper = new ObjectMapper();
String output = mapper.writeValueAsString(mapList);
JsonNode jsonNode = mapper.readTree(output);
Iterator<JsonNode> iter=jsonNode.iterator();
String result = "[";
while(iter.hasNext()){
result+=iter.next().toString() + ",\n";
}
result =result.substring(0,result.length()-2) + "]";
System.out.println(result);
Result:
[{"rollNo":"1","name":"SO"},
{"rollNo":"2","name":"SO1"}]
This approach will work for String containing characters like comma, for example consider the input hashmap.put("n,,,ame","SO");
The result is:
[{"n,,,ame":"SO","rollNo":"1"},
{"rollNo":"2","name":"SO1"}]
Update: Output updated to include [ and ] and commas between rows.
Update: Fixed the output accordingly

Parse multiple JSON objects in one file

I have multiple JSON objects stored in one file separated by new line character (but one object can span over multiple lines) - it's an output from MongoDB shell.
What is the easiest way to parse them (get them in an array or collection) using Gson and Java?

Another possibility is to use Jackson and its ObjectReader.readValues() methods:
public <T> Iterator<T> readStream(final InputStream _in) throws IOException {
ObjectMapper mapper = new ObjectMapper();
// configure object mappings
...
// and then
return mapper.reader(MapObject.class).readValues(_in);
}
works pretty good on big enough (few gigabytes) JSON datafiles

Alternatives to JSON-object binding in Android application

From my Android application I need to use a RESTful web service that returns me a list of objects in json format.
This list can be very long (about 1000/2000 object.).
What I need to do is to search and retrive just some of the objects inside the json file.
Due to the limited memory of mobile device, I was thinking that using object-binding (using for example GSON library) can be dangerous.
Which are the alternatives for solving this problem?

If you are using gson, use gson streaming.
I've added the sample from the link and added my comment inside of it:
public List<Message> readJsonStream(InputStream in) throws IOException {
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
List<Message> messages = new ArrayList<Message>();
reader.beginArray();
while (reader.hasNext()) {
Message message = gson.fromJson(reader, Message.class);
// TODO : write an if statement
if(someCase) {
messages.add(message);
// if you want to use less memory, don't add the objects into an array.
// write them to the disk (i.e. use sql lite, shared preferences or a file...) and
// and retrieve them when you need.
}
}
reader.endArray();
reader.close();
return messages;
}

For example
1) Read the list as a stream and handle the single JSON entities on the fly and save only those that are of interest to you
2) Read the data into String object/objects and then find the JSON entities and handle them one by one instead of everything at the same time. Ways to analyse the String for JSON structures include regular expressions or manual indexOf combined with substring -type analysis.
1) is more efficient but requires a bit more work as you have to handle the stream at the same time where as 2) is probably more simple but it requires you to use quite a big Strings as temporary means.

Tika pass parser info during incremental read

I know Tika has a very nice wrapper that let's me get a Reader back from parsing a file like so:
Reader parsedReader = tika.parse(in);
However, if I use this, I cannot specify the parser that I want and the metadata that I want to pass in. For example, I would want to pass in extra info like which handler, parser, and context to use, but I can't do it if I use this method. As far as I know, it's the only one that let's me get a Reader instance back and read incrementally instead of getting the entire parsed string back.
Example of things I want to include:
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); //This aids in the content detection
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
However, calling parse on a parser directly does not return a reader, and the only option I have(noticed in the docs) is to return a fully parsed string, which might not be great for memory usage. I know I can limit the string that is returned, but I want to stay away from that as I wanto the fully parsed info, but in incremental fashion. Best of both world, is this possible?

One of the many great things about Apache Tika is that it's open source, so you can see how it works. The class for the Tika facade you're using is here
The key bit of that class for your interest is this bit:
public Reader parse(InputStream stream, Metadata metadata)
throws IOException {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
return new ParsingReader(parser, stream, metadata, context);
}
You see there how Tika is taking a parser and a stream, and processing it to a Reader. Do something similar and you're set. Alternately, write your own ContentHandler and call that directly for full control!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop + Jackson parsing: ObjectMapper reads Object and then breaks - java

Related

Checkmarx scan issue - deserilization of unsanitized xml data from the input

List of HashMap into JSONs string with new line in Java

Parse multiple JSON objects in one file

Alternatives to JSON-object binding in Android application

Tika pass parser info during incremental read

Categories

Resources