Cannot read the first line of a JSON file in Java - java

I am trying to read some data from a JSON file that I generated from a MongoDB document. But when trying to read the first entry in the document, i get an exception:
org.json.JSONException: JSONObject["Uhrzeit"] not found.
This only happens with the first entry, reading other entrys does not cause an exception.
Using jsonObject.getString("") on any entry that is not the first returns the values as expected.
//Initiate Mongodb and declare the database and collection
MongoClient mongoClient = new MongoClient(new MongoClientURI("mongodb://localhost:27017"));
MongoDatabase feedbackDb = mongoClient.getDatabase("local");
MongoCollection<Document> feedback = feedbackDb.getCollection("RückmeldungenShort");
//gets all documents in a collection. "new Document()" is the filter, that returns all Documents
FindIterable<Document> documents = feedback.find(new Document());
//Iterates over all documents and converts them to JSONObjects for further use
for(Document doc : documents) {
JSONObject jsonObject = new JSONObject(doc.toJson());
System.out.print(jsonObject.toString());
System.out.print(jsonObject.getString("Uhrzeit"));
}
Printing jsonObject.toString() produces the JSON String for testing purposes (in one line):
{
"Ort":"Elsterwerda",
"Wetter-Keyword":"Anderes",
"Feedback\r":"Test Gelb\r",
"Betrag":"Gelb",
"Datum":"18.05.2018",
"Abweichung":"",
"Typ":"Vorhersage",
"_id":{
"$oid":"5b33453b75ef3c23f80fc416"
},
"Uhrzeit":"05:00"
}
Note, that the order in which the entries appear is mixed up and the first one appearing in the database was "Uhrzeit".
This is how it looks like:
The JSON file is valid according to https://jsonformatter.curiousconcept.com/ .
The "Uhrzeit" is even recognized within the JSONObject while in debug mode:
I assumed it might have something to do with the entries themselves, so I switched "Datum" and "Ort" to the first place in the document but that produced the same results.
There are lots of others that have posted on this error message, but it seems to me like they all had slightly different problems.

I imported a .csv with my data into MongoDB and read the documents from there. Somewhere in the process of reading the data, "\r"s were automatically generated where the line breaks were in my .csv (aka. at the end of each dataset). In this case at the key value pair "Feedback" (as seen in the last picture).
When checking my output again with another JSON validator, I noticed that there was an "invisible" symbol in my JSON file that caused the key not to be found. Now this symbol is located in front of the first key (after the MongoDB-id) when importing a .csv document to my DB. I imported a correct version of the .csv into my MongoDB and exported it again and the symbol reappeared.
The problem was that my .csv was in "Windows" format. Converting it to "Unix" format will get rid of the generated "\r"s. The "invisible" symbol was the UTF-8-BOM code that is added at the beginning of a document. You can reformat your .csv to be just UTF-8 and get rid of it that way.

Related

How to append to the existing document without replacing earlier while uploading to couchbase

I have an array of json Like
[{json1},{json2},...]
At first step will upload first json document and then in subsequent steps will have to get this jsondocument from couchbase and append json2 to it (without replacing it) and upload again . Have tried using jsonarray but it is not possible with upsert.
what is the correct way to do this .

Search with "Hash sign" in Solr

I am new to Solr and facing problems while optimizing the search in solr.
When i search for "C4902AN#140", it displays results with "140" first and result with ""C4902AN#140" is appearing later.i.e. after results containing "140".But I want result with "C4902AN#140" before results having "140".
Thanks in advance!!!
you may have to check with tokenizer you used for field type definition in schema file.
if the field type has solr.standardTokenizer it will remove # character.
OR
you should consider boosting the document which has"C4902AN#140"
you can use elevate.xml file in config folder and just mention which document to appear first in the resultset for specific searchTerm string.
The Analyzer which you are using for this should be using KeyWordTokenizerFactory so that your whole word does not get Tokenized , but only a single token , i.e the word itself is generated .

Reading JSON file with BigQuery to make table

I'm new to Google Dataflow, and can't get this thing to work with JSON. I've been reading throughout the documentation, but can't solve my problem.
So, following the WordCount example i figured how data is loaded from .csv file with next line
PCollection<String> input = p.apply(TextIO.Read.from(options.getInputFile()));
where inputFile in .csv file from my gcloud bucket. I can transform read lines from .csv with:
PCollection<TableRow> table = input.apply(ParDo.of(new ExtractParametersFn()));
(Extract ParametersFn defined by me). So far so good!
But then I realize my .csv file is too big and had to convert it to JSON (https://cloud.google.com/bigquery/preparing-data-for-bigquery).
Since BigQueryIO is supposedly better for reading JSON, I tried with the following code:
PCollection<TableRow> table = p.apply(BigQueryIO.Read.from(options.getInputFile()));
(inputFile is then JSON file and the output when reading with BigQuery is PCollection with TableRows) I tried with TextIO too (which returns PCollection with Strings) and neither of the two IO options work.
What am I missing? The documentation is really not that detailed to find an answer there, but perhaps some of you guys already dealt with this problem before?
Any suggestions would be very appreciated. :)
I believe there are two options to consider:
Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.

Store base64 encoded string in HBase

I have a very specific requirement of storing PDF data in Hbase columns. The source of Data is Mongo DB, from where the base64 encoded data is read and I will need to bulk upload it to Hbase table.
I realized that in base64 encoded string there are a lot of "\n" character which splits the entire string into parts. Not sure if it is because of this, but when I store the string as it is, using a put :
put.add(Bytes.toBytes(ColFamilyName), Bytes.toBytes(columnName), Bytes.toBytes(data.replaceAll("\n","").toString()));
It is storing only the first line from the entire encoded string. Eg :
If the actual content was something like this :
"JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
" +
"MSkKL1Byb2R1Y2VyIChBcGFjaGUgRk9QIFZlcnNpb24gMS4xKQovQ3JlYXRpb25EYXRlIChEOjIw\n" +
"MTUwODIyMTIxMjM1KzAzJzAwJykKPj4KZW5kb2JqCjUgMCBvYmoKPDwKICAvTiAzCiAgL0xlbmd0\n" +
It is storing only the first line which is :
JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
in the column. Even after trying to remove the "\n" manually it is the same output.
Could someone please guide me in the right direction here ?
Currently, I am also working on Base64 encoding. As per my understanding, you should try using
org.apache.hadoop.hbase.util.Base64.encodeBytes(byte[] source, int option)
method where DONT_BREAK_LINES can be used as an option.
Please let me know if this works fine.
Managed to solve it. The issue was when reading the Base64 encoded data from MongoDB Source. Read the data from Mongo DB document DBObject as:
jsonObj.get("receiptContent").toString().replaceAll("\n","")
And stored it as such in Hbase. Even from the Hue HBase UI Browser I can see the PDF content now.

Trailing null (\x00) characters when writing text to Accumulo

I am trying to write the name of a file into Accumulo. I am using accumulo-core-1.43.
For some reason, certain files seem to be written into Accumulo with trailing \x00 characters at the end of the name. The upload is coming through a Java servlet (using the jquery file upload plugin). In the servlet, I check the name of the file with a System.out.println and it looks normal, and I even tried unescaping the string with
org.apache.commons.lang.StringEscapeUtils.unescapeJava(...);
The actual writing to accumulo looks like this:
Mutation mut = new Mutation(new Text(checkSum));
Value val = new Value(new Text(filename).getBytes());
long timestamp = System.currentTimeMillis();
mut.put(new Text(colFam), new Text(EMPTY_BYTES), timestamp, val);
but nothing unusual showed up there (perhaps \x00 isn't escaped)? But then if I do a scan on my table in accumulo, there will be one or more \x00 in the file name.
The problem this seems to cause is that I return that string within XML when I retrieve a list of files (where it shows up) and pass that back to the browser, the the XSL that is supposed to render the information in the XML no longer works when there's these extra characters (not sure why that is the case either).
In chrome, for the response on these calls, I see that there's three red dots after the file name, and when I hover over it, \u0 pops up (which I think is a different representation of 0/null?).
Anyway, I'm just trying to figure out why this happens, or at the very least, how I can filter out \x00 characters before returning the file in Java. any ideas?
You are likely incorrectly using the Hadoop Text class -- this is not an error with Accumulo. Specifically, you make the mistake in your above example:
Value val = new Value(new Text(filename).getBytes());
You must adhere to the length of provided by the Text class. See the Text javadoc for more information. If you're using Hadoop-2.2.0, you can use the provided copyBytes method on Text. If you're on older version of Hadoop where this method doesn't yet exist, you can use something like the ByteBuffer class or the System.arraycopy method to get a copy of the byte[] with the proper limits enforced.

Categories

Resources