Parsing VS Treating as string (JSON)

Parsing VS Treating as string (JSON) - java

Hey fellow programmers,
I am working on a JAVA application where speed is key. I need to deal with a stream of JSON (Requests to the server return a json object that I continuously parse to analyze it later on). The Json object is about 2000 characters long, so I was wondering if it wouldn't be quicker to just treat it as a string (using indexOf, substring etc... ) instead of using a JSON Parser. (I used both Jackson and Json-lib without noticeable difference) ? Will it save me a couple milli-seconds ?
Thank you !

It depends what you need to know from it, but in general, I think it's better to use a JSON parser. The parser will be highly optimized, so it will beat your own attempts if you need to read many values. Also, the parser will ignore whitespace, while you have to take care of it explicitly.
Checking something yourself is harder than you think. For instance, if you need to know if a property 'x' exists, you cannot just check for the existance of the string x, because it can also be part of a value. You cannot look for x:, because maybe there is a space between them. And if you found x, do you know if it is in the right place? Is it part of the right object, or maybe of a sub-object you didn't expect there to be at all?
Before you know it, you are writing a parser yourself.
If you can't notice the difference, don't bother and use the parser, because it is the easiest, safest and most flexible choice. Only start optimizing if you need to.

Related

Which way of transfering JSON data is more optimized?

I am sending a big array of data. What is more optimized: to concat data with a symbol or to send it as a JSONArray?
Data is being sent from Android client to Apache PHP.
Example of concated data:
data1_data2_data3_data4
Example of JSONArray
{ "Data": [data1, data2, data3, data4] }

It completely depends on your usecase. From you example, here are some thoughts:
in terms of bytes sent, the concatenation is slightly better, as a JSON adds some metadata and symbols.
In terms of ease of use, JSON clearly wins, as there are libraries and standards. If you just have plain data without any _, concatenated data are ok. But what happens if one of you data has a _ ? You will need to escape those and to keep track of your custom format all over your codes... (And that's just the tip of the iceberg).
In general, my advice is: use standard data serialization schemes, always. In case the size of the serialized data is a concern, have a look at binary standards (for example protobuf).

It doesn't really matter, if you're asking about optimized towards transfer size in bytes, the difference is minimal.
However, the concatenated data example that you gave will require more data processing on the recipient part as your script will have to cut the sent data and parse the symbol transferring it into a usable object.
So best to stick with the usual JSON object, as I don't think you will gain any optimization this way.

Depends on what you mean by optimization.
Realistically speaking, even if you were to parse it with a custom made function/class vs. some built in function(like json_decode from PHP) the time difference would be rather minimal or irrelevant.
If you can stick to the standard, then do it. Send it as proper JSON not some weirdly concatenated string.
The advantages outweigh anything else.

to concat data will be optimized but you want to make sure your data do not have "_" or handle delimiter properly.

Algorithms for string processing

I have a question which make me think about how to improve speed and memory of system.
I will describe it by example, I have this file which have some string:
<e>Customer</e>
<a1>Customer Id</a1>
<a2>Customer Name</a2>
<e>Person</e>
It similar to xml file.
Now, my solution is when I read <e>Customer</e>, I will read from that to a nearest tag and then, substring from <e>Customer</e> to a nearest tag.
It make the system need to process so much. I used only regular expression to do it. I thought I will do the same as real compiler which have some phases (lexical analysis, parser).
Any ideas?
Thanks in advance!

If you really don't want to use one of the free and reliable xml parsers then a truly fast solution will almost certainly involve a state machine.
See this How to create a simple state machine in java question for a good start.
Please be sure to have a very good reason for taking this route.

Regular expressions are not the right tool for parsing complex structures like this. Since your file looks a lot like XML, it may make sense to add what's missing to make it XML (i.e. the header), and feed the result to an XML parser.
XML parsers are optimized for processing large volumes of data quickly (especially the SAX kind). You should see a significant improvement in performance if you switch to parsing XML from processing large volumes of text with regular expressions.

Just don't invest the time into an XML lexer/parser (its not worth it) and use what is allready out there.
For example http://www.mkyong.com/tutorials/java-xml-tutorials/ is a good tutorial,just use google.

converting legacy data to json format

I am working with some legacy code where I am using a column value and its earlier history to make some comparisons and highlight the differences. However, the column is stored in an arbitary delimited fashion and the code around comparison is well, very hard to understand.
My initial thought was to refactor the code - but when I later thought about it, I thought why not fix the original source of the issue which is the data in this case. I have some limitations around the table structure and hence converting this from a single column into multiple column data is not an option.
With that, I thought if its worth to convert this arbitary delimited data into a standardized format like json. My approach would be to export this data to some file and apply some regular expressions and convert this data to json format and then re-import it back.
I wanted to check with this group if I am approaching this problem the right way and if there are other ideas that I should consider. Is JSON the right format in the first place? I'd like to know how you approached a similar problem. Thanks!

However, the column is stored in an arbitary delimited fashion and the
code around comparison is well, very hard to understand.
So it seems the problem is in the maintainability/evolvability of the code for adding or modifying functionalities.
My initial thought was to refactor the code - but when I later thought
about it, I thought why not fix the original source of the issue which
is the data in this case.
My opinion is: why? I think your initial thought was right. You should refactor (i.e. " ...restructuring an existing body of code, altering its internal structure without changing its external behavior") to gain what you want. If you change the data format you could end with a messy code above the new cooler data format in any case.
In others words you can have all the combinations because they are not strictly related: you can have a cool data format with a messy cody or with a fantastic code, and you can have a ugly data format with a fantastic code or a messy code.
It seems you are in the last case but moving from this case to: cool data format + messy code and then to cool data format + fantastic code is less straightforward than: fantastic code + ugly data format and then cool data format + fantastic code (optional).
So my opinion is point your goal in a straightforward way:
1) write tests (if you haven't) for the current functionalities and verify them with the current code (first step for refactoring)
2) change the code driven by the tests wrote in 1
3) If you still want, change the data format ever guided by your tests
Regards

Simplest way to allow users to specify output format

I have written an application which outputs data as XML. However, it would be nice to allow the user to completely customize the output format so they can more easily integrate it into their applications.
What would be the best way to approach this problem? My initial thoughts are to define a grammar and write a parser from the ground up.
Are there any free Java libraries that can assist in parsing custom scripting(formatting?) languages?
Since I already have the XML, would it be a better approach to just 'convert' this with a search & replace algorithm?
I should specify here that 'users' are other programmers so defining a simple language would be fine, and that the output is potentially recursive (imagine outputting the contents of a directory to XML).
Just looking for general advice in this area before I set off down the wrong track.
EDIT: To clarify... My situation is a bit unique. The application outputs coordinates and other data to be loaded into a game engine. Everybody seems to use a different, completely custom format in their own engine. Most people do not want to implement a JSON parser and would rather use what they already have working. In other words, it is in the interests of my users to have full control over the output, asking them to implement a different parser is not an option.

Have you considered just using a templating engine like Velocity or FreeMarker.

I would have created a result bean as a POJO.
Then I would have different classes working on the result bean. That way you can easily extend with new formats if needed.
E.g
Result result = logic.getResult();
XMLOutputter.output(result, "myXMLFile.xml");
Format1Outputter.output(result, "myFormat1File.fo1");
Format2Outputter.output(result, "myFormat2File.fo2");

If you are planning to provide this as an API to multiple parties, I would advise against allowing over-customization, it will add unnecessary complexity to your product and provide just one more place for bugs to be introduced.
Second, it will increase the complexity of your documentation and as a side affect likely cause your documentation to fall out of sync with the api in general.
The biggest thing I would suggest considering, in terms of making your stream easier to digest, is making the output available in JSON format, which just about every modern language has good support for (I use Gson for Java, myself).

Removing Optional Elements from XML when invalid

I have a piece of xml that contains optional non-enumerated elements, so schema validation does not catch invalid values. However, this xml is transformed into a different format after validation and is then handed off to a system that tries to store the information in a database. At this point, some of the values that were optional in the previous format are now coded values in the database that will throw foreign key constraint exception if we try and store them. So, I need to build a process in a J2EE app that will check a set of xpaths values against a set of values that are allowable at those spots and if they are not valid either remove them/replace them/remove them and their parents depending on schema restrictions.
I have a couple options that will work, but neither of them seem like very elegant/intuitive solutions.
Option #1 would involve doing the work in an xslt 1.0. Before sending the xml through the xslt, querying up the acceptable values and sending the lists in as parameters. Then place tests at the appropriate locations in the xml that compares the incoming value against the acceptable ones and generates the xml accordingly.
This option doesn't seem very reusable, but it'd be very quick to implement.
Option #2 would involve Java code and an xml config file. The xml config file would layout the xpaths of the needed tests, the acceptable values, the default values (if applicable) and what to take out of the doc if the tests fail.
This option is much more reusable, but would probably double the time needed to build it.
So, which one of these would you pick? Or do you have another idea altogether? I'm open to all suggestions and would love to hear how you would handle this.

Sounds to me like option 2 is over-engineering. Do you have a clear idea about when you will want to reuse this functionality? If not, YAGNI, so go for the simpler and easier solution

Both options are acceptable. Depending on your skills and the complexity of your XML, I would say that it will require about the same amount of time.
Option 1 would be in my opinion more flexible, easier to maintain in the long run.
Option 2 could be tricky in some cases, how to define the config file itself for complex rules and how do you parse it without having to write complex code? One could say, I'll use a dom4j visitor and I'll be done with it. However, option 2 could become unnecessarily complicated imho if you deal with a complex XML structure.

I agree here. It felt like it was borderline over-engineering, but I was afraid that someone hearing that this was done would assume that it would be reusable and attempt to design something that used it in the future. However, I have since been reassured that this is a one-time deal and thus, will be going with the xslt approach.
Thanks all for your comments/answers!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.