Working with large text snippets in Java source

Working with large text snippets in Java source - java

Are there any good ways to work with blocks of text (Strings) within Java source code? Many other languages have heredoc syntax available to them, but Java does not. This makes it pretty inconvenient to work with things like tag libraries which output a lot of static markup, and unit tests where you need to assert comparisons against blocks of XML.
How do other people work around this? Is it even possible? Or do I just have to put up with it?

If the text is static, or can be parameterized, a possible solution would be to store it in an external file and then import it. However, this creates file I/O which may be unnecessary or have a performance impact. Using this solution would need to involve caching the file contents to reduce the number of file reads.

The closes option in Java to HereDoc is java.text.MessageFormat.
You can not embed logic. It a simple value escape utility. There are no variables used. You have to use zero based indexing. Just follow the javadoc.
http://download.oracle.com/javase/1,5.0/docs/api/java/text/MessageFormat.html

While you could use certain formatters to convert and embed any text file or long literal
as a Java string (e.g., with newline breaks, the necessary escapes, etc.), I can't really think of frequent situations where you would need these capabilities.
The trend in software is generally to separate code from the data it operates on. Large text sections, even if meant just for display or comparison, are data, and are thus typically stored externally. The cost of reading a file (or even caching the result in memory) is fairly low. Internationalization is easier. Changing is easier. Version control is easier. Other tools (e.g., spell checkers) can easily be used.
I agree that in the case of unit tests where you want to compare things against a mock you would need large scale text comparisons. However, when you deal with such large files you will typically have tests that can work on several different large inputs to produce several large outputs, so why not just have your test load the appropriate files rather than inline it ?
Same goes with XML. In fact, for XML I would argue that in many cases you would want to read the XML and build a DOM tree which you would then compare rather than do a text compare that can be affected by whitespaces. And manually creating an XML tree in your unit test is ugly.

Related

Fastest way to read a CSV?

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.

Should I split a large file before running multiple regexes?

I have an input that's about 35KB of text that I need to pull a bunch of small bits of data from. I use multiple regexes to find the data, and that part works fine.
My question: should I split the large text into multiple smaller strings and run the appropriate regexes on each string, or just keep it in one big string and reset the matcher for each regex? Which way is best for efficiency?

If it isn't running too slow then go with whatever you currently have that is working fast enough.
Otherwise, you shouldn't be using raw regexes for this task anyway. As soon as you mention "multiple regexes" extracting "small bits of data" from, you are talking about writing a parser and should use a decent parsing tool.
As you are using java I would recommend starting with jFlex, which is a mature java implementation of an extremely mature and stable C tool.
For most tasks jFlex will be all you need, but it also integrates smoothly with a number of java parser-generators should the problem prove to be more complicated. My personal preference is the slightly obscure Beaver.
Of course, if you can implement it as a set of regexes it isn't more complicated and jFlex will do the job for you.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.

Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.

Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.

The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).

I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

Removing Optional Elements from XML when invalid

I have a piece of xml that contains optional non-enumerated elements, so schema validation does not catch invalid values. However, this xml is transformed into a different format after validation and is then handed off to a system that tries to store the information in a database. At this point, some of the values that were optional in the previous format are now coded values in the database that will throw foreign key constraint exception if we try and store them. So, I need to build a process in a J2EE app that will check a set of xpaths values against a set of values that are allowable at those spots and if they are not valid either remove them/replace them/remove them and their parents depending on schema restrictions.
I have a couple options that will work, but neither of them seem like very elegant/intuitive solutions.
Option #1 would involve doing the work in an xslt 1.0. Before sending the xml through the xslt, querying up the acceptable values and sending the lists in as parameters. Then place tests at the appropriate locations in the xml that compares the incoming value against the acceptable ones and generates the xml accordingly.
This option doesn't seem very reusable, but it'd be very quick to implement.
Option #2 would involve Java code and an xml config file. The xml config file would layout the xpaths of the needed tests, the acceptable values, the default values (if applicable) and what to take out of the doc if the tests fail.
This option is much more reusable, but would probably double the time needed to build it.
So, which one of these would you pick? Or do you have another idea altogether? I'm open to all suggestions and would love to hear how you would handle this.

Sounds to me like option 2 is over-engineering. Do you have a clear idea about when you will want to reuse this functionality? If not, YAGNI, so go for the simpler and easier solution

Both options are acceptable. Depending on your skills and the complexity of your XML, I would say that it will require about the same amount of time.
Option 1 would be in my opinion more flexible, easier to maintain in the long run.
Option 2 could be tricky in some cases, how to define the config file itself for complex rules and how do you parse it without having to write complex code? One could say, I'll use a dom4j visitor and I'll be done with it. However, option 2 could become unnecessarily complicated imho if you deal with a complex XML structure.

I agree here. It felt like it was borderline over-engineering, but I was afraid that someone hearing that this was done would assume that it would be reusable and attempt to design something that used it in the future. However, I have since been reassured that this is a one-time deal and thus, will be going with the xslt approach.
Thanks all for your comments/answers!

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.