Fastest way to read a CSV? - java

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.

Related

Should I split a large file before running multiple regexes?

I have an input that's about 35KB of text that I need to pull a bunch of small bits of data from. I use multiple regexes to find the data, and that part works fine.
My question: should I split the large text into multiple smaller strings and run the appropriate regexes on each string, or just keep it in one big string and reset the matcher for each regex? Which way is best for efficiency?
If it isn't running too slow then go with whatever you currently have that is working fast enough.
Otherwise, you shouldn't be using raw regexes for this task anyway. As soon as you mention "multiple regexes" extracting "small bits of data" from, you are talking about writing a parser and should use a decent parsing tool.
As you are using java I would recommend starting with jFlex, which is a mature java implementation of an extremely mature and stable C tool.
For most tasks jFlex will be all you need, but it also integrates smoothly with a number of java parser-generators should the problem prove to be more complicated. My personal preference is the slightly obscure Beaver.
Of course, if you can implement it as a set of regexes it isn't more complicated and jFlex will do the job for you.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

Is there a fast Java library to search for a string and its position in file?

I need to search a big number of files (i.e. 600 files, 0.5 MB each) for a specific string.
I'm using Java, so I'd prefer the answer to be a Java library or in the worst case a library in a different language which I could call from Java.
I need the search to return the exact position of the found string in a file (so it seems Lucene for example is out of the question).
I need the search to be as fast as possible.
EDIT START:
The files might have different format (i.e. EDI, XML, CSV) and contain sometimes pretty random data (i.e. numerical IDs etc.). This is why I preliminarily ruled out an index-based searching engine.
The files will be searched multiple times for similar but different strings (i.e. for IDs which might have similar length and format, but they will usually be different).
EDIT END
Any ideas?
600 files of 0.5 MB each is about 300MB - that can hardly be considered big nowadays, let alone large. A simple string search on any modern computer should actually be more I/O-bound than CPU-bound - a single thread on my system can search 300MB for a relatively simple regular expression in under 1.5 seconds - which goes down to 0.2 if the files are already present in the OS cache.
With that in mind, if your purpose is to perform such a search infrequently, then using some sort of index may result in an overengineered solution. Start by iterating over all files, reading each block-by-block or line-by-line and searching - this is simple enough that it barely merits its own library.
Set down your performance requirements, profile your code, verify that the actual string search is the bottleneck and then decide whether a more complex solution is warranted. If you do need something faster, you should first consider the following solutions, in order of complexity:
Use an existing indexing engine, such as Lucene, to filter out the bulk of the files for each query and then explicitly search in the (hopefully few) remaining files for your string.
If your files are not really text, so that word-based indexing would work, preprocess the files to extract a term list for each file and use a DB to create your own indexing system - I doubt you will find an FTS engine that uses anything else than words for its indexing.
If you really want to reduce the search time to the minimum, extract term/position pairs from your files, and enter those in your DB. You may still have to verify by looking at the actual file, but it would be significantly faster.
PS: You do not mention at all what king of strings we are discussing about. Does it contain delimited terms, e.g. words, or do your files contain random characters? Can the search string be broken into substrings in a meaningful manner, or is it a bunch of letters? Is your search string fixed, or could it also be a regular expression? The answer to each of these questions could significantly limit what is and what is not actually feasible - for example indexing random strings may not be possible at all.
EDIT:
From the question update, it seems that the concept of a term/token is generally applicable, as opposed to e.g. searching for totally random sequences in a binary file. That means that you can index those terms. By searching the index for any tokens that exist in your search string, you can significantly reduce the cases where a look at the actual file is needed.
You could keep a term->file index. If most terms are unique to each file, this approach might offer a good complexity/performance trade-off. Essentially you would narrow down your search to one or two files and then perform a full search on those files only.
You could keep a term->file:position index. For example, if your search string is "Alan Turing". you would first search the index for the tokens "Alan" and "Turing". You would get two lists of files and positions that you could cross-reference. By e.g. requiring that the positions of the token "Alan" precede those of the token "Turing" by at most, say, 30 characters, you would get a list of candidate positions in your files that you could verify explicitly.
I am not sure to what degree existing indexing libraries would help. Most are targeted towards text indexing and may mishandle other types of tokens, such as numbers or dates. On the other hand, your case is not fundamentally different either, so you might be able to use them - if necessary, by preprocessing the files you feed them to make them more palatable. Building an indexing system of your own, tailored to your needs, does not seem too difficult either.
You still haven't mentioned if there is any kind of flexibility in your search string. Do you expect being able to search for regular expressions? Is the search string expected to be found verbatim, or do you need to find just the terms in it? Does whitespace matter? Does the order of the terms matter?
And more importantly, you haven't mentioned if there is any kind of structure in your files that should be considered while searching. For example, do you want to be able to limit the search to specific elements of an XML file?
Unless you have an SSD, your main bottleneck will be all the file accesses. Its going to take about 10 seconds to read the files, regardless of what you in Java.
If you have an SSD, reading the files won't be a problem, and the CPU speed in Java will matter more.
If you can create an index for the files this will help enormously.

Working with large text snippets in Java source

Are there any good ways to work with blocks of text (Strings) within Java source code? Many other languages have heredoc syntax available to them, but Java does not. This makes it pretty inconvenient to work with things like tag libraries which output a lot of static markup, and unit tests where you need to assert comparisons against blocks of XML.
How do other people work around this? Is it even possible? Or do I just have to put up with it?
If the text is static, or can be parameterized, a possible solution would be to store it in an external file and then import it. However, this creates file I/O which may be unnecessary or have a performance impact. Using this solution would need to involve caching the file contents to reduce the number of file reads.
The closes option in Java to HereDoc is java.text.MessageFormat.
You can not embed logic. It a simple value escape utility. There are no variables used. You have to use zero based indexing. Just follow the javadoc.
http://download.oracle.com/javase/1,5.0/docs/api/java/text/MessageFormat.html
While you could use certain formatters to convert and embed any text file or long literal
as a Java string (e.g., with newline breaks, the necessary escapes, etc.), I can't really think of frequent situations where you would need these capabilities.
The trend in software is generally to separate code from the data it operates on. Large text sections, even if meant just for display or comparison, are data, and are thus typically stored externally. The cost of reading a file (or even caching the result in memory) is fairly low. Internationalization is easier. Changing is easier. Version control is easier. Other tools (e.g., spell checkers) can easily be used.
I agree that in the case of unit tests where you want to compare things against a mock you would need large scale text comparisons. However, when you deal with such large files you will typically have tests that can work on several different large inputs to produce several large outputs, so why not just have your test load the appropriate files rather than inline it ?
Same goes with XML. In fact, for XML I would argue that in many cases you would want to read the XML and build a DOM tree which you would then compare rather than do a text compare that can be affected by whitespaces. And manually creating an XML tree in your unit test is ugly.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?
Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.
First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?
No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...
SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.
I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?
Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly
You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.
I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.
You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).
If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Categories

Resources