Related
I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.
Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.
Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.
The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).
I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.
Are there any good ways to work with blocks of text (Strings) within Java source code? Many other languages have heredoc syntax available to them, but Java does not. This makes it pretty inconvenient to work with things like tag libraries which output a lot of static markup, and unit tests where you need to assert comparisons against blocks of XML.
How do other people work around this? Is it even possible? Or do I just have to put up with it?
If the text is static, or can be parameterized, a possible solution would be to store it in an external file and then import it. However, this creates file I/O which may be unnecessary or have a performance impact. Using this solution would need to involve caching the file contents to reduce the number of file reads.
The closes option in Java to HereDoc is java.text.MessageFormat.
You can not embed logic. It a simple value escape utility. There are no variables used. You have to use zero based indexing. Just follow the javadoc.
http://download.oracle.com/javase/1,5.0/docs/api/java/text/MessageFormat.html
While you could use certain formatters to convert and embed any text file or long literal
as a Java string (e.g., with newline breaks, the necessary escapes, etc.), I can't really think of frequent situations where you would need these capabilities.
The trend in software is generally to separate code from the data it operates on. Large text sections, even if meant just for display or comparison, are data, and are thus typically stored externally. The cost of reading a file (or even caching the result in memory) is fairly low. Internationalization is easier. Changing is easier. Version control is easier. Other tools (e.g., spell checkers) can easily be used.
I agree that in the case of unit tests where you want to compare things against a mock you would need large scale text comparisons. However, when you deal with such large files you will typically have tests that can work on several different large inputs to produce several large outputs, so why not just have your test load the appropriate files rather than inline it ?
Same goes with XML. In fact, for XML I would argue that in many cases you would want to read the XML and build a DOM tree which you would then compare rather than do a text compare that can be affected by whitespaces. And manually creating an XML tree in your unit test is ugly.
I'm writing a small agent in java that will play a game against other agents. I want to keep a small amount of state (probably approx. 1kb at most) around between runs of the program so that I can try to tweak the performance of the agent based upon past successes. Essentially, I will be reading a small amount of data at the beginning of each game and writing a small amount at the end. It seems like I have 2 options, file I/O or derby. Is there a speed advantage to either? Or does it not really matter for such a small amount of data?
With 1kb of data, you are better off using standard file IO. Most likely, you could serialize the entire object tree to disk and dismply deserialize when you startup again. If you wanted to get fancy, you could use JAXB to serialize to XML instead of binary files.
As much as I love to fit every problem to the database solution, I don't think that's very practical here. Unless you have some special need of database specific capabilities, you are introducting a lot of overhead, complexity, maintenance problems by using a database.
The only areas where you might really want to use the database is if you have a lot of small objects/rows and you frequently perform sorts and filters on the data. But even then, you could probably keep a dozen in-memory ordered lists and get better performance with less resources and without the headache of a database.
If you really think you need a database in this scenario, consider HSQL. I don't consider it a real database, but it's a in-memory database that can persist to a file. Low overhead, low complexity, and relatively few points of failure. Plus, if you need to edit the persisted data, you can do so with a text editor. Can't say that about Derby.
Considering that these objects can vary per file size, and your computer's specs (bus speed, HD speed) affect this, the only way to be sure is to write your own benchmark. Just create a simple for loop, count from 1 to 1000, and read the file inside the loop over and over (but do not create and destroy the objects inside the loop, just focus on the reading part).
Of course this whole exercise reeks of pre-optimization, which can lead to bad coding habit. Just write your code in the most readable, simple fashion, and if there is a speed problem, refactor as needed.
But since it's a small amount of data, I would say it won't matter.
I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?
Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.
First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?
No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...
SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.
I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?
Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly
You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.
I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.
You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).
If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.
So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?
So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.
You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.
You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!
Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.
This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.
You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.
I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.
If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.
I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.
If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.
If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.