Should I split a large file before running multiple regexes?

Should I split a large file before running multiple regexes? - java

I have an input that's about 35KB of text that I need to pull a bunch of small bits of data from. I use multiple regexes to find the data, and that part works fine.
My question: should I split the large text into multiple smaller strings and run the appropriate regexes on each string, or just keep it in one big string and reset the matcher for each regex? Which way is best for efficiency?

If it isn't running too slow then go with whatever you currently have that is working fast enough.
Otherwise, you shouldn't be using raw regexes for this task anyway. As soon as you mention "multiple regexes" extracting "small bits of data" from, you are talking about writing a parser and should use a decent parsing tool.
As you are using java I would recommend starting with jFlex, which is a mature java implementation of an extremely mature and stable C tool.
For most tasks jFlex will be all you need, but it also integrates smoothly with a number of java parser-generators should the problem prove to be more complicated. My personal preference is the slightly obscure Beaver.
Of course, if you can implement it as a set of regexes it isn't more complicated and jFlex will do the job for you.

Related

Fastest way to read a CSV?

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.

Algorithms for string processing

I have a question which make me think about how to improve speed and memory of system.
I will describe it by example, I have this file which have some string:
<e>Customer</e>
<a1>Customer Id</a1>
<a2>Customer Name</a2>
<e>Person</e>
It similar to xml file.
Now, my solution is when I read <e>Customer</e>, I will read from that to a nearest tag and then, substring from <e>Customer</e> to a nearest tag.
It make the system need to process so much. I used only regular expression to do it. I thought I will do the same as real compiler which have some phases (lexical analysis, parser).
Any ideas?
Thanks in advance!

If you really don't want to use one of the free and reliable xml parsers then a truly fast solution will almost certainly involve a state machine.
See this How to create a simple state machine in java question for a good start.
Please be sure to have a very good reason for taking this route.

Regular expressions are not the right tool for parsing complex structures like this. Since your file looks a lot like XML, it may make sense to add what's missing to make it XML (i.e. the header), and feed the result to an XML parser.
XML parsers are optimized for processing large volumes of data quickly (especially the SAX kind). You should see a significant improvement in performance if you switch to parsing XML from processing large volumes of text with regular expressions.

Just don't invest the time into an XML lexer/parser (its not worth it) and use what is allready out there.
For example http://www.mkyong.com/tutorials/java-xml-tutorials/ is a good tutorial,just use google.

The Efficiency of Hard-Coding vs. File Input

I'm working on a machine learning project in Java which will involve a very large model (the output of a Support Vector Machine, for those of you familiar with that) that will need to be retrieved fairly frequently for use by the end user. The bulk of the model consists of large two-dimensional array of fairly small objects.
Unfortunately, I do not know exactly how large the model is going to be (I've been working with benchmark data so far, and the data I'm actually going to be using isn't ready yet), nor do I know the specifications of the machine it will run on, as that is also up in the air.
I already have a method to write the model to a file as a string, but the write process takes a great deal of time and the read process takes the better part of a minute. I'd like to cut down on that time, so I had the either bright or insanely convoluted idea of writing the model to a .java file in such a way that it could be compiled and then run to produce a fully formed model.
My questions to you are, will storing and compiling the model in Java be significantly faster than reading it from the file, under the assumption that the model is about 1 MB in size? And is there some reason I haven't seen yet that this could be a fantastically stupid idea that I should not pursue under any circumstances?
Thank you for any ideas you can give me.
EDIT: apparently trying to automatically write several thousand values into code makes a method that is roughly two orders of magnitude larger than the compiler can handle. Ah well, live and learn.

Instead of writing to a string or to a java file, you might consider creating a compact binary format for you data.

Will storing and compiling the model in Java be significantly faster
than reading it from the file ?
That depends on the way you fashion your custom datastructure to contain your model.

The question IMHO is if the reading of the file takes long because of IO or because of computing time (=> CPU). If the later is the case then tough luck. If your IO (e.g. hard disc) is the cause then you can compress the file and extract it after/while reading. There is (of course) ZIP-support in Java (even for Streams).

I agree with the answer given above to use a binary input format. Let's try optimising that first. Can you provide some information? ...or have you googled working with binary data? ...buffering it? etc.?
Writing a .java file and compiling it will be quiet interesting... but it is bound to give your issues at some point. However, I think you will find that it will be slightly slower than an optimised binary format, but faster than text-based input.
Also, be very careful for early optimisation. Usually, "highly-configurable" and "blinding fast" is mutual exclusive. Rather, get everything to work first and then use a profiler to optimise the really slow sections of the application.

abusing ragel, possibly need new approach / tool

I'm trying to use Ragel to implement a simple yes/no fsm. Unfortunately the language specification consists of the union of about a thousand regular expressions, with * operators appearing once or more in the majority of them. So, the number of possible states explodes and it seems it will be impossible to use Ragel to generate an fsm for my language. Is there a tool out there that can do what I need, or should I swap approaches? I need something better than checking input strings against each regular expression in turn. I could chop up the thousand regular expressions into chunks of ~50 and generate an fsm for each, and run every input string against all the machines, but if there's a tool that can handle this kind of job without such a hack I'd be pleased to hear of it.
Thanks!

Well, I've ended up breaking the machine into multiple machines in order to prevent Ragel from eating all available memory - in fact, I had to break up the machine into a couple of separate Ragel files because the generated java class had too many constants in it from the huge state tables generated. I'm still interested in hearing of a better solution for this, if anybody has one!

Working with large text snippets in Java source

Are there any good ways to work with blocks of text (Strings) within Java source code? Many other languages have heredoc syntax available to them, but Java does not. This makes it pretty inconvenient to work with things like tag libraries which output a lot of static markup, and unit tests where you need to assert comparisons against blocks of XML.
How do other people work around this? Is it even possible? Or do I just have to put up with it?

If the text is static, or can be parameterized, a possible solution would be to store it in an external file and then import it. However, this creates file I/O which may be unnecessary or have a performance impact. Using this solution would need to involve caching the file contents to reduce the number of file reads.

The closes option in Java to HereDoc is java.text.MessageFormat.
You can not embed logic. It a simple value escape utility. There are no variables used. You have to use zero based indexing. Just follow the javadoc.
http://download.oracle.com/javase/1,5.0/docs/api/java/text/MessageFormat.html

While you could use certain formatters to convert and embed any text file or long literal
as a Java string (e.g., with newline breaks, the necessary escapes, etc.), I can't really think of frequent situations where you would need these capabilities.
The trend in software is generally to separate code from the data it operates on. Large text sections, even if meant just for display or comparison, are data, and are thus typically stored externally. The cost of reading a file (or even caching the result in memory) is fairly low. Internationalization is easier. Changing is easier. Version control is easier. Other tools (e.g., spell checkers) can easily be used.
I agree that in the case of unit tests where you want to compare things against a mock you would need large scale text comparisons. However, when you deal with such large files you will typically have tests that can work on several different large inputs to produce several large outputs, so why not just have your test load the appropriate files rather than inline it ?
Same goes with XML. In fact, for XML I would argue that in many cases you would want to read the XML and build a DOM tree which you would then compare rather than do a text compare that can be affected by whitespaces. And manually creating an XML tree in your unit test is ugly.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.