Algorithms for string processing

Algorithms for string processing - java

I have a question which make me think about how to improve speed and memory of system.
I will describe it by example, I have this file which have some string:
<e>Customer</e>
<a1>Customer Id</a1>
<a2>Customer Name</a2>
<e>Person</e>
It similar to xml file.
Now, my solution is when I read <e>Customer</e>, I will read from that to a nearest tag and then, substring from <e>Customer</e> to a nearest tag.
It make the system need to process so much. I used only regular expression to do it. I thought I will do the same as real compiler which have some phases (lexical analysis, parser).
Any ideas?
Thanks in advance!

If you really don't want to use one of the free and reliable xml parsers then a truly fast solution will almost certainly involve a state machine.
See this How to create a simple state machine in java question for a good start.
Please be sure to have a very good reason for taking this route.

Regular expressions are not the right tool for parsing complex structures like this. Since your file looks a lot like XML, it may make sense to add what's missing to make it XML (i.e. the header), and feed the result to an XML parser.
XML parsers are optimized for processing large volumes of data quickly (especially the SAX kind). You should see a significant improvement in performance if you switch to parsing XML from processing large volumes of text with regular expressions.

Just don't invest the time into an XML lexer/parser (its not worth it) and use what is allready out there.
For example http://www.mkyong.com/tutorials/java-xml-tutorials/ is a good tutorial,just use google.

Related

Memory efficient way to modify XML in Java

I need to modify a single information in XML file . XML file is about 100 lines . For modifying a single element in whole XML file what would be the most memory efficient way in JAVA ?
JAXB is better ?
Simple SAX parser ?
or any other way .....Kindly suggest .....

SAX parser gives more control on parsing and is faster than DOM parser. JAXB will be easy from the sense of less code writing. XStream is also another option but that is similar to JAXB which is a high level API, so it has some overhead task so it will be bit slower then SAX.
I will not suggest for direct string manipulation (applying String.indexOf() and String.replace()) although would be fastest way for updating a unique tag in the XML but its risky as your XML might not be valid and if xml structure is not simple then there will be risk of updating wrong level tag :-)
Therefore, SAX parser looks the best bet to me.

Your files are not big. The memory used to hold a 100-line XML file costs about as much as 5 milliseconds of a programmer's time. I would question your requirement: why do you need to do it in "the most memory efficient way"? I would use XSLT or JDOM2, unless there is clear quantified information that this will not meet externally-imposed performance requirement, which cannot be solved by buying a bit more memory.

Fastest way to read a CSV?

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.

Should I split a large file before running multiple regexes?

I have an input that's about 35KB of text that I need to pull a bunch of small bits of data from. I use multiple regexes to find the data, and that part works fine.
My question: should I split the large text into multiple smaller strings and run the appropriate regexes on each string, or just keep it in one big string and reset the matcher for each regex? Which way is best for efficiency?

If it isn't running too slow then go with whatever you currently have that is working fast enough.
Otherwise, you shouldn't be using raw regexes for this task anyway. As soon as you mention "multiple regexes" extracting "small bits of data" from, you are talking about writing a parser and should use a decent parsing tool.
As you are using java I would recommend starting with jFlex, which is a mature java implementation of an extremely mature and stable C tool.
For most tasks jFlex will be all you need, but it also integrates smoothly with a number of java parser-generators should the problem prove to be more complicated. My personal preference is the slightly obscure Beaver.
Of course, if you can implement it as a set of regexes it isn't more complicated and jFlex will do the job for you.

XML transformation in Java with best Performance

I want to do some manipulation on xml content in Java. See below xml
From Source XML:
<ns1:Order xmlns:ns1="com.test.ns" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</ns1:Order>
Target XML:
<OrderData>
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</OrderData>
As shown, I have Source xml and I want target xml for that .. The only difference we can observe is root_element "ns1:Order" is replace with "OrderData" in target xml.
Fyi, OrderHeader has one sub-element Image which holds binary image of 250KB (so this xml going to be large one) .. also root element of target xml "OrderData" is well-known in advance.
Now, I want to achieve above result in java with best performance .. I have Source xml content already as byte[] and I want target xml content also as byte[] .. I am open to use Sax parser too.
Please provide the solution which has best performance for doing above stuff.
Thanks in advance,
Nurali

Do you mean machine performance or human performance? Spending an infinite amount of programmer time to achieve a microscopic gain in machine performance is a strange trade-off to make these days, when a powerful computer costs about the same as half a day of a contract programmer's time.
I would recommend using XSLT. It might not be fastest, but it will be fast enough. For a simple transformation like this, XSLT performance will be dominated by parsing and serialization costs, and those won't be any worse than for any other solution.

Not much will beat direct bytes/String manipulation, for instance, a regular expression.
But be warned, manipulating XML with Regex is always a hot debate

I used XLST to transform XML documents. That's another way to do it. There are several Java implementations of XLST processors.

The fastest way to manipulate strings in Java is using direct manipulation and the StringBuilder for the results. I wrote code to modify 20 mb strings that built a table of change locations and then copied and modified the string into a new StringBuilder. For Strings XSLT and RegEx are much slower than direct manipulation and SAX/DOM parsers are slower still.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Algorithms for string processing - java

If you really don't want to use one of the free and reliable xml parsers then a truly fast solution will almost certainly involve a state machine. See this How to create a simple state machine in java question for a good start. Please be sure to have a very good reason for taking this route.

Just don't invest the time into an XML lexer/parser (its not worth it) and use what is allready out there. For example http://www.mkyong.com/tutorials/java-xml-tutorials/ is a good tutorial,just use google.

Related

Memory efficient way to modify XML in Java

Fastest way to read a CSV?

Should I split a large file before running multiple regexes?

XML transformation in Java with best Performance

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

Categories

Resources