VTD XML (Java) VTDNavHuge write XPath result to file

VTD XML (Java) VTDNavHuge write XPath result to file - java

I am experimenting with VTD XML because I frequently need to modify huge XML files (2-10GB or more).
I am try to write an XPath Query result back to a file.
Writing huge files in VTD XML is not obvious to me though:
The method getBytes() is "not implemented" for XMLMemMappedBuffer (see https://jar-download.com/javaDoc/com.ximpleware/vtd-xml/2.13/com/ximpleware/extended/XMLMemMappedBuffer.html)
One of the authors (?) gives a code example in this thread (last post, 2010-04-21): https://sourceforge.net/p/vtd-xml/discussion/379067/thread/a2e03ede/
However, the example is outdated as
long la = vnh.getElementFragment();
returns an Array long[] (see https://jar-download.com/java-documentation-javadoc.php?a=vtd-xml&g=com.ximpleware&v=2.13)
Adapting the relevant lines like this
long[] la = vnh.getElementFragment();
vnh.getXML().writeToFileOutputStream(new FileOutputStream("c:/text2.xml"), (int)la[0], (int)la[1]);
results in the following error:
Exception in thread "main" java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(Unknown Source)
at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source)
at com.ximpleware.extended.XMLMemMappedBuffer.writeToFileOutputStream(XMLMemMappedBuffer.java:104)
at WriteXML.main(WriteXML.java:16)
Questions:
Is this error due to any obvious mistake in the code?
What tools would you use to handle huge XML files (~10GB)
efficiently? (Does not have to be Java.)
My goal is to do simple
transformations or split the xml and write back to file with great
performance. Thanks!

Can't answer your first question, but as to the second, if you're looking for different technology then streaming XSLT 3.0 is one to explore: can't tell whether it's actually suitable without seeing more detail on your requirement.

First of all, to process XML of huge size as you mentioned, I suggest that you load xml into memory using mem-map mode. And since vtd-xml doesn't alter the underlying byte format of xml, you can easily imagine saving a lot of back-and-forth encoding/decoding byte-moving operations and the performance advantage thereof.
As you have pointed out, XMLMemMappedBuffer getBytes is not implemented... this is to avoid excessive memory usage when the fragment is very large...
your work around is to use XMLMemMappedBuffer's writeToFileOutputStream() method to directly dump it to output. In other words, if you know the offset and length of the fragment... getBytes is often bypass-able.
Below is the signature document of that method.
public void writeToFileOutputStream(java.io.FileOutputStream ost,
long os,
long len)
throws java.io.IOException
write the segment (denoted by its offset and length) into an output file stream

Related

BufferedReader taking too long [duplicate]

This question already has answers here:
JAVA - Best approach to parse huge (extra large) JSON file
(3 answers)
OutOfMemory exception in a lot of memory
Closed 5 years ago.
This is to read a file faster not write it.
I have a 150MB file which has a JSON object inside it. I currently use the following code to read it:
String filename ="/tmp/fileToRead";
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(filename), Charset.forName("UTF-8")));
decompressedString = reader.readLine();
reader.close();
JSONObject obj = new JSONObject(decompressedString);
JSONArray profileData = obj.getJSONObject("profileData").getJSONArray("children");
....
It is a single line file and since it is JSON I can't split it ( or atleast I think so). Reading the file gives me a OutOfMemory Error or a TLE. The file takes more than 7 secs to be read and that results in the TLE since the execution of the whole code cannot go beyond 7 seconds. I get the OOM on decompressedString = reader.readLine();.
Is there a way I can reduce the memory used or the time it takes to be read completely?

You have several problems at hand:
You're preemptively parsing too much.
The error you get happens already when you read the line since you said "I get the OOM on decompressedString = reader.readLine();".
You should never try to read data line by line. BufferedReader.readLine() will block until you've read the character \r or \n or the sequence \r\n. When processing data of any length, you're never sure you'll get one of those characters. Also, you're never sure you'll get of those characters outside of the data itself. So your string may be too long or malformed. So don't ever pretend to know the format. BufferedReader.readLine() must be used when parsing, not when acquiring data.
You're not using an appropriate library for your use-case
Reading your JSON is important, yes, but you're reading too much at once. When creating your JSON, you might want to build it from a stream (one of InputStream, Reader or any nio's Channel/Buffer).
Currently you're making your JSON from a String. A huge one. So I can safely assume you're going to require at one point twice the memory you need. One time in the String and one time in the finalized object.
To reduce that, use an appropriate library to which you can pass one of the stream mentioned above. I mentioned in my comments the following: Gson, JSON.simple and Jackson.
Your file may be too big anyways.
If you get your data and you want to acquire only subset of it (here, you want everything under {"profileData":{"children": <DATA>}}). But you probably have way too much. How many elements exist at the same level as profileData? How many elements exist at the same level as children? Do you know? Probably way too much. All that is not under profileData.children is useless. What percentage of your total data is that? 50%? 90%? 99%?
To solve this, you probably want one of two things: you want less data or you want to be able to focus your request.
If you want less data, ask your data provider to give you less: only what you need. Why get more than that? It makes no sense. Tell him so and say "I want less".
If you want focused data, use a library that allows you to both parse and reduce the amount of data. You might want to have a library that lets you say this: "parse this JSON and return only the processingData.children element". Unfortunately I know no library that does it. If others do, please add a comment or answer. Apparently, Gson is able to do so if you use the JsonReader yourself and selectively use skipValue().

I need to parse, modfiy and write back Java Source Files? [duplicate]

I need to parse, modify and write back Java source files. I investigated some options but it seams that I miss the point.
The output of the parsed AST when written back to file always screwed up the formatting using a standard format but not the original one.
Basically I want something that can do: content(write(parse(sourceFile))).equals(content(sourceFile)).
I tried the JavaParser but failed. I might use the Eclipse JDT's parser as a stand alone parser but this feels heavy. I also would like to avoid doing my own stuff. The Java parser for instance has information about column and line already but writing it back seams to ignore these information.
I would like to know how I can achieve parsing and writing back while the output looks the same as the input (intents, lines, everything). Basically a solution that is preserving the original formatting.
[Update]
The modifications I want to do is basically everything that is possible with the AST like adding, removing implemented interfaces, remove / add final to local variables but also generate source methods and constructors.
The idea is to add/remove anything but the rest needs to remain untouched especially the formatting of methods and expressions if the resulting line is larger than the page margin.

You may try using antlr4 with its java8 grammar file
The grammar skips all whitespaces by default but based on token positions you may be able to reconstruct the source being close to the original one

The output of a parser generated by REx is a sequence of events written to this interface:
public interface EventHandler
{
public void reset(CharSequence input);
public void startNonterminal(String name, int begin);
public void endNonterminal(String name, int end);
public void terminal(String name, int begin, int end);
public void whitespace(int begin, int end);
}
where the integers are offsets into the input. The event stream can be used to construct a parse tree. As the event stream completely covers all of the input, the resulting data structure can represent it without loss.
There is sample driver, implementing XmlSerializer on top of this interface. That streams out an XML parse tree, which is just markup added to the input. Thus the string value of the XML document is identical to the original input.
For seeing it in action, use the Java 7 sample grammar and generate a parser using command line
-ll 2 -backtrack -tree -main -java
Then run the main method of the resulting Java.java, passing in some Java source file name.

Our DMS Software Reengineering Toolkit with its Java Front End can do this.
DMS is a program transformation system (PTS), designed to parse source code to an internal representation (usually ASTs), let you make changes to those trees, and regenerate valid output text for the modified tree.
Good PTSes will preserve your formatting/layout at places where you didn't change the code or generate nicely formatted results, including comments in the original source. They will also let you write source-to-source transformations in the form of:
if you see *this* pattern, replace it by *that* pattern
where pattern is written in the surface syntax of your targeted language (in this case, Java). Writing such transformations is usually a lot easier than writing procedural code to climb up and down the tree, inspecting and hacking individual nodes.
DMS has all these properties, including OP's request for idempotency of the null transform.
[Reacting to another answer: yes, it has a Java 8 grammar]

How to parse freedict files (.dict and .index)

I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.

It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py

dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...

Java Serialization to transfer data between any language

Question:
Instead of writing my own serialization algorithm; would it be possible to just use the built in Java serialization, like I have done below, while still having it work across multiple languages?
Explanation:
How I imagine it working, would be as follows: I start up a process, that will be be a language-specific program - written in that language. So I'd have a CppExecutor.exe file, for example. I would write data to a stream to this program. The program would then do what it needs to do, then return a result.
To do this, I would need to serialize the data in some way. The first thing that came to mind was the basic Java Serialization with the use of an ObjectInputStream and ObjectOutputStream. Most of what I have read has only stated that the Java serialization is Java-to-Java applications.
None of the data will ever need to be stored in a file. The method of transferring these packets would be through a java.lang.Process, which I have set up already.
The data will be composed of the following:
String - Mostly containing information that is displayed to the user.
Integer - most likely 32-bit. Won't need to deal with times.
Float- just to handle all floating-point values.
Character - to ensure proper types are used.
Array - Composed of any of the elements in this list.
The best way I have worked out how to do this is as follows: I would start with a 4-byte magic number - just to ensure we are working with the correct data. Following, I would have an integer specifying how many elements there are. After that, for each of the elements I would have: a single byte, signifying the data type (of the above), following by any crucial information, e.x: length for the String and Array. Then, the data that follows.
Side-notes:
I would also like to point out that a lot of these calculations will be taking place, where every millisecond could matter. Due to this, a text-based format (such as JSON) may produce far larger operation times. Considering that non of the packets would need to be interpreted by a human, using only bytes wouldn't be an issue.

I'd recommend Google protobuf: it is binary, stable, proven, and has bindings for all languages you've mentioned. Moreover, it also handles structured data nicely.

There is a binary json format called bson.
I would also like to point out that a lot of these calculations will be taking place, so a text-based format (such as JSON) may produce far larger operation times.
Do not optimize before you measured.
Premature optimization is the root of all evil.
Can you have a try and benchmark the throughput? See if it fits your needs?

Thrift,Protobuf,JSON,MessagePack
complexity of installation Thrift >> Protobuf > BSON > MessagePack > JSON
serialization data size JSON > MessagePack > Binary Thrift > Compact Thrift > Protobuf
time cost Compact Thrift > Binary Thrift > Protobuf > JSON > MessagePack

Algorithm to convert a double to a char array in Java (without using objects like Double.toString or StringBuilder)?

Does anyone know where I can find that algorithm? It takes a double and StringBuilder and appends the double to the StringBuilder without creating any objects or garbage. Of course I am not looking for:
sb.append(Double.toString(myDouble));
// or
sb.append(myDouble);
I tried poking around the Java source code (I am sure it does it somehow) but I could not see any block of code/logic clear enough to be re-used.

I have written this for ByteBuffer. You should be able to adapt it. Writing it to a direct ByteBuffer saves you having to convert it to bytes or copy it into "native" space.
See public ByteStringAppender append(double d)
If you are logging this to a file, you might use the whole library as it can write around 20 million doubles per second sustained. It can do this without system calls as it writes to a memory mapped file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.