I need to parse, modify and write back Java source files. I investigated some options but it seams that I miss the point.
The output of the parsed AST when written back to file always screwed up the formatting using a standard format but not the original one.
Basically I want something that can do: content(write(parse(sourceFile))).equals(content(sourceFile)).
I tried the JavaParser but failed. I might use the Eclipse JDT's parser as a stand alone parser but this feels heavy. I also would like to avoid doing my own stuff. The Java parser for instance has information about column and line already but writing it back seams to ignore these information.
I would like to know how I can achieve parsing and writing back while the output looks the same as the input (intents, lines, everything). Basically a solution that is preserving the original formatting.
[Update]
The modifications I want to do is basically everything that is possible with the AST like adding, removing implemented interfaces, remove / add final to local variables but also generate source methods and constructors.
The idea is to add/remove anything but the rest needs to remain untouched especially the formatting of methods and expressions if the resulting line is larger than the page margin.
You may try using antlr4 with its java8 grammar file
The grammar skips all whitespaces by default but based on token positions you may be able to reconstruct the source being close to the original one
The output of a parser generated by REx is a sequence of events written to this interface:
public interface EventHandler
{
public void reset(CharSequence input);
public void startNonterminal(String name, int begin);
public void endNonterminal(String name, int end);
public void terminal(String name, int begin, int end);
public void whitespace(int begin, int end);
}
where the integers are offsets into the input. The event stream can be used to construct a parse tree. As the event stream completely covers all of the input, the resulting data structure can represent it without loss.
There is sample driver, implementing XmlSerializer on top of this interface. That streams out an XML parse tree, which is just markup added to the input. Thus the string value of the XML document is identical to the original input.
For seeing it in action, use the Java 7 sample grammar and generate a parser using command line
-ll 2 -backtrack -tree -main -java
Then run the main method of the resulting Java.java, passing in some Java source file name.
Our DMS Software Reengineering Toolkit with its Java Front End can do this.
DMS is a program transformation system (PTS), designed to parse source code to an internal representation (usually ASTs), let you make changes to those trees, and regenerate valid output text for the modified tree.
Good PTSes will preserve your formatting/layout at places where you didn't change the code or generate nicely formatted results, including comments in the original source. They will also let you write source-to-source transformations in the form of:
if you see *this* pattern, replace it by *that* pattern
where pattern is written in the surface syntax of your targeted language (in this case, Java). Writing such transformations is usually a lot easier than writing procedural code to climb up and down the tree, inspecting and hacking individual nodes.
DMS has all these properties, including OP's request for idempotency of the null transform.
[Reacting to another answer: yes, it has a Java 8 grammar]
Related
I am experimenting with VTD XML because I frequently need to modify huge XML files (2-10GB or more).
I am try to write an XPath Query result back to a file.
Writing huge files in VTD XML is not obvious to me though:
The method getBytes() is "not implemented" for XMLMemMappedBuffer (see https://jar-download.com/javaDoc/com.ximpleware/vtd-xml/2.13/com/ximpleware/extended/XMLMemMappedBuffer.html)
One of the authors (?) gives a code example in this thread (last post, 2010-04-21): https://sourceforge.net/p/vtd-xml/discussion/379067/thread/a2e03ede/
However, the example is outdated as
long la = vnh.getElementFragment();
returns an Array long[] (see https://jar-download.com/java-documentation-javadoc.php?a=vtd-xml&g=com.ximpleware&v=2.13)
Adapting the relevant lines like this
long[] la = vnh.getElementFragment();
vnh.getXML().writeToFileOutputStream(new FileOutputStream("c:/text2.xml"), (int)la[0], (int)la[1]);
results in the following error:
Exception in thread "main" java.nio.channels.ClosedChannelException
at sun.nio.ch.FileChannelImpl.ensureOpen(Unknown Source)
at sun.nio.ch.FileChannelImpl.transferTo(Unknown Source)
at com.ximpleware.extended.XMLMemMappedBuffer.writeToFileOutputStream(XMLMemMappedBuffer.java:104)
at WriteXML.main(WriteXML.java:16)
Questions:
Is this error due to any obvious mistake in the code?
What tools would you use to handle huge XML files (~10GB)
efficiently? (Does not have to be Java.)
My goal is to do simple
transformations or split the xml and write back to file with great
performance. Thanks!
Can't answer your first question, but as to the second, if you're looking for different technology then streaming XSLT 3.0 is one to explore: can't tell whether it's actually suitable without seeing more detail on your requirement.
First of all, to process XML of huge size as you mentioned, I suggest that you load xml into memory using mem-map mode. And since vtd-xml doesn't alter the underlying byte format of xml, you can easily imagine saving a lot of back-and-forth encoding/decoding byte-moving operations and the performance advantage thereof.
As you have pointed out, XMLMemMappedBuffer getBytes is not implemented... this is to avoid excessive memory usage when the fragment is very large...
your work around is to use XMLMemMappedBuffer's writeToFileOutputStream() method to directly dump it to output. In other words, if you know the offset and length of the fragment... getBytes is often bypass-able.
Below is the signature document of that method.
public void writeToFileOutputStream(java.io.FileOutputStream ost,
long os,
long len)
throws java.io.IOException
write the segment (denoted by its offset and length) into an output file stream
I have a big .pm File, which only consist of a very big Perl hash with lots of subhashes. I have to load this hash into a Java program, do some work and changes on the data lying below and save it back into a .pm File, which should look similar to the one i started with.
By now, i tried to convert it linewise by regex and string matching, converting it into a XML Document and later Elementwise parse it back into a perl hash.
This somehow works, but seems quite dodgy. Is there any more reliable way to parse the perl hash without having a perl runtime installed?
You're quite right, it's utterly filthy. Regex and string for XML in the first place is a horrible idea, and honestly XML is probably not a good fit for this anyway.
I would suggest that you consider JSON. I would be stunned to find java can't handle JSON and it's inherently a hash-and-array oriented data structure.
So you can quite literally:
use JSON;
print to_json ( $data_structure, { pretty => 1 } );
Note - it won't work for serialising objects, but for perl hash/array/scalar type structures it'll work just fine.
You can then import it back into perl using:
my $new_data = from_json $string;
print Dumper $new_data;
Either Dumper it to a file, but given you requirement is multi-language going forward, just using native JSON as your 'at rest' data is probably a more sensible choice.
But if you're looking at parsing perl code within java, without a perl interpreter? No, that's just insanity.
I tried using Javalang module available in python to get the AST of Java source code , but it requires an entire class to generate the AST . Passing a block of code like an 'if' statement throws an error . Is there any other way of doing it ?
PS : I am preferably looking for a python module to do the task.
Thanks
Javalang can parse snippets of Java code:
>>> tokens = javalang.tokenizer.tokenize('System.out.println("Hello " + "world");')
>>> parser = javalang.parser.Parser(tokens)
>>> parser.parse_expression()
MethodInvocation
OP is interested in a non-Python answer.
Our DMS Software Reengineering Toolkit with its Java Front End can accomplish this.
DMS is a general purpose tools for parsing/analyzing/transforming code, parameterized by langauge definitions (including grammars). Given a langauge definition, DMS can easily be invoked on a source file/stream representing the goal symbol for a grammar by calling the Parse method offered by the langauge parameter, and DMS will build a tree for the parsed string. Special support is provided for parsing source file/streams for arbitrary nonterminals as defined by the langauge grammar; DMS will build an AST whose root is that nonterminal, parsing the source according to the subgrammar define by that nonterminal.
Once you have the AST, DMS provides lots of support for visiting the AST, inspecting/modifying nodes, carry out source-to-source transformations on the AST using surface syntax rewrite rules. Finally you can prettyprint the modified AST and get back valid source code. (If you have only parsed a code fragment for a nonterminal, what you get back is valid code for that nonterminal).
If OP is willing to compare complete files instead of snippets, our Smart Differencer might be useful out of the box. SmartDifferencer builds ASTs of its two input files, finds the smallest set of conceptual edits (insert, delete, move, copy, rename) over structured code elemnts that explains the differences, and reports that difference.
I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.
It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py
dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...
I'm building an ANTLR parser for a small query language. The query language is by definition ambiguous, and we need all possible interpretations (ASTs) to process the query.
Example:
query : CLASSIFIED_TOKEN UNCLASSIFIED_TOKEN
| ANY_TOKEN UNCLASSIFIED_TOKEN
;
In this case, if input matches both rules, I need to get 2 ASTs with both interpretations. ANTLR will return the first matched AST.
Do you know a simple way to get all possible ASTs for the same grammar? I'm thinking about running parser multiple times, "turning off" already matched rules between iterations; this seems dirty. Is there a better idea? Maybe other lex/parser tool with java support that can do this?
Thanks
If I were you, I'd remove the ambiguities. You can often do that by using contextual information to determine which grammar rules actually trigger. For instance, in
C* X;
in C (not your language, but this is just to make a point), you can't tell if this is just a pointless multiplication (legal to write in C), or a declaration of a variable X of type "pointer to C". So, there are two valid (ambiguous) parses. But if you know that C is a type declaration (from some context, perhaps an earlier code declaration), you can hack the parser to kill off the inappropriate choices and end up with just the one "correct" parse, no ambiguities.
If you really don't have the context, then you likely need a GLR parser, which happily generate both parses in your final tree. I don't know of any available for Java.
Our DMS Software Reengineering Toolkit [not a Java-based product] has GLR parsing support, and we use that all the time to parse difficult languages with ambiguities. The way we handle the C example above is to produce both parses, because the GLR parser is happy to do this, and then if we have additional information (such as symbol table table), post-process the tree to remove the inappropriate parses.
DMS is designed to support the customized analysis and transformation of arbitrary languages, such as your query language, and makes it easy to define the grammar. Once you have a context-free grammar (ambiguities or not), DMS can parse code and you can decide what to do later.
I doubt you're going to get ANTLR to return multiple parse trees without wholesale rewriting of the code.
I believe you're going to have to partition the ambiguities, each into its own unambiguous grammar and run the parse multiple times. If the total number of ambiguous productions is large you could have an unmanageable set of distinct grammars. For example, for three binary ambiguities (two choices) you'll end up with 8 distinct grammars, though there might be slightly fewer if one ambiguous branch eliminates one or more of the other ambiguities.
Good luck