Reporting progress of parser - java

I have a large file that I am trying to parse with Antlr in Java, and I would like to show the progress.
It looked like could do the following:
CommonTokenStream tokens = new CommonTokenStream(lexer);
int maxTokenIndex = tokens.size();
and then use maxTokenIndex in a ParseTreeListener as such:
public void exitMyRule(MyRuleContext context) {
int tokenIndex = context.start.getTokenIndex();
myReportProgress(tokenIndex, maxTokenIndex);
}
The second half of that appears to work. I get ever increasing values for tokenIndex. However, tokens.size() is returning 0. This makes it impossible to gauge how much progress I have made.
Is there a good way to get an estimate of how far along I am?

The following appears to work.
File file = getFile();
ANTLRInputStream input = new ANTLRInputStream(new FileReader(file));
ProgressMonitor progress = new ProgressMonitor(null,
"Loading " + file.getName(),
null,
0,
input.size());
Then extend MyGrammarBaseListener with
#Override
public void exitMyRule(MyRuleContext context) {
progress.setProgress(context.stop.getStopIndex());
}

Related

Stop Bullet number to be updated automatically when merging word docs using docx4j

I am trying to merge 2 docx files which has their own bullet number, after merging of word docs the bullets are automatically updated.
E.g:
Doc A has 1 2 3
Doc B has 1 2 3
After merging the bullet numbering are updated to be 1 2 3 4 5 6
how to stop this.
I am using following code
if(counter==1)
{
FirstFileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FirstFileIS = new java.io.ByteArrayInputStream(FirstFileByteStream);
FirstWordFile = org.docx4j.openpackaging.packages.WordprocessingMLPackage.load(FirstFileIS);
main = FirstWordFile.getMainDocumentPart();
//Add page break for Table of Content
main.addObject(objBr);
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Table of contents - End
}
else
{
FileByteStream = org.apache.commons.codec.binary.Base64.decodeBase64(strFileData.getBytes());
FileIS = new java.io.ByteArrayInputStream(FileByteStream);
byte[] bytes = IOUtils.toByteArray(FileIS);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/part" + (chunkCount++) + ".docx"));
afiPart.setContentType(new ContentType(CONTENT_TYPE));
afiPart.setBinaryData(bytes);
Relationship altChunkRel = main.addTargetPart(afiPart);
CTAltChunk chunk = Context.getWmlObjectFactory().createCTAltChunk();
chunk.setId(altChunkRel.getId());
main.addObject(objBr);
htmlCode = new StringBuilder();
htmlCode.append("<html>");
htmlCode.append("<h2><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><p style=\"font-family:'Arial Black'; color: #f35b1c\">"+ReqName+"</p></h2>");
htmlCode.append("</html>");
if (htmlCode != null) {
main.addAltChunk(org.docx4j.openpackaging.parts.WordprocessingML.AltChunkType.Html,htmlCode.toString().getBytes());
}
//Add Page Break before new content
main.addObject(objBr);
//Add new content
main.addObject(chunk);
}
Looking at your code, you are adding HTML altChunks to your document.
For these to display it Word, the HTML is converted to normal docx content.
An altChunk is usually converted by Word when you open the docx.
(Alternatively, docx4j-ImportXHTML can do it for an altChunk of type XHTML)
The upshot is that what happens with the bullets (when Word converts your HTML) is largely outside your control. You could experiment with CSS but I think Word will mostly ignore it.
An alternative may be to use XHTML altChunks, and have docx4j-ImportXHTML convert them. main.convertAltChunks()
If the same problem occurs when you try that, well, at least we can address it.
I was able to fix my issue using following code. I found it at (http://webapp.docx4java.org/OnlineDemo/forms/upload_MergeDocx.xhtml). You can also generate your custom code, they have a nice demo where they generate code according to your requirement :).
public final static String DIR_IN = System.getProperty("user.dir")+ "/";
public final static String DIR_OUT = System.getProperty("user.dir")+ "/";
public static void main(String[] args) throws Exception
{
String[] files = {"part1docx_20200717t173750539gmt.docx", "part1docx_20200717t173750539gmt (1).docx", "part1docx_20200717t173750539gmt.docx"};
List blockRanges = new ArrayList();
for (int i=0 ; i< files.length; i++) {
BlockRange block = new BlockRange(WordprocessingMLPackage.load(new File(DIR_IN + files[i])));
blockRanges.add( block );
block.setStyleHandler(StyleHandler.RENAME_RETAIN);
block.setNumberingHandler(NumberingHandler.ADD_NEW_LIST);
block.setRestartPageNumbering(false);
block.setHeaderBehaviour(HfBehaviour.DEFAULT);
block.setFooterBehaviour(HfBehaviour.DEFAULT);
block.setSectionBreakBefore(SectionBreakBefore.NEXT_PAGE);
}
// Perform the actual merge
DocumentBuilder documentBuilder = new DocumentBuilder();
WordprocessingMLPackage output = documentBuilder.buildOpenDocument(blockRanges);
// Save the result
SaveToZipFile saver = new SaveToZipFile(output);
saver.save(DIR_OUT+"OUT_MergeWholeDocumentsUsingBlockRange.docx");
}

Java: Antlr4 MySql get individual statements

I'm using Java with JDBC to run MySql code. I want to execute a DDL script, but JDBC can only execute a single statement at a time, which makes it unsuitable to execute a whole .sql file out of the box.
What I'm trying to do is use Antlr4 to parse the .sql file so I can break up each individual statement and then iteratively execute them with JDBC.
I've gotten this far:
InputStream resourceAsStream = Main.class.getClassLoader()
.getResourceAsStream("an-arbitrary-ddl.sql");
CharStream codePointCharStream = CharStreams.fromStream(resourceAsStream);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
// Where do I go from here?
I'm sure I'm just not searching for the correct terms because I'm new to Antlr and manually parsing code. I can't find any reference from here as to what I need to do to get individual sql statements out of the MySqlParser. What do I need to do next?
A parser is not the right tool for this kind of problem. A statement splitter is pretty easy to write manually and much faster if you do it yourself. I implemented such a splitter in C++ in MySQL Workbench. Shouldn't be difficult to port this to Java. The code is very fast (1 Mio LOC SQL code in under 1 sec on an average machine). A parser would need much longer.
I'm sure this can be improved, however, as the most simple way I could create this was creating a listener and provide the constructor with a Consumer<String> object. The listener looks at individual statements and recursively constructs them. There is probably a more optimal solution, however, I no longer have time to try to optimize this if there is.
/**
* #author Paul Nelson Baker
* #see GitHub
* #see LinkedIn
* #since 2018-09
*/
public class SqlStatementListener extends MySqlParserBaseListener {
private final Consumer<String> sqlStatementConsumer;
public SqlStatementListener(Consumer<String> sqlStatementConsumer) {
this.sqlStatementConsumer = sqlStatementConsumer;
}
#Override
public void enterSqlStatement(MySqlParser.SqlStatementContext ctx) {
if (ctx.getChildCount() > 0) {
StringBuilder stringBuilder = new StringBuilder();
recreateStatementString(ctx.getChild(0), stringBuilder);
stringBuilder.setCharAt(stringBuilder.length() - 1, ';');
String recreatedSqlStatement = stringBuilder.toString();
sqlStatementConsumer.accept(recreatedSqlStatement);
}
super.enterSqlStatement(ctx);
}
private void recreateStatementString(ParseTree currentNode, StringBuilder stringBuilder) {
if (currentNode instanceof TerminalNode) {
stringBuilder.append(currentNode.getText());
stringBuilder.append(' ');
}
for (int i = 0; i < currentNode.getChildCount(); i++) {
recreateStatementString(currentNode.getChild(i), stringBuilder);
}
}
}
Next you need to traverse the statements, the string consumer from earlier allows you to lazily redirect the output wherever you need. This can be as simple as just printing to stdout, however, it can just as easily be used to append to a list.
public List<String> mySqlStatementsFrom(String sourceCode) {
List<String> statements = new ArrayList<>();
mySqlStatementsToConsumer(sourceCode, statements::add);
return statements;
}
public void mySqlStatementsToConsumer(String sourceCode, Consumer<String> mySqlStatementConsumer) {
CharStream codePointCharStream = CharStreams.fromString(sourceCode);
MySqlLexer tokenSource = new MySqlLexer(new CaseChangingCharStream(codePointCharStream, true));
TokenStream tokenStream = new CommonTokenStream(tokenSource);
MySqlParser mySqlParser = new MySqlParser(tokenStream);
SqlStatementListener statementListener = new SqlStatementListener(mySqlStatementConsumer);
ParseTreeWalker.DEFAULT.walk(statementListener, mySqlParser.sqlStatements());
}

What is the idiomatic way for printing all the differences in XMLUnit?

I'm trying to override the default XMLUnit behavior of reporting only the first difference found between two inputs with a (text) report that includes all the differences found.
I've so far accomplished this:
private static void reportXhtmlDifferences(String expected, String actual) {
Diff ds = DiffBuilder.compare(Input.fromString(expected))
.withTest(Input.fromString(actual))
.checkForSimilar()
.normalizeWhitespace()
.ignoreComments()
.withDocumentBuilderFactory(dbf).build();
DefaultComparisonFormatter formatter = new DefaultComparisonFormatter();
if (ds.hasDifferences()) {
StringBuffer expectedBuffer = new StringBuffer();
StringBuffer actualBuffer = new StringBuffer();
for (Difference d: ds.getDifferences()) {
expectedBuffer.append(formatter.getDetails(d.getComparison().getControlDetails(), null, true));
expectedBuffer.append("\n----------\n");
actualBuffer.append(formatter.getDetails(d.getComparison().getTestDetails(), null, true));
actualBuffer.append("\n----------\n");
}
throw new ComparisonFailure("There are HTML differences", expectedBuffer.toString(), actualBuffer.toString());
}
}
But I don't like:
Having to iterate through the Differences in client code.
Reaching into the internals of DefaultComparisonFormatter and calling getDetails with that null ComparisonType.
Concating the differences with line dashes.
Maybe this is just coming from an unjustified bad gut feeling, but I'd like to know if anyone has some input on this use case.
XMLUnit suggests to simply print out the differences, see the section on "Old XMLUnit 1.x DetailedDiff": https://github.com/xmlunit/user-guide/wiki/Migrating-from-XMLUnit-1.x-to-2.x
Your code would look like this:
private static void reportXhtmlDifferences(String expected, String actual) {
Diff ds = DiffBuilder.compare(Input.fromString(expected))
.withTest(Input.fromString(actual))
.checkForSimilar()
.normalizeWhitespace()
.ignoreComments()
.withDocumentBuilderFactory(dbf).build();
if (ds.hasDifferences()) {
StringBuffer buffer = new StringBuffer();
for (Difference d: ds.getDifferences()) {
buffer.append(d.toString());
}
throw new RuntimeException("There are HTML differences\n" + buffer.toString());
}
}

How to edit a microsoft window spacing from a java application?

i'm trying to implement steganography's word shifting coding protocol on a microsoft word report using java application. Basicly, it uses an existing report and edit it's spacing to put some secret data. If it's wider, then its 1 bit data. And if it's narrower, then it's 0 bit data. So i wonder what kind of library should i have to start constructing this java app or if java doesn't support this kind of comunication with ms-word what kind language of programming should i use, thank you for your time.
I would recommend using C# and the Microsoft.Office.Interop.Word. You can use the free Visual Studio Community version (https://www.visualstudio.com/products/visual-studio-community-vs), create a console application and add a reference for the interop namespace (in project explorer, right click on references, add reference: COM->Microsoft Word 16.0 Object Library).
Simple example:
namespace WordShiftingExample
{
class Program
{
private static int[] getSpaces(string text)
{
System.Collections.ArrayList list = new System.Collections.ArrayList();
int index = 0;
while (index != text.LastIndexOf(" "))
{
index = text.IndexOf(" ", index + 1);
list.Add(index);
}
return list.ToArray(typeof(int)) as int[];
}
static void Main(string[] args)
{
try
{
Microsoft.Office.Interop.Word.Application winword = new Microsoft.Office.Interop.Word.Application();
winword.ShowAnimation = false;
winword.Visible = false;
object missing = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document document = winword.Documents.Add(ref missing, ref missing, ref missing, ref missing);
float zero = 0.1F;
float one = 0.15F;
document.Content.Text = "This is a test document.";
//set word-spacing for first two spaces
int[] spaces = getSpaces(document.Content.Text);
document.Range(spaces[0], spaces[0]+1).Font.Spacing=zero;
document.Range(spaces[1], spaces[1]+1).Font.Spacing = one;
//read word-spacing for first two spaces
System.Diagnostics.Debug.WriteLine(document.Range(spaces[0], spaces[0]+1).Font.Spacing); // prints 0.1
System.Diagnostics.Debug.WriteLine(document.Range(spaces[1], spaces[1]+1).Font.Spacing); // prints 0.15
//Save the document
object filename = System.Environment.GetEnvironmentVariable("USERPROFILE")+"\\temp1.docx";
document.SaveAs2(ref filename);
document.Close(ref missing, ref missing, ref missing);
document = null;
winword.Quit(ref missing, ref missing, ref missing);
winword = null;
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine(ex.StackTrace);
}
}
}
}

Handling errors in ANTLR4

The default behavior when the parser doesn't know what to do is to print messages to the terminal like:
line 1:23 missing DECIMAL at '}'
This is a good message, but in the wrong place. I'd rather receive this as an exception.
I've tried using the BailErrorStrategy, but this throws a ParseCancellationException without a message (caused by a InputMismatchException, also without a message).
Is there a way I can get it to report errors via exceptions while retaining the useful info in the message?
Here's what I'm really after--I typically use actions in rules to build up an object:
dataspec returns [DataExtractor extractor]
#init {
DataExtractorBuilder builder = new DataExtractorBuilder(layout);
}
#after {
$extractor = builder.create();
}
: first=expr { builder.addAll($first.values); } (COMMA next=expr { builder.addAll($next.values); })* EOF
;
expr returns [List<ValueExtractor> values]
: a=atom { $values = Arrays.asList($a.val); }
| fields=fieldrange { $values = values($fields.fields); }
| '%' { $values = null; }
| ASTERISK { $values = values(layout); }
;
Then when I invoke the parser I do something like this:
public static DataExtractor create(String dataspec) {
CharStream stream = new ANTLRInputStream(dataspec);
DataSpecificationLexer lexer = new DataSpecificationLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DataSpecificationParser parser = new DataSpecificationParser(tokens);
return parser.dataspec().extractor;
}
All I really want is
for the dataspec() call to throw an exception (ideally a checked one) when the input can't be parsed
for that exception to have a useful message and provide access to the line number and position where the problem was found
Then I'll let that exception bubble up the callstack to whereever is best suited to present a useful message to the user--the same way I'd handle a dropped network connection, reading a corrupt file, etc.
I did see that actions are now considered "advanced" in ANTLR4, so maybe I'm going about things in a strange way, but I haven't looked into what the "non-advanced" way to do this would be since this way has been working well for our needs.
Since I've had a little bit of a struggle with the two existing answers, I'd like to share the solution I ended up with.
First of all I created my own version of an ErrorListener like Sam Harwell suggested:
public class ThrowingErrorListener extends BaseErrorListener {
public static final ThrowingErrorListener INSTANCE = new ThrowingErrorListener();
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e)
throws ParseCancellationException {
throw new ParseCancellationException("line " + line + ":" + charPositionInLine + " " + msg);
}
}
Note the use of a ParseCancellationException instead of a RecognitionException since the DefaultErrorStrategy would catch the latter and it would never reach your own code.
Creating a whole new ErrorStrategy like Brad Mace suggested is not necessary since the DefaultErrorStrategy produces pretty good error messages by default.
I then use the custom ErrorListener in my parsing function:
public static String parse(String text) throws ParseCancellationException {
MyLexer lexer = new MyLexer(new ANTLRInputStream(text));
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MyParser parser = new MyParser(tokens);
parser.removeErrorListeners();
parser.addErrorListener(ThrowingErrorListener.INSTANCE);
ParserRuleContext tree = parser.expr();
MyParseRules extractor = new MyParseRules();
return extractor.visit(tree);
}
(For more information on what MyParseRules does, see here.)
This will give you the same error messages as would be printed to the console by default, only in the form of proper exceptions.
When you use the DefaultErrorStrategy or the BailErrorStrategy, the ParserRuleContext.exception field is set for any parse tree node in the resulting parse tree where an error occurred. The documentation for this field reads (for people that don't want to click an extra link):
The exception which forced this rule to return. If the rule successfully completed, this is null.
Edit: If you use DefaultErrorStrategy, the parse context exception will not be propagated all the way out to the calling code, so you'll be able to examine the exception field directly. If you use BailErrorStrategy, the ParseCancellationException thrown by it will include a RecognitionException if you call getCause().
if (pce.getCause() instanceof RecognitionException) {
RecognitionException re = (RecognitionException)pce.getCause();
ParserRuleContext context = (ParserRuleContext)re.getCtx();
}
Edit 2: Based on your other answer, it appears that you don't actually want an exception, but what you want is a different way to report the errors. In that case, you'll be more interested in the ANTLRErrorListener interface. You want to call parser.removeErrorListeners() to remove the default listener that writes to the console, and then call parser.addErrorListener(listener) for your own special listener. I often use the following listener as a starting point, as it includes the name of the source file with the messages.
public class DescriptiveErrorListener extends BaseErrorListener {
public static DescriptiveErrorListener INSTANCE = new DescriptiveErrorListener();
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol,
int line, int charPositionInLine,
String msg, RecognitionException e)
{
if (!REPORT_SYNTAX_ERRORS) {
return;
}
String sourceName = recognizer.getInputStream().getSourceName();
if (!sourceName.isEmpty()) {
sourceName = String.format("%s:%d:%d: ", sourceName, line, charPositionInLine);
}
System.err.println(sourceName+"line "+line+":"+charPositionInLine+" "+msg);
}
}
With this class available, you can use the following to use it.
lexer.removeErrorListeners();
lexer.addErrorListener(DescriptiveErrorListener.INSTANCE);
parser.removeErrorListeners();
parser.addErrorListener(DescriptiveErrorListener.INSTANCE);
A much more complicated example of an error listener that I use to identify ambiguities which render a grammar non-SLL is the SummarizingDiagnosticErrorListener class in TestPerformance.
What I've come up with so far is based on extending DefaultErrorStrategy and overriding it's reportXXX methods (though it's entirely possible I'm making things more complicated than necessary):
public class ExceptionErrorStrategy extends DefaultErrorStrategy {
#Override
public void recover(Parser recognizer, RecognitionException e) {
throw e;
}
#Override
public void reportInputMismatch(Parser recognizer, InputMismatchException e) throws RecognitionException {
String msg = "mismatched input " + getTokenErrorDisplay(e.getOffendingToken());
msg += " expecting one of "+e.getExpectedTokens().toString(recognizer.getTokenNames());
RecognitionException ex = new RecognitionException(msg, recognizer, recognizer.getInputStream(), recognizer.getContext());
ex.initCause(e);
throw ex;
}
#Override
public void reportMissingToken(Parser recognizer) {
beginErrorCondition(recognizer);
Token t = recognizer.getCurrentToken();
IntervalSet expecting = getExpectedTokens(recognizer);
String msg = "missing "+expecting.toString(recognizer.getTokenNames()) + " at " + getTokenErrorDisplay(t);
throw new RecognitionException(msg, recognizer, recognizer.getInputStream(), recognizer.getContext());
}
}
This throws exceptions with useful messages, and the line and position of the problem can be gotten from either the offending token, or if that's not set, from the current token by using ((Parser) re.getRecognizer()).getCurrentToken() on the RecognitionException.
I'm fairly happy with how this is working, though having six reportX methods to override makes me think there's a better way.
For anyone interested, here's the ANTLR4 C# equivalent of Sam Harwell's answer:
using System; using System.IO; using Antlr4.Runtime;
public class DescriptiveErrorListener : BaseErrorListener, IAntlrErrorListener<int>
{
public static DescriptiveErrorListener Instance { get; } = new DescriptiveErrorListener();
public void SyntaxError(TextWriter output, IRecognizer recognizer, int offendingSymbol, int line, int charPositionInLine, string msg, RecognitionException e) {
if (!REPORT_SYNTAX_ERRORS) return;
string sourceName = recognizer.InputStream.SourceName;
// never ""; might be "<unknown>" == IntStreamConstants.UnknownSourceName
sourceName = $"{sourceName}:{line}:{charPositionInLine}";
Console.Error.WriteLine($"{sourceName}: line {line}:{charPositionInLine} {msg}");
}
public override void SyntaxError(TextWriter output, IRecognizer recognizer, Token offendingSymbol, int line, int charPositionInLine, string msg, RecognitionException e) {
this.SyntaxError(output, recognizer, 0, line, charPositionInLine, msg, e);
}
static readonly bool REPORT_SYNTAX_ERRORS = true;
}
lexer.RemoveErrorListeners();
lexer.AddErrorListener(DescriptiveErrorListener.Instance);
parser.RemoveErrorListeners();
parser.AddErrorListener(DescriptiveErrorListener.Instance);
For people who use Python, here is the solution in Python 3 based on Mouagip's answer.
First, define a custom error listener:
from antlr4.error.ErrorListener import ErrorListener
from antlr4.error.Errors import ParseCancellationException
class ThrowingErrorListener(ErrorListener):
def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e):
ex = ParseCancellationException(f'line {line}: {column} {msg}')
ex.line = line
ex.column = column
raise ex
Then set this to lexer and parser:
lexer = MyScriptLexer(script)
lexer.removeErrorListeners()
lexer.addErrorListener(ThrowingErrorListener())
token_stream = CommonTokenStream(lexer)
parser = MyScriptParser(token_stream)
parser.removeErrorListeners()
parser.addErrorListener(ThrowingErrorListener())
tree = parser.script()

Categories

Resources