Getting skipped whitespases in ANTLR parser

Getting skipped whitespases in ANTLR parser - java

I have some grammar, that ignores whitespaces in following way
WS : [ \r\t\n]+ -> channel(HIDDEN) ;
It's ok, 'cos whitespace isn't part of my grammar. But in parser I need to know where whitespaces was. For now I unable to find any straight way to do this.
I use last version of ANTLR4
Thanks in advance.

in v3 you would do something like that if you're looking for a token while parsing the tree:
getPreviousTokenInHiddenChannel(retval, input);
public String getPreviousTokenInHiddenChannel(TreeRuleReturnScope retval, TreeNodeStream input) {
try {
TokenStream tstream = input.getTokenStream();
CommonTree node = (CommonTree) retval.start;
int boundary = node.getTokenStopIndex();
if (boundary <= 0) { // fix for antlr 3.3 bug, from 3.5 getTokenStartIndex should itself resolve parent's boundaries if <= 0
while (node.getTokenStartIndex() == -1) { // if node is imaginary
node = (CommonTree) node.getParent();
if (node == null) return ""; // means we are root
boundary = node.getTokenStopIndex();
if (boundary > 0) break;
}
}
int i = boundary;
while (true) {
i--;
Token tok = tstream.get(i);
if (tok.getChannel() == HIDDEN) {
// do what you want to do https://www.youtube.com/watch?v=JgRBkjgXHro
}
}
} catch (Exception e) {
// handle e
}
}
You can easily adapt that piece of code for v4 with something like that (pseudocode):
BufferedTokenStream bts;
// retrieve bts
List<Token> hiddenTokens = bts.getHiddenTokensToLeft(bts.index(), HIDDEN);
// loop backwards over the list
for (int i = hiddenTokens.size(); i--; i >= 0) {
Token t = hiddenTokens.get(i)
// process your hidden token
}

See Token stream API
You must get used to looking at the API and source code. You can also buy the book cheaply. Page 206: Accessing Hidden Channels.

Related

Apache POI: ${my_placeholder} is treated as three different runs

I have a .docx template with placeholders to be filled, such as ${programming_language}, ${education}, etc.
The placeholder keywords must be easily distinguished from the other plain words, hence they are enclosed with ${ }.
for (XWPFTable table : doc.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (XWPFTableCell cell : row.getTableCells()) {
for (XWPFParagraph paragraph : cell.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
System.out.println("run text: " + run.text());
/** replace text here, etc. */
}
}
}
}
}
I want to extract the placeholders together with the enclosing ${ } characters. The problem is, that is seems like the enclosing characters are treated as different runs...
run text: ${
run text: programming_language
run text: }
run text: Some plain text here
run text: ${
run text: education
run text: }
Instead, I would like to achieve the following effect:
run text: ${programming_language}
run text: Some plain text here
run text: ${education}
I have tried using other enclosing characters, such as: { }, < >, # #, etc.
I do not want to do some weird concatenations of runs, etc. I want to have it in a single XWPFRun.
If I cannot find the proper solution, I will just make it like so: VAR_PROGRAMMING_LANGUGE, VAR_EDUCATION, I think.

Current apache poi 4.1.2 provides TextSegment to deal with those Word text-run issues. XWPFParagraph.searchText searches for a string in a paragraph and returns a TextSegment. This provides access to the begin run and the end run of that text in that paragraph (BeginRun and EndRun). It also provides access to the start character position in begin run and end character position in end run (BeginChar and EndChar).
It additionally provides access to the index of the text element in the text run (BeginText and EndText). This always should be 0, because default text runs only have one text element.
Having this, we can do the following:
Replace the found partial string in begin run by the replacement. To do so, get the text part which was before the searched string and concatenate the replacement to it. After that the begin run fully contains the replacement.
Delete all text runs between begin run and end run as they contain parts of the searched string which is not more needed.
Let remain only the text part after the searched string in end run.
Doing so we are able replacing text which is in multiple text runs.
Following example shows this.
import java.io.*;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
public class WordReplaceTextSegment {
static public void replaceTextSegment(XWPFParagraph paragraph, String textToFind, String replacement) {
TextSegment foundTextSegment = null;
PositionInParagraph startPos = new PositionInParagraph(0, 0, 0);
while((foundTextSegment = paragraph.searchText(textToFind, startPos)) != null) { // search all text segments having text to find
System.out.println(foundTextSegment.getBeginRun()+":"+foundTextSegment.getBeginText()+":"+foundTextSegment.getBeginChar());
System.out.println(foundTextSegment.getEndRun()+":"+foundTextSegment.getEndText()+":"+foundTextSegment.getEndChar());
// maybe there is text before textToFind in begin run
XWPFRun beginRun = paragraph.getRuns().get(foundTextSegment.getBeginRun());
String textInBeginRun = beginRun.getText(foundTextSegment.getBeginText());
String textBefore = textInBeginRun.substring(0, foundTextSegment.getBeginChar()); // we only need the text before
// maybe there is text after textToFind in end run
XWPFRun endRun = paragraph.getRuns().get(foundTextSegment.getEndRun());
String textInEndRun = endRun.getText(foundTextSegment.getEndText());
String textAfter = textInEndRun.substring(foundTextSegment.getEndChar() + 1); // we only need the text after
if (foundTextSegment.getEndRun() == foundTextSegment.getBeginRun()) {
textInBeginRun = textBefore + replacement + textAfter; // if we have only one run, we need the text before, then the replacement, then the text after in that run
} else {
textInBeginRun = textBefore + replacement; // else we need the text before followed by the replacement in begin run
endRun.setText(textAfter, foundTextSegment.getEndText()); // and the text after in end run
}
beginRun.setText(textInBeginRun, foundTextSegment.getBeginText());
// runs between begin run and end run needs to be removed
for (int runBetween = foundTextSegment.getEndRun() - 1; runBetween > foundTextSegment.getBeginRun(); runBetween--) {
paragraph.removeRun(runBetween); // remove not needed runs
}
}
}
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument(new FileInputStream("source.docx"));
String textToFind = "${This is the text to find}"; // might be in different runs
String replacement = "Replacement text";
for (XWPFParagraph paragraph : doc.getParagraphs()) { //go through all paragraphs
if (paragraph.getText().contains(textToFind)) { // paragraph contains text to find
replaceTextSegment(paragraph, textToFind, replacement);
}
}
FileOutputStream out = new FileOutputStream("result.docx");
doc.write(out);
out.close();
doc.close();
}
}
Above code works not in all cases because XWPFParagraph.searchText has bugs. So I will provide a better searchText method:
/**
* this methods parse the paragraph and search for the string searched.
* If it finds the string, it will return true and the position of the String
* will be saved in the parameter startPos.
*
* #param searched
* #param startPos
*/
static TextSegment searchText(XWPFParagraph paragraph, String searched, PositionInParagraph startPos) {
int startRun = startPos.getRun(),
startText = startPos.getText(),
startChar = startPos.getChar();
int beginRunPos = 0, candCharPos = 0;
boolean newList = false;
//CTR[] rArray = paragraph.getRArray(); //This does not contain all runs. It lacks hyperlink runs for ex.
java.util.List<XWPFRun> runs = paragraph.getRuns();
int beginTextPos = 0, beginCharPos = 0; //must be outside the for loop
//for (int runPos = startRun; runPos < rArray.length; runPos++) {
for (int runPos = startRun; runPos < runs.size(); runPos++) {
//int beginTextPos = 0, beginCharPos = 0, textPos = 0, charPos; //int beginTextPos = 0, beginCharPos = 0 must be outside the for loop
int textPos = 0, charPos;
//CTR ctRun = rArray[runPos];
CTR ctRun = runs.get(runPos).getCTR();
XmlCursor c = ctRun.newCursor();
c.selectPath("./*");
try {
while (c.toNextSelection()) {
XmlObject o = c.getObject();
if (o instanceof CTText) {
if (textPos >= startText) {
String candidate = ((CTText) o).getStringValue();
if (runPos == startRun) {
charPos = startChar;
} else {
charPos = 0;
}
for (; charPos < candidate.length(); charPos++) {
if ((candidate.charAt(charPos) == searched.charAt(0)) && (candCharPos == 0)) {
beginTextPos = textPos;
beginCharPos = charPos;
beginRunPos = runPos;
newList = true;
}
if (candidate.charAt(charPos) == searched.charAt(candCharPos)) {
if (candCharPos + 1 < searched.length()) {
candCharPos++;
} else if (newList) {
TextSegment segment = new TextSegment();
segment.setBeginRun(beginRunPos);
segment.setBeginText(beginTextPos);
segment.setBeginChar(beginCharPos);
segment.setEndRun(runPos);
segment.setEndText(textPos);
segment.setEndChar(charPos);
return segment;
}
} else {
candCharPos = 0;
}
}
}
textPos++;
} else if (o instanceof CTProofErr) {
c.removeXml();
} else if (o instanceof CTRPr) {
//do nothing
} else {
candCharPos = 0;
}
}
} finally {
c.dispose();
}
}
return null;
}
This will be called like:
...
while((foundTextSegment = searchText(paragraph, textToFind, startPos)) != null) {
...

Just like someone has commented your question, you can't have control where or when Word will split the paragraph in some runs. If the other answer still didn't help you, then I have the way I got around it:
First of all, this "solution" have a big problem, but still, I will put it here for the reason that someone can solve it.
public void mainMethod(XWPFParagraph paragraph) {
if (paragraph.getRuns().size() > 1) {
String myRun = unifyRuns(paragraph.getRuns());
// make the verification of placeholders ${...}
paragraph.getRuns().get(0).setText(myRun);
while(paragraph.getRuns().size() > 1) {
paragraph.removeRun(1);
}
}
}
private String unifyRuns(List<XWPFRun> runElements) {
StringBuilder unifiedRun = new StringBuilder();
for (XWPFRun run : runElements) {
unifiedRun.append(run);
}
return unifiedRun.toString();
}
The code may contain some error since I'm doing it as I remember.
The problem here is that when Word separates paragraphs into runs, it doesn't do it for nothing, because when there are texts with different fonts (like font-family or font-size), it separates the texts in different runs.
In the text "Here's my bold text", Word will split the text to separate the bold and normal text. Then, the code above is a bad solution if you are using POI to create large documents with different types of fonts. In that case you would need to verify first if the run is actualy in bold, then you will treat the placeholders.
Again, this a "solution" that i found, and it's not complete yet. Sorry for english errors, i'm using Google Translate to write this answer.

Parse CACM collection in Java

i'm having a problem parsin the CACM collection in java.
The collection has this format:
.I number
.T
title
.A
authors
multiple authors allowed
.W
body
multiple lines of body allowed
I'm trying to extract each of the fields with this extract method:
public static String extract(char campo, String text,Boolean allowEmpty)
{
String[] lines = text.split("\\r?\\n");
/*for(String line:lines)
System.out.println(line);*/
StringBuilder builder = new StringBuilder();
boolean start = false;
boolean end = false;
for(String l:lines)
{
System.out.println(l);
//System.out.println(line.charAt(0));
if((l.charAt(0) == '.') && (l.charAt(1) == campo))
{
System.out.println("Detectado campo "+l.charAt(1));
start = true;
builder.append(l.substring(2)).append("\n");
}
else
{
if(l.charAt(0) == '.')
{
//System.out.println(campo);
break;
}
else if(start)
builder.append(l);
}
}
return builder.toString();
}
But i do not know why, it does only extract the .I field, and i cant get it to work with any other field. I'm clueless in regard to where to correct the code, or if the approximation is logical.
Any clue in this?
Thank you in advance.

Modifying YAML in java while preserving comments

How can we modify an existing YAML and preserve comments in it.
Is there any Java parser which does that ?
For example if i have following YAML:
#This is a test YAML
name: abcd
age: 23
#Test YAML ends here.
Is there a way I can edit this Yaml using a java parser and preserve the comments.

As of the time of writing, there is no round-tripping YAML parser for Java. There is the well-known SnakeYAML, which does not preserve comments (see the author's comment here), and a newer project named camel, which I know little of; but it definitely is not round-tripping.
What you can theoretically do is to use SnakeYaml's Yaml.parse and then iterate over the events. Each event has a start and an end mark, giving the start and end line & column of the event. This makes it possible to map the events back into the source and discover the portions of the source that were not parsed into events (presumably comments). Having this mapping, you can now modify the event list and write it back. Finally, you read the result in a second time and discover the gaps between your events where there were comments in the original YAML, but not in the modified YAML, and re-insert those comments, giving you the final YAML with your modifications and the comments.
Of course, this is very complex. I would not advice to do it unless you a) have either a solid understanding of how YAML is structured or are willing to learn it, and b) your use-case justifies this amount of work.

I wrote a groovy script to solve this. The Java version is very similar:
def key = "name"
def value = "efgh"
def yamlFile = new File("file.yaml")
def yamlFileLines = new StringBuilder()
def foundKey = false
yamlFile.text.eachLine { line ->
if (!foundKey && line.contains("$key:")) {
line = line.replaceAll(/$key:.*/, "$key: $value")
foundKey = true
}
yamlFileLines.append("$line\n")
}
if (foundKey) {
yamlFile.text = yamlFileLines.toString()
} else {
throw new StopExecutionException("Could not find key '$key' in file ${yamlFile.getAbsolutePath()}")
}

if you use snakeyaml , you should modify the ScannerImpl file.
notice: read the in-line comment as text
private Token scanPlain() {
StringBuilder chunks = new StringBuilder();
Mark startMark = reader.getMark();
Mark endMark = startMark;
int indent = this.indent + 1;
String spaces = "";
while (true) {
int c;
int length = 0;
// A comment indicates the end of the scalar.
// read the in-line comment as text
// if (reader.peek() == '#' && ) {
// break;
// }
while (true) {
c = reader.peek(length);
if (Constant.NULL_BL_T_LINEBR.has(c)
|| (c == ':' && Constant.NULL_BL_T_LINEBR.has(reader.peek(length + 1), flowLevel != 0 ? ",[]{}":""))
|| (this.flowLevel != 0 && ",?[]{}".indexOf(c) != -1)) {
break;
}
length++;
}
if (length == 0) {
break;
}
this.allowSimpleKey = false;
chunks.append(spaces);
chunks.append(reader.prefixForward(length));
endMark = reader.getMark();
spaces = scanPlainSpaces();
// System.out.printf("spaces[%s]\n", spaces);
if (spaces.length() == 0
// read the in-line comment as text
// || reader.peek() == '#'
|| (this.flowLevel == 0 && this.reader.getColumn() < indent)) {
break;
}
}
return new ScalarToken(chunks.toString(), startMark, endMark, true);
}

Convert Iterator to a for loop with index in order to skip objects

I am using Jericho HTML Parser to parse some malformed html. In particular I am trying to get all text nodes, process the text and then replace it.
I want to skip specific elements from processing. For example I want to skip all elements, and any element that has attribute class="noProcess". So, if a div has class="noProcess" then I want to skip this div and all children from processing. However, I do want these skipped elements to return back to the output after processing.
Jericho provides an Iterator for all nodes but I am not sure how to skip complete elements from the Iterator. Here is my code:
private String doProcessHtml(String html) {
Source source = new Source(html);
OutputDocument outputDocument = new OutputDocument(source);
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
return outputDocument.toString();
}
It doesn't look like using the ignoreWhenParsing() method works for me as the parser just treats the "ignored" element as text.
I was thinking that if I could convert the Iterator loop to a for (int i = 0;...) loop I could probably be able to skip the element and all its children by modifying i to point to the EndTag and then continue the loop.... but not sure.

I think you might want to consider a redesign of the way your segments are built. Is there a way to parse the html in such a way that each segment is a parent element that contains a nested list of child elements? That way you could do something like:
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
continue;
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
for(Segment child : segment.childNodes()) {
//Use recursion to process child elements
//You will want to put your for loop in a separate method so it can be called recursively.
}
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
Without more code to inspect its hard to determine if restructuring the segment element is even possible or worth the effort.

Managed to have a working solution by using the getEnd() method of the Element object of the Tag. The idea is to skip elements if their end position is less than a position you set. So you find the end position of the element you want to exclude and you do not process anything else before that position:
final ArrayList<String> excludeTags = new ArrayList<String>(Arrays.asList(new String[] {"head", "script", "a"}));
final ArrayList<String> excludeClasses = new ArrayList<String>(Arrays.asList(new String[] {"noProcess"}));
Source.LegacyIteratorCompatabilityMode = true;
Source source = new Source(htmlToProcess);
OutputDocument outputDocument = new OutputDocument(source);
int skipToPos = 0;
for (Segment segment : source) {
if (segment.getBegin() >= skipToPos) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
Element element = tag.getElement();
// check excludeTags
if (excludeTags.contains(tag.getName().toLowerCase())) {
skipToPos = element.getEnd();
}
// check excludeClasses
String classes = element.getAttributeValue("class");
if (classes != null) {
for (String theClass : classes.split(" ")) {
if (excludeClasses.contains(theClass.toLowerCase())) {
skipToPos = element.getEnd();
}
}
}
} else if (segment instanceof CharacterReference) { // for future use. Source.LegacyIteratorCompatabilityMode = true;
CharacterReference characterReference = (CharacterReference) segment;
} else {
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
}
return outputDocument.toString();

This should work.
String skipTag = null;
for (Segment segment : source) {
if (skipTag != null) { // is skipping ON?
if (segment instanceof EndTag && // if EndTag found for the
skipTag.equals(((EndTag) segment).getName())) { // tag we're skipping
skipTag = null; // set skipping OFF
}
continue; // continue skipping (or skip the EndTag)
} else if (segment instanceof Tag) { // is tag?
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
if (HTMLElementName.A.equals(tag.getName()) { // if <a> ?
skipTag = tag.getName(); // set
continue; // skipping ON
} else if (tag instanceof StartTag) {
if ("noProcess".equals( // if <tag class="noProcess" ..> ?
((StartTag) tag).getAttributeValue("class"))) {
skipTag = tag.getName(); // set
continue; // skipping ON
}
}
} // ...
}

EMV TLV Java Function

I'm looking for a way to translate an EMV response with Java like with this online option:
http://www.emvlab.org/tlvutils/
where you put something like this EMV response:
6f3a8407a0000000031010a52f500b56495341204352454449548701015f2d086573656e707466729f12074352454449544f9f1101019f38039f1a02
and it will show you everything perfectly, I started doing something by myself but then I realize that maybe we could have two 9F38(PDOL) Strings not neccesary two same tags cuz I know it's impossible but maybe the value of a tag end in 9F and the start of the next tag would be 38 and that would give me an error... Now that I mention it, is that possible? cuz that was one of the main reasons why I stopped doing my own function..
Does any of you have written a function to do this already?
Thanks!

https://github.com/binaryfoo/emv-bertlv should do the trick.
Using your example, the following code:
List<DecodedData> decoded = new RootDecoder().decode("6f3a8407a0000000031010a52f500b56495341204352454449548701015f2d086573656e707466729f12074352454449544f9f1101019f38039f1a02", "EMV", "constructed");
new DecodedWriter(System.out).write(decoded, "");
Will output:
[6F (FCI template)] 8407A0000000031010A52F500B56495341204352454449548701015F...1A02
[84 (dedicated file name)] A0000000031010
[A5 (FCI proprietary template)] 500B56495341204352454449548701015F2D086573656E707466729F...1A02
[50 (application label)] VISA CREDIT
[87 (application priority indicator)] 01
[5F2D (language preference)] esenptfr
[9F12 (application preferred name)] CREDITO
[9F11 (issuer code table index)] 01
[9F38 (PDOL - Processing data object list)] 9F1A02
9F1A (terminal country code) 2 bytes

This project has code to deal with EMV data http://code.google.com/p/javaemvreader/

You are on the right track. You can easily build your own EMV parser using the technique call TLV (Tag Length Value). Your raw data always comes back with a Tag, then after the tag is the length, using the length can get you the value.
So create three methods
method 1: Contains all the short tags
method 2: Contains all the long tags
method 3: Contains all the proprietary tags
So when you pass in your raw emv tag:
6f3a8407a0000000031010a52f500b56495341204352454449548701015f2d086573656e707466729f12074352454449544f9f1101019f38039f1a02
Loop through all those three methods, it will give you all the nice information that you need.

Use below function which will gives you hashmap of TLV value
public LinkedHashMap parseBERTLVTag(String tlv) throws DecoderException
{
if(tlv==null || "".equalsIgnoreCase(tlv)){
return null;
}
System.out.println("============= START ["+tlv+"]==================");
boolean inTagRead= true;
Map<String,String> tags= new HashMap<>();
StringBuilder _tmp = new StringBuilder();
String lastTag = "";
int old_index = 0;
boolean isFirstTagByte = true;
int len = 0;
boolean more=true;
String data = "";
while (more)
{
len = 0;
String hByte = tlv.substring(old_index,(old_index = old_index+2));
if(inTagRead)
{
if(isLastTagByte(hByte, isFirstTagByte))
{
inTagRead=false;
_tmp.append(hByte);
lastTag = _tmp.toString();
System.out.println("Tag["+lastTag+"]");
tags.put(lastTag, null);
_tmp= new StringBuilder();
}else
{
_tmp.append(hByte);
}
isFirstTagByte = false;
}else//Length
{
isFirstTagByte = true;
if(isLastLengthByte(hByte)) {
inTagRead=true;
_tmp.append(hByte);
len = Integer.parseInt(_tmp.toString(), 16 );
//read len*2
System.out.println(" Length ["+len+"]");
data = tlv.substring(old_index, (old_index = old_index+len*2));
String tmpData= lastTag+":"+_tmp.toString()+":h"+data;
System.out.println(" Data ["+tmpData+"]");
_tmp = new StringBuilder();
tags.put(lastTag, tmpData);
}else
{
_tmp.append(hByte);
}
}
more= tlv.length()<=old_index?false:true;
System.out.println("tag "+lastTag+" value "+data+" length "+len);
if(lastTag.length() > 0 && data.length() > 0 && len > 0){
if(!map.containsKey(lastTag)){
map.put(lastTag,new TLVModel().setTag(lastTag).setLength(len).setValue(data));
}
}
}//END OF WHILE
System.out.println("------------ as MAP ---------------------");
System.out.println("size "+map.size());
for (Map.Entry mp:map.entrySet()){
System.out.println("key "+mp.getKey()+" value "+mp.getValue());
}
return map.size() > 0 ? map : null;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting skipped whitespases in ANTLR parser - java

See Token stream API You must get used to looking at the API and source code. You can also buy the book cheaply. Page 206: Accessing Hidden Channels.

Related

Apache POI: ${my_placeholder} is treated as three different runs

Parse CACM collection in Java

Modifying YAML in java while preserving comments

Convert Iterator to a for loop with index in order to skip objects

EMV TLV Java Function

Categories

Resources