Get the list of object containing text matching a pattern

Get the list of object containing text matching a pattern - java

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)
So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this
XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>
But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:
System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
System.out.println(run.text());
}
Sometimes it can be like this:
// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>
And other time like this
// Output:
// Number of runs: 4
// Some text with a tag
// <#
// SOMETAG
// #>
What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.
So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).
Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.

Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)
// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
List<Integer> runsToRemove = new LinkedList<Integer>();
StringBuilder tmpText = new StringBuilder();
int runCursor = 0;
// Processing (in normal order) the all runs until I found my surroundedTag
while (!tmpText.toString().contains(surroundedTag)) {
tmpText.append(paragraph.getRuns().get(runCursor).text());
runsToRemove.add(runCursor);
runCursor++;
}
tmpText = new StringBuilder();
// Processing back (in reverse order) to only keep the runs I need to edit/remove
while (!tmpText.toString().contains(surroundedTag)) {
runCursor--;
tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
}
// Edit the first run of the tag
XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);
// Forget the runs I don't to remove
while (runCursor >= 0) {
runsToRemove.remove(0);
runCursor--;
}
// Remove the unused runs
Collections.reverse(runsToRemove);
for (Integer runToRemove : runsToRemove) {
paragraph.removeRun(runToRemove);
}
}
So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.

Related

How many times a text appears in webpage - Selenium Webdriver

Hi I would like to count how many times a text Ex: "VIM LIQUID MARATHI" appears on a page using selenium webdriver(java). Please help.
I have used the following to check if a text appears in the page using the following in the main class
assertEquals(true,isTextPresent("VIM LIQUID MARATHI"));
and a function to return a boolean
protected boolean isTextPresent(String text){
try{
boolean b = driver.getPageSource().contains(text);
System.out.println(b);
return b;
}
catch(Exception e){
return false;
}
}
... but do not know how to count the number of occurrences...

The problem with using getPageSource(), is there could be id's, classnames, or other parts of the code which match your String, but those don't actually appear on the page. I suggest just using getText() on the body element, which will only return the page's content, and not HTML. If I'm understanding your question correctly, I think that is more what you are looking for.
// get the text of the body element
WebElement body = driver.findElement(By.tagName("body"));
String bodyText = body.getText();
// count occurrences of the string
int count = 0;
// search for the String within the text
while (bodyText.contains("VIM LIQUID MARATHI")){
// when match is found, increment the count
count++;
// continue searching from where you left off
bodyText = bodyText.substring(bodyText.indexOf("VIM LIQUID MARATHI") + "VIM LIQUID MARATHI".length());
}
System.out.println(count);
The variable count contains the number of occurrences.

There are two different ways to do this:
int size = driver.findElements(By.xpath("//*[text()='text to match']")).size();
This will tell the driver to find all of the elements that have the text, and then output the size.
The second way is to search the HTML, like you said.
int size = driver.getPageSource().split("text to match").length-1;
This will get the page source, the split the string whenever it finds the match, then counts the number of splits it made.

You can try to execute javascript expression using webdriver:
((JavascriptExecutor)driver).executeScript("yourScript();");
If you are using jQuery on your page you can use jQuery's selectors:
((JavascriptExecutor)driver).executeScript("return jQuery([proper selector]).size()");
[proper selector] - this should be selector that will match text you are searching for.

Try
int size = driver.findElements(By.partialLinkText("VIM MARATHI")).size();

openCSV parse individual columns

I wrote a script parsing a .csv file in groovy using tokenize, which ended up not doing exactly what I needed, I am trying to use the openCSV library but I am unsure as to how I can parse out individual columns. here is my code so far:
List<String[]> rows = new CSVReader(
new InputStreamReader(getClass().classLoader.getResourceAsStream(inputFileString)))
.readAll()
rows.each { row ->
row.each { it ->
println it
}
}
and here is my input data:
1,"unknown","positive","full message","I love it."
So what I am trying to figure out is how to print select columns in said row. Also thanks in advance, I am trying to get my head around groovy/java, I come from a Ruby background.

Not sure what you mean by '...how to print select columns in said row'
But this script (for example) prints the 4th column for each row:
#Grab( 'net.sf.opencsv:opencsv:2.3' )
import au.com.bytecode.opencsv.CSVReader
// This sets example to a 2 line string
// I'm using it instead of a file, as it makes
// an easier example to follow
def example = '''1,"unknown","positive","full message","I love it."
|2,"tim","negative","whoop!","It's ok"'''.stripMargin()
List<String[]> rows = new CSVReader( new StringReader( example ) ).readAll()
rows.each {
// print the 4th column
println it[ 3 ]
}
That prints:
full message
whoop!

caret position into the html of JEditorPane

The getCaretPosition method of JEditorPane gives an index into the text only part of the html control. Is there a possibility to get the index into the html text?
To be more specific suppose I have a html text (where | denotes the caret position)
abcd<img src="1.jpg"/>123|<img src="2.jpg"/>
Now getCaretPosition gives 8 while I would need 25 as a result to read out the filename of the image.

I had mostly the same problem and solved it with the following method (I used JTextPane, but it should be the same for JEditorPane):
public int getCaretPositionHTML(JTextPane pane) {
HTMLDocument document = (HTMLDocument) pane.getDocument();
String text = pane.getText();
String x;
Random RNG = new Random();
while (true) {
x = RNG.nextLong() + "";
if (text.indexOf(x) < 0) break;
}
try {
document.insertString(pane.getCaretPosition(), x, null);
} catch (BadLocationException ex) {
ex.printStackTrace();
return -1;
}
text = pane.getText();
int i = text.indexOf(x);
pane.setText(text.replace(x, ""));
return i;
}
It just assumes your JTextPane won't contain all possible Long values ;)

The underlying model of the JEditorPane (some subclass of StyledDocument, in your case HTMLDocument) doesn't actually hold the HTML text as its internal representation. Instead, it has a tree of Elements containing style attributes. It only becomes HTML once that tree is run through the HTMLWriter. That makes what you're trying to do kinda tricky! I could imagine putting some flag attribute on the character element that you're currently on, and then using a specially crafted subclass of HTMLWriter to write out until that marker and count the characters, but that sounds like something of an epic hack. There is probably an easier way to get what you want there, though it's a bit unclear to me what that actually is.

I had the same problem, and solved it with the following code:
editor.getDocument().insertString(editor.getCaretPosition(),"String to insert", null);

I don't think you can transform your caret to be able to count tags as characters. If your final aim is to read image filename, you should use :
HTMLEditorKit (JEditorPane.getEditorKitForContentType("text/html") );
For more information about utilisation see Oracle HTMLEditorKit documentation and this O'Reilly PDF that contains interesting examples.

What approach to use for parsing a file with fixed length records, when the record layout isn't known until runtime?

I want to parse a file based on a record layout provided in another file.
Basically there will be a definition file, which is a comma delimited list of fields and their respective lengths. There will be many of these, a new one will be loaded each time I run the program.
firstName,text,20
middleInitial,text,1
lastName,text,20
salary,number,10
Then I will display a blank table with the supplied column headings, and an option to add data by clicking a button or whatever - I haven't decided yet.
I also want to have an option to both load data from a file, or save data to a file, with the file matching the format described in the definition file.
For example, a file to load (or one produced by the save function) for the above definition file might look like this.
Adam DSmith 50000
Brent GWilliams 45000
Harry TThompson 47500
What kind of patterns could be useful here, and can anyone give me pointers of a rough guide on how to structure the way data is internally stored and modeled.
I would like to think I can find my way around the java documentation alright, but if anyone can point me at somewhere to start looking, it would be greatly appreciated!
Thanks

So it sounds like to me that you have a howToParse file and infoToParse file with the directions of how to parse information and the information to parse in these files respectively.
First, I would read in the howToParse file and create some sort of dynamic Parser object. It looks like each line in this file is a different ParsingStep object. Then you just need to read the line which will be stored as a String object and just split the ParsingStep into its 3 parts: field name, type of data, length of data.
// Create new parser to hold parsing steps.
Parser dynamicParser = new Parser();
// Create new scanner to read through parse file.
Scanner parseFileScanner = new Scanner(howToParseFileName);
// *** Add exception handling as necessary *** this is just an example
// Read till end of file.
while (parseFileScanner.hasNext()) {
String line = parseFileScanner.nextLine(); // Get next line in file.
String[] lineSplit = line.split(","); // Split on comma
String fieldName = lineSplit[0];
String dataType = lineSplit[1];
String dataLength = lineSplit[2]; // Convert to Integer with Integer.parseInt();
ParsingStep step = new ParsingStep(fieldName, dataType, dataLength);
dynamicParser.addStep(step);
}
parseFileScanner.close();
Then you would have how to parse a line, then you just need to parse through the other file and store the information from that file, probably in an array.
// Open infoToParse file and start reading.
Scanner infoScanner = new Scanner(infoToParseFileName);
// Add exception handling.
while (infoScanner.hasNext()) {
String line = infoScanner.nextLine();
// Parse line and return a Person object or maybe just a Map of field names to values
Map<String,String> personMap = dynamicParser.parse(line);
}
infoScanner.close();
Then the only other code is just making sure the parser is parsing in the correct order.
public class Parser {
private ArrayList<ParsingStep> steps;
public Parser() {
steps = new ArrayList<ParsingStep>();
}
public void addStep(ParsingStep step) {
steps.add(step);
}
public Map<String,String> parse(String line) {
String remainingLine = line;
for (ParsingStep step : steps) {
remainingLine = step.parse(remainingLine);
}
return map; // Somehow convert to map.
}
}
Personally, I would add some error checking in the parse steps just in case the infoToParse file is not in the proper format.
Hope this helps.

DocumentListener slows down Document.setCharacterAttributes method?

this is my first question in this site, though is not the first time I enter to clear my doubts, awesome webpage. :)
I'm writing a java program that highlights code in a JTextPane and I'm changing the way highlights are done. I'm using a JTabbedPane to let the user edit more than one file at the same time and I used to perform document highlights using a Timer, now I've built a highlight queue that runs in a separate thread and implemented a DocumentListener that queues the documents as changes take place.
But I have a really big problem, if I add the document via DocumentListener, the Highlight process takes a really long time while if I add it in the main class by getting the document directly from the JTextPane, it takes just a few milliseconds.
I've performed multiple benchmarks in my code and found out that what takes so much time to be performed when the document is added from the DocumentListener is the method Document.setCharacterAttributes().
Here is the method that adds documents via DocumentListener:
// eventType: 0 - insertUpdate / 1- removeUpdate
private void queueChange(javax.swing.event.DocumentEvent e, int eventType){
StyledDocument doc = (StyledDocument) e.getDocument();
int changeLength = e.getLength();
int changeOffset = e.getOffset();
int length = doc.getLength();
String title = (String) doc.getProperty("title");
String text;
try {
text = doc.getText(0, length);
if (changeLength != 1) {
Element element = doc.getDefaultRootElement();
int startLn = element.getElement(element.getElementIndex(changeOffset)).getStartOffset();
int endLn = element.getElement(element.getElementIndex(changeOffset + changeLength)).getEndOffset() - 1;
Engine.addDocument(doc, startLn, endLn, title, text);
} else {
if(eventType == 1){
changeOffset = changeOffset - changeLength;
}
int startLn = text.lastIndexOf("\n", changeOffset) + 1;
int endLn = text.indexOf("\n", changeOffset);
if (endLn < 0) {
if (length != startLn) {
endLn = length;
Engine.addDocument(doc, startLn, endLn, title, text);
}
} else if (startLn != endLn && startLn < endLn) {
Engine.addDocument(doc, startLn, endLn, title, text);
}
}
} catch (BadLocationException ex) {
Engine.crashEngine();
}
}
If I add a document with 2k lines with this method, it takes ~1900 ms to highlight the whole document, while if I add the document to the highlight queue by using a caret listening method it takes ~500 ms.
Here's a part of the caret listening method that is used to highlight whole documents when they're loaded:
if (loadFile == true) {
isKey = false;
doc = edit[currentTab].Editor.getStyledDocument();
try {
Highlight.addDocument(doc, 0, doc.getLength(),
Scripts.getTitleAt(currentTab), doc.getText(0, doc.getLength()));
} catch (BadLocationException ex) {
ex.printStackTrace();
}
loadFile = false;
}
Note: the Highlight/Engine.addDocument() method has five parameters: (StyledDocument doc,int start, int end, String tabTitle, String docText). Start and end both indicate the region where highlighting is needed.
I will appreciate any help related to this problem cause I've been trying to solve it for a few days and I can't find anything similar on the Internet. :(
Btw, does anyone know the actual difference between Document.setCharacterAttributes and Document.setParagraphAttributes? :P

Maybe you have some kind of recursion in your code that is causing the problem. With the DocumentEvent you should only worry about additions and removals. You don't need to worry about changes since those are attribute changes.
Maybe you add some text which schedules the highlighting, but then when you change the attributes of the text you schedule another highllighting task.

You can try to set a flag indicating whether it's user changes or your API changes. In the beginning of the Engine.addDocument() set the flag to API state and reset it back after changes are done.
In your listener check the flag and skip changes from API.
You wrote " I use highlights the text by setting the character attributes of a portion of the Document, so the method is not inserting more text". I'm not sure it doesn't insert text. E.g. you have "it's a bold text piece" then you select the "bold" and change attributes to bold. Original element is separated and 3 new elements appear. I didn't test it but it might call insertUpdate() and removeUpdate()
does anyone know the actual difference between Document.setCharacterAttributes and Document.setParagraphAttributes?
There are paragraph and char attributes. Char attributes are font size, family, style, colors. Paragraph attributes are alignment, indentation, line spacing.
Actually paragraphs are char elements' parents.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get the list of object containing text matching a pattern - java

Related

How many times a text appears in webpage - Selenium Webdriver

openCSV parse individual columns

caret position into the html of JEditorPane

What approach to use for parsing a file with fixed length records, when the record layout isn't known until runtime?

DocumentListener slows down Document.setCharacterAttributes method?

Categories

Resources