unable to read "/" on device - java

I am processing an xml document and reading value from it. One of the value that am reading has / in it. This is how the value looks: M/S John Smith At 4. I was doing some testing on emulator and it was showing the correct value. Now i deployed my app to my Samsung Galaxy S2 device and the process is not reading the value correctly. It just shows M in the value field for that name.
I am thinking it could be because / is a special character. Is there something i can do to escape the special character in the value and read the whole name as it is?
P.S.: I am not an experienced Java Developer so this question may sound stupid to you but if you have the solution, please let me know.
When i am printing the value in console window, this is how it reads in the xmlDocument after parsing it: M/S John Smith At 4
This function reads the value:
public static String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
In the adove function, cd.getdata() returns M
After some more debugging:
When i see the element in the watch window, for other names it has only one child. But for the element that contains / it got 3 children. It slices the stringbuffer bcz it sees / in there. I guess either i have to change the below function and ready all the child nodes or i have to use escape character in there before i pass it on.

If we're talking about a text node, have you tried Node's getNodeValue()?
public static String getCharacterDataFromElement(Element e)
{
Node child = e.getFirstChild();
return child.getNodeValue();
}
Documentation: http://docs.oracle.com/javase/1.4.2/docs/api/org/w3c/dom/Node.html#getNodeValue%28%29

This is what the new method now looks like:
public static String getCharacterDataFromElement(Element e) {
return e.getTextContent();
}
As of now this is working. I dont know for how long but hopefully till i decide to do the right thing and iterate over child nodes and concatenate the string values.

Related

Parse HTMl using JSOUP - Need specific pattern

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.
I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.
String text ="\<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>"
Elements italic_tags = ep.select("em");
for(Element em:italic_tags) {
if(em.tagName().equals("em")) {
System.out.println( em.select("em").text());
}
}
If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:
String text = "<p><em>return </em>an item, <em>cancel</em> an order, <em>print</em> a receipt, <em>track</em> your purchases or reorder items.</p>";
Document doc = Jsoup.parse(text);
Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
if (node instanceof TextNode) {
// if it's a text, just display it
System.out.println(node);
} else {
// if it's another element, then display its first
// child which in this case is a text
System.out.println(node.childNode(0));
}
}
output:
return
an item,
cancel
an order,
print
a receipt,
track
your purchases or reorder items.

Get the list of object containing text matching a pattern

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)
So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this
XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>
But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:
System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
System.out.println(run.text());
}
Sometimes it can be like this:
// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>
And other time like this
// Output:
// Number of runs: 4
// Some text with a tag
// <#
// SOMETAG
// #>
What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.
So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).
Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.
Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)
// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
List<Integer> runsToRemove = new LinkedList<Integer>();
StringBuilder tmpText = new StringBuilder();
int runCursor = 0;
// Processing (in normal order) the all runs until I found my surroundedTag
while (!tmpText.toString().contains(surroundedTag)) {
tmpText.append(paragraph.getRuns().get(runCursor).text());
runsToRemove.add(runCursor);
runCursor++;
}
tmpText = new StringBuilder();
// Processing back (in reverse order) to only keep the runs I need to edit/remove
while (!tmpText.toString().contains(surroundedTag)) {
runCursor--;
tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
}
// Edit the first run of the tag
XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);
// Forget the runs I don't to remove
while (runCursor >= 0) {
runsToRemove.remove(0);
runCursor--;
}
// Remove the unused runs
Collections.reverse(runsToRemove);
for (Integer runToRemove : runsToRemove) {
paragraph.removeRun(runToRemove);
}
}
So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.

Preserving lines with Jsoup

I am using Jsoup to get some data from html, I have this code:
System.out.println("nie jest");
StringBuffer url=new StringBuffer("http://www.darklyrics.com/lyrics/");
url.append(args[0]);
url.append("/");
url.append(args[1]);
url.append(".html");
//wyciaganie odpowiednich klas z naszego htmla
Document doc=Jsoup.connect(url.toString()).get();
Element lyrics=doc.getElementsByClass("lyrics").first();
Element tracks=doc.getElementsByClass("albumlyrics").first();
//Jso
//lista sciezek
int numberOfTracks=tracks.getElementsByTag("a").size();
Everything would be fine, I extracthe data I want, but when I do:
lyrics.text()
I get the text with no line breaks, so I am wondering how to leave line breaks in displayed text, I read other threads on stackoverflow on this matter but they weren't helpful, I tried to do something like this:
TextNode tex=TextNode.createFromEncoded(lyrics.text(), lyrics.baseUri());
but I can't get the text I want with line breaks. I looked at previous threads about this like,
Removing HTML entities while preserving line breaks with JSoup
but I can't get the effect I want. What should I do?
Edit: I got the effect I wanted but I don't think it is very good solution:
for (Node nn:listOfNodes)
{
String s=Jsoup.parse(nn.toString()).text();
if ((nn.nodeName()=="#text" || nn.nodeName()=="h3"))
{
buf.append(s+"\n");
}
}
Anyone got better idea?
You could get the text nodes (the text between <br />s) by checking if the node is an instance of TextNode. This should work out for you:
Document document = Jsoup.connect(url.toString()).get();
Element lyrics = document.select(".lyrics").first();
StringWriter buffer = new StringWriter();
PrintWriter writer = new PrintWriter(buffer);
for (Node node : lyrics.childNodes()) {
if (node.nodeName().equals("h3")) {
writer.println(((Element) node).text());
} else if (node instanceof TextNode) {
writer.println(((TextNode) node).text());
}
}
System.out.println(buffer.toString());
(please note that comparing the object's internal value should be done by equals() method, not ==; strings are objects, not primitives)
Oh, I also suggest to read their privacy policy.

Recursive Traverse of Binary Tree Not Terminating At Return Statement

I have created a class that populates a binary tree with morse code. Where traversing to the left signifies a DOT and traversing to the right signifies a DASH. Everything was going great until I am writing an encode method to convert a alpha character into a morse code string. The method should recursively do a preorder traverse of the tree(creating a string of the morse code along the way) until it finds a target character and then returns that string.
However, for some reason my recursion won't terminate on my base case. It just keeps running the entire traverse. I attached my code for the method below. Why does the return statement at in the if statement not trigger and end the method?
Sorry if this is ambiguous, but I didn't want to post 300 lines of code for my entire project when someone smarter than I would notice the problem right off.
Thanks for any help
//wrapper class
//#parameter character is the character to be encoded
//#return return the morse code as a string corresponding to the character
public String encode(char character){
return encode(morseTree, character, "");
}
//#Parameters tree is the binary tree is the tree to be searched,
//element is the target character trying to be foudn, s is the string being used to build the morse code
//#return returns the morse code that corresponds to the element being checked
public String encode(BinaryTree<Character> tree, char target, String s){
if(tree.getData() == target){ //if the data at the current tree is equal to the target element
//return the string that is holding the morse code pattern for this current traversal
return s;
}else{
if(tree.getLeftSubtree() != null){
//Traverse the left side, add a DOT to the end of a string to change the morse code
encode(tree.getLeftSubtree(), target, s + DOT);
}
if(tree.getRightSubtree() != null){
//Traverse the left side, add a DOT to the end of a string to change the morse code
encode(tree.getRightSubtree(), target, s + DASH);
}
}
//The code should never get this far!
return s;
}
Your calls in the else block don't return - they probably should, like this:
if (tree.getLeftSubtree() != null) {
// Traverse the left side, add a DOT to the end of a string to
// change the morse code
return encode(tree.getLeftSubtree(), target, s + DOT);
}
if (tree.getRightSubtree() != null) {
// Traverse the left side, add a DOT to the end of a string to
// change the morse code
return encode(tree.getRightSubtree(), target, s + DASH);
}
However, what do you want to happen if both the left and right subtrees are null? And if they're both non-null, what do you want to return?
Note that just because your base call already returned, that only returns for that single call - not all the other calls in the stack. Recursing doesn't replace the stack frame with the new call - it just adds another stack frame1. Returning from that new stack frame just gets you back to where you were.
1 Yes, I know about tail recursion. Let's not confuse things though.

Optimizing a lot of Scanner.findWithinHorizon(pattern, 0) calls

I'm building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I'm fully aware that there's going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.
Excerpt of my manual matching code (I know it's horrible that I use lists of tokens like that, but it was the best thing I could think of):
public static String lookup(Pattern tokenBefore,
List<String> tokensAfter)
{
String result = null;
while(_match(tokenBefore)) { // block until all input is read
if(id.hasNext())
{
result = id.next(); // capture the next token that matches
if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
return result;
} else
return null; // end of file; no match
}
return null; // no matches
}
private static boolean _match(List<String> tokens)
{
return _match(tokens, true);
}
private static boolean _match(Pattern token)
{
if(token != null)
{
return (id.findWithinHorizon(token, 0) != null);
} else {
return false;
}
}
private static boolean _match(List<String> tokens, boolean block)
{
if(tokens != null && !tokens.isEmpty()) {
if(id.findWithinHorizon(tokens.get(0), 0) == null)
return false;
for(int i = 1; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
break; // break to blocking behaviour
}
}
} else {
return true; // empty list always matches
}
if(block)
return _match(tokens); // loop until we find something or nothing
else
return false; // return after just one attempted match
}
private static boolean _matchImmediate(List<String> tokens)
{
if(tokens != null) {
for(int i = 0; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
return false; // doesn't match, or end of file
}
}
return false; // we have some serious problems if this ever gets called
} else {
return true; // empty list always matches
}
}
Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id is scanning a java.util.String, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.
I've also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that's instant.
Some notes (updated 6/6/2010):
I still need the pattern-matching behaviour for tokensBefore.
All ID numbers I need don't necessarily start at a fixed position in a line, but it's guaranteed that after the ID token is the name of the corresponding object.
I would ideally want to return a String, not the start position of the result as an int or something.
Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!
Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object's full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn't work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.
Generalised code:
String field;
for (/* each record in file A */)
{
/* construct the rest of this object from file A info */
// now to find the ID, if we can
List<String> objectName = new ArrayList<String>(1);
objectName.add(Pattern.quote(thisObject.fullName));
field = lookup(objectSearchToken, objectName); // search file B
if(field == null) // not found in file B
{
lookupReset(false); // initialise scanner to check file C
objectName.clear(); // not using the full name
String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
for(String s : tokens)
objectName.add(Pattern.quote(s));
field = lookup(objectSearchToken, objectName); // search file C
lookupReset(true); // back to file B
} else {
/* found it, file B specific processing here */
}
if(field != null) // found it in B or C
thisObject.ID = field;
}
The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person's name).
As per aioobe's answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+, although it was not noticed in the above code with \r\n. Working along these lines, it has me wondering:
Would it be better for me to match \r\n[^ ] if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.
Another possible optimisation is this: concatenate all tokensAfter, and put a (.*) beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a "potential token" from every line with an ID on it. Is that also worth doing?
The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.
Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
return m.matches();
Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:
pattern[i] = Pattern.compile(tokens.get(i));
And then simply invoke something like
pattern[i].matcher(str).matches()

Categories

Resources