Preserving lines with Jsoup

Preserving lines with Jsoup - java

I am using Jsoup to get some data from html, I have this code:
System.out.println("nie jest");
StringBuffer url=new StringBuffer("http://www.darklyrics.com/lyrics/");
url.append(args[0]);
url.append("/");
url.append(args[1]);
url.append(".html");
//wyciaganie odpowiednich klas z naszego htmla
Document doc=Jsoup.connect(url.toString()).get();
Element lyrics=doc.getElementsByClass("lyrics").first();
Element tracks=doc.getElementsByClass("albumlyrics").first();
//Jso
//lista sciezek
int numberOfTracks=tracks.getElementsByTag("a").size();
Everything would be fine, I extracthe data I want, but when I do:
lyrics.text()
I get the text with no line breaks, so I am wondering how to leave line breaks in displayed text, I read other threads on stackoverflow on this matter but they weren't helpful, I tried to do something like this:
TextNode tex=TextNode.createFromEncoded(lyrics.text(), lyrics.baseUri());
but I can't get the text I want with line breaks. I looked at previous threads about this like,
Removing HTML entities while preserving line breaks with JSoup
but I can't get the effect I want. What should I do?
Edit: I got the effect I wanted but I don't think it is very good solution:
for (Node nn:listOfNodes)
{
String s=Jsoup.parse(nn.toString()).text();
if ((nn.nodeName()=="#text" || nn.nodeName()=="h3"))
{
buf.append(s+"\n");
}
}
Anyone got better idea?

You could get the text nodes (the text between <br />s) by checking if the node is an instance of TextNode. This should work out for you:
Document document = Jsoup.connect(url.toString()).get();
Element lyrics = document.select(".lyrics").first();
StringWriter buffer = new StringWriter();
PrintWriter writer = new PrintWriter(buffer);
for (Node node : lyrics.childNodes()) {
if (node.nodeName().equals("h3")) {
writer.println(((Element) node).text());
} else if (node instanceof TextNode) {
writer.println(((TextNode) node).text());
}
}
System.out.println(buffer.toString());
(please note that comparing the object's internal value should be done by equals() method, not ==; strings are objects, not primitives)
Oh, I also suggest to read their privacy policy.

Related

Parse HTMl using JSOUP - Need specific pattern

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.
I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.
String text ="\<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>"
Elements italic_tags = ep.select("em");
for(Element em:italic_tags) {
if(em.tagName().equals("em")) {
System.out.println( em.select("em").text());
}
}

If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:
String text = "<p><em>return </em>an item, <em>cancel</em> an order, <em>print</em> a receipt, <em>track</em> your purchases or reorder items.</p>";
Document doc = Jsoup.parse(text);
Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
if (node instanceof TextNode) {
// if it's a text, just display it
System.out.println(node);
} else {
// if it's another element, then display its first
// child which in this case is a text
System.out.println(node.childNode(0));
}
}
output:
return
an item,
cancel
an order,
print
a receipt,
track
your purchases or reorder items.

Read a portion of ArrayList for n times of lines?

If you have a HTML page stored in String ArrayList, and you want to for example read the whole <div> tag of certain class type, how do you read the next lines so that it would reach the end of div tag?
for (String l : line) {
if (l.contains("<div class=\"somne_class\">"){
//read the next n strings in ArrayList until </div> tag is reached
}

Generally, it's bad idea to store HTML file as list of raw strings. Why do you store it in such way?
Imagine you have string like <div id="outer_div"><div id="inner_div">Hei!</div></div>. Here you have multiple nested HTML tags in a single line, so you won't easily get the closing tag.
Consider using HTML parser, then you can get desired tag(s) by type or attribute. There are plenty of HTML parsers implemented in Java. One of the most popular is jsoup.

I agree with Vladimir, you're probably looking for an HTML parser.
To answer the exact question in the post: to simply find the next </div> tag, you can use a for loop instead of a foreach loop.
for (int i = 0; i < line.size(); ++i) {
String l = line.get(i);
if (l.contains("<div class=\"somne_class\">") {
for (int j = i; j < line.size(); ++j) {
String l2 = line.get(j);
if (l2.contains("</div>")) {
// l2 is the next line that contains a </div> tag
}
}
}
}
Note that this might not be the matching closing tag for the opening tag, even if you assume that every tag is in a different line.

I recommend you to use jsoup
It is nice for parsing an writing html file.Althought i hadn't
yet digged to much on it here is an example of taking all the elements
with tag div:
Document htmlFile = null;
// Read the html file
try {
htmlFile = Jsoup.parse(new File("path"),"UTF-8");//path,encoding
} catch (IOException e) {
e.printStackTrace();
}
Elements images = htmlFile.getElementsByTag("div");
You can do much more read here

Get the list of object containing text matching a pattern

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)
So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this
XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>
But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:
System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
System.out.println(run.text());
}
Sometimes it can be like this:
// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>
And other time like this
// Output:
// Number of runs: 4
// Some text with a tag
// <#
// SOMETAG
// #>
What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.
So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).
Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.

Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)
// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
List<Integer> runsToRemove = new LinkedList<Integer>();
StringBuilder tmpText = new StringBuilder();
int runCursor = 0;
// Processing (in normal order) the all runs until I found my surroundedTag
while (!tmpText.toString().contains(surroundedTag)) {
tmpText.append(paragraph.getRuns().get(runCursor).text());
runsToRemove.add(runCursor);
runCursor++;
}
tmpText = new StringBuilder();
// Processing back (in reverse order) to only keep the runs I need to edit/remove
while (!tmpText.toString().contains(surroundedTag)) {
runCursor--;
tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
}
// Edit the first run of the tag
XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);
// Forget the runs I don't to remove
while (runCursor >= 0) {
runsToRemove.remove(0);
runCursor--;
}
// Remove the unused runs
Collections.reverse(runsToRemove);
for (Integer runToRemove : runsToRemove) {
paragraph.removeRun(runToRemove);
}
}
So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.

unable to read "/" on device

I am processing an xml document and reading value from it. One of the value that am reading has / in it. This is how the value looks: M/S John Smith At 4. I was doing some testing on emulator and it was showing the correct value. Now i deployed my app to my Samsung Galaxy S2 device and the process is not reading the value correctly. It just shows M in the value field for that name.
I am thinking it could be because / is a special character. Is there something i can do to escape the special character in the value and read the whole name as it is?
P.S.: I am not an experienced Java Developer so this question may sound stupid to you but if you have the solution, please let me know.
When i am printing the value in console window, this is how it reads in the xmlDocument after parsing it: M/S John Smith At 4
This function reads the value:
public static String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
In the adove function, cd.getdata() returns M
After some more debugging:
When i see the element in the watch window, for other names it has only one child. But for the element that contains / it got 3 children. It slices the stringbuffer bcz it sees / in there. I guess either i have to change the below function and ready all the child nodes or i have to use escape character in there before i pass it on.

If we're talking about a text node, have you tried Node's getNodeValue()?
public static String getCharacterDataFromElement(Element e)
{
Node child = e.getFirstChild();
return child.getNodeValue();
}
Documentation: http://docs.oracle.com/javase/1.4.2/docs/api/org/w3c/dom/Node.html#getNodeValue%28%29

This is what the new method now looks like:
public static String getCharacterDataFromElement(Element e) {
return e.getTextContent();
}
As of now this is working. I dont know for how long but hopefully till i decide to do the right thing and iterate over child nodes and concatenate the string values.

caret position into the html of JEditorPane

The getCaretPosition method of JEditorPane gives an index into the text only part of the html control. Is there a possibility to get the index into the html text?
To be more specific suppose I have a html text (where | denotes the caret position)
abcd<img src="1.jpg"/>123|<img src="2.jpg"/>
Now getCaretPosition gives 8 while I would need 25 as a result to read out the filename of the image.

I had mostly the same problem and solved it with the following method (I used JTextPane, but it should be the same for JEditorPane):
public int getCaretPositionHTML(JTextPane pane) {
HTMLDocument document = (HTMLDocument) pane.getDocument();
String text = pane.getText();
String x;
Random RNG = new Random();
while (true) {
x = RNG.nextLong() + "";
if (text.indexOf(x) < 0) break;
}
try {
document.insertString(pane.getCaretPosition(), x, null);
} catch (BadLocationException ex) {
ex.printStackTrace();
return -1;
}
text = pane.getText();
int i = text.indexOf(x);
pane.setText(text.replace(x, ""));
return i;
}
It just assumes your JTextPane won't contain all possible Long values ;)

The underlying model of the JEditorPane (some subclass of StyledDocument, in your case HTMLDocument) doesn't actually hold the HTML text as its internal representation. Instead, it has a tree of Elements containing style attributes. It only becomes HTML once that tree is run through the HTMLWriter. That makes what you're trying to do kinda tricky! I could imagine putting some flag attribute on the character element that you're currently on, and then using a specially crafted subclass of HTMLWriter to write out until that marker and count the characters, but that sounds like something of an epic hack. There is probably an easier way to get what you want there, though it's a bit unclear to me what that actually is.

I had the same problem, and solved it with the following code:
editor.getDocument().insertString(editor.getCaretPosition(),"String to insert", null);

I don't think you can transform your caret to be able to count tags as characters. If your final aim is to read image filename, you should use :
HTMLEditorKit (JEditorPane.getEditorKitForContentType("text/html") );
For more information about utilisation see Oracle HTMLEditorKit documentation and this O'Reilly PDF that contains interesting examples.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Preserving lines with Jsoup - java

Related

Parse HTMl using JSOUP - Need specific pattern

Read a portion of ArrayList for n times of lines?

Get the list of object containing text matching a pattern

unable to read "/" on device

caret position into the html of JEditorPane

Categories

Resources