Jsoup: Linebreaks in URL cannot be removed - java

I'm collecting links from HTML and using jsoup to get the links which I add to a collection. The problem is that I need each link to be on one line so when written to a file it can be parsed line by line.
The input is a WARC record and for each record I want to get all links.
The getcontentutf8() and getHeaderMetadataItem methods come from a WarcRecord api found here.
The code:
String baseURL = getHeaderMetadataItem("WARC-Target-URI");
Vector<String> retVec = new Vector<String>();
Document doc = Jsoup.parse(getContentUTF8(), baseURL);
Elements links = doc.select("a[href]");
for (Element link : links){
String newLink = link.absUrl("href").replace("\n", "");
retVec.add();
System.out.println(newLink);
}
When writing stdout to file some links get split over two lines, for example:
1: http://somelink.com/submit?url=
2: http://someotherlink.net
While other might look like this (the way I want them):
1: http://somesite.com/submit?url=http://someothersite.com/
It looks like it always happens after a =.
EDIT: Added more information. Seems like removing both carriage return and newlines fixed some cases. However, now I am encountering tab characters within absolute URLs from jsoup. I checked some source sites and they actually have tabs after href. Seems like there are can be alot of cases to handle. I would like to think that there is a general solution to catching them?
<a class="MenuButton " href="/ features"> <em> Features </em> </a>
turns into the absolute URL:
http://archinect.com/ features
Since I store it to file on form URI \t <list of links> this will break when I parse it

Related

text content located in second br tag can't be printed

I'm trying to print the text content located in second br tag by following xpath but all texts which are in all br tags are printed in console. What might be the reason ?
driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]")).getText();
The reason you can't get the text is because the text is not in the br tag.
< open br tag /> close
Additionally, if you read a bit more about it, even the /> is surplus to requirements. If you had just <br> the text wouldn't be contained within it because:
The <br> tag is an empty tag which means that it has no end tag.
The point is, all your text is h2. You need to deal with that the best you can.
To solve you're issue you'll need to:
.getAttribute("innerHTML") - this will give you all the text of the h2 with the br tag
split your string on the string <br> (please note that in chrome my <br /> becomes <br> - you might need to adjust this)
select either select item[2] or do a lamda to select the item that contains your text (do whichever you feel more comfortable with)
And those steps look like this:
//Get the element,
var h2Element = driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]"));
var myTextArray = h2Element.getAttribute("innerHTML").split("<br>");
//approach 1 - just print the [1] item
var approach1Text = myTextArray[2];
System.out.println(approach1Text);
//aproach 2 - use a lamda to select by contains
var approach2Text = Arrays.stream(myTextArray).filter(a -> a.contains("Telefon")).findFirst().get();
System.out.println(approach2Text);
For a bonus note - you probably had fun getting your xpath to work because the br tag splits the text into separate elements. As result your h2 actually has multiple text() values. It has text(), text()[2], text()[3], etc - as many as there are brs
I put together a simple page to test this for you - just to show you what's going on: (note the xpath in dev tools)
This is text()[3] because xpaths are indexed from 1 (comapred to the java code above that starts at 0). However - that's just an example of why it's tricky, i wouldn't recommend you do it that way.
The easy way to eliminate the <br> (and other tags!) affect on text is to use normalize-space().
An xpath like this works and is realtively simple to follow.
//h2[contains(normalize-space(),'Telefon')]
Maps to my sample page OK:
I share this extra bit in case you have any more text-split objects and it helps you down the line.
...All that said - good work on getting your original xpath to work. That's good too.
The driver.findElement() function returns all the elements with the given Xpath. To get only one element in selenium you can use driver.find_element_by_xpath(fxp) function, where 'fxp' is full XPath of the given element.
Try changing your xpath expression to
//h2/br/following-sibling::text()[contains(.,'Telefon')]
and see if it works.

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("&apos;", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.

jSoup get data using td-class tags from webpage

I would like to get data from http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104 using jSoup. I know how to use jSoup - but I am finding it difficult to pinpoint the data that I need.
I would like the Time, Home Team and Away Team from each row of the tbody table. So the output from the first row should be:
08:30 Persipura Jayapura Pelita Bandung Raya
I can see the td class of each of these elements as "status alt", "home" and "guest".
Currently I have tried the below, but it doesn't seem to output anything... what am I doing wrong?
matches = new ArrayList<Match>();
//getHistory
String website = "http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104";
Document doc = Jsoup.connect(website).get();
Element tblHeader = doc.select("tbody").first();
List<Match> data = new ArrayList<>();
for (Element element1 : tblHeader.children()){
Match match = new Match();
match.setTimeOfMatch(element1.select("td.status.alt").text());
match.setAwayTeam(element1.select("td.home").text());
match.setHomeTeam(element1.select("td.guest").text());
data.add(match);
System.out.println(data.toString());
Does anybody know how I can use jSoup to get these elements from each row of the table?
Thanks,
Rob
The content of this site is generated via AJAX it seems. Jsoup can't handle this, since it is not a browser that interprets JavaScript. To solve this scraping problem you may need something like Selenium webdriver. I gave a longer answer to a generalized question about this before, so please look here:
Jsoup get dynamically generated HTML

Integrating a stemmer with Jsoup

I want to implement KrovetzStemmer for the pages that I download. The biggest problem that I have is I cannot simply use the body().text() with the given document and then stem all the words. The reason is because I need href links which should not be stemmed at all. So I thought maybe if I could get the body with href links then I could split it by href and then use a LinkedHashMap as Element and Boolean or an enum type which would specify whether the Element is text or link.
So the problem is let's say given HTML
<!DOCTYPE html>
<html>
<body>
<h1> This is the heading part. This is for testing purposes only.</h1>
First Link
<p>This is the first paragraph to be considered.</p>
Second Link
<p>This is the second paragraph to be considered.</p>
<img border="0" src="/images/pulpit.jpg" alt="Pulpit rock" width="304" height="228">
Third Link
</body>
</html>
I want to be able to only get this:
This is the heading part. This is for testing purposes only.
First Link
This is the first paragraph to be considered.
Second Link
This is the second paragraph to be considered.
Third Link
And then split them and then insert into a LinkedHashMap so if I do something like this:
int i = 1;
for (Entry<Element, Boolean> entry : splitedList.getEntry()) {
if(!entry.getValue()) { System.out.println(i + ": " + entry.getKey());}
i++;
}
Then it would print:
1: This is the heading part. This is for testing purposes only.
3: This is the first paragraph to be considered.
5: This is the second paragraph to be considered.
So that I can apply stemming and keep the order of iteration.
Now, I have no idea how to implement this as I don't know how to:
a) Get the body text with href links only
b) Split the body(I know with Strings we can always use split() but I am talking about the Elements of body of a page)
How would I be able to do these two things above?
Also I am not too sure whether my solution is a good solution or not. Are there better/easier ways to do that?
Now that I understand your requirement, I am updating the post with new answer here:
so considering you have the html Document doc by parsing the given HTML
you can get all the a tags and wrap them in <xmp> tags (look here)
for (Element element : doc.body().select("a"))
element.wrap("<xmp></xmp>");
Now you need to load the new HTML into doc, so Jsoup would avoid parsing the content inside <xmp> tags
doc = Jsoup.parse(doc.html());
System.out.println(doc.body().text());
The output would be:
This is the heading part. This is for testing purposes only.
First Link
This is the first paragraph to be considered.
Second Link
This is the second paragraph to be considered.
Third Link
Now you can go ahead and do what you want with the output.
Updating the code based on the comment for splitting
for (Element element : doc.body().select("a"))
element.wrap("<xmp>split-me-here</xmp>split-me-here");
doc = Jsoup.parse(doc.html());
int cnt = 0;
List<String> splitText = Arrays.asList(doc.body().text().split("split-me-here"));
for (String text : splitText) {
cnt++;
if (!text.contains("</a>"))
System.out.println(cnt + "." + text.trim());
}
The above code will print following output:
1.This is the heading part. This is for testing purposes only.
3.This is the first paragraph to be considered.
5.This is the second paragraph to be considered.

How to keep line breaks when using Jsoup.parse?

This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely
The question is
I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.
I tried the following and all methods from here
FileInputStream in = new FileInputStream("C:\\...myfile.htm");
String htmlText = IOUtils.toString(in);
for (String line : htmlText.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.
This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.
As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.
Something like:
Document doc = Jsoup.parse(htmlText);
Then pass that in a recursive function for each child node:
String getText(Element parentElement) {
String working = "";
for (Node child : parentElement.childNodes()) {
if (child instanceof TextNode) {
working += child.text();
}
if (child instanceof Element) {
Element childElement = (Element)child;
// do more of these for p or other tags you want a new line for
if (childElement.tag().getName().equalsIgnoreCase("br")) {
working += "\n";
}
working += getText(childElement);
}
}
return working;
}
Then you can just call the function to strip the text.
strippedText = getText(doc);
Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.

Categories

Resources