Integrating a stemmer with Jsoup - java

I want to implement KrovetzStemmer for the pages that I download. The biggest problem that I have is I cannot simply use the body().text() with the given document and then stem all the words. The reason is because I need href links which should not be stemmed at all. So I thought maybe if I could get the body with href links then I could split it by href and then use a LinkedHashMap as Element and Boolean or an enum type which would specify whether the Element is text or link.
So the problem is let's say given HTML
<!DOCTYPE html>
<html>
<body>
<h1> This is the heading part. This is for testing purposes only.</h1>
First Link
<p>This is the first paragraph to be considered.</p>
Second Link
<p>This is the second paragraph to be considered.</p>
<img border="0" src="/images/pulpit.jpg" alt="Pulpit rock" width="304" height="228">
Third Link
</body>
</html>
I want to be able to only get this:
This is the heading part. This is for testing purposes only.
First Link
This is the first paragraph to be considered.
Second Link
This is the second paragraph to be considered.
Third Link
And then split them and then insert into a LinkedHashMap so if I do something like this:
int i = 1;
for (Entry<Element, Boolean> entry : splitedList.getEntry()) {
if(!entry.getValue()) { System.out.println(i + ": " + entry.getKey());}
i++;
}
Then it would print:
1: This is the heading part. This is for testing purposes only.
3: This is the first paragraph to be considered.
5: This is the second paragraph to be considered.
So that I can apply stemming and keep the order of iteration.
Now, I have no idea how to implement this as I don't know how to:
a) Get the body text with href links only
b) Split the body(I know with Strings we can always use split() but I am talking about the Elements of body of a page)
How would I be able to do these two things above?
Also I am not too sure whether my solution is a good solution or not. Are there better/easier ways to do that?

Now that I understand your requirement, I am updating the post with new answer here:
so considering you have the html Document doc by parsing the given HTML
you can get all the a tags and wrap them in <xmp> tags (look here)
for (Element element : doc.body().select("a"))
element.wrap("<xmp></xmp>");
Now you need to load the new HTML into doc, so Jsoup would avoid parsing the content inside <xmp> tags
doc = Jsoup.parse(doc.html());
System.out.println(doc.body().text());
The output would be:
This is the heading part. This is for testing purposes only.
First Link
This is the first paragraph to be considered.
Second Link
This is the second paragraph to be considered.
Third Link
Now you can go ahead and do what you want with the output.
Updating the code based on the comment for splitting
for (Element element : doc.body().select("a"))
element.wrap("<xmp>split-me-here</xmp>split-me-here");
doc = Jsoup.parse(doc.html());
int cnt = 0;
List<String> splitText = Arrays.asList(doc.body().text().split("split-me-here"));
for (String text : splitText) {
cnt++;
if (!text.contains("</a>"))
System.out.println(cnt + "." + text.trim());
}
The above code will print following output:
1.This is the heading part. This is for testing purposes only.
3.This is the first paragraph to be considered.
5.This is the second paragraph to be considered.

Related

text content located in second br tag can't be printed

I'm trying to print the text content located in second br tag by following xpath but all texts which are in all br tags are printed in console. What might be the reason ?
driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]")).getText();
The reason you can't get the text is because the text is not in the br tag.
< open br tag /> close
Additionally, if you read a bit more about it, even the /> is surplus to requirements. If you had just <br> the text wouldn't be contained within it because:
The <br> tag is an empty tag which means that it has no end tag.
The point is, all your text is h2. You need to deal with that the best you can.
To solve you're issue you'll need to:
.getAttribute("innerHTML") - this will give you all the text of the h2 with the br tag
split your string on the string <br> (please note that in chrome my <br /> becomes <br> - you might need to adjust this)
select either select item[2] or do a lamda to select the item that contains your text (do whichever you feel more comfortable with)
And those steps look like this:
//Get the element,
var h2Element = driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]"));
var myTextArray = h2Element.getAttribute("innerHTML").split("<br>");
//approach 1 - just print the [1] item
var approach1Text = myTextArray[2];
System.out.println(approach1Text);
//aproach 2 - use a lamda to select by contains
var approach2Text = Arrays.stream(myTextArray).filter(a -> a.contains("Telefon")).findFirst().get();
System.out.println(approach2Text);
For a bonus note - you probably had fun getting your xpath to work because the br tag splits the text into separate elements. As result your h2 actually has multiple text() values. It has text(), text()[2], text()[3], etc - as many as there are brs
I put together a simple page to test this for you - just to show you what's going on: (note the xpath in dev tools)
This is text()[3] because xpaths are indexed from 1 (comapred to the java code above that starts at 0). However - that's just an example of why it's tricky, i wouldn't recommend you do it that way.
The easy way to eliminate the <br> (and other tags!) affect on text is to use normalize-space().
An xpath like this works and is realtively simple to follow.
//h2[contains(normalize-space(),'Telefon')]
Maps to my sample page OK:
I share this extra bit in case you have any more text-split objects and it helps you down the line.
...All that said - good work on getting your original xpath to work. That's good too.
The driver.findElement() function returns all the elements with the given Xpath. To get only one element in selenium you can use driver.find_element_by_xpath(fxp) function, where 'fxp' is full XPath of the given element.
Try changing your xpath expression to
//h2/br/following-sibling::text()[contains(.,'Telefon')]
and see if it works.

In JSoup I am trying to get text from a span that has multiple classes with strange names the compiler is not liking

Here is my code:
enter code heretext = text.toUpperCase();
Document doc = Jsoup.connect("https://finance.yahoo.com/quote/" + text + "?p=" + text).userAgent("Safari").get();
Element temp = doc.selectFirst("span.Trsdu(0.3s).Fw(b).Fz(36px).Mb(-4px).D(ib)");
System.out.println(temp);
here is the span I am trying to get:
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35">1,119.50</span>
I am new to JSoup so if i am being ignorant please let me know what i should do
This may not be the answer but I can't comment yet since I don't have 50 rep points but I'd still like to help so I'll post it here.
Jsoup has a lot of issue with recognizing characters that I've also encountered.
For this particular example, I think you can use the data attribute 'data-react-id' to locate that element. First you would select all spans then the attribute, something like this doc.select("span").select("[data-react-id]=35]")
Hope that helps.

Jsoup: Linebreaks in URL cannot be removed

I'm collecting links from HTML and using jsoup to get the links which I add to a collection. The problem is that I need each link to be on one line so when written to a file it can be parsed line by line.
The input is a WARC record and for each record I want to get all links.
The getcontentutf8() and getHeaderMetadataItem methods come from a WarcRecord api found here.
The code:
String baseURL = getHeaderMetadataItem("WARC-Target-URI");
Vector<String> retVec = new Vector<String>();
Document doc = Jsoup.parse(getContentUTF8(), baseURL);
Elements links = doc.select("a[href]");
for (Element link : links){
String newLink = link.absUrl("href").replace("\n", "");
retVec.add();
System.out.println(newLink);
}
When writing stdout to file some links get split over two lines, for example:
1: http://somelink.com/submit?url=
2: http://someotherlink.net
While other might look like this (the way I want them):
1: http://somesite.com/submit?url=http://someothersite.com/
It looks like it always happens after a =.
EDIT: Added more information. Seems like removing both carriage return and newlines fixed some cases. However, now I am encountering tab characters within absolute URLs from jsoup. I checked some source sites and they actually have tabs after href. Seems like there are can be alot of cases to handle. I would like to think that there is a general solution to catching them?
<a class="MenuButton " href="/ features"> <em> Features </em> </a>
turns into the absolute URL:
http://archinect.com/ features
Since I store it to file on form URI \t <list of links> this will break when I parse it

Getting text from a website using JSoup

I’m working with JSoup to parse the html website.
I want to get the article from (for example) Wikipedia.
I would like to get the text from the main page (http://en.wikipedia.org/wiki/Main_Page) from the table “From today’s featured article”.
Here’s the code:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page”);
Elements el = doc.select("div.mp-tfa”);
System.out.println(el);
The problem is that it doesn’t work properly - it prints out just a blank line.
The “From today’s featured article” table is inserted in div class=“mp-tfa”.
How to get this text in my java program?
Thanks in advance.
Change:
doc.select("div.mp-tfa");
To:
doc.select("div#mp-tfa");
The better way would to iterate over the Elements thus retrieved for the tag, class or Element of your choice, simply put:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
for (Element e : el) {
System.out.println(e.text());
}
Would give:
The Boulonnais is a heavy draft horse breed from Fr....
I think it's supposed to be:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
System.out.println(el);

separate html coded string and normal string

I want to split a single string containing normal text as well as html code into array of string. I tried to search on google but not found any suitable suggestion.
Consider the following string:
blahblahblahblahblahblahblahblahblahblah
blahblah First para blahblahblahblah
blahblahblahblahblahblahblahblahblahblah
<html>
<body>
<p>hello</p>
</body>
</html>
blahblahblahblahblahblahblahblahblahblah
blahblah Second Para lahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
this becomes:
s[0]=whole first para
s[1]=html code
s[2]=whole second para
Is it possible through jsoup ?. Or I need any other api?
It is possible with jQuery. Here below is a code snippet.
var str = "blablabla <html><body><p>hello</p></body></html> blabla";
var parsedHTML = $.parseHTML(str);
myList = [];
// loop through parsed text and put it into text based on its type
$.each(parsedHTML, function( i, el ) {
if (el.nodeType < 3) myList[i] = el.nodeName;
else myList[i] = el.data;
});
// use myList ...
Here is a fiddle which shows you that it works. The only disadvantage is that both <html> and <body> tag is parsed and not being obtained in the parsedHTML.
jsfiddle example
This can be done with JSoup
Simple use example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Then you can navigate the DOM structure to extract the information.
update
To get the text with all the tags you could wrap the entire string in <meta> ... </meta> tags; then parse it, access the individual components, and finally serialize the components back into strings.
Alternatively if you believe the code is well formed (with matching beginning and end tags) you could search for the first match of the regex
/<(html|body)\s*>/
Depending on what the contents of the first tag (match) are you then look for the last occurrence of the matching close tag.
More manual, more prone to error, not recommended. But since you have a non- standard problem it seems you might want a non-standard solution .

Categories

Resources