JSoup always timing out

JSoup always timing out - java

I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?

Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.

Related

Jsoup connect(url) get encoding

I'm having some issues when connecting to a URL with Jsoup, I am unable to set the encoding of HTML, the text in the tags are only displayed as "?". I've searched exhaustedly here in the forum and in the documentation but I can't make any solution that is proposed to work.
This is one of the HTML parts that gives me the issue when running the Jsoup connect
The result when running the connection is this:
(source: i.ibb.co)
If I try to use the parser, I have the following message: "Please enable JavaScript to view the page content"
As described in some threads here in stackoverflow, I've changed the output encoding to check if the problem was that, but the result was the same. I tried saving the content to a file in the correct iso and it didn't work as well, same output with the question marks.
The snippet that I am using is yet very simple since I am just trying to get the HTML:
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1") // tried other encodings but no success as well, same output
.get();
System.out.println(doc);
Have anyone had this problem before using the connect().get() from Jsoup?
Update
Using another site the issue is not presented:
String a = "https://flatschart.com/html5/descricao.html";
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1")
.get();
System.out.println(doc);

web scraping jsoup java unable to scrape full information

I have an information to be scraped from a website. I could scrape it. But not all the information is being scraped. There is so much of data loss. The following images helps you further to understand :
I used Jsoup, connected it to URL and then extracted this particular data using the following code :
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").get();
Elements durationCycle = doc.select("g.x.axis g.tick text");
But in the result, I couldn't find any of that related information at all. So I printed the whole document from the URL and it shows the following :
I could see the information when I download the page and read it as an input file but not when I connect directly to URL. But I want to connect it to URL. Is there any suggestion?
I hope my question is understandable. Let me know in case if it is not explanatory.

There is a request body limitation in Jsoup. you should use the maxBodySize parameter:
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").maxBodySize(0).get();
"0" is no limit.

Parsing shopping websites usign jsoup

I have the following code:
doc = Jsoup.connect("http://www.amazon.com/gp/goldbox").userAgent("Mozilla").timeout(5000).get();
Elements hrefs = doc.select("div.a-row.layer");
System.out.println("Results:"+ hrefs); //I am trying to print out contents but not able to see the output.
Problem: Want to display all image src within the div with class name "a-row layer". But, i am unable to see the output.
What is the mistake with my query?

I have taken a look at the website and tested it myself. The issue seems to be that the piece of html code you want to extract (div.a-row.layer) is generated by JavaScript.
Jsoup does not support JavaScript and cannot parse those generated by it. You would need a headless web browser to deal with this, such as HTMLUnit.

Jsoup not getting full html

I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();

The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.

HTML parsing using JSoup

I am new to jsoup. I want to parse html but the problem is with the URL which we have to specify in jsoup.connect(url), I will get this url in response from some other page at runtime. Is there any way to pass the received url into jsoup.connect? I had read something like:
String html = response.getContentAsString();
Document document = Jsoup.parse(html);
But I am not getting exactly how to use it. I would love to know if some other way of doing this is better than jsoup.

"parse" function accepts html content string, not url.
According to jsoup javadoc, the following should solve your problem:
Document doc = Jsoup.connect("http://example.com").get();
pay attention to the fact that "connect" method returns Connection object, but, in fact, does not connect. Therefore additional call to "get" (or "put", depending on the handler type on the server side).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup always timing out - java

Related

Jsoup connect(url) get encoding

web scraping jsoup java unable to scrape full information

Parsing shopping websites usign jsoup

Jsoup not getting full html

HTML parsing using JSoup

Categories

Resources