I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?
Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.
Related
I'm having some issues when connecting to a URL with Jsoup, I am unable to set the encoding of HTML, the text in the tags are only displayed as "?". I've searched exhaustedly here in the forum and in the documentation but I can't make any solution that is proposed to work.
This is one of the HTML parts that gives me the issue when running the Jsoup connect
The result when running the connection is this:
(source: i.ibb.co)
If I try to use the parser, I have the following message: "Please enable JavaScript to view the page content"
As described in some threads here in stackoverflow, I've changed the output encoding to check if the problem was that, but the result was the same. I tried saving the content to a file in the correct iso and it didn't work as well, same output with the question marks.
The snippet that I am using is yet very simple since I am just trying to get the HTML:
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1") // tried other encodings but no success as well, same output
.get();
System.out.println(doc);
Have anyone had this problem before using the connect().get() from Jsoup?
Update
Using another site the issue is not presented:
String a = "https://flatschart.com/html5/descricao.html";
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1")
.get();
System.out.println(doc);
I have an information to be scraped from a website. I could scrape it. But not all the information is being scraped. There is so much of data loss. The following images helps you further to understand :
I used Jsoup, connected it to URL and then extracted this particular data using the following code :
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").get();
Elements durationCycle = doc.select("g.x.axis g.tick text");
But in the result, I couldn't find any of that related information at all. So I printed the whole document from the URL and it shows the following :
I could see the information when I download the page and read it as an input file but not when I connect directly to URL. But I want to connect it to URL. Is there any suggestion?
I hope my question is understandable. Let me know in case if it is not explanatory.
There is a request body limitation in Jsoup. you should use the maxBodySize parameter:
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").maxBodySize(0).get();
"0" is no limit.
I have the following code:
doc = Jsoup.connect("http://www.amazon.com/gp/goldbox").userAgent("Mozilla").timeout(5000).get();
Elements hrefs = doc.select("div.a-row.layer");
System.out.println("Results:"+ hrefs); //I am trying to print out contents but not able to see the output.
Problem: Want to display all image src within the div with class name "a-row layer". But, i am unable to see the output.
What is the mistake with my query?
I have taken a look at the website and tested it myself. The issue seems to be that the piece of html code you want to extract (div.a-row.layer) is generated by JavaScript.
Jsoup does not support JavaScript and cannot parse those generated by it. You would need a headless web browser to deal with this, such as HTMLUnit.
I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.
I am new to jsoup. I want to parse html but the problem is with the URL which we have to specify in jsoup.connect(url), I will get this url in response from some other page at runtime. Is there any way to pass the received url into jsoup.connect? I had read something like:
String html = response.getContentAsString();
Document document = Jsoup.parse(html);
But I am not getting exactly how to use it. I would love to know if some other way of doing this is better than jsoup.
"parse" function accepts html content string, not url.
According to jsoup javadoc, the following should solve your problem:
Document doc = Jsoup.connect("http://example.com").get();
pay attention to the fact that "connect" method returns Connection object, but, in fact, does not connect. Therefore additional call to "get" (or "put", depending on the handler type on the server side).