HTML parsing using JSoup - java

I am new to jsoup. I want to parse html but the problem is with the URL which we have to specify in jsoup.connect(url), I will get this url in response from some other page at runtime. Is there any way to pass the received url into jsoup.connect? I had read something like:
String html = response.getContentAsString();
Document document = Jsoup.parse(html);
But I am not getting exactly how to use it. I would love to know if some other way of doing this is better than jsoup.

"parse" function accepts html content string, not url.
According to jsoup javadoc, the following should solve your problem:
Document doc = Jsoup.connect("http://example.com").get();
pay attention to the fact that "connect" method returns Connection object, but, in fact, does not connect. Therefore additional call to "get" (or "put", depending on the handler type on the server side).

Related

Jsoup connect(url) get encoding

I'm having some issues when connecting to a URL with Jsoup, I am unable to set the encoding of HTML, the text in the tags are only displayed as "?". I've searched exhaustedly here in the forum and in the documentation but I can't make any solution that is proposed to work.
This is one of the HTML parts that gives me the issue when running the Jsoup connect
The result when running the connection is this:
(source: i.ibb.co)
If I try to use the parser, I have the following message: "Please enable JavaScript to view the page content"
As described in some threads here in stackoverflow, I've changed the output encoding to check if the problem was that, but the result was the same. I tried saving the content to a file in the correct iso and it didn't work as well, same output with the question marks.
The snippet that I am using is yet very simple since I am just trying to get the HTML:
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1") // tried other encodings but no success as well, same output
.get();
System.out.println(doc);
Have anyone had this problem before using the connect().get() from Jsoup?
Update
Using another site the issue is not presented:
String a = "https://flatschart.com/html5/descricao.html";
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1")
.get();
System.out.println(doc);

web scraping jsoup java unable to scrape full information

I have an information to be scraped from a website. I could scrape it. But not all the information is being scraped. There is so much of data loss. The following images helps you further to understand :
I used Jsoup, connected it to URL and then extracted this particular data using the following code :
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").get();
Elements durationCycle = doc.select("g.x.axis g.tick text");
But in the result, I couldn't find any of that related information at all. So I printed the whole document from the URL and it shows the following :
I could see the information when I download the page and read it as an input file but not when I connect directly to URL. But I want to connect it to URL. Is there any suggestion?
I hope my question is understandable. Let me know in case if it is not explanatory.
There is a request body limitation in Jsoup. you should use the maxBodySize parameter:
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").maxBodySize(0).get();
"0" is no limit.

Jsoup not getting full html

I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.

JSoup always timing out

I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?
Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.

Retrieving known invalid XML via Selenium WebDriver

I have a Selenium test that needs to get some raw XML from a web server. The problem I'm having is that one of the XML documents is known to be invalid because it is missing a root element. I'd like get the raw source of the invalid XML and tack on my own root element, but every WebDriver flavor I've tried the attempting to parse the XML and returns some form of error message. In short, I'm doing this:
WebDriver driver = new FirefoxDriver();
driver.get("http://some_URL_that_returns_xml_data");
String source = driver.getPageSource();
The source string represents the invalid XML error message rendered in the browser rather than the actual raw source as if I viewed source in the browser.
Does anyone know of a trick to get around this?
The standard way of doing this is to use Apache HttpUtils package, and in your HTTP request, just push the correct Content-Type header, which is probably application/xml . Then, your response will be rendered by the browser as xml, rather than text or html.
If the XML is invalid, the browser might only render part of the document and so if you want all of the text, you might want to send content type of text in the request.
If the only thing wrong with the XML is the absence of a wrapper element, then it is a "well-formed external parsed entity", and you can retrieve it using an entity reference. Create a dummy document like this:
<!DOCTYPE doc [
<!ENTITY e SYSTEM "http://uri.com/realdata.xml">
]>
<doc>&e;</doc>
(where the string after "SYSTEM" is the location of your XML), and pass this dummy document to your XML parser. (But not in the browser, where XML parsers typically ignore external entities).
Try:
WebElement element = driver.findElement(By.tagname("body"));
String elHtml = element.getAttribute("innerHTML");
OR:
String elHtml = driver.findElement(By.tagName("body")).getText()

Categories

Resources