Jsoup connect(url) get encoding - java

I'm having some issues when connecting to a URL with Jsoup, I am unable to set the encoding of HTML, the text in the tags are only displayed as "?". I've searched exhaustedly here in the forum and in the documentation but I can't make any solution that is proposed to work.
This is one of the HTML parts that gives me the issue when running the Jsoup connect
The result when running the connection is this:
(source: i.ibb.co)
If I try to use the parser, I have the following message: "Please enable JavaScript to view the page content"
As described in some threads here in stackoverflow, I've changed the output encoding to check if the problem was that, but the result was the same. I tried saving the content to a file in the correct iso and it didn't work as well, same output with the question marks.
The snippet that I am using is yet very simple since I am just trying to get the HTML:
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1") // tried other encodings but no success as well, same output
.get();
System.out.println(doc);
Have anyone had this problem before using the connect().get() from Jsoup?
Update
Using another site the issue is not presented:
String a = "https://flatschart.com/html5/descricao.html";
Document doc = Jsoup.connect(a)
.header("Content-Type", "application/x-www-form-urlencoded")
.postDataCharset("ISO-8859-1")
.get();
System.out.println(doc);

Related

web scraping jsoup java unable to scrape full information

I have an information to be scraped from a website. I could scrape it. But not all the information is being scraped. There is so much of data loss. The following images helps you further to understand :
I used Jsoup, connected it to URL and then extracted this particular data using the following code :
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").get();
Elements durationCycle = doc.select("g.x.axis g.tick text");
But in the result, I couldn't find any of that related information at all. So I printed the whole document from the URL and it shows the following :
I could see the information when I download the page and read it as an input file but not when I connect directly to URL. But I want to connect it to URL. Is there any suggestion?
I hope my question is understandable. Let me know in case if it is not explanatory.
There is a request body limitation in Jsoup. you should use the maxBodySize parameter:
Document doc = Jsoup.connect("https://www.awattar.com/tariffs/hourly#").userAgent("Mozilla/17.0").maxBodySize(0).get();
"0" is no limit.

How to post and get data - JSOUP JAVA

Guys I am trying send post method to https://www.servientrega.com/wps/portal/Colombia/transacciones-personas/rastreo-envios and get results of tracke and trace. I need to send this number for example : 2003159943. This is my code:
Connection.Response Form = Jsoup
.connect("https://www.servientrega.com/wps/portal/Colombia/transacciones-personas/rastreo-envios")
.validateTLSCertificates(false)
.method(Connection.Method.GET)
.execute();
Document document = Jsoup
.connect("https://www.servientrega.com/wps/portal/Colombia/transacciones-personas/rastreo-envios")
.validateTLSCertificates(false)
.data("txtNumGuia", "2003159943")
.cookies(Form.cookies())
.post();
I need to get this history:
Image with the data what I want
but I get this when I tried println(document):
Image with the result what I got
enter image description here
The data you want to obtain are set by javascript after page is downloaded. Jsoup does not execute javascript, it only downloads initial html.
If you examine what connections are made, for example with browser debugging tools you will find out, that the data are downloaded with request to the api: https://web.servientrega.com/PortalServientrega/WebServicePortal/tracking/api/envio/2003159943/1/es
The data you are looking for should be in response.
Document document = Jsoup.connect("https://web.servientrega.com/PortalServientrega/WebServicePortal/tracking/api/envio/2003159943/1/es")
.validateTLSCertificates(false)
.ignoreContentType(true)
.get();
System.out.println(document.text());

jsoup won't extract email only website

I am having trouble with jsoup. It only is extracting website links not the email link. Here is my code:
try {
Document doc = Jsoup.connect(url2).get();
Elements links = doc.select("a[href]");
for (Element web: links) {
Log.i("websites/emails/etc.", web.attr("abs:href"));
}
Elements links2 = doc.select("link[href]");
for (Element web: links2) {
Log.i("websites/emails/etc.", web.attr("abs:href"));
}
The logcat is only showing the website links.
Here is the inspected page:
Edit --
I missed you were using Android. I tested on the JVM and your code looked good, I re-tested on Android and see the same thing. The solution appears to be remove the abs: qualifier from your attr call.
Log.i("websites/emails/etc.", web.attr("href"));
Original answer, which may apply to other attempts to extract mailto:.
This is almost certainly the intended behavior from the website creator. Due to the mailto: tags being easily scraped by spambot email harvesters, there are a variety of techniques used to make the mailto: tag not-obvious when you pull the raw HTML. Instead, they cleverly encoded, or are generated dynamically by javascript. See here for an example. Safari is showing you the element because these technique are designed to be correct in the browser, even when the just the HTML looks funky. If you download with file with curl and look at the raw text, there is likely no "mailto:" tag there.

JSoup always timing out

I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?
Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.

HTML parsing using JSoup

I am new to jsoup. I want to parse html but the problem is with the URL which we have to specify in jsoup.connect(url), I will get this url in response from some other page at runtime. Is there any way to pass the received url into jsoup.connect? I had read something like:
String html = response.getContentAsString();
Document document = Jsoup.parse(html);
But I am not getting exactly how to use it. I would love to know if some other way of doing this is better than jsoup.
"parse" function accepts html content string, not url.
According to jsoup javadoc, the following should solve your problem:
Document doc = Jsoup.connect("http://example.com").get();
pay attention to the fact that "connect" method returns Connection object, but, in fact, does not connect. Therefore additional call to "get" (or "put", depending on the handler type on the server side).

Categories

Resources