Encoding of JSOUP get title of web page - java

How can I fix string %D0%BC%D0%BE%D0%BE - Пошук Google to normal russian like : моа - Пошук Google
I get it from title from downloaded page, but seems like it is in wrong encoding.

try this
public static void main(String[] args) throws UnsupportedEncodingException {
String s = "%D0%BC%D0%BE%D0%BE - Пошук Google";
System.out.println(URLDecoder.decode(s, "UTF-8"));
}
it will print:
моо - Пошук Google

Try this code while getting document.
Document doc = Jsoup.connect("url").get();
doc.charset(Charset.forName("UTF-8"));

Related

Page content couldn't be seen by Jsoup and HttpClient

Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}

Jsoup parser not working as expected for particular URL only

I am using Jsoup to download the page content and then for parsing it.
public static void main(String[] args) throws IOException {
Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
System.out.println(elements.size());
}
The Problem : If you view the source of page content, there is tag exist <dt> which contains EAN/ISBN: text, but if you run above code, it will give you 0 in output, while it should give me 1. I have already checked html using document.html(), it seems html tags are there, but the tag I wanted is replaced by characters like <dt> instead it should <dt>. Same code is working for other product urls from same site.
I have already worked with Jsoup and developed many parser, but I am not getting why above very simple code is not working. It's strange! Is it Jsoup bug? Can anybody help me?
When using connect() or parse() jsoup will per default expect a valid html and format the input automatically if needed. You may try the xml parser instead.
public static void main(String [] args) throws IOException {
String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
//final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
// the same as above but more readable:
final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");
System.out.println(elements.size());
}
You need to put single quotes around the 'EAN/ISBN:' value; otherwise it will be interpreted as a variable.
Also, there is no need to break up the string and concatenate pieces together. Just put the whole thing in one string.

Getting Content From a Website with Java

I was curious how to pull information from a website with Java, and I found JSoup ( HTML Parser) Was a popular suggestion. I have found quite a few examples online but nothing really explaining how to use it. Say I wanted to get the temperature for Toronto using this url, http://weather.gc.ca/city/pages/on-143_metric_e.html , how would I go about doing so?
I guess you have to specify tags, but in the html for that site, the information I want is in a tag, but so is more inforation so when when I run my code
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("dd").text();
System.out.println("Title: " + temp);
I get a lot more information than I want.
For the temperature try this:
String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();
String temp = document.select("p").get(1).text();
System.out.println("Temperature: " + temp);
For formulating the CSS queries refer to the syntax sheet: http://jsoup.org/cookbook/extracting-data/selector-syntax
Also try: http://try.jsoup.org/, great for testing!
Let say I want to read the contents of mywebsite.com. This is how i'll do it:
import java.net.*;
import java.io.*;
class MyClass {
public static void main(String[] arg) throws Exception {
URL u = new URL("http://www.mywebsite.com");
InputStream ins = u.openStream();
InputStreamReader isr = new InputStreamReader(ins);
BufferedReader br = new BufferedReader(isr);
System.out.println(br.readLine());
}
}
Hopefully this should get you started..

Trying to get exact source code from web page I see from my browser using Java.

Very very new to programming,i.e its my 2nd day. I am looking at finance webpage, and am trying to extract the stock symbols from the webpage. Using the source code from the webpage id like a list that looks like ADK-A,AEH,AED, etc..., which is a list of the symbols as they appear on the webpage and browser generated source code.
Looking at the source code via Chrome's browser you can see the stock symbols, but using java even though I get some of the source code, every way i try the stock symbols and plenty of other code are never generated.
I have tried implementations using URL class, URLConnection class, and the HtmlUnit class. I dont know much but im guessing this part of the source is generated by some sort of javascript?? I figured working with Htmlunit would help as supposedly it can handle scripts? It didnt at least the way I am using it. Anyways this is what i tried
private static String name1 = "http://www.quantumonline.com/pfdtable.cfm?Type=TaxAdvPfds&SortColumn=Company&SortOrder=ASC";
//Implementation 1
public static void main (String[] args) throws IOException {
URL thisUrl = new URL(name1);
BufferedReader thisUrlBufferedReader = new BufferedReader (new InputStreamReader(thisUrl.openStream()));
String currentline;
while( (currentline = thisUrlBufferedReader.readLine()) != null) {
if ((currentline.contains("href")) == true) {
System.out.println(currentline);
}
}
}
//Implementation 2. My understading of fudging with addRequestProperty of a URLConnection, was to make sure my that the website wasnt restricting me based on my user-agent, I
honestly dont really know what it does, but i tried with and without, didnt help
public static void main (String[] args) throws IOException {
URL thisUrl = new URL(name1);
URLConnection thisUrlConnect = thisUrl.openConnection();
thisUrlConnect.addRequestProperty("User-Agent", "the user agent i got from http://whatsmyuseragent.com/");
InputStream input = thisUrlConnect.getInputStream();
BufferedReader thisUrlBufferedReader = new BufferedReader (new InputStreamReader (input));
String currentline;
while( (currentline = thisUrlBufferedReader.readLine()) != null) {
System.out.println(currentline);
}
}
//Implementation 3 i also used WebClient(BrowserVersion.CHROME) plus all the other versions
//nothing worked
public static void main(String[] args) throws Exception {
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(name1);
System.out.println(page.asXml());
}
}
Anyways if anyone has any ideas im all ears. THANKS!!!

Using boilerpipe to extract non-english articles

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!
Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding
Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.
Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.
I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);

Categories

Resources