Jsoup parser not working as expected for particular URL only - java

I am using Jsoup to download the page content and then for parsing it.
public static void main(String[] args) throws IOException {
Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
System.out.println(elements.size());
}
The Problem : If you view the source of page content, there is tag exist <dt> which contains EAN/ISBN: text, but if you run above code, it will give you 0 in output, while it should give me 1. I have already checked html using document.html(), it seems html tags are there, but the tag I wanted is replaced by characters like <dt> instead it should <dt>. Same code is working for other product urls from same site.
I have already worked with Jsoup and developed many parser, but I am not getting why above very simple code is not working. It's strange! Is it Jsoup bug? Can anybody help me?

When using connect() or parse() jsoup will per default expect a valid html and format the input automatically if needed. You may try the xml parser instead.
public static void main(String [] args) throws IOException {
String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
//final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
// the same as above but more readable:
final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");
System.out.println(elements.size());
}

You need to put single quotes around the 'EAN/ISBN:' value; otherwise it will be interpreted as a variable.
Also, there is no need to break up the string and concatenate pieces together. Just put the whole thing in one string.

Related

Page content couldn't be seen by Jsoup and HttpClient

Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}

Encoding of JSOUP get title of web page

How can I fix string %D0%BC%D0%BE%D0%BE - Пошук Google to normal russian like : моа - Пошук Google
I get it from title from downloaded page, but seems like it is in wrong encoding.
try this
public static void main(String[] args) throws UnsupportedEncodingException {
String s = "%D0%BC%D0%BE%D0%BE - Пошук Google";
System.out.println(URLDecoder.decode(s, "UTF-8"));
}
it will print:
моо - Пошук Google
Try this code while getting document.
Document doc = Jsoup.connect("url").get();
doc.charset(Charset.forName("UTF-8"));

Jsoup: null result in absUrl (abs:)

I tried to make a image links downloader with jsoup. I have made a downloader HTML code part, and when I have done a parse part, I recognized, that sometimes links to images appeared without main part. So I found absUrl solution, but by some reasons it did not work (it gave me null). So I tried use uri.resolve(), but it gave me unchanged result. So now I do not know how to solve it. I attached part of my code, that responsible for parsing ant writing url to string:
public static String finalcode(String textin) throws Exception {
String text = source(textin);
Document doc = Jsoup.parse(text);
Elements images = doc.getElementsByTag("img");
String Simages = images.toString();
int Limages = countLines(Simages);
StringBuilder src = new StringBuilder();
while (Limages > 0) {
Limages--;
Element image = images.get(Limages);
String href = image.attr("src");
src.append(href);
src.append("\n");
}
String result = src.toString();
return result;
}
It looks like you are parsing HTML from String, not from URL. Because of that jsoup can't know from which URL this HTML codes comes from, so it can't create absolute path.
To set this URL for Document you should parse it using Jsoup.parse(String html, String baseUri) version, like
String url = "http://server/pages/document.htlm";
String text = "<img src = '../images/image_name1.jpg'/><img src = '../images/image_name2.jpg'/>'";
Document doc = Jsoup.parse(text, url);
Elements images = doc.getElementsByTag("img");
for (Element image : images){
System.out.println(image.attr("src")+" -> "+image.attr("abs:src"));
}
Output:
../images/image_name1.jpg -> http://server/images/image_name1.jpg
../images/image_name2.jpg -> http://server/images/image_name2.jpg
Other option would be letting Jsoup parse page directly by supplying URL instead of String with HTML
Document doc = Jsoup.connect("http://example.com").get();
This way Document will know from which URL it came, so it will be able to create absolute paths.

cannot preserve newlines in text read from URL

I am reading text from URL using Jsoup. Following link has some tips to preserve new lines when converting the body to text
How do I preserve line breaks when using jsoup to convert html to plain text?
I use following lines to convert the tags
String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
.none().addTags("br", "p", "h1"), new OutputSettings()
.prettyPrint(true));
System.out.println(prettyPrintedBodyFragment);
I still get the body/content in single line. Any clues pl?
EDIT: Here is the complete source code and I see output in only 1 line
public static void main(String[] args) throws Exception {
Connection conn = Jsoup.connect("http://finance.yahoo.com/");
Document doc = conn.get();
String body = doc.body().text();
String prettyPrintedBodyFragment = Jsoup.clean(body, "", Whitelist
.none().addTags("br", "p", "h1"), new OutputSettings()
.prettyPrint(true));
System.out.println(prettyPrintedBodyFragment);
}
Change:
String body = doc.body().text();
To:
String body = doc.body().html();
Since you are already dumping the tags, your Whitelist has no way to include them while formatting your text.

HTML Parser fetch link text

I'm using HTML Parser to fetch links from a web page. I need to store the URL, link text and the URL to the parent page containing the link. I have managed to get the link URL as well as the parent URL.
I still ned to get the link text.
link text
Unfortunately I'm having a hard time figuring it out, any help would be greatly appreciated.
public static List<LinkContainer> findUrls(String resource) {
String[] tagNames = {"A", "AREA"};
List<LinkContainer> urls = new ArrayList<LinkContainer>();
Tag tag;
String url;
String sourceUrl;
try {
for (String tagName : tagNames) {
Parser parser = new Parser(resource);
NodeList nodes = parser.parse(new TagNameFilter(tagName));
NodeIterator i = nodes.elements();
while (i.hasMoreNodes()) {
tag = (Tag) i.nextNode();
url = tag.getAttribute("href");
sourceUrl = tag.getPage().getUrl();
if (RegexUtil.verifyUrl(url)) {
urls.add(new LinkContainer(url, null, sourceUrl));
}
}
}
} catch (ParserException pe) {
pe.printStackTrace();
}
return urls;
}
Have you tried ((LinkTag) tag).getLinkText() ? Personally I prefer n html parser which produces XML according to a well used standard, e.g., xerces or similar. This is what you get from using e.g., http://nekohtml.sourceforge.net/.
You would need to check the children of each A Tag. If you assume that your A tags only have a single child (the text itself), you can use the getFirstChild() method. This should be an instance of TextNode, and you can call getText() on this to get the link text.

Categories

Resources