jsoup - Not able to fetch a specific website

jsoup - Not able to fetch a specific website - java

I'm using latest jsoup (1.13.1) in latest Eclipse IDE for Java Developers (includes Incubating components)
Version: 2020-09 (4.17.0)
Build id: 20200910-1200.
I'm trying to parse a very specific website, but with no success.
After I execute these lines:
doc = Jsoup.connect("http://pokehb.pw/%D7%A2%D7%95%D7%A0%D7%94/21/%D7%A4%D7%A8%D7%A7/43").get();
doc.select("title").forEach(System.out::println);
Nothing gets printed.
It's not just the , any element or property of the page is not available.
Yes, the URL is weird, but this is the one I need, I can browse it fine in Chrome.
I also know this is now due to the Hebrew in the website, since other Hebrew sites works ok.
For example, using this URL seems fine: https://context.reverso.net/translation/hebrew-english/%D7%9C%D7%9B%D7%AA%D7%95%D7%91%D7%AA+url
Any hint on what can be done?

What I can tell you is there's a "laravel_session" in the cookies. This suggests you'll need a more capable technology than JSoup. Try HtmlUnit instead and it might work.

What I ended up doing is using this command:
doc = Jsoup.parse(driver.getPageSource());
Which brought all of the page's source into the doc.
From there it was a simple use of getElementsByClass and getElementsByTag.
Hope this helps someone, and thanks Rob for trying to answer.

Related

Using Java/Selenium to read data from Google spreadsheet document

I have a simple Google spreadsheet doc (a list of site URLs) that gets created by a script:
https://docs.google.com/spreadsheets/d/1aPpgCIWG0ASw1GgRQneVzJu8mHlLXWrk60_4WGELbaI
I want to navigate a Firefox browser to each URL and run my link checker against each URL. Is there a way to do this without authorizing tokens and such? I'm not having much luck finding good, current info on this. I am using Eclipse and Java 1.8 and I build with a POM.xml file and Maven.
Thanks in advance for any help provided!

Parse a page (partly generated by JavaScript) by using Selenium

I've got a problem: I want to parse a page (e.g. this one) to collect information about the offered apps and save these information into a database.
Moreover I am using crawler4j for visiting every (available) page. But the problem - as I can see - is, that crawler4j needs links to follow in the source code.
But in this case the hrefs are generated by some JavaScript code so that crawler4j does not get new links to visit / pages to crawl.
So my idea was to use Selenium so that I can inspect several Elements like in a real Browser like Chrome or Firefox (I'm quite new with this).
But, to be honest, I don't know how to get the "generated" HTML instead of the source code.
Can anybody help me?

To inspect elements, you do not need the Selenium IDE, just use Firefox with the Firebug extension. Also, with the developer tools add on you can view a page's source and also the generated source (this is mainly for PHP).
Crawler4J can not handle javascript like this. It is better left for another more advanced crawling library. See this response here:
Web Crawling (Ajax/JavaScript enabled pages) using java

CSS Validator library for Java

I am looking out for a css validator library that I can use in my java application. I have checked out this. http://jigsaw.w3.org/css-validator/manual.html. But according to my understanding that needs to be run on a server locally or used as a command line tool. Correct me if I am wrong here.
Thanks in Advance

Behind the link you have posted there is a webservice which you can use to validate your CSS files. But you have to be online for doing this. There is also an offline version available: http://jigsaw.w3.org/css-validator/DOWNLOAD.html which you might can embed in your application.

Need to extract results from google keywords external tool?

Need to build a little java tool that gets the keyword suggestions and traffic estimates from the google keywords tool at https://adwords.google.com/select/KeywordToolExternal .
The page is rendered in javascript so simple scraping isn't possible. I have tried htmlunit, but it doesn't work (tried diff browserversions .. still no luck).
one way could be to embed a webbrowser in java , but hvn't had any success with it ?
any suggestions or some alternative ?

Have a look at adword's api and [Client library for Java,Python ect].

how to convert html page to image using java or php

I want to know how to convert html file to image. How do I do this?

You can checkout the source code for the popular BrowserShots service,
http://browsershots.org/

If you're running Windows, and have the GD library installed, you can use imagegrabwindow. I've never used it myself, but as always, the PHP site has lots of documentation and examples.

Use:
WKHTMLTOPDF.
It also has binding to PHP, or you can run it yourself from command line.

Problem is that you need to implement all the functionality of a browser and an HTTP stack (and this still does not deal with the case where the content is modified using javascript).
As John McCollum says, if you've got the website open in a browser on your PC, then you can use imagegrabwindow or snapsIE (MSIE only)
If you want to to be able to get a snapshot using code only, then you might want to look at one of the off the shelf solutions - AFAIK there are several programs (at least 2 of which are called html2pdf) which will generate a PDF of static html - and its relatively easy using standard tools to trim this to window size and convert to an image file.
e.g. https://metacpan.org/pod/distribution/PDF-FromHTML/script/html2pdf.pl

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

jsoup - Not able to fetch a specific website - java

What I can tell you is there's a "laravel_session" in the cookies. This suggests you'll need a more capable technology than JSoup. Try HtmlUnit instead and it might work.

What I ended up doing is using this command: doc = Jsoup.parse(driver.getPageSource()); Which brought all of the page's source into the doc. From there it was a simple use of getElementsByClass and getElementsByTag. Hope this helps someone, and thanks Rob for trying to answer.

Related

Using Java/Selenium to read data from Google spreadsheet document

Parse a page (partly generated by JavaScript) by using Selenium

CSS Validator library for Java

Need to extract results from google keywords external tool?

how to convert html page to image using java or php

Categories

Resources