Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}
Related
I am trying to run the tutorial on here. The code looks like this:
public class Test {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient client = new WebClient(BrowserVersion.FIREFOX);
HtmlPage page = client.getPage("https://google.com/");
// Getting Form from google home page. tsf is the form name
HtmlForm form = page.getHtmlElementById("tsf"); // Error occurs here
form.getInputByName("q").setValueAttribute("test");
// Creating a virtual submit button
HtmlButton submitButton = (HtmlButton)page.createElement("button");
submitButton.setAttribute("type", "submit");
form.appendChild(submitButton);
// Submitting the form and getting the result
HtmlPage newPage = submitButton.click();
// Getting the result as text
String text = page.asNormalizedText();
System.out.println(text);
}
}
But I am getting error message:
Exception in thread "main" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[tsf]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1670)
at Test.main(Test.java:20)
Since this tutorial is relatively old, the ID tsf might be outdated. However, if I check the form name from the google home page, I cant figure it out. Maybe I dont understand the meaning of the whole HtmlForm object. (I am completely new to this topic)
There is no element with ID tsf anymore. Best way to check it out is to go to the site and use Web Developer Tools of your browser (f12) mostly on every browsers. You can see the whole HTML document from there.
I am using Jsoup to download the page content and then for parsing it.
public static void main(String[] args) throws IOException {
Document document = Jsoup.connect("http://www.toysrus.ch/product/index.jsp?productId=89689681").get();
final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
System.out.println(elements.size());
}
The Problem : If you view the source of page content, there is tag exist <dt> which contains EAN/ISBN: text, but if you run above code, it will give you 0 in output, while it should give me 1. I have already checked html using document.html(), it seems html tags are there, but the tag I wanted is replaced by characters like <dt> instead it should <dt>. Same code is working for other product urls from same site.
I have already worked with Jsoup and developed many parser, but I am not getting why above very simple code is not working. It's strange! Is it Jsoup bug? Can anybody help me?
When using connect() or parse() jsoup will per default expect a valid html and format the input automatically if needed. You may try the xml parser instead.
public static void main(String [] args) throws IOException {
String url = "http://www.toysrus.ch/product/index.jsp?productId=89689681";
Document document = Jsoup.parse(new URL(url).openStream(), "UTF-8", "", Parser.xmlParser());
//final Elements elements = document.select("dt:contains(" + "EAN/ISBN:" + ")");
// the same as above but more readable:
final Elements elements = document.getElementsMatchingOwnText("EAN/ISBN");
System.out.println(elements.size());
}
You need to put single quotes around the 'EAN/ISBN:' value; otherwise it will be interpreted as a variable.
Also, there is no need to break up the string and concatenate pieces together. Just put the whole thing in one string.
I am trying to parsing a website wiht HtmlUnit and Jsoup and i facing this problem.
I have different pages to parse and I stored this links of this pages in a string array.
I want to loop on array's length and parse each page and i proceed in this way.
1) For loop on the length of link's array
2) Opening new webclient
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) Closing webclient
6) go back to 2).
In this way, i'm obtaining what I want, but code it's little bit slow. So i tried to open and close webClient outside the for loop. Like this:
1) Opening new webclient
2) For loop on the length of link's array
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) go back to 2).
6) Closing webclient
It's much more faster but i'm not obtaining same results of previous way.
Is it wrong to use webclient constructor in this way?
EDIT:
Following the code I'm testing:
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
// TODO Auto-generated method stub
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
String[] links = {"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;6",
"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;9"};
String bm = null;
String[] odds = new String[2];
//Second way
WebClient webClient = new WebClient(BrowserVersion.CHROME);
System.out.println("Client opened");
for (int i=0; i<links.length; i++) {
HtmlPage page = webClient.getPage(links[i]);
System.out.println("Page loaded");
Document csDoc = Jsoup.parse(page.asXml());
System.out.println("Page parsed");
Element table = csDoc.select("table.table-main.detail-odds.sortable").first();
Elements cols = table.select("td:eq(0)");
if (cols.first().text().trim().contains("bet365.it")) {
bm = cols.first().text().trim();
odds[i]=table.select("tbody > tr.lo").select("td.right.odds").first().text().trim();
}
else {
Elements footTable = csDoc.select("table.table-main.detail-odds.sortable");
Elements footRow = footTable.select("tfoot > tr.aver");
odds[i] = footRow.select("td.right").text().trim();
bm = "AVG";
}
webClient.close();
}
System.out.println(bm +"\t" +odds[0] + "\t" + odds[1]);
}
If i run this code results are right. If i move webClient.close(); outside the for loop results are not correct. In particular odds[0] is equal to odds[1];
Think about WebClient as the replacement of your browser. Creating a new WebClient is like starting a new browser.
If you like to do something equal to open a new tab in your browser, you can use WebClient#openWindow(..). And from the memory point of view it is a good idea to close the window if you are done.
If you are looking for performance, why you re-parse the whole page Jsoup. HtmlUnit retrieves the page, parses the page, creates the whole DOM and runs the javascript on top of this dom before your are getting back the page from your getPage call.
Then you are using HtmlUnit to serialize the Dom tree back to Html and use Jsoup to parse the page again.
HtmlUnit offers many ways to search for elements on a page. I'm suggesting to use this API directly on the page you got.
I have problems with getting content by URL. I'm using HtmlUnit for parsing an HTML page, but when I run my application I don't get content without filling after executed js.
I getting html without needed me content.
Who can help me please ?
Example code:
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38)) {
webClient.waitForBackgroundJavaScript(30 * 1000);
final HtmlPage page = webClient.getPage("http://.....some url");
final String pageAsXml = page.asXml();
final String pageAsText = page.asText();
} –
I am doing a Project in Java.
In this project I have to work with DOM.
For that I first load a dynamic page of any given URL, by using Selenium.
Then I parse them using Jsoup.
I want to get the dynamic page source code of given URL
Code snapshot:
public static void main(String[] args) throws IOException {
// Selenium
WebDriver driver = new FirefoxDriver();
driver.get("ANY URL HERE");
String html_content = driver.getPageSource();
driver.close();
// Jsoup makes DOM here by parsing HTML content
Document doc = Jsoup.parse(html_content);
// OPERATIONS USING DOM TREE
}
But the problem is, Selenium takes around 95% of the whole processing time, that is undesirable.
Selenium first opens Firefox, then loads the given page, then gets the dynamic page source code.
Can you tell me how I can reduce the time taken by Selenium, by replacing this tool with another efficient tool. Any other advice would also be welcome.
Edit NO. 1
There is some code given on this link.
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("general.useragent.override", "some UA string");
WebDriver driver = new FirefoxDriver(profile);
But what is second line here, I didn't understand. As Documentation is also very poor of selenium.
Edit No. 2
System.out.println("Fetching %s..." + url1);
System.out.println("Fetching %s..." + url2);
WebDriver driver = new FirefoxDriver(createFirefoxProfile());
driver.get("url1");
String hml1 = driver.getPageSource();
driver.get("url2");
String hml2 = driver.getPageSource();
driver.close();
Document doc1 = Jsoup.parse(hml1);
Document doc2 = Jsoup.parse(hml2);
Try this:
public static void main(String[] args) throws IOException {
// Selenium
WebDriver driver = new FirefoxDriver(createFirefoxProfile());
driver.get("ANY URL HERE");
String html_content = driver.getPageSource();
driver.close();
// Jsoup makes DOM here by parsing HTML content
// OPERATIONS USING DOM TREE
}
private static FirefoxProfile createFirefoxProfile() {
File profileDir = new File("/tmp/firefox-profile-dir");
if (profileDir.exists())
return new FirefoxProfile(profileDir);
FirefoxProfile firefoxProfile = new FirefoxProfile();
File dir = firefoxProfile.layoutOnDisk();
try {
profileDir.mkdirs();
FileUtils.copyDirectory(dir, profileDir);
} catch (IOException e) {
e.printStackTrace();
}
return firefoxProfile;
}
The createFireFoxProfile() method creates a profile if one doesn't exist. It uses if a profile already exists. So selenium doesn't need to create the profile-dir structure each and every time.
if you are sure, confident about your code, you can go with phantomjs. it is a headless browser and will get your results with quick hits. FF will take time to execute.