I am trying to parsing a website wiht HtmlUnit and Jsoup and i facing this problem.
I have different pages to parse and I stored this links of this pages in a string array.
I want to loop on array's length and parse each page and i proceed in this way.
1) For loop on the length of link's array
2) Opening new webclient
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) Closing webclient
6) go back to 2).
In this way, i'm obtaining what I want, but code it's little bit slow. So i tried to open and close webClient outside the for loop. Like this:
1) Opening new webclient
2) For loop on the length of link's array
3) Creating new HtmlPage from link with getPage method
4) Parsing and getting some elements
5) go back to 2).
6) Closing webclient
It's much more faster but i'm not obtaining same results of previous way.
Is it wrong to use webclient constructor in this way?
EDIT:
Following the code I'm testing:
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
// TODO Auto-generated method stub
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
String[] links = {"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;6",
"http://www.oddsportal.com/tennis/china/atp-beijing/murray-andy-dimitrov-grigor-fTdGYm3q/#cs;2;9"};
String bm = null;
String[] odds = new String[2];
//Second way
WebClient webClient = new WebClient(BrowserVersion.CHROME);
System.out.println("Client opened");
for (int i=0; i<links.length; i++) {
HtmlPage page = webClient.getPage(links[i]);
System.out.println("Page loaded");
Document csDoc = Jsoup.parse(page.asXml());
System.out.println("Page parsed");
Element table = csDoc.select("table.table-main.detail-odds.sortable").first();
Elements cols = table.select("td:eq(0)");
if (cols.first().text().trim().contains("bet365.it")) {
bm = cols.first().text().trim();
odds[i]=table.select("tbody > tr.lo").select("td.right.odds").first().text().trim();
}
else {
Elements footTable = csDoc.select("table.table-main.detail-odds.sortable");
Elements footRow = footTable.select("tfoot > tr.aver");
odds[i] = footRow.select("td.right").text().trim();
bm = "AVG";
}
webClient.close();
}
System.out.println(bm +"\t" +odds[0] + "\t" + odds[1]);
}
If i run this code results are right. If i move webClient.close(); outside the for loop results are not correct. In particular odds[0] is equal to odds[1];
Think about WebClient as the replacement of your browser. Creating a new WebClient is like starting a new browser.
If you like to do something equal to open a new tab in your browser, you can use WebClient#openWindow(..). And from the memory point of view it is a good idea to close the window if you are done.
If you are looking for performance, why you re-parse the whole page Jsoup. HtmlUnit retrieves the page, parses the page, creates the whole DOM and runs the javascript on top of this dom before your are getting back the page from your getPage call.
Then you are using HtmlUnit to serialize the Dom tree back to Html and use Jsoup to parse the page again.
HtmlUnit offers many ways to search for elements on a page. I'm suggesting to use this API directly on the page you got.
Related
Hi I want to scrap the information from a website so I tried to use Jsoup (also tried HttpClient) to do so. I realize that both of them couldn't "see" certain content of the html page. so when I tried to print out the parsed html, I got the empty div like this. It prints out some other div just fine.
here's my code:
Class Main{
public static void main(String args[]) throws IOException, InterruptedException {
Document doc = Jsoup.connect(url).get();
System.out.println(doc.getElementsByClass("needed content"));
}
}
the result in the terminal is:
<div class="needed content"></div>
I am searching for answers on stackoverflow, some recommends using Jackson Library
Java - How do I access a child of Div using JSoup
some recommend embed a browser in java
Is there a way to embed a browser in Java?
some recommend using htmlunit
Fail to get full content of page with JSoup
I just tried combining Jsoup with html unit, same result here's the code:
try(WebClient wc = new WebClient()){
wc.getOptions().setJavaScriptEnabled(true);
wc.getOptions().setCssEnabled(false);
wc.getOptions().setThrowExceptionOnScriptError(false);
wc.getOptions().setTimeout(10000);
HtmlPage page = wc.getPage("https://chainlinklabs.com/jobs");
String pageXml = page.asXml();
Document doc2 = Jsoup.parse(pageXml, url);
System.out.println(doc2.getElementsByClass("needed content"));
System.out.println("Thank God!");
}
My interpretation of the problem is Jsoup is not showing part of the html content because it contains javascript; am I heading to the right direction?
There is no need (and it is a waste of resources) to re-parse the page from HtmlUnit into jsoup. All the select options are available in HtmlUnit also (see https://htmlunit.sourceforge.io/gettingStarted.html) - and maybe more.
This simple code works for me - parts of the page are generated by an js script that starts asynchronous. Because of this you have to wait for these scripts before accessing the page.
public static void main(String[] args) throws IOException {
String url = "https://chainlinklabs.com/jobs";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScriptStartingBefore(10_000);
// System.out.println("--------------------------------");
// System.out.println(page.asXml());
// System.out.println("--------------------------------");
System.out.println("- Jobs -------------------------");
final DomNodeList<DomNode> jobTitles = page.querySelectorAll(".job-title");
for (DomNode domNode : jobTitles) {
System.out.println(domNode.asNormalizedText());
}
System.out.println("--------------------------------");
}
}
I am trying to run the tutorial on here. The code looks like this:
public class Test {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient client = new WebClient(BrowserVersion.FIREFOX);
HtmlPage page = client.getPage("https://google.com/");
// Getting Form from google home page. tsf is the form name
HtmlForm form = page.getHtmlElementById("tsf"); // Error occurs here
form.getInputByName("q").setValueAttribute("test");
// Creating a virtual submit button
HtmlButton submitButton = (HtmlButton)page.createElement("button");
submitButton.setAttribute("type", "submit");
form.appendChild(submitButton);
// Submitting the form and getting the result
HtmlPage newPage = submitButton.click();
// Getting the result as text
String text = page.asNormalizedText();
System.out.println(text);
}
}
But I am getting error message:
Exception in thread "main" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[tsf]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1670)
at Test.main(Test.java:20)
Since this tutorial is relatively old, the ID tsf might be outdated. However, if I check the form name from the google home page, I cant figure it out. Maybe I dont understand the meaning of the whole HtmlForm object. (I am completely new to this topic)
There is no element with ID tsf anymore. Best way to check it out is to go to the site and use Web Developer Tools of your browser (f12) mostly on every browsers. You can see the whole HTML document from there.
I'm trying to change reorder pages of my PDF document, but i can't and I don't know why.
I read several articals about changing order, it's java(iText) and i have got few problems with it.(exampl1, exampl2, example3). This example on c#, but there is using other method(exampl4)
I want take my TOC on 12 page and put to 2 page. After 12 page I have other content. This is my template for change order of pages:
String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n)
This is my method for changing order of pages:
public void ChangePageOrder(string path)
{
MemoryStream baos = new MemoryStream();
PdfReader sourcePDFReader = new PdfReader(path);
int toc = 12;
int n = sourcePDFReader.NumberOfPages;
sourcePDFReader.SelectPages(String.Format("1,%s, 2-%s, %s-%s", toc, toc-1, toc+1, n));
using (var fs = new FileStream(path, FileMode.Open, FileAccess.ReadWrite))
{
PdfStamper stamper = new PdfStamper(sourcePDFReader, fs);
stamper.Close();
}
}
Here is call to method:
...
doc.Close();
ChangePageOrder(filePath);
What am I doing not right?
Thank you.
Your code can't work because you are using path to create the PdfReader as well as to create the FileStream. You probably get an error such as "The file is in use" or "The file can't be accessed".
This is explained here:
StackOverflow: How to update a PDF without creating a new PDF?
Official web site:
How to update a PDF without creating a new PDF?
You create a MemoryStream() named baos, but you aren't using that object anywhere. One way to solve your problem, is to replace the FileStream when you first create your PDF by that MemoryStream, and then use the bytes stored in that memory stream to create a PdfReader instance. In that case, PdfStamper won't be writing to a file that is in use.
Another option would be to use a different path. For instance: first you write the document to a file named my_story_unordered.pdf (created by PdfWriter), then you write the document to a file named my_story_reordered.pdf (created by PdfStamper).
It's also possible to create the final document in one go. In that case, you need to switch to linear mode. There's an example in my book "iText in Action - Second Edition" that shows how to do this: MovieHistory1
In the C# port of this example, you have:
writer.SetLinearPageMode();
In normal circumstances, iText will create a page tree with branches and leaves. As soon a a branch has more than 10 leaves, a new branch is created. With setLinearPageMode(), you tell iText not to do this. The complete page tree will consist of one branch with nothing but leaves (no extra branches). This is bad from the point of view of performance when viewing the document, but it's acceptable if the number of pages in your document is limited.
Once you've switched to page mode, you can reorder the pages like this:
document.NewPage();
// get the total number of pages that needs to be reordered
int total = writer.ReorderPages(null);
// change the order
int[] order = new int[total];
for (int i = 0; i < total; i++) {
order[i] = i + toc;
if (order[i] > total) {
order[i] -= total;
}
}
// apply the new order
writer.ReorderPages(order);
Summarized: if your document doesn't have many pages, use the ReorderPages method. If your document has many pages, use the method you've been experimenting with, but do it correctly. Don't try to write to the file that you are still trying to read.
Without going into details about what you should do you can loop through all pages from a pdf, put them into a new pdf doc with all the pages. You can put your logic inside the for loop.
reader = new PdfReader(sourcePDFpath);
sourceDocument = new Document(reader.GetPageSizeWithRotation(startpage));
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPDFpath, System.IO.FileMode.Create));
sourceDocument.Open();
for (int i = startpage; i <= endpage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
I need to download a source code of this webpage: https://app.zonky.cz/#/marketplace/ so I could have the code checking if there is a new loan available. Unfortunate for me, the web page uses a loading spinner for the time the page is being loaded in the background. When I try to download the page's source using:
String url = "https://app.zonky.cz/#/marketplace/";
StringBuilder text = new StringBuilder();
try
{
URL pageURL = new URL(url);
Scanner scanner = new Scanner(pageURL.openStream(), "utf-8");
try {
while (scanner.hasNextLine()){
text.append(scanner.nextLine() + "\n");
}
}
finally{
scanner.close();
}
}
catch(Exception ex)
{
//
}
System.out.println(text.toString());
I get the page's source from the moment the spinner is being shown. Do you know of a better approach?
Solution:
public static String getSource() {
WebDriver driver = new FirefoxDriver();
driver.get("https://app.zonky.cz/#/marketplace/");
String output = driver.getPageSource();
driver.close();
return output;
}
You could always wait until the page has finished loading by checking if an element exists(one that is only loaded after the spinner disappears)
Also have you looked into using selenium ? it can be really useful for interacting with websites and handling tricky procedures such as waiting for elements :P
Edit: a pretty simple tutorial for Selenium waiting can be found here - http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp#explicit-and-implicit-waits
I am trying to verify if the image is present on the webpage or not. Can you suggest me the most feasible code. I am giving the necessary details below.
(Here I am referring to the main product image on left top side in green)
Page URL : http://www.marksandspencer.com/ditsy-floral-tunic/p/p60072079
Also I can send the screenshot if you want. Please send me the email id.
Please suggest at the earliest as I am in need of it.
try to check size of found elements by xpath:
".//*[#class='s7staticimage']/img"
if width is more than 10px -- picture will be shown =)
Here are some of the ways with which you can verify if the image is present,
Checking if the img src contains the file name
Checking if the img webelement size if greater than your desired size
This one is a little outside the scope of webdriver, you need to get the src attribute from the img webelement, then make a GET request to the src to see if you get a 200 OK.
These above verifications will only help ensure that there is an image present on the page, but you cannot verify if the image you want is being displayed unless you do image comparison.
If you want to do image comparisons then take a look at https://github.com/facebookarchive/huxley
Here is the code to verify all images in a webpage - selenium webdriver, TestNG, Java.
public void testAllImages() {
// test webpage - www.yahoo.com
wd.get("https://www.yahoo.com");
//Find total No of images on page and print In console.
List<WebElement> total_images = wd.findElements(By.tagName("img"));
System.out.println("Total Number of images found on page = " + total_images.size());
//for loop to open all images one by one to check response code.
boolean isValid = false;
for (int i = 0; i < total_images.size(); i++) {
String url = total_images.get(i).getAttribute("src");
if (url != null) {
//Call getResponseCode function for each URL to check response code.
isValid = getResponseCode(url);
//Print message based on value of isValid which Is returned by getResponseCode function.
if (isValid) {
System.out.println("Valid image:" + url);
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
} else {
System.out.println("Broken image ------> " + url);
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
}
} else {
//If <a> tag do not contain href attribute and value then print this message
System.out.println("String null");
System.out.println("----------XXXX-----------XXXX----------XXXX-----------XXXX----------");
System.out.println();
continue;
}
}
}