Use Crawljax to also download files from webpage

Use Crawljax to also download files from webpage - java

I'm trying to write my own crawljax 3.6 plugin in Java. It should tell crawljax which is a very famous web-crawler to also download files, which he finds on webpages. (PDF, Images, and so on). I don't want only the HTML or actual DOM-Tree. I would like to get access to the files (PDF, jpg) he finds.
How can I tell crawljax to download PDF files, images and so on?
Thanks for any help!
This is what I have so far -a new Class using the default plugin (CrawlOverview):
import java.io.File;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
import org.apache.commons.io.FileUtils;
import com.crawljax.browser.EmbeddedBrowser.BrowserType;
import com.crawljax.condition.NotXPathCondition;
import com.crawljax.core.CrawlSession;
import com.crawljax.core.CrawljaxRunner;
import com.crawljax.core.configuration.BrowserConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuilder;
import com.crawljax.core.configuration.Form;
import com.crawljax.core.configuration.InputSpecification;
import com.crawljax.plugins.crawloverview.CrawlOverview;
/**
* Example of running Crawljax with the CrawlOverview plugin on a single-page
* web app. The crawl will produce output using the {#link CrawlOverview}
* plugin.
*/
public final class Main {
private static final long WAIT_TIME_AFTER_EVENT = 200;
private static final long WAIT_TIME_AFTER_RELOAD = 20;
private static final String URL = "http://demo.crawljax.com";
/**
* Run this method to start the crawl.
*
* #throws IOException
* when the output folder cannot be created or emptied.
*/
public static void main(String[] args) throws IOException {
CrawljaxConfigurationBuilder builder = CrawljaxConfiguration
.builderFor(URL);
builder.addPlugin(new CrawlOverview());
builder.crawlRules().insertRandomDataInInputForms(false);
// click these elements
builder.crawlRules().clickDefaultElements();
builder.crawlRules().click("div");
builder.crawlRules().click("a");
builder.setMaximumStates(10);
builder.setMaximumDepth(3);
// Set timeouts
builder.crawlRules().waitAfterReloadUrl(WAIT_TIME_AFTER_RELOAD,
TimeUnit.MILLISECONDS);
builder.crawlRules().waitAfterEvent(WAIT_TIME_AFTER_EVENT,
TimeUnit.MILLISECONDS);
// We want to use two browsers simultaneously.
builder.setBrowserConfig(new BrowserConfiguration(BrowserType.FIREFOX,
1));
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
}
}

As images are concerned - I don't see any problem, Crawljax loads these just fine for me.
On the PDF topic:
Unfortunately Crawljax is hardcoded to skip links to PDF files.
See com.crawljax.core.CandidateElementExtractor:342:
/**
* #param href
* the string to check
* #return true if href has the pdf or ps pattern.
*/
private boolean isFileForDownloading(String href) {
final Pattern p = Pattern.compile(".+.pdf|.+.ps|.+.zip|.+.mp3");
Matcher m = p.matcher(href);
if (m.matches()) {
return true;
}
return false;
}
This could be solved by modifying Crawljax source and introducing a configuration option for pattern above.
After that limitations of Selenium regarding non-HTML files apply: PDF is either viewed in Firefox JavaScript PDF viewer, a download pop-up appears or the file is downloaded. It is somewhat possible to interact with the JavaScript viewer, it is not possible to interact with the download popup but if autodownload is enabled then the file is downloaded to disk.
If you would like to set Firefox to automatically download file without popping up a download dialog:
import javax.inject.Provider;
static class MyFirefoxProvider implements Provider<EmbeddedBrowser> {
#Override
public EmbeddedBrowser get() {
FirefoxProfile profile = new FirefoxProfile();
profile.setPreference("browser.download.folderList", 2);
profile.setPreference("browser.download.dir", "/tmp");
profile.setPreference("browser.helperApps.neverAsk.saveToDisk",
"application/octet-stream,application/pdf,application/x-gzip");
// disable Firefox's built-in PDF viewer
profile.setPreference("pdfjs.disabled", true);
// disable Adobe Acrobat PDF preview plugin
profile.setPreference("plugin.scan.plid.all", false);
profile.setPreference("plugin.scan.Acrobat", "99.0");
FirefoxDriver driver = new FirefoxDriver(profile);
return WebDriverBackedEmbeddedBrowser.withDriver(driver);
}
}
And use the newly created FirefoxProvider:
BrowserConfiguration bc =
new BrowserConfiguration(BrowserType.FIREFOX, 1, new MyFirefoxProvider());

Obtain the links manually using Jsoup by using the CSS selector a[href] on getStrippedDom(), iterate through the elements and use a HttpURLConnection / HttpsURLConnection to download them.

Related

How do I use iText7 to OCR a multi page PDF?

I have used iText PdfRender, which converts a non-OCR PDF to an image, after which I used iText PdfOcr to convert that image to an OCR'd PDF. Is there a tool that lets me perform this process in one step?
It would also be helpful if there was some documentation on how I can process multi-page PDFs using PdfRender, which I can't seem to find. The following is the code I used to convert 1 image to an OCR PDF document.
import com.itextpdf.pdfocr.OcrPdfCreator;
import com.itextpdf.pdfocr.tesseract4.Tesseract4LibOcrEngine;
import com.itextpdf.pdfocr.tesseract4.Tesseract4OcrEngineProperties;
import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
public class img2pdfocr {
static final Tesseract4OcrEngineProperties tesseract4OcrEngineProperties = new Tesseract4OcrEngineProperties();
private static List LIST_IMAGES_OCR = Arrays.asList(new File("image1.jpg"));
private static String OUTPUT_PDF = "F:\\ITEXT_workspace\\jumpstart\\bizdoc.pdf";
public static void main(String[] args) throws IOException {
final Tesseract4LibOcrEngine tesseractReader = new Tesseract4LibOcrEngine(tesseract4OcrEngineProperties);
tesseract4OcrEngineProperties.setPathToTessData(new File("F:\\ITEXT_workspace\\jumpstart\\TESS_DATA_FOLDER"));
OcrPdfCreator ocrPdfCreator = new OcrPdfCreator(tesseractReader);
try (PdfWriter writer = new PdfWriter(OUTPUT_PDF)) {
ocrPdfCreator.createPdf(LIST_IMAGES_OCR, writer).close();
}
}
}
EDIT
As pointed out in the comments, I need not use pdfRender, iText core itself can be used to extract images from a PDF. Use this for the code. You can check on this documentation

Scrape/extract with Java, result from coinmarketcap.com

I need to extract coinmarket cap volume (ex: Market Cap: $306,020,249,332) from top of page with Java, please see picture attached.
I have used jsoup library in Java Eclipse but didn't extract volume. Jsoup extract only other attributes. Probably problem is from a java script library.
Also I have used html unit without success:
import java.io.IOException;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Testss {
public static void main(String\[\] args) throws IOException {
String url = "https://coinmarketcap.com/faq/";
WebClient client = new WebClient();
HtmlPage page = client.getPage(url);
List<?> anchors = page.getByXPath("//div\[#class='col-sm-6 text-center'\]//a");
for (Object obj : anchors) {
HtmlAnchor a = (HtmlAnchor) obj;
System.out.println(a.getTextContent().trim());
}
}
}
How can I extract volume from this site with Java?
Thanks!

Check the network tab findout the exact request which is fetching the data, In your case its https://files.coinmarketcap.com/generated/stats/global.json
Also the request URL is the below one
So, Fetching the main URL will not give you what you require, For that you have to fetch the data from the request URL directly and parse it using any JSON library. SimpleJSON I can suggest in one of those.
The JSON data which you will get after hitting the url.
{
"bitcoin_percentage_of_market_cap": 55.95083004655126,
"active_cryptocurrencies": 1324,
"total_volume_usd": 21503093761,
"active_markets": 7009,
"total_market_cap_by_available_supply_usd": 301100436864
}

HTMLUnit: get button without name, id, type only onclick=

How can I get a button without having a name, an ID or a type like
button?
This is the HTML-Code I try to manage:
<a class="btnv6_blue_hoverfade btn_small" href="#"
onclick="DoAgeGateSubmit(); return false;">
<span>Fortfahren</span>
</a>
And this is my Code I have at this moment:
package htmlParser;
import java.io.IOException;
import java.net.URL;
import org.jsoup.nodes.Element;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.RefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlDivision;
import com.gargoylesoftware.htmlunit.html.HtmlButtonInput;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlImage;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;
public class HitTheDamnButton
{
public static void main(String[] args) throws Exception
{
String url = "http://store.steampowered.com/agecheck/app/72850/? snr=1_7_7_230_150_2";
WebClient webClient = new WebClient();
HtmlPage startPage = webClient.getPage(url);
HtmlForm form = (HtmlForm) startPage.getElementById("agecheck_form");
HtmlSelect dropDown1 = form.getSelectByName("ageDay");
HtmlSelect dropDown2 = form.getSelectByName("ageMonth");
HtmlSelect dropDown3 = form.getSelectByName("ageYear");
dropDown1.setSelectedAttribute("2", true);
dropDown2.setSelectedAttribute("February", true);
dropDown3.setSelectedAttribute("1970", true);
webClient.close();
}
}
How can I get this button to click? I tried everything
HTMLButton button = form.getButtonByName("a.btnv6_blue_hoverfade.btn_small");
... form.hasAttribute(), ... getSelectByName("name");
But nothing worked.
Thanks for any help in advance!

What you are looking for is an anchor not a button.
Try something like startPage.getAnchorByText or startPage.getAnchors and than iterate and compare the class and/or text to get the right one.

Ok, the advice to search for an Anchor led to some results.
For testing-purposes I switched to another site where I just have to click on a button (no formular has to be filled, I first wanted to solve the simply-Click-on-the-button-problem). I choosed this site:
http://store.steampowered.com/app/324800/?snr=1_7_...
and it leads to the agecheck of "Shadow Warriors 2".
The mentioned button in HTML-code is:
<a class="btn_grey_white_innerfade btn_medium" href="#" onclick="HideAgeGate( 324800 )">ev<span>Weiter</span></a>
Now I made it, to identify the button and clicked on it. But I'm not sure "on what I clicked at last", cause I wasn't redirected to the site behind the agecheck, but to "Shadow Warrior Classics"....
The new URL to which I was directed is:
http://store.steampowered.com/widget/238070/?dynamiclink=1
I don't get it.
Here is my programmCode:
List<HtmlAnchor> anchor = startPage.getAnchors();
// for(HtmlAnchor out : anchor)
// {
// System.out.println(out);
// }
HtmlAnchor anchorButton = anchor.get(143);
System.out.println(anchor.get(143));
// anchorButton.dblClick();
anchorButton.click();
document = Jsoup.connect(anchorButton.click().getUrl().toString()).timeout(0).get();
currentLink = startPage.getBaseURL();
url = currentLink.toString();
document = Jsoup.connect(url).timeout(0).get();
Element parentNode = document.getElementById("app_reviews_hash");
Elements childNodes = parentNode.getElementsByClass("user_reviews_filter_section");
for(Element out2 : childNodes)
{
String all = out2.getElementsByClass("user_reviews_count").text();
String steamPurchasers = out2.getElementsByClass("user_reviews_count").text();
System.out.println(all);
}
System.out.println(anchor.get(143));
shows the right button:
HtmlAnchor[<a class="btn_grey_white_innerfade btn_medium" href="#" onclick="HideAgeGate( 324800 )">]
but after I clicked on it (by "anchorButton.click();") I wont be directed to the right site. The agecheck is still active....
And I still got a NullPointer Exception at
line:
Elements childNodes = parentNode.getElementsByClass("user_reviews_filter_section");
cause on the mislinked site isn't such Element for
Element parentNode = document.getElementById("app_reviews_hash");
so parentNode remains "null".
What have I done wrong?

Ok, I solved the problem. In short terms: I switched to Selenium WebDriver (for JavaCode) and Selenium IDE (FireFox Plugin).
________ Elaborately Description (step by step):
1. Install Selenium IDE for FireFox-Browser:
Go to:
!!!FUCK: I'm not allowed to post Links cause of my low reputation. Just want to do some good deeds, but was hindered (Fuck this world!) !!!
h**ps://addons.mozilla.org/en-US/firefox/addon/selenium-ide/
(note: replace the two * with t)
and click on "+ Add to Firefox"-Button. After rebooting Firefox, the
installation will be done.
ATTENTION: It could be, that some errors will occure at this point (the "Selenium IDE"-entry doesn't appear in the menu of Firefox. If that so,
try to install Selenium IDE by Firefox-> Add-ons->Plug ins: search for
Selenium and select:
Selenium IDE 2.9.1.1-signed"
"Highlight Elements (Selenium IDE)"
"Selenium IDE Button 1.2.0.1-signed.1-signed"
Navigate in FireFoxMenu to :
Tools-> Web-DevelopmentExtras-> add new tools:
(don't Know the exact term, cause I'm using german version of Firefox:
-> Web-Entwickler->Weitere Tools laden)
search for Selenium and choose:
"SeleniumX"
After the Installation the "Selenium IDE"-entry appears in the firefoxMenu under: Tools-> Selenium IDE (german: Extras).
2. Install Selenium WebDriver for Eclipse / Dynamic WebProjects:
Got to:
!!!FUCK: I'm not allowed to post Links cause of my low reputation. Just want to do some good deeds, but was hindered (Fuck this world!) !!!
h**p://www.seleniumhq.org/download/#selenium_ide
(note: replace the two * with t)
and download (first section on site): Selenium Standalone Server
=> version 3.0.1 (date: 11.5.16 [month-day-year])
After downloading the .jar-file, copy it to your in
Eclipse into the following folder:
NameofProject\WebContent\WEB-INF\lib
Note: you could import this by "Build Path-> Configure Build Path", but I
prefer this faster way.
Note: For creating a new "Dynamic Web Project" you have to install some
new software in Eclipse: Help-> Install new Software: In the first line
"Work with" choose:
"Luna - FORBITTEN LINK for low REPUTATION-people"
(for Eclipse Luna version, modify it to your Eclipse version!).
WAIT, til Pending... is done and then choose (last entry):
"Web, XML, Java EE and OSGI Enterprise Development)
3. Using Selenium IDE to identify WebElement in HTML-Code by creating "Test cases" and export them as Java-Code to Eclipse:
Detailed Tutorial:
!!!FUCK: I'm not allowed to post Links cause of my low reputation. Just want to do some good deeds, but was hindered (Fuck this world!) !!!
h**p://docs.seleniumhq.org/docs/02_selenium_ide.jsp
(note: replace the two * with t)
3.1. Open FireFox-Browser: Go to WebSite you want to inspect / crawl / parse HTML-Code. Then (after page was loaded) open Selenium IDE (Tools-> Selenium IDE). Assure that the red button (looks like record-button in some Video-Tools)
on the right most position in the menuBar (over "Table / Source"-Tabs) is
activated (you can read a message by MouseOver). While recording, each
CLICK on the Website you want to inspect creates automatically an entry
into the "Table"-tab (a sort of simple Script-Command). Try to execute as
many actions as you can / need on the website you want to crawl, cause
each action gives you the element in the HTML-code and helps you later to
identify it by Java-Code!
3.2. After finishing your "inspectation" by simply MouseClicks, you have
to save your "Test case" you created right now.
File (F) ->Save Test Case: Choose a name you wish and confirm the save-
Process.
Note: the default StoreLocation for your Test cases is the "Mozilla-
FireFox"-folder on your PC (common path: C:\Programs\Mozilla Firefox).
3.3. Export the Test case as JAVA-CODE to Eclipse:
!!!!! This is the most awsome feature of Selenium IDE !!!!!
Now - after saving your Test case - go again in Selenium IDE to:
File (F)-> Export Test Case As:
choose: Java/JUnit 4/WebDriver: again FileChooser opens (default:
FireFox-folder) and now you can save this "Export-File" as a Java-file.
IMPORTANT: the file ending has to be ".java" (e.g.: "IHateLowReputation.java").
Then copy / import it into your Eclipse-Project. Now you can open this
.java-file and inspect the outwritten Java code for the rigth WebElements
you want to find / choose / manipulate.
You can use this to get a feeling, how Selenium Webdriver commands in
Java has to be coded. Copy the required Code-Lines to your Class.
_____________ And here is my SolutionCode for my Problem above:
package fixWrongEntries;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.concurrent.TimeUnit;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.support.ui.Select;
import com.gargoylesoftware.htmlunit.ScriptResult;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;
import data.DB_Steam_Spiele;
import data.Spiel;
public class SolveButtonClick_FormSubmitt
{
public static void main(String[] args)
{
String agecheckButton = "Content in this product may not be appropriate for all ages, or may not be appropriate for viewing at work.";
String agecheckKonkret = "Please enter your birth date to continue:";
String noReviews = "There are no reviews for this product";
try
{
// turn off annoying htmlunit warnings
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
// Enabling JavaScript => true in brackets
HtmlUnitDriver driver = new HtmlUnitDriver(true);
// Link for agecheck Typ 1 (simply Button click)
String url = "http://store.steampowered.com/app/324800/?snr=1_7_...";
// Link for agecheck Typ 2 (fill out formular and submitt)
//Stng url = "http://store.steampowered.com/agecheck/app/72850/";
driver.get(url);
// System.out.println(driver.findElement(By.cssSelector("h2")).getText());
System.out.println(driver.getCurrentUrl());
/*********************************************************************
*
* Agecheck Typ 2
*
*********************************************************************/
if(driver.findElement(By.cssSelector("h2")).getText().equals(agecheckKonkret))
{
System.out.println("Achtung: Agecheck konkret!");
// Fill out form with age-specifications:
new Select(driver.findElement(By.name("ageDay"))).selectByVisibleText("18");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
new Select(driver.findElement(By.name("ageMonth"))).selectByVisibleText("April");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
new Select(driver.findElement(By.id("ageYear"))).selectByVisibleText("1970");
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
// Click AgeCheck Formular Button: Fortfahren
driver.findElement(By.cssSelector("a.btnv6_blue_hoverfade.btn_small > span")).click();
if(driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
System.out.println("Keine Reviews vorhanden!");
continue;
}
else if(!driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
String all = driver.findElement(By.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label"))
.getText();
String steamPurchaser = driver.findElement(By
.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label[2]")).getText();
String communityURL = driver.findElement(By.cssSelector("a.btnv6_blue_hoverfade.btn_medium"))
.getAttribute("href");
}
}
/*********************************************************************
*
* AgeChecck Type 1
*
*********************************************************************/
else if(driver.findElement(By.cssSelector("h2")).getText().equals(agecheckButton))
{
System.out.println("Achtung: Agecheck Button!");
driver.findElement(By.cssSelector("a.btn_grey_white_innerfade.btn_medium > span")).click();
if(driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
System.out.println("Keine Reviews vorhanden!");
continue;
}
else if(!driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
String all = driver.findElement(By.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label"))
.getText();
String steamPurchaser = driver.findElement(By
.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label[2]")).getText();
String communityURL = driver.findElement(By.cssSelector("a.btnv6_blue_hoverfade.btn_medium"))
.getAttribute("href");
}
}
/*********************************************************************
*
* No Agecheck
*
*********************************************************************/
else
{
if(driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
System.out.println("Keine Reviews vorhanden!");
continue;
}
else if(!driver.findElement(By.id("app_reviews_hash")).getText().contains(noReviews))
{
String all = driver.findElement(By.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label"))
.getText();
String steamPurchaser = driver.findElement(By
.xpath("//div[#id='app_reviews_hash']/div[3]/div[2]/label[2]")).getText();
String communityURL = driver.findElement(By.cssSelector("a.btnv6_blue_hoverfade.btn_medium"))
.getAttribute("href");
}
}
}
catch(Throwable t)
{
System.out.println("Fehlermeldung aufgefangen");
t.printStackTrace();
}
}
private static boolean isElementPresent(WebDriver driver, By by)
{
try
{
driver.findElement(by);
return true;
}
catch(NoSuchElementException e)
{
return false;
}
}
}
I hope this will help people with a simular problem.

Selenium get .har file

I have a two page application:
/login
/profile
I want to get .har file page /profile.
When i go to the page /login, the cookie is created with a key=connect.sid and value = "example value". This cookie is not yet active.
I added the cookies with active connect.sid.
WebDriver webDriver = getDriver();
webDriver.get(LOGIN_PAGE);
webDriver.manage().addCookie(connectsSId);
it does not work because after the load page, /login crated a new cookies.
i also tried this code:
WebDriver webDriver = getDriver();
webDriver.get(PROFILE_PAGE);
webDriver.manage().deleteAllCookies();
webDriver.manage().addCookie(connectsSId);
and this does not work. cookies were added but it seems too late.
WebDriver webDriver = getDriver();
LoginPage loginPage = new LoginPage(getDriver());
LandingPage landingPage = loginPage.login();
landingPage.openProfilePage();
This code created a .har file for the page /login.
for some reason, the file is created only after the first call to the page. I can not solve this problem.

Use PhantomJS with BrowserMobProxy. PhantomJS helps us for JavaScript enables pages. The following code works for HTTPS web addresses, too.
Place 'phantomjs.exe' in C drive and you get the 'HAR-Information.har' file in C drive itself.
Make sure you DO NOT put a ' / ' at the end of the url, like
driver.get("https://www.google.co.in/")
It should be
driver.get("https://www.google.co.in");
Otherwise, it won't work.
package makemyhar;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import net.lightbody.bmp.BrowserMobProxy;
import net.lightbody.bmp.BrowserMobProxyServer;
import net.lightbody.bmp.core.har.Har;
import net.lightbody.bmp.proxy.CaptureType;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.CapabilityType;
import org.openqa.selenium.remote.DesiredCapabilities;
public class MakeMyHAR {
public static void main(String[] args) throws IOException, InterruptedException {
//BrowserMobProxy
BrowserMobProxy server = new BrowserMobProxyServer();
server.start(0);
server.setHarCaptureTypes(CaptureType.getAllContentCaptureTypes());
server.enableHarCaptureTypes(CaptureType.REQUEST_CONTENT, CaptureType.RESPONSE_CONTENT);
server.newHar("Google");
//PHANTOMJS_CLI_ARGS
ArrayList<String> cliArgsCap = new ArrayList<>();
cliArgsCap.add("--proxy=localhost:"+server.getPort());
cliArgsCap.add("--ignore-ssl-errors=yes");
//DesiredCapabilities
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setCapability(CapabilityType.ACCEPT_SSL_CERTS, true);
capabilities.setCapability(CapabilityType.SUPPORTS_JAVASCRIPT, true);
capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap);
capabilities.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY,"C:\\phantomjs.exe");
//WebDriver
WebDriver driver = new PhantomJSDriver(capabilities);
driver.get("https://www.google.co.in");
//HAR
Har har = server.getHar();
FileOutputStream fos = new FileOutputStream("C:\\HAR-Information.har");
har.writeTo(fos);
server.stop();
driver.close();
}
}

Set preferences in your Selenium code:
profile.setPreference("devtools.netmonitor.har.enableAutoExportToFile", true);
profile.setPreference("devtools.netmonitor.har.defaultLogDir", String.valueOf(dir));
profile.setPreference("devtools.netmonitor.har.defaultFileName", "network-log-file-%Y-%m-%d-%H-%M-%S");
and open console:
Actions keyAction = new Actions(driver);
keyAction.keyDown(Keys.LEFT_CONTROL).keyDown(Keys.LEFT_SHIFT).sendKeys("q").keyUp(Keys.LEFT_CONTROL).keyUp(Keys.LEFT_SHIFT).perform();

You can use browsermob proxy to capture all the request and response data
See here

I have tried as well to get the har file using a proxy like browsermob proxy
I did a lot of research because the file which I've received was always empty.
What I did was to enable the browser performance log.
Note this will work only with chrome driver.
This is my driver class (in python)
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver
from lib.config import config
class Driver:
global performance_log
capabilities = DesiredCapabilities.CHROME
capabilities['loggingPrefs'] = {'performance': 'ALL'}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
mobile_emulation = {"deviceName": "Nexus 5"}
if config.Env().is_mobile():
chrome_options.add_experimental_option(
"mobileEmulation", mobile_emulation)
else:
pass
chrome_options.add_experimental_option(
'perfLoggingPrefs', {"enablePage": True})
def __init__(self):
self.instance = webdriver.Chrome(
executable_path='/usr/local/bin/chromedriver', options=self.chrome_options)
def navigate(self, url):
if isinstance(url, str):
self.instance.get(url)
self.performance_log = self.instance.get_log('performance')
else:
raise TypeError("URL must be a string.")
The amount of information which is found the in output is huge so you'll have to filter the raw data and get the network received and send objects only.
import json
import secrets
def digest_log_data(performance_log):
# write all raw data in a file
with open('data.json', 'w', encoding='utf-8') as outfile:
json.dump(performance_log, outfile)
# open the file and real it with encoding='utf-8'
with open('data.json', encoding='utf-8') as data_file:
data = json.loads(data_file.read())
return data
def digest_raw_data(data, mongo_object={}):
for idx, val in enumerate(data):
data_object = json.loads(data[idx]['message'])
if (data_object['message']['method'] == 'Network.responseReceived') or (data_object['message']['method'] == 'Network.requestWillBeSent'):
mongo_object[secrets.token_hex(30)] = data_object
else:
pass
We choose to push this data into a mongo db which will be analyse later by an etl and pushed into a redshift database to create statistics .
I hope is what you are looking for.
The way Im running the script is :
import codecs
from pprint import pprint
import urllib
from lib import mongo_client
from lib.test_data import test_data as data
from jsonpath_ng.ext import parse
from IPython import embed
from lib.output_data import process_output_data as output_data
from lib.config import config
from lib import driver
browser = driver.Driver()
# get the list of urls which we need to navigate
urls = data.url_list()
for url in urls:
browser.navigate(config.Env().base_url() + url)
print('Visiting ' + url)
# get performance log
performance_log = browser.performance_log
# digest the performace log
data = output_data.digest_log_data(performance_log)
# initiate an empty dict
mongo_object = {}
# prepare the data for the mongo document
output_data.digest_raw_data(data, mongo_object)
# load data into the mongo db
mongo_client.populate_mongo(mongo_object)
browser.instance.quit()
My main source was this one which I've adjusted it to my needs.
https://www.reddit.com/r/Python/comments/97m9iq/headless_browsers_export_to_har/
Thanks

You may do it by the simplest way Selenide + Java + JS
import java.nio.file.Files and java.nio.file.Paths in you class
Then create function:
public static void getHar() {
open("http://you-task.com");
String scriptGetInfo = "performance.setResourceTimingBufferSize(1000000);" +
"return performance.getEntriesByType('resource').map(JSON.stringify).join('\\n')";
String har = executeJavaScript(scriptGetInfo);
Files.write(Paths.get("log.har"), har.getBytes());
}
It saves you log.har in the root of you project.
Just call this function in the place you want to save har-file

How to convert SWT Image to data URI for use in Browser-widget

I am trying to display images inside a Browser-widget (SWT). These images can be found inside the a jar file (plug-in development). However: this is not directly possible as the browser-widget expects some kind of URL or URI information.
My idea is to turn SWT-images into data-URI values, which I could induce into the src-attribute of every given img-element. I know, that this is not a good solution from a performance point of view, but I don't mind the speed disadvantage.
I'd like to know how to turn a SWT image into a data-URI value for use in a browser-widget.
My code so far:
package editor.plugin.editors.htmlprevieweditor;
import editor.plugin.Activator;
import org.eclipse.swt.browser.Browser;
import org.eclipse.swt.events.DisposeEvent;
import org.eclipse.swt.events.DisposeListener;
import org.eclipse.swt.graphics.ImageData;
import org.eclipse.swt.layout.FillLayout;
import org.eclipse.swt.widgets.Composite;
public class HtmlPreview extends Composite implements DisposeListener {
private final Browser content;
public HtmlPreview(final Composite parent, final int style) {
super(parent, style);
this.setLayout(new FillLayout());
content = new Browser(this, style);
final ImageData imageData = Activator.getImageDescriptor(Activator.IMAGE_ID + Activator.PREVIEW_SMALL_ID).getImageData();
content.setText("<html><body><img src=\"data:image/png;base64," + imageData + "\"/></body></html>"); // need help on changing imageData to a base64-encoded String of bytes?
this.addDisposeListener(this);
}
#Override
public void widgetDisposed(final DisposeEvent e) {
e.widget.dispose();
}
}
Any help is greatly appreciated :)!
Edit 1: I have read SWT Image to/from String , but unfortunately it does not seem to exactly cover my needs.
Edit 2: I don't know if it matters, but I am trying to load a PNG24-image with per-pixel alpha-transparency.

The question is too general if you only say "Browser in SWT". Mozzila browser supports jar URL protocol, and you can do this:
public static void main(String[] args) {
final Display display = new Display();
final Shell shell = new Shell(display);
shell.setLayout(new FillLayout());
final URL url = ShellSnippet.class.getResource("/icons/full/message_error.gif");
final Browser browser = new Browser(shell, SWT.MOZILLA);
final String html = String.format("<html><head/><body>image: <img src=\"%s\"/></body></html>", url);
browser.setText(html);
shell.open();
while (!shell.isDisposed()) {
if (!display.readAndDispatch()) {
display.sleep();
}
}
display.dispose();
}
It looks like this:
I used an image from the JFace jar to keep the snippet simple and yet work for most people out of the box. It is GIF, but I expect it to work just as well with PNG files.
If you use Internet Explorer, something I do not recommend because your application depends on OS version, this does not work. It looks like this (after changing in the snippet the style from SWT.MOZILLA to SWT.NONE):
It does however understand the file protocol and you can copy files to the temp folder and create URLs directly to the file using File.toURL(). This should work for any browser.
I cannot test the simple solution on WEBKIT broswer. If anyone can, please post the result in a comment.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Use Crawljax to also download files from webpage - java

Obtain the links manually using Jsoup by using the CSS selector a[href] on getStrippedDom(), iterate through the elements and use a HttpURLConnection / HttpsURLConnection to download them.

Related

How do I use iText7 to OCR a multi page PDF?

Scrape/extract with Java, result from coinmarketcap.com

HTMLUnit: get button without name, id, type only onclick=

Selenium get .har file

How to convert SWT Image to data URI for use in Browser-widget

Categories

Resources