Web-scraping: how to choose proxy - java

I want to write as reusable as possible web scraper. I gonna write it on Selenium + PhantomJS. PhantomJS will use a pool of IPs (proxies). There is a huge list of free proxies, for example.
How could I choose at runtime the best proxy for specific URL? By the best I mean the fastest an one which won't be blocked with target resource.
Workarround
I deployed my simplest app on Heroku. The app serves some html content. I used different proxies (with response time < 300ms) instead of 151.252.120.177:8080 (see code below), and noticed that most of them aren't able to parse simplest html in 15 seconds timeout. And some of them (that are even slower) retrieved content in a second. Why some proxies are unable to reach my content? Are they blacklisted with Heroku?
DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true);
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "drivers/phantomjs");
ArrayList<String> cliArgsCap = new ArrayList<String>();
cliArgsCap.add("--proxy=151.252.120.177:8080");
cliArgsCap.add("--proxy-type=socks");
caps.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, cliArgsCap);
WebDriver driver = new PhantomJSDriver(caps);
driver.get(REMOTE_URL);
WebDriverWait wait = new WebDriverWait(driver, 15);
WebElement element = wait.until(ExpectedConditions.elementToBeClickable(By.className("btn-success")));
element.click();
driver.quit();

If you need just a scraper result you can use free scrapers with proxy support. Try this one http://datathief.verych.ru/

Related

Java Selenium - Bypass driver.get() waiting time

I am trying to access a website using Selenium WebDriver, but the website will keep loading although I can still interact with it. (The website is nitrotype.com if you are wondering.) I think it is because driver.get() waits until the page is fully loaded. Can I bypass this until just a certain element loads ?
TL;DR
How do I bypass the driver.get() waiting until a site is completely loaded before proceeding?
Look into page load strategy: https://www.selenium.dev/documentation/en/webdriver/page_loading_strategy/
Normal, Eager & None are your options. I suggest you combine your strategy with the proper explicit wait.
NORMAL
normal This will make Selenium WebDriver to wait for the entire page
is loaded. When set to normal, Selenium WebDriver waits until the load
event fire is returned.
By default normal is set to browser if none is provided.
EAGER
eager This will make Selenium WebDriver to wait until the initial HTML
document has been completely loaded and parsed, and discards loading
of stylesheets, images and subframes.
When set to eager, Selenium WebDriver waits until DOMContentLoaded
event fire is returned.
NONE
none When set to none Selenium WebDriver only waits until the initial
page is downloaded.
To implement:
public class pageLoadStrategy {
public static void main(String[] args) {
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setPageLoadStrategy(PageLoadStrategy.NONE);
WebDriver driver = new ChromeDriver(chromeOptions);
try {
// Navigate to Url
driver.get("https://google.com");
} finally {
driver.quit();
}
}
}

How can I work around with "Vector smash protection is enabled"?

This is a error message that I am getting when I am trying to make a request to a pages (For instance. http://www.lipsum.com) that has enabled Vector smash protection. But how can I work around with this ?
This is the exact error message:
Vector smash protection is enabled
You can deal with this problem using the chrome options and creating a desired capabilities but first of all you need to consider:
1: the value you should to put in 'user-data-dir' is the same you can find out in the route chrome://version/ in Google Chrome. Let me explain that with a picture (at the end of this answer).
ChromeOptions options = new ChromeOptions();
DesiredCapabilities capabilities = DesiredCapabilities.chrome();
options.addArguments("user-data-dir=/Users/YourUser/Library/Application Support/Google/Chrome/Profile 1");
capabilities.setCapability(ChromeOptions.CAPABILITY, options);
​And afterward you need to add this options to your Driver:
driver = new ChromeDriver(capabilities);
So, this is the best way to make request to a page that has Vector smash protection enabled.

Selenium Webdrivers: Load Page without any resources

I am trying to prevent Javascript from changing the site's source code I'm testing with Selenium. The problem is, I can't just simply turn Javascript off in the Webdriver, because I need it for a test. Here's what I'm doing for the Firefox Webdriver:
firefoxProfile.setPreference("permissions.default.image", 2);
firefoxProfile.setPreference("permissions.default.script", 2);
firefoxProfile.setPreference("permissions.default.stylesheet", 2);
firefoxProfile.setPreference("permissions.default.subdocument", 2);
I don't allow Firefox to load any Images, Scripts and Stylesheets.
How can I do this with the Internet Explorer Webdriver and the Chrome Webdriver? I have not found any similar preferences. Or is there even a more elegant way to stop the webdrivers from loading the site's JS Files after all?
Thank you!
Solution is to use proxy. Webdriver integrates very well with browsermob proxy: http://bmp.lightbody.net/
private WebDriver initializeDriver() throws Exception {
// Start the server and get the selenium proxy object
ProxyServer server = new ProxyServer(proxy_port); // package net.lightbody.bmp.proxy
server.start();
server.setCaptureHeaders(true);
// Blacklist google analytics
server.blacklistRequests("https?://.*\\.google-analytics\\.com/.*", 410);
// Or whitelist what you need
server.whitelistRequests("https?://*.*.yoursite.com/.*. https://*.*.someOtherYourSite.*".split(","), 200);
Proxy proxy = server.seleniumProxy(); // Proxy is package org.openqa.selenium.Proxy
// configure it as a desired capability
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setCapability(CapabilityType.PROXY, proxy);
// start the driver ;
Webdriver driver = new FirefoxDriver(capabilities);
//WebDriver driver = new InternetExplorerDriver();
return driver;
}
Probably the easiest way to accomplish what you want in a cross-browser way is to use a proxy. This would allow you to intercept requests for resources, and block them. This would also have the advantage of using the same code for all browsers, rather than having to special-case each browser with settings unique to that browser.

Selenium WebDriver get(url) speed issue

The get(url) method waits for the web page to be fully loaded. If the page has a lot of stuff on it, loading can be very slow.
Is there a way to navigate to the target web page and wait only for the WebElement of interest? (i.e. not the banners, ads, etc.)
Thanks!
You can use Page load timeout. As far as I know, this is definitely supported by FirefoxDriver and InternetExplorerDriver, though I'm not sure about other drivers.
driver.manage().timeouts().pageLoadTimeout(0, TimeUnit.MILLISECONDS);
try {
driver.get("http://google.com");
} catch (TimeoutException ignored) {
// expected, ok
}
Or you can do a nonblocking page load with JavaScript:
private JavascriptExecutor js;
// I like to do this right after driver is instantiated
if (driver instanceof JavascriptExecutor) {
js = (JavascriptExecutor)driver;
}
// later, in the test, instead of driver.get("http://google.com");
js.executeScript("window.location.href = 'http://google.com'");
Both these examples load Google, but they return the control over the driver instance back to you immediatelly instead of waiting for the whole page to load. You can then simply wait for the one element you're looking for.
In case you didn't want this functionality just on the WebDriver#get(), but on you wanted a nonblocking click(), too, you can do one of these:
Use the page load timeout as shown above.
Use The Advanced User Interactions API (JavaDocs)
WebElement element = driver.findElement(By.whatever("anything"));
new Actions(driver).click(element).perform();
Use JavaScript again:
WebElement element = driver.findElement(By.whatever("anything"));
js.executeScript("arguments[0].click()", element);
Following url may help you.
Temporarily bypassing implicit waits with WebDriver
https://code.google.com/p/selenium/issues/detail?id=4993

Selenium 2: New Browser Has No Bookmarks

When I use Selenium 2 code (Java) to open Firefox (or any other browser) for some automated tests, the new window opens without my bookmarks, or for that matter the bookmark bar. Additionally, I suspect that cookies aren't retrieved either, because sites I normally log into do not remember certain things from my previous history.
The relevant code:
//WebDriver driver = new FirefoxDriver();
WebDriver driver = new InternetExplorerDriver();
String baseUrl = "http://localhost:8080/";
driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
//Navigate to login page
driver.navigate().to(baseUrl + "/myApp");
//obtain the username and password elements
WebElement username = driver.findElement(By.name("username"));
WebElement password = driver.findElement(By.name("password"));
//log in
username.sendKeys("myTestLogin");
password.sendKeys("myTestPwd");
driver.findElement(By.cssSelector("input.btnStyle")).click();
...
I think by default Selenium (WebDriver) will try to use as "clean" of a profile as possible. This is so the browser's settings that a user set up don't cause testing failures. You can modify these settings if you need to. Check out http://code.google.com/p/selenium/wiki/TipsAndTricks and see if that helps get you on the right track. I haven't done this with IE before though. I think with Firefox you can even have Selenium use an existing profile if you really need it to.

Categories

Resources