Selenium Chrome Driver Limitations Web Scraping at Scale

Selenium Chrome Driver Limitations Web Scraping at Scale - java

I'm planning to use Selenium Chrome Driver for my project which will be used to do web scraping to multiple public websites (something like kayak or skyscanner). So there will be a REST GET endpoint where my backend would launch headless Chrome to scrape multiple websites, and eventually return a manipulated JSON.
I want to know how scalable is Chrome Driver as it sounds like a headless Chrome instance needs to be launched whenever a request comes in.
Updated: Question using Google Chrome Headless

Please find the pros and cons of phantom js which I noticed during implementation .Hope this helps.
Cons:
1)It will fail to recognize the browser elements like id,xpath,csselector
when compared to chrome driver.
2)If you have login mechanism ,redirects won't work as you expect when compared to chrome driver.
3)You need to manually implement the custom logic for screen shots for the test failures if you need it.
4)If you want to switch between multiple drivers like chrome,html etc then it is very difficult
Pros:
1)Test case execution is faster when compared to chrome driver
2)No browser is required it will run without GUI.
3)No much configurations are needed when compared to chromedriver.
You can go with html driver also which is quite faster then phantom but even it has its own limitations that you need take care of before implementation.

I am not sure that you really need to use PhantomJS.
Chrome implemented "headless" mode couple of months ago.
"Headless Chrome" does the same job that PhantomJS, and does it better.
I heard that PhantomJS authors even said that they will not support it anymore.
You can enable headless mode in Selenide with just on line:
Configuration.headless = true;

Did you think about headless chrome?
Headless Chrome

Related

Headless Chrome - getting blank page source

I'm trying to load a website with Chrome browser in headless mode using Selenium web driver. I face an issue with some specific websites. The page is loading, in the first 2-3 seconds it shows a page with "please enable javascript..." and after 3 seconds, page source goes blank.
I'm using Selenium and especially Chrome for long time and I am familiar with the platform. For the purpose of this case, I'm using Chrome Version 73.0.3683.86 , ChromeDriver 2.46.628411 (which is compatible according to Which ChromeDriver version is compatible with which Chrome Browser version?) on a Mac OS. selenium java version is latest - 3.141.59
I suspect that headless Chrome cannot handle specific content-type such as "svg" and any other GUI related HTTP response.
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://identity.tescobank.com/login");
Thread.sleep(3000);
System.out.println(driver.getPageSource());
driver.quit();
Expected result is to have the page source same as it is showing in non-headless mode.

Headless Chrome should be able to handle everything the normal Chrome can do:
It brings all modern web platform features provided by Chromium and the Blink rendering engine to the command line.
(see https://developers.google.com/web/updates/2017/04/headless-chrome)
Since only the login page of a bank causes you trouble, my guess is that the security of the page detects an anomaly and decides not to serve you.
One way they can do that is by looking at the User Agent string which contains HeadlessChrome.
That said, unless you're writing integration tests for the bank, your behavior is at least suspicious. If you have a valid and legal concern, clear it with the bank first. They might take actions against you, otherwise. Blocking your IP address (which could affect many people) or asking the police to have a word with you.

I was facing similar issue in my script, after login. Somehow refreshing the page resolved the issue.
driver.navigate().refresh();

Java Selenium Chromedriver webdriver as fast as possible

I am connecting a standalone program to a website, and I have to read some pages of the website. For first, I used Jsoup, but with this I discovered that some informations that I need are loaded after page load, so I searched for webdrivers. (I am not looking for images or something big, my content are all textual)
Now, i found the ChromeDriver, but it is too slow for my case because it has a lot of options and features.
In my case I need just a step more than the Jsoup possibilities.
It is possible to disable the best part of ChromeDriver options and features to reach this goal?
For example, i saw plugin disabling, but is one by one and is not for every chrome browser on every pc. I didn't found an option like "plugin.disable-all".
Furthermore,in this way I cannot open more than a few instance of chromedriver. In this moment, every instance of the chromedriver is opening a Google Chrome Helper that uses 100Mb of ram.
Hope all is clear

HtmlUnit might be enough for your needs. It does support some Javascript.
It can be used with Webdriver. But might as well be enough on its own

To make your webdriver run faster (but not that much faster), you can run the driver in Headless mode. See these articles for a tutorial on how to go into Headless mode for Chrome.
Before starting the driver, add the --headless argument to ChromeOptions.
Headless mode can speed up your automation by not rendering the browser window, but keep in mind that doing a straight HTTP GET with JSoup would always be faster.
My advice would be to reverse engineer the page a bit more, and see if you can figure out how to query directly whatever the (presumably AJAX) calls are putting on the page. If you can treat those specific requests as an API and only query for exactly what you want, you will be able to get results faster than with browser automation through Selenium.

Java - Go to new url in new window/tab

I want to be able to use Java to tell it to go to X url when X browser is open/running (my lingo is terrible). (Firefox/Chrome/IE is already up, and I want it to go from the default page to let's say Twitter.)
Most of the solutions are using java.awt.Desktop to launch native browser with a url in it, but that isn't useful if I want to change the url later on. (Already on Twitter-Home Page, but want to go to Twitter-Contact Us afterwards.)
The other solutions I've seen involve using Selenium WebDriver, but I also need to eventually learn how to basically force the Java to read a long list of URLs off an excel and simply verify that url isn't dead, and then do this on the Native Android browser, for example. So the Selenium might not be the right choice. Granted, you can also tell me this is an awesome choice for this too if it truly is. I haven't really been exploring Selenium.
Sorry for asking such a basic question. Company wants QA Automation without training/hiring an Automation QA. My end goal (aside not getting canned), is to see if I can get a bunch of urls to load on specific browsers. I can sort of (praying) be able to do stuff with it afterwards.

A simple trick would be to create an add-on( if you know javascript ) which will be quite similar in chrome and firefox (for IE I have no idea in my days it needed BHO) and send websocket commands from java to your addon. But this needs a java websocket server running where your addon will connect when the browser opens. Rest of communication can be carried upon the protocol lines of your requirements.

There are multiple parts to your question.
Read urls from excel.
Use Apache POI to do the same. Selenium code can use the same.
Check that the urls are not dead.
Use any java http client, (apache) to do that without even opening a browser. If the link is dead, it will be dead for all the browser.
Open the links in a multiple browsers.
Selenium is perfect for this. I am assuming that after the page is loaded you have way of validating that the page is correct. Selenium is very powerful here.
Target native android browser too.
I do not know of much difference between this and the previous question unless you are also testing site display based on browser size. The browser is more or less the same as chrome with webkit rendering engine.

chrome v62 enable flash

I have a selenium (v2.53) test that visits a site containing flash player (I'm testing this player). up until now, everything was working fine, but after I updated chrome to v62, flash is disabled by default.
I can't change manually flash setting since this test is automated and running on remote machines.
I've tried adding some chrome capabilities that should work on previous versions of chrome, but it did not work on chrome 62 version since allowing flash is not enough, now a list of allowed site is also required.
How can I change both the enabled status and the list of sites using selenium?
Also, is there a way to install chrome with a config file that both enables flash and populates the required sites list?
Thanks.
P.S. I'm working with Java 8

Your best bet is to simply use Chrome options. Why do you need a config file? That sounds overly complicated and unnecessary. You can enable it through chrome preferences. Try a fresh install of Chrome too.
Something akin to the likes of:
chromeOptions:{
args: ["--allow-running-insecure-content", "--allow-insecure-websocket-from-https-origin", "allow-outdated-plugins"]
You didn't specify which language so I can't give you a language example.

Configuration of selenium webdriver with xorg-x11-server-Xvfb

We have developed selenium webdriver script with junit+java using eclipse on window 7. All the scripts are working as expected now we are using this script for load testing using Jmeter. However, while running script system open multiple browser (200) based on user thread and it create system to hang, is there any way to handle this or we can run script without opening browser. I have come across xvfb tool, but not able to get java api for this tool to plugin in eclipse.
We have also tried using HtmlUnitDriver but as it does not support javascript hence the test is getting failed, also we tried HtmlUnit and found same thing.
Note: that we have writen webdriver script to maintain display item of element (autocomplete, image) on screen.
It would be great if anyone can help or provide more inputs on this...

Firstly, do not integrate selenium scripts with JMeter for load testing! It's not a good approach to follow due to the obvious consequences that you have mentioned in your post. I followed a similar apporach in the beginning when I was new to JMeter and selenium but suffered a great deal when it came to running load tests that spawned too many browser instances which killed the OS.
You can go for HtmlUnitDriver or any headless browser testing tools out there with JMeter, but still, they will be running the browser internally in the memory. Moreover if your application is heavily using Javascript, it won't help.
So I would suggest that you record a browsing session with JMeter Proxy and modify the script (set of requests) according to your needs and play those requests alone, with number of threads.
From a higher level, you should be doing this:
Add a JMeter test plan, listeners, thread group and setup JMeter proxy and record a browsing session where you enter something into the autocomplete textbox and you get certain results.
Stop your proxy and take a look at all the requests that come under your thread group.
As far as I know, when it comes to autocomplete plugins, multiple
requests are sent everytime you enter a letter into the textbox. For
example, for the word 'stackoverflow':
Request1: q=s Request2: q=st Request3: q=sta and so on
Here you can simulate this effect by including words such that all
words have the same length which in turn will let you have same
number of requests to be sent to the server.
So in your test plan, you will pass one word per Jmeter thread. You
can pass the words to a request, from a csv file by using jmeter
parametrization.
This will be a much memory efficient way of load testing instead of using selenium with JMeter. I had asked a similar question. You can check out the responses.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.