Using JSoup to scrape Google Results

Using JSoup to scrape Google Results - java

I'm trying to use JSoup to scrape the search results from Google. Currently this is my code.
public class GoogleOptimization {
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("what should i put here?");
for (Element link : links) {
System.out.println("\n"+link.text());
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
I'm just trying to get the title of the search results and the snippets below the title. So yea, I just don't know what element to look for in order to scrape these. If anyone has a better method to scrape Google using java I would love to know.
Thanks.

Here you go.
public class ScanWebSO
{
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Also, to do this yourself I would suggest using chrome. You just right click on whatever you want to scrape and go to inspect element. It will take you to the exact spot in the html where that element is located. In this case you first want to find out where the root of all the result listings are. When you find that, you want to specify the element, and preferably an unique attribute to search it by. In this case the root element is
<ol eid="" id="rso">
Below that you will see a bunch of listings that start with
<li class="g">
This is what you want to put into your initial elements array, then for each element you will want to find the spot where the title and body are. In this case, I found the title to be under the
<h3 class="r" style="white-space: normal;">
element. So you will search for that element in each listing. The same goes for the body. I found the body to be under so I searched for that using the .text() method and it returned all the text under that element. The key is to ALWAYS try and find the element with an original attribute (using a class name is ideal). If you don't and only search for something like "div" it will search the entire page for ANY element containing div and return that. So you will get WAY more results than you want. I hope this explains it well. Let me know if you have any more questions.

Related

Selenium - Filter/ loop through elements to check for specific text

As Java is not my "best language" I would like to ask for support. I'm trying to validate if element from the list contains a specific value.
I'm having element which locates at least 5+ elements
I'm trying to get that element:
eg. List<WebElement> elementsList = getWebElementList(target, false);
I'm trying to validate if specific text is a part of the taken list. By doing something like:
elementsList.contains(valueToCheck);
it does not work...
also I found a way of using stream() method which looks more/less like this but I'm having troubles with continuation:
elementsList.stream().filter(WebElement::getText.....
Can you please explain me how lists are handled in Java in modern/ nowadays way? In C# I was mostly using LINQ but don't know if Java has similar functionality.

Once you get the element list.You can try something like that. this will return true or false.
elementsList.stream().anyMatch(e -> e.getText().trim().contains("specficelementtext")

You can not apply contains() method directly on a list of WebElement objects.
This method can be applied on String object only.
You also can not extract text directly from list of WebElement objects.
What you can do is:
Get the list of WebElements matching the given locator, iterate over the WebElements in the list extracting their text contents and checking if that text content contains the desired text, as following:
public static boolean waitForTextInElementList(WebDriver driver, By locator, String expectedText, int timeout) {
try {
List<WebElement> list;
for (int i = 0; i < timeout; i++) {
list = driver.findElements(locator);
for (WebElement element : list) {
if (element.getText().contains(expectedText)) {
return true;
}
}
i++;
}
} catch (Exception e) {
ConsoleLogger.error("Could not find the expected text " + expectedText + " on specified element list" + e.getMessage());
}
return false;
}

How to get nested element using jSoup?

I am trying to access the nest class gwt-HTML from http://folkets-lexikon.csc.kth.se/folkets/#lookup&dricker&0, which contains the following text:
Böjningar: drack, druckit, drick, dricka, dricker
Some quick, relevant information about the above site: it is an English-Swedish dictionary, where I all I need to do it just slightly modifiy the URL each time and then grab the text that follows after the word Böjningar, in this case I would get 'drack, druckit, drick, dricka, dricker'
Here is what I have tried so far
Document document = Jsoup.connect("http://folkets-lexikon.csc.kth.se/folkets/#lookup&dricker&0").get();
Elements elements = document.getElementsByClass("gwt-HTML");
if(!elements.isEmpty()){
for(Element element: elements){
System.out.println(element.data());
}
} else {
System.out.println("***********NO RESULTS !!!");
}
With the above code, I keep entering the else statement, even though when I inspect the elements of the site, I can see
<div class="gwt-HTML">Böjningar: drack, druckit, drick, dricka, dricker</div>
How can I gain access to this element?
Here is a screenshot of the data

Use select("div.gwt-HTML") instead of getElementsByClass("gwt-HTML")
Document document = Jsoup.connect("http://folkets-lexikon.csc.kth.se/folkets/#lookup&dricker&0").get();
Elements elements = document.select("div.gwt-HTML");
if(!elements.isEmpty()){
for(Element element: elements){
System.out.println(element.data());
}
} else {
System.out.println("***********NO RESULTS !!!");
}

Getting links from the table and all the tabs of a website using Jsoup

I'm new to web scraping so the question may not have been framed perfectly. I am trying to extract all the drug name links from a given page alphbetically and as a result extract all a-z drug links, then iterate over these links to extract information from within each of these like generic name, brand etc. I have a very basic code below that doesn't work. Some help in approaching this problem will be much appreciated.
public class WebScraper {
public static void main(String[] args) throws Exception {
String keyword = "a"; //will iterate through all the alphabets eventually
String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha=" + keyword;
Document doc = Jsoup.connect(url).get();
Element table = doc.select("table").first();
Elements links = table.select("a[href]"); // a with href
for (Element link : links) {
System.out.println(link.attr("href"));
}
}

After looking at the website and what you are expecting to get, it looks like you are grabbing the wrong table element. You don't want the first table, you want the second.
To grab a specific table, you can use this:
Element table = doc.select("table").get(1);
This will get the table at index 1, ie the second table in the document.

How to handle dynamic xpaths

I need to select the text that has returned based on my search operation.
For every search xpaths will get differ. These are various xpaths that are returned on search
.//*[#id='messageBoxForm']/div/div[1]/div[1]/div/div[1]/div[1]/span/input
.//*[#id='messageBoxForm']/div/div[1]/div[1]/div/div[2]/div/div/div[2]/div[2]/strong

You could place it in a try-catch-block and use the first x-path in your try, catch the "NoSuchElementException" Selenium could throw and then try the other x-path.
Based on the criteria you posted this should do the job.
WebElement element;
try {
element = webDriver.findElement(By.xpath("xyz"));
} catch (NoSuchElementException e) {
element = webDriver.findElement(By.xpath("abc"));
}
... do things with your element

Selenium How to check the drop down list item is selected

public static WebElement drpdwn_selectMonth() throws Exception{
try{
WebElement monthSelector = driver.findElement(By.id("monthID"));
monthSelector.click();
driver.manage().timeouts().implicitlyWait(15, TimeUnit.SECONDS);
monthSelector = driver.findElement(By.xpath("//*[#id='monthID']/option[2]"));
monthSelector.click();
driver.manage().timeouts().implicitlyWait(15, TimeUnit.SECONDS);
}catch (Exception e){
throw(e);
}
return element;
}
how to do a Boolean check there is a value under drop down list is selected?
how to print and get the value selected in drop down list

According to your given little details , It can be done in below way :
WebElement monthSelector = driver.findElement(By.id("monthID"));
monthSelector.click();
if(monthSelector.isSelected())
{
Select sel = new Select(driver.findElement(By.id("monthID")));
sel.selectByVisibleText("Your-dropdown-value");
}
else
{
System.out.println("Sorry , Dropdown not selected yet");
}
Please replace Your-dropdown-value with your dropdown actual value e.g "January".
Better you also share your HTML code , if above does not work for you.

HTML snip would help, but here's my take. If your menu element is a <select> element, you can make use of the Select API.
Once instantiated with your WebElement representing the root locator of the menu, you can use the getAllSelectedOptions() or getFirstSelectedOption() methods to retrieve the text of the selected option(s). From here, you can print the value, or validate the selected option in your assert statement.
This is only high level concept, but if you read through the API Doc, you should be able to come up with the solution that fits your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using JSoup to scrape Google Results - java

Related

Selenium - Filter/ loop through elements to check for specific text

How to get nested element using jSoup?

Getting links from the table and all the tabs of a website using Jsoup

How to handle dynamic xpaths

Selenium How to check the drop down list item is selected

Categories

Resources