How to scrape search results using WebGrude?

How to scrape search results using WebGrude? - java

I recently used WebGrude for scrape some content from web pages. Then I tried to scrape some search results from e-bay. Here what tried,
#Page("http://www.ebay.com/sch/{0}")
public class PirateBay {
public static void main(String[] args) {
//Search calls Browser, which loads the page on a PirateBay instance
PirateBay search = PirateBay.search("iPhone");
while (search != null) {
search.magnets.forEach(System.out::println);
search = search.nextPage();
}
}
public static PirateBay search(String term) {
return Browser.get(PirateBay.class, term);
}
private PirateBay() {
}
/*
* This selector matches all magnet links. The result is added to this String list.
* The default behaviour is to use the rendered html inside the matched tag, but here
* we want to use the href value instead.
*/
#Selector(value = "#ResultSetItems a[href*=magnet]", attr = "href")
public List<String> magnets;
/*
* This selector matches a link to the next page result, wich can be mapped to a PirateBay instance.
* The Link next gets the page on the href attribute of the link when method visit is called.
*/
#Selector("a:has(img[alt=Next])")
private Link<PirateBay> next;
public PirateBay nextPage() {
if (next == null)
return null;
return next.visit();
}
}
But the result is empty. How may I scrape search results using this?

The selector "#ResultSetItems a[href*=magnet]" selects the links where the href attribute has the string "magnet" on its value.
Here you can read more about Atribute selectors: attribute_selectors
What you want is "#ResultSetItems h3.lvtitle a"
To test your selectors there is this nice repl that uses Jsoup, the same library used by webgrude Try jsoup

Related

Getting links from the table and all the tabs of a website using Jsoup

I'm new to web scraping so the question may not have been framed perfectly. I am trying to extract all the drug name links from a given page alphbetically and as a result extract all a-z drug links, then iterate over these links to extract information from within each of these like generic name, brand etc. I have a very basic code below that doesn't work. Some help in approaching this problem will be much appreciated.
public class WebScraper {
public static void main(String[] args) throws Exception {
String keyword = "a"; //will iterate through all the alphabets eventually
String url = "http://www.medindia.net/drug-price/brand-index.asp?alpha=" + keyword;
Document doc = Jsoup.connect(url).get();
Element table = doc.select("table").first();
Elements links = table.select("a[href]"); // a with href
for (Element link : links) {
System.out.println(link.attr("href"));
}
}

After looking at the website and what you are expecting to get, it looks like you are grabbing the wrong table element. You don't want the first table, you want the second.
To grab a specific table, you can use this:
Element table = doc.select("table").get(1);
This will get the table at index 1, ie the second table in the document.

How to wait for a css attribute to change?

How can I tell Selenium webdriver to wait on a specific element to have a specific attribute set in css?
I want to wait for:
element.getCssValue("display").equalsIgnoreCase("block")
Usually one waits for elements to be present like this:
webdriver.wait().until(ExpectedConditions.presenceOfElementLocated(By.id("some_input")));
How can I wait for the specific css value of display attribute?

I think this would work for you.
webdriver.wait().until(ExpectedConditions.presenceOfElementLocated(By.xpath("//*[#id='some_input'][contains(#style, 'display: block')]")));

Small modification of the above answer helpped for me. I had to use two (s before I start By.XPATH, not 1:
WebDriverWait(browser,1000).until(EC.presence_of_element_located((By.XPATH,xpathstring)))

In C# with extension method:
public static WebDriverWait wait = new WebDriverWait(SeleniumInfo.Driver, TimeSpan.FromSeconds(20));
public static void WaitUntilAttributeValueEquals(this IWebElement webElement, String attributeName, String attributeValue)
{
wait.Until<IWebElement>((d) =>
{
if (webElement.GetAttribute(attributeName) == attributeValue)
{
return webElement;
}
return null;
});
}
Usage:
IWebElement x = SeleniumInfo.Driver.FindElement(By.Xpath("//..."));
x.WaitUntilAttributeValueEquals("disabled", null); //Verifies that the attribute "disabled" does not exist
x.WaitUntilAttributeValueEquals("style", "display: none;");

How to check if the page has content?

I would like to avoid activating some page if its content is empty. I do this with some servlet as follow:
#SlingServlet(paths = "/bin/servlet", methods = "GET", resourceTypes = "sling/servlet/default")
public class ValidatorServlet extends SlingAllMethodsServlet {
#Override
protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) {
String page = "pathToPage";
PageManager pageManager = request.adaptTo(PageManager.class);
Page currentPage = pageManager.getPage(page);
boolean result = pageHasContent(currentPage);
}
Now how to check, if currentPage has content?

Please note that the following answer was posted in 2013 when CQ/AEM was a lot different to the current version. The following may not work consistently if used. Refer to Tadija Malic's answer below for more on this.
The hasContent() method of the Page class can be used to check whether the page has content or not. It returns true if the page has jcr:content node, else returns false.
boolean result = currentPage != null ? currentPage.hasContent() : false;
In case you would like to check for pages that have not been authored, one possible way is to check if there are any additional nodes that are present under jcr:content.
Node contentNode = currentPage.getContentResource().adaptTo(Node.class);
boolean result = contentNode.hasNodes();

I would create an OSGi service that takes a Page and walks its content tree according to the rules that you set to find out whether the page has meaningful content.
Whether a page has actual content or not is application-specific, so creating your own service will give you full control on that decision.

One way is to create a new page using the same template and then iterating through the node list and calculating the hash of components (or their content depending on what exactly you want to compare). Once you have the hash of an empty page template, then then you can compare any other page hash with that.
Note: this solution needs to be adapted to your own use case. Maybe it is enough for you to check which components are on the page and their order, and maybe you want to compare their configurations as well.
private boolean areHashesEqual(final Resource copiedPageRes, final Resource currentPageRes) {
final Resource currentRes = currentPageRes.getChild(com.day.cq.commons.jcr.JcrConstants.JCR_CONTENT);
return currentRes != null && ModelUtils.getPageHash(copiedPageRes).equals(ModelUtils.getPageHash(currentRes));
}
Model Utils:
public static String getPageHash(final Resource res) {
long pageHash = 0;
final Queue<Resource> components = new ArrayDeque<>();
components.add(res);
while (!components.isEmpty()) {
final Resource currentRes = components.poll();
final Iterable<Resource> children = currentRes.getChildren();
for (final Resource child : children) {
components.add(child);
}
pageHash = ModelUtils.getHash(pageHash, currentRes.getResourceType());
}
return String.valueOf(pageHash);
}
/**
* This method returns product of hashes of all parameters
* #param args
* #return int hash
*/
public static long getHash(final Object... args) {
int result = 0;
for (final Object arg : args) {
if (arg != null) {
result += arg.hashCode();
}
}
return result;
}
Note: using Queue will consider the order of components as well.
This was my approach, but I had a very specific use case. In general, you would want to think if you really want to calculate the hash of every component on every page you want to publish since this will slow down the publishing process.
You can also compare hash in every iteration and break the calculation on the first difference.

Using JSoup to scrape Google Results

I'm trying to use JSoup to scrape the search results from Google. Currently this is my code.
public class GoogleOptimization {
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("what should i put here?");
for (Element link : links) {
System.out.println("\n"+link.text());
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
I'm just trying to get the title of the search results and the snippets below the title. So yea, I just don't know what element to look for in order to scrape these. If anyone has a better method to scrape Google using java I would love to know.
Thanks.

Here you go.
public class ScanWebSO
{
public static void main (String args[])
{
Document doc;
try{
doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
Elements links = doc.select("li[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: "+title);
System.out.println("Body: "+body+"\n");
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Also, to do this yourself I would suggest using chrome. You just right click on whatever you want to scrape and go to inspect element. It will take you to the exact spot in the html where that element is located. In this case you first want to find out where the root of all the result listings are. When you find that, you want to specify the element, and preferably an unique attribute to search it by. In this case the root element is
<ol eid="" id="rso">
Below that you will see a bunch of listings that start with
<li class="g">
This is what you want to put into your initial elements array, then for each element you will want to find the spot where the title and body are. In this case, I found the title to be under the
<h3 class="r" style="white-space: normal;">
element. So you will search for that element in each listing. The same goes for the body. I found the body to be under so I searched for that using the .text() method and it returned all the text under that element. The key is to ALWAYS try and find the element with an original attribute (using a class name is ideal). If you don't and only search for something like "div" it will search the entire page for ANY element containing div and return that. So you will get WAY more results than you want. I hope this explains it well. Let me know if you have any more questions.

Recursively retrieving links from web page in Java

I'm working on a simplified website downloader (Programming Assignment) and I have to recursively go through the links in the given url and download the individual pages to my local directory.
I already have a function to retrieve all the hyperlinks(href attributes) from a single page, Set<String> retrieveLinksOnPage(URL url). This function returns a vector of hyperlinks. I have been told to download pages up to level 4. (Level 0 being the Home Page) Therefore I basically want to retrieve all the links in the site but I'm having difficulty coming up with the recursion algorithm. In the end, I intend to call my function like this :
retrieveAllLinksFromSite("http://www.example.com/ldsjf.html",0)
Set<String> Links=new Set<String>();
Set<String> retrieveAllLinksFromSite (URL url, int Level,Set<String> Links)
{
if(Level==4)
return;
else{
//retrieveLinksOnPage(url,0);
//I'm pretty Lost Actually!
}
}
Thanks!

Here is the pseudo code:
Set<String> retrieveAllLinksFromSite(int Level, Set<String> Links) {
if (Level < 5) {
Set<String> local_links = new HashSet<String>();
for (String link : Links) {
// do download link
Set<String> new_links = ;// do parsing the downloaded html of link;
local_links.addAll(retrieveAllLinksFromSite(Level+1, new_links));
}
return local_links;
} else {
return Links;
}
}
You will need to implement thing in the comments yourself. To run the function from a given single link, you need to create an initial set of links which contains only one initial link. However, it also works if you ahve multiple initial links.
Set<String> initial_link_set = new HashSet();
initial_link_set.add("http://abc.com/");
Set<String> final_link_set = retrieveAllLinksFromSite(1, initial_link_set);

You can use a HashMap instead of a Vector to store the links and their levels (since you need to recursively get all links down to level 4)
Also , it would be something like this(just giving an overall hint) :
HashMap Links=new HashMap();
void retrieveAllLinksFromSite (URL url, int Level)
{
if(Level==4)
return;
else{
retrieve the links on current page and for each retrieved link,
do {
download the link
Links.put(the retrieved url,Level) // store the link with level in hashmap
retrieveAllLinksFromSite (the retrieved url ,Level+1) //recursively call for
further levels
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to scrape search results using WebGrude? - java

Related

Getting links from the table and all the tabs of a website using Jsoup

How to wait for a css attribute to change?

How to check if the page has content?

Using JSoup to scrape Google Results

Recursively retrieving links from web page in Java

Categories

Resources