Java - How do I extract Google News Titles and Links using Jsoup? - java

I am very new to using jsoup and html. I was wondering how to extract the titles and links (if possible) from the stories on the front page of google news. Here is my code:
org.jsoup.nodes.Document doc = null;
try {
doc = (org.jsoup.nodes.Document) Jsoup.connect("https://news.google.com/").get();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Elements titles = doc.select("titletext");
System.out.println("Titles: " + titles.text());
//non existent
for (org.jsoup.nodes.Element e: titles) {
System.out.println("Title: " + e.text());
System.out.println("Link: " + e.attr("href"));
}
For some reason I think my program is unable to find titletext, since this is the output when the code runs: Titles:
I would really appreciate your help, thanks.

First get all nodes/elements which start with h2 html tag
Elements elem = html.select("h2");
Now you have element it has some child element(s) (id, href, originalhref and so on). Here you need retrieve these data which you need
for(Element e: elem){
System.out.println(e.select("[class=titletext]").text());
System.out.println(e.select("a").attr("href"));
}

Related

Price extraction in java

I am trying to create a discord bot that searches up an item inputted by user "!price item" and then gives me a price that I can work with later on in the code. I figured out how to get the html code into a string or a doc file, but I am struggling on finding a way to extract only prices.
Here is the code:
#Override
public void onMessageReceived(MessageReceivedEvent event) {
String html;
System.out.println("I received a message from " +
event.getAuthor().getName() + ": " +
event.getMessage().getContentDisplay());
if (event.getMessage().getContentRaw().contains("!price")) {
String input = event.getMessage().getContentDisplay();
String item = input.substring(9).replaceAll(" ", "%20");
String URL = "https://www.google.lt/search?q=" + item + "%20price";
try {
html = Jsoup.connect(URL).userAgent("Mozilla/49.0").get().html();
html = html.replaceAll("[^\\ ,.£€eur0123456789]"," ");
} catch (Exception e) {
return;
}
System.out.println(html);
}
}
The biggest problem is that I am using google search so the prices are not in the same place in the html code. Is there a way I can extract only (numbers + EUR) or (a euro sign + price) from the html code?.
you can easily do that scrapping the website. Here's a simple working example to do what you are looking for using JSOUP:
public class Main {
public static void main(String[] args) {
try {
String query = "oneplus";
String url = "https://www.google.com/search?q=" + query + "%20price&client=firefox-b&source=lnms&tbm=shop&sa=X";
int pricesToRetrieve = 3;
ArrayList<String> prices = new ArrayList<String>();
Document document = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
Elements elements = document.select("div.pslires");
for (Element element : elements) {
String price = element.select("div > div > b").text();
String[] finalPrice = price.split(" ");
prices.add(finalPrice[0] + finalPrice[1]);
pricesToRetrieve -= 1;
if (pricesToRetrieve == 0) {
break;
}
}
System.out.println(prices);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That piece of code will output:
[347,10€, 529,90€, 449,99€]
And if you want to retrieve more information just connect JSOUP to the Google Shop url adding your desired query, and scrapping it using JSOUP. In this case I scrapped Google Shop for OnePlus to check its prices, but you can also get the url to buy it, the full product name, etc. In this piece of code I want to retrieve the first 3 prices indexed in Google Shop and add them to an ArrayList of String. Then before adding it to the ArrayList I split the retrieved text by "space" so I just get the information I want, the price.
This is a simple scrapping example, if you need anything else feel free to ask! And if you want to learn more about scrapping using JSOUP check this link.
Hope this helped you!

How to extract <a> from google scholar that contains link to download pdf file

I need to extract the tag from html of google scholar. I've written the script but it extracts all the 's. and I cant find any way to extract the specific tag where the download link of the paper is resting. Please Help.!
Below is the code
public static void main(String[] args) throws IOException {
Document doc;
try {
doc = Jsoup.connect("https://scholar.google.com.pk/scholar?q=Bergmark%2C+D.+%282000%29.+Automatic+extraction+of+reference+linking+information+from+online+documents.+Technical+Report+CSTR2000-1821%2C+Cornell+Digital+Library+Research+Group&btnG=&hl=en&as_sdt=0%2C5").get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("a[href]");
// Elements link = doc.select(".pdf");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
And here this is the structure of this tag :
<span class="gs_ctg2">[PDF]</span> cornell.edu
Use div.gs_ggsd and a[href] as css query
Here
div.gs_ggsd => Select all the div Tag that have class name gs_ggsd
Example :
try {
Document doc = Jsoup
.connect("https://scholar.google.com.pk/scholar?q=Bangla+Speech+Recognition")
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36")
.get();
String title = doc.title();
System.out.println("title : " + title);
Elements links = doc.select("div.gs_ggsd").select("a[href]");
for (Element link : links) {
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
Read More : https://jsoup.org/cookbook/extracting-data/selector-syntax

Selecting elements by class with Jsoup

Hi I am trying to parse data from yahoo finance using Jsoup in Eclipse by selecting elements by their class with the below code.
This method has worked for me with other website but will not work here. The attached link is the page I'm trying to parse. In this example the line I'm trying to parse 21.74 specifically I want to parse out the "21.74". I have tried selecting table elements but nothing seems to work. This is my first question so any suggestions are mush appreciated!!
public static final String YAHOOLINK = new String("http://finance.yahoo.com/quote/MMM/key-statistics?p=");
private String yahooLink;
private Document rawYahooData;
private static String CLASSNAME = new String("W(100%) Pos(r)");
public YahooDataCollector(String aStockTicker){
yahooLink = new String(YAHOOLINK + aStockTicker);
try
{
rawYahooData = (Document) Jsoup.connect(yahooLink).timeout(10*1000).get();
Elements yahooElements = rawYahooData.getElementsByClass(CLASSNAME);
for(Element e : yahooElements)
{
System.out.println(e.text());
}
}
catch(IOException e)
{
System.out.println("Error Grabbing Raw Data For "+ aStockTicker);
}
}

Get child pages in java component xwiki

I'm trying to get all the documents that list the current document as the perant using:
List<DocumentReference> childDocs = doc.getChildrenReferences(xcontext);
where doc is the parent XWikiDocument.
However the function only return an empty list, althoug there is documents in the space.
If I print doc.getFullName() its AAA.WebHome.. And all the chldren spaces is listed under AAA.. how should I reference doc to get say AAA.BBB.WebHome in the list? or where am I going wrong?
I'm trying ot write a recursive function that deletes all the child pages in the current space. But can't list all the child pages. Here is the recursive function:
public void RECdeleteSpace(XWikiDocument doc, XWikiContext xcontext,boolean toTrash){
XWiki xwiki = xcontext.getWiki();
List<DocumentReference> childDocs;
try {
childDocs = doc.getChildrenReferences(xcontext);
System.out.println("REC " + doc.getFullName());
System.out.println("CHLD " + childDocs.toString());
System.out.println("----- ");
Iterator<DocumentReference> docit = childDocs.iterator();
while (docit.hasNext()) {
DocumentReference chdocref = docit.next();
XWikiDocument chdoc = xwiki.getDocument(chdocref, xcontext);
System.out.println("DOC: "+chdoc.getFullName());
RECdeleteSpace(chdoc,xcontext,toTrash);
}
xwiki.deleteDocument(doc, toTrash, xcontext);
} catch (XWikiException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
The output is only:
INIT WebHome
REC AAA.WebHome
CHLD []
-----
And the itterator is nver going to AAA.BBB.Webhome or AAA.BBB.CCC.WebHome

Selenium Webdriver List with WebElements

Hello i have to save from a page same <a> in the page and has got as class=my_img i save this elements in a List an after i try to go in firts element of list and after go back get the second element but the selenium give me this error
Exception in thread "main"
org.openqa.selenium.StaleElementReferenceException: Element not found
in the cache - perhaps the page has changed since it was looked up
and this is my code
List <WebElement> Element = drivers.findElements(By.cssSelector(".my_img"));
System.out.println("Megethos"+Element.size());
System.out.println("Pame stous epomenous \n");
for (i = 1; i < Element.size(); i++) {
drivers.manage().timeouts().implicitlyWait(35, TimeUnit.SECONDS);
System.out.println(i+" "+Element.size());
System.out.println(i+" "+Element.get(i));
action.click(Element.get(i)).perform();
Thread.sleep(2000);
System.out.println("go back");
drivers.navigate().back();
Thread.sleep(6000);
drivers.navigate().refresh();
Thread.sleep(6000);
}
Your action.click() and/or navigate() calls are resulting in a page reload, causing the WebElement's in your list to no longer be valid. Put your findElements() call inside the loop:
List <WebElement> Element = drivers.findElements(By.cssSelector(".my_img"));
for (i = 1; i < Element.size(); i++) {
Element = drivers.findElements(By.cssSelector(".my_img"));
drivers.manage().timeouts().implicitlyWait(35, TimeUnit.SECONDS);
System.out.println(i+" "+Element.size());
System.out.println(i+" "+Element.get(i));
action.click(Element.get(i)).perform();
Thread.sleep(2000);
System.out.println("go back");
drivers.navigate().back();
Thread.sleep(6000);
drivers.navigate().refresh();
Thread.sleep(6000);
}
If the primary purpose is to click on the links and get back to the previous page, its better to get "href" attributes for all the "a" elements in the page and navigate to each of them. The way you've followed will always result in StaleElementReferenceExeception, as when you navigate back to the original DOM changes.
Below is the way like I suggested:
List<WebElement> linkElements = driver.findElements(By.xpath("//a[#class='my_img']"));
System.out.println("The number of links under URL is: "+linkElements.size());
//Getting all the 'href' attributes from the 'a' tag and putting into the String array linkhrefs
String[] linkhrefs = new String[linkElements.size()];
int j = 0;
for (WebElement e : linkElements) {
linkhrefs[j] = e.getAttribute("href");
j++;
}
// test each link
int k=0;
for (String t : linkhrefs) {
try{
if (t != null && !t.isEmpty()) {
System.out.println("Navigating to link number "+(++k)+": '"+t+"'");
driver.navigate().to(t);
String title;
title = driver.getTitle();
System.out.println("title is: "+title);
//Some known errors, if and when, found in the navigated to page.
if((title.contains("You are not authorized to view this page"))||(title.contains("Page not found"))
||(title.contains("503 Service Unavailable"))
||(title.contains("Problem loading page")))
{
System.err.println(t + " the link is not working because title is: "+title);
} else {
System.out.println("\"" + t + "\"" + " is working.");
}
}else{
System.err.println("Link's href is null.");
}
}catch(Throwable e){
System.err.println("Error came while navigating to link: "+t+". Error message: "+e.getMessage());
}
System.out.println("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++");
}

Categories

Resources