Selecting elements by class with Jsoup

Selecting elements by class with Jsoup - java

Hi I am trying to parse data from yahoo finance using Jsoup in Eclipse by selecting elements by their class with the below code.
This method has worked for me with other website but will not work here. The attached link is the page I'm trying to parse. In this example the line I'm trying to parse 21.74 specifically I want to parse out the "21.74". I have tried selecting table elements but nothing seems to work. This is my first question so any suggestions are mush appreciated!!
public static final String YAHOOLINK = new String("http://finance.yahoo.com/quote/MMM/key-statistics?p=");
private String yahooLink;
private Document rawYahooData;
private static String CLASSNAME = new String("W(100%) Pos(r)");
public YahooDataCollector(String aStockTicker){
yahooLink = new String(YAHOOLINK + aStockTicker);
try
{
rawYahooData = (Document) Jsoup.connect(yahooLink).timeout(10*1000).get();
Elements yahooElements = rawYahooData.getElementsByClass(CLASSNAME);
for(Element e : yahooElements)
{
System.out.println(e.text());
}
}
catch(IOException e)
{
System.out.println("Error Grabbing Raw Data For "+ aStockTicker);
}
}

Related

Price extraction in java

I am trying to create a discord bot that searches up an item inputted by user "!price item" and then gives me a price that I can work with later on in the code. I figured out how to get the html code into a string or a doc file, but I am struggling on finding a way to extract only prices.
Here is the code:
#Override
public void onMessageReceived(MessageReceivedEvent event) {
String html;
System.out.println("I received a message from " +
event.getAuthor().getName() + ": " +
event.getMessage().getContentDisplay());
if (event.getMessage().getContentRaw().contains("!price")) {
String input = event.getMessage().getContentDisplay();
String item = input.substring(9).replaceAll(" ", "%20");
String URL = "https://www.google.lt/search?q=" + item + "%20price";
try {
html = Jsoup.connect(URL).userAgent("Mozilla/49.0").get().html();
html = html.replaceAll("[^\\ ,.£€eur0123456789]"," ");
} catch (Exception e) {
return;
}
System.out.println(html);
}
}
The biggest problem is that I am using google search so the prices are not in the same place in the html code. Is there a way I can extract only (numbers + EUR) or (a euro sign + price) from the html code?.

you can easily do that scrapping the website. Here's a simple working example to do what you are looking for using JSOUP:
public class Main {
public static void main(String[] args) {
try {
String query = "oneplus";
String url = "https://www.google.com/search?q=" + query + "%20price&client=firefox-b&source=lnms&tbm=shop&sa=X";
int pricesToRetrieve = 3;
ArrayList<String> prices = new ArrayList<String>();
Document document = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
Elements elements = document.select("div.pslires");
for (Element element : elements) {
String price = element.select("div > div > b").text();
String[] finalPrice = price.split(" ");
prices.add(finalPrice[0] + finalPrice[1]);
pricesToRetrieve -= 1;
if (pricesToRetrieve == 0) {
break;
}
}
System.out.println(prices);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That piece of code will output:
[347,10€, 529,90€, 449,99€]
And if you want to retrieve more information just connect JSOUP to the Google Shop url adding your desired query, and scrapping it using JSOUP. In this case I scrapped Google Shop for OnePlus to check its prices, but you can also get the url to buy it, the full product name, etc. In this piece of code I want to retrieve the first 3 prices indexed in Google Shop and add them to an ArrayList of String. Then before adding it to the ArrayList I split the retrieved text by "space" so I just get the information I want, the price.
This is a simple scrapping example, if you need anything else feel free to ask! And if you want to learn more about scrapping using JSOUP check this link.
Hope this helped you!

NullPointerException when setting variables to objects in ArrayList

I have an ArrayList containing Movie objects which I produced from a List containing File objects using this method:
// returns a list containing movie objects with correct names
private static ArrayList<Movie> createMovieObjs(Collection<File> videoFiles) {
ArrayList<Movie> movieArrayList = new ArrayList<>();
Matcher matcher;
for (File file : videoFiles) {
matcher = Movie.NAME_PATTERN.matcher(file.getName());
while (matcher.find()) {
String movieName = matcher.group(1).replaceAll("\\.", " ");
Movie movie = new Movie(movieName);
if (!movieArrayList.contains(movie)) {
movieArrayList.add(movie);
}
}
}
return movieArrayList;
}
Everything works fine in above code, getting correct ArrayList.
Then I want to parse info for each Movie object in this ArrayList and set that info to that Movie object:
// want to parse genre, release year and imdbrating for every Movie object
for (Movie movie : movieArrayList) {
try {
movie.imdbParser();
} catch (IOException e) {
System.out.println("Parsing failed: " + e);
}
}
Here is Movie.imdbParser which uses Movie.createXmlLink (createXmlLink works fine on its own, so thas imdbParser - tested both):
private String createXmlLink() {
StringBuilder sb = new StringBuilder(XML_PART_ONE);
// need to replace spaces in movie names to "+" - api works that way
String namePassedToXml = this.title.replaceAll(" ", "+");
sb.append(namePassedToXml);
sb.append(XML_PART_TWO);
return sb.toString();
}
// parses IMDB page and sets releaseDate, genre and imdbRating in Movie objects
public void imdbParser() throws IOException {
String xmlLink = createXmlLink();
// using "new Url..." because my xml is on the web, not on my disk
Document doc = Jsoup.parse(new URL(xmlLink).openStream(), "UTF-8", "", Parser.xmlParser());
Element movieFromXml = doc.select("movie").first();
// using array to extract only last genre name - usually the most substantive one
String[] genreArray = movieFromXml.attr("genre").split(", ");
this.genre = genreArray[genreArray.length - 1];
this.imdbRating = Float.parseFloat(movieFromXml.attr("imdbRating"));
// using array to extract only year of release
String[] dateArray = movieFromXml.attr("released").split(" ");
this.releaseYear = Integer.parseInt(dateArray[2]);
}
Problems seems to be with accessing Movie objects, it doesn't create good XMLLink so when it tries to access genre in XML it throws NPE.
My error:
Exception in thread "main" java.lang.NullPointerException
at com.michal.Movie.imdbParser(Movie.java:79)
at com.michal.Main.main(Main.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Movie.java:79 is:
String[] genreArray = movieFromXml.attr("genre").split(", ");
and Main.java:52 is:
movie.imdbParser();

A lot of errors such as this are caused by the data not being as you expect. Often when dealing with a lot of data then sometimes it can be just a single record that does not "conform"!
You say that this line is causing you problems;
String[] genreArray = movieFromXml.attr("genre").split(", ");
So split the line down and debug your application to find the issue. If you can't debug then export the results to a log file.
Check that movieFromXml is not NULL.
Check that movieFromXml contains a "genre" attribute as you expect
Check that the data in the attribute can be split as you expect, and produces a valid output.
Often, it is good to view the data being retrieved prior to this call. I often find it useful to dump the data out to file and then load it in an external viewer to check that it is how I expect.

When you make a chain of calling method like this: movieFromXml.attr("genre").split(", "), you have to make sure that each of the preceding element is not null. In this case, you have to make sure that movieFromXml is not null and movieFromXml.attr("genre") is not null.

Java - How do I extract Google News Titles and Links using Jsoup?

I am very new to using jsoup and html. I was wondering how to extract the titles and links (if possible) from the stories on the front page of google news. Here is my code:
org.jsoup.nodes.Document doc = null;
try {
doc = (org.jsoup.nodes.Document) Jsoup.connect("https://news.google.com/").get();
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Elements titles = doc.select("titletext");
System.out.println("Titles: " + titles.text());
//non existent
for (org.jsoup.nodes.Element e: titles) {
System.out.println("Title: " + e.text());
System.out.println("Link: " + e.attr("href"));
}
For some reason I think my program is unable to find titletext, since this is the output when the code runs: Titles:
I would really appreciate your help, thanks.

First get all nodes/elements which start with h2 html tag
Elements elem = html.select("h2");
Now you have element it has some child element(s) (id, href, originalhref and so on). Here you need retrieve these data which you need
for(Element e: elem){
System.out.println(e.select("[class=titletext]").text());
System.out.println(e.select("a").attr("href"));
}

How to extract multiple values from html to java? [duplicate]

This question already has answers here:
What HTML parsing libraries do you recommend in Java [closed]
(3 answers)
Closed 7 years ago.
I'm trying to extract some data from html source code to my java project.
The html is taken from "Bing search images" and I wanna get all the images from the <a> tag. This is the html code:
<a href="/images/search?q=nba&view=detailv2&&&
id=FE19E7BB2916CE8B6CD78148F3BC0656D151049A&
selectedIndex=3&
ccid=2%2f7OBkGc&
simid=608035681734625885&
thid=JN.tdPCsRj4HyJzbwA%2bgXsS8g"
ihk="JN.tdPCsRj4HyJzbwA+gXsS8g"
m="{ns:"images",k:"5070",dirovr:"ltr",
mid:"FE19E7BB2916CE8B6CD78148F3BC0656D151049A",
surl:"http://www.nba.com/gallery/rookie/070727_1.html",
imgurl:"http://www.nba.com/media/draft_class_3_07_070727.jpg
",
ow:"300",docid:"608035681734625885",oh:"192",tft:"58"}"
mid="FE19E7BB2916CE8B6CD78148F3BC0656D151049A"
t1="The 2007 NBA Draft Class"
t2="625 x 400 · 374 kB · jpeg"
t3="www.nba.com/gallery/rookie/070727_1.html"
h="ID=images,5070.1"><img data-bm="16"
src="https://tse3.mm.bing.net/th?id=JN.tdPCsRj4HyJzbwA%2bgXsS8g&w=217&h=142&c=7&rs=1&qlt=90&o=4&pid=1.1"
style="width:217px;height:142px;" width="217" height="142">
</a>
and this is how i tried to extract it but no succeeded:
public static void main(String[] args) {
String title = "dog";
String url = "https://www.bing.com/images/search?q="+title+"&FORM=HDRSC2";
try {
Document doc = Jsoup.connect(url).get();
Elements img = doc.getElementsByTag("a");
for (Element el : img) {
String src1 = el.absUrl("imgurl");
String src2 = el.absUrl("surl");
System.out.println(src1 + " " + src2);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Any idea if it's possible?

As far as I understand your <a> element has attribute m, not imgurl or surl, and that m contains a JSON which in turn contains imgurl and surl. So you should extract JSON from m:
String m = el.attr("m");
And then parse that m as a JSON, using any library you like, e.g. GSON:
class MJson {
private String imgurl;
private String surl;
...
}
MJson mJson = new Gson().fromJson(m, MJson.class);
String src1 = mJson.getImgurl();
String src2 = mJson.getSurl();

Cannot extract data from an XML

Im using getElementBytag method to extract data from the following an XML document(Yahoo finance news api http://finance.yahoo.com/rss/topfinstories)
Im using the following code . It gets the new items and the title's no problem using the getelementsBytag method but for some reason wont pick up the link when searched by tag. It only picks up the closing tag for the link element. Is it a problem with the XML document or a problem with jsoup?
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class GetNewsXML {
/**
* #param args
*/
/**
* #param args
*/
public static void main(String args[]){
Document doc = null;
String con = "http://finance.yahoo.com/rss/topfinstories";
try {
doc = Jsoup.connect(con).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements collection = doc.getElementsByTag("item");// Gets each news item
for (Element c: collection){
System.out.println(c.getElementsByTag("title"));
}
for (Element c: collection){
System.out.println(c.getElementsByTag("link"));
}
}

You get <link /> http://...; the link is put after the link-tag as a textnode.
But this is not a problem:
final String url = "http://finance.yahoo.com/rss/topfinstories";
Document doc = Jsoup.connect(url).get();
for( Element item : doc.select("item") )
{
final String title = item.select("title").first().text();
final String description = item.select("description").first().text();
final String link = item.select("link").first().nextSibling().toString();
System.out.println(title);
System.out.println(description);
System.out.println(link);
System.out.println("");
}
Explanation:
item.select("link") // Select the 'link' element of the item
.first() // Retrieve the first Element found (since there's only one)
.nextSibling() // Get the next Sibling after the one found; its the TextNode with the real URL
.toString() // Get it as a String
With your link this example prints all elements like this:
Tax Day Freebies and Deals
You made it through tax season. Reward yourself by taking advantage of some special deals on April 15.
http://us.rd.yahoo.com/finance/news/rss/story/SIG=14eetvku9/*http%3A//us.rd.yahoo.com/finance/news/topfinstories/SIG=12btdp321/*http%3A//finance.yahoo.com/news/tax-day-freebies-and-deals-133544366.html?l=1
(...)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Selecting elements by class with Jsoup - java

Related

Price extraction in java

NullPointerException when setting variables to objects in ArrayList

Java - How do I extract Google News Titles and Links using Jsoup?

How to extract multiple values from html to java? [duplicate]

Cannot extract data from an XML

Categories

Resources