This question already has answers here:
What HTML parsing libraries do you recommend in Java [closed]
(3 answers)
Closed 7 years ago.
I'm trying to extract some data from html source code to my java project.
The html is taken from "Bing search images" and I wanna get all the images from the <a> tag. This is the html code:
<a href="/images/search?q=nba&view=detailv2&&&
id=FE19E7BB2916CE8B6CD78148F3BC0656D151049A&
selectedIndex=3&
ccid=2%2f7OBkGc&
simid=608035681734625885&
thid=JN.tdPCsRj4HyJzbwA%2bgXsS8g"
ihk="JN.tdPCsRj4HyJzbwA+gXsS8g"
m="{ns:"images",k:"5070",dirovr:"ltr",
mid:"FE19E7BB2916CE8B6CD78148F3BC0656D151049A",
surl:"http://www.nba.com/gallery/rookie/070727_1.html",
imgurl:"http://www.nba.com/media/draft_class_3_07_070727.jpg
",
ow:"300",docid:"608035681734625885",oh:"192",tft:"58"}"
mid="FE19E7BB2916CE8B6CD78148F3BC0656D151049A"
t1="The 2007 NBA Draft Class"
t2="625 x 400 · 374 kB · jpeg"
t3="www.nba.com/gallery/rookie/070727_1.html"
h="ID=images,5070.1"><img data-bm="16"
src="https://tse3.mm.bing.net/th?id=JN.tdPCsRj4HyJzbwA%2bgXsS8g&w=217&h=142&c=7&rs=1&qlt=90&o=4&pid=1.1"
style="width:217px;height:142px;" width="217" height="142">
</a>
and this is how i tried to extract it but no succeeded:
public static void main(String[] args) {
String title = "dog";
String url = "https://www.bing.com/images/search?q="+title+"&FORM=HDRSC2";
try {
Document doc = Jsoup.connect(url).get();
Elements img = doc.getElementsByTag("a");
for (Element el : img) {
String src1 = el.absUrl("imgurl");
String src2 = el.absUrl("surl");
System.out.println(src1 + " " + src2);
}
} catch (IOException e) {
e.printStackTrace();
}
}
Any idea if it's possible?
As far as I understand your <a> element has attribute m, not imgurl or surl, and that m contains a JSON which in turn contains imgurl and surl. So you should extract JSON from m:
String m = el.attr("m");
And then parse that m as a JSON, using any library you like, e.g. GSON:
class MJson {
private String imgurl;
private String surl;
...
}
MJson mJson = new Gson().fromJson(m, MJson.class);
String src1 = mJson.getImgurl();
String src2 = mJson.getSurl();
Related
I am trying to create a discord bot that searches up an item inputted by user "!price item" and then gives me a price that I can work with later on in the code. I figured out how to get the html code into a string or a doc file, but I am struggling on finding a way to extract only prices.
Here is the code:
#Override
public void onMessageReceived(MessageReceivedEvent event) {
String html;
System.out.println("I received a message from " +
event.getAuthor().getName() + ": " +
event.getMessage().getContentDisplay());
if (event.getMessage().getContentRaw().contains("!price")) {
String input = event.getMessage().getContentDisplay();
String item = input.substring(9).replaceAll(" ", "%20");
String URL = "https://www.google.lt/search?q=" + item + "%20price";
try {
html = Jsoup.connect(URL).userAgent("Mozilla/49.0").get().html();
html = html.replaceAll("[^\\ ,.£€eur0123456789]"," ");
} catch (Exception e) {
return;
}
System.out.println(html);
}
}
The biggest problem is that I am using google search so the prices are not in the same place in the html code. Is there a way I can extract only (numbers + EUR) or (a euro sign + price) from the html code?.
you can easily do that scrapping the website. Here's a simple working example to do what you are looking for using JSOUP:
public class Main {
public static void main(String[] args) {
try {
String query = "oneplus";
String url = "https://www.google.com/search?q=" + query + "%20price&client=firefox-b&source=lnms&tbm=shop&sa=X";
int pricesToRetrieve = 3;
ArrayList<String> prices = new ArrayList<String>();
Document document = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
Elements elements = document.select("div.pslires");
for (Element element : elements) {
String price = element.select("div > div > b").text();
String[] finalPrice = price.split(" ");
prices.add(finalPrice[0] + finalPrice[1]);
pricesToRetrieve -= 1;
if (pricesToRetrieve == 0) {
break;
}
}
System.out.println(prices);
} catch (IOException e) {
e.printStackTrace();
}
}
}
That piece of code will output:
[347,10€, 529,90€, 449,99€]
And if you want to retrieve more information just connect JSOUP to the Google Shop url adding your desired query, and scrapping it using JSOUP. In this case I scrapped Google Shop for OnePlus to check its prices, but you can also get the url to buy it, the full product name, etc. In this piece of code I want to retrieve the first 3 prices indexed in Google Shop and add them to an ArrayList of String. Then before adding it to the ArrayList I split the retrieved text by "space" so I just get the information I want, the price.
This is a simple scrapping example, if you need anything else feel free to ask! And if you want to learn more about scrapping using JSOUP check this link.
Hope this helped you!
I have this code
public void descargarURL() {
try{
URL url = new URL("https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1");
BufferedReader lectura = new BufferedReader(new InputStreamReader(url.openStream()));
File archivo = new File("descarga2.txt");
BufferedWriter escritura = new BufferedWriter(new FileWriter(archivo));
BufferedWriter ficheroNuevo = new BufferedWriter(new FileWriter("nuevoFichero.txt"));
String texto;
while ((texto = lectura.readLine()) != null) {
escritura.write(texto);
}
lectura.close();
escritura.close();
ficheroNuevo.close();
System.out.println("Archivo creado!");
//}
}
catch(Exception ex) {
ex.printStackTrace();
}
}
public static void main(String[] args) throws FileNotFoundException, IOException {
Paginaweb2 pg = new Paginaweb2();
pg.descargarURL();
}
}
And I want to remove from the URL the part of the reference that is B078ZYX4R5, and this entity /
After the html that is saved in the text file there is a part of the code that has *"<div id =" cerberus-data-metrics "style =" display: none; "data-asin =" B078ZYX4R5 "data-as-price = "1479.00" data-asin-shipping = "0" data-asin-currency-code = "EUR" data-substitute-count = "0" data-device-type = "WEB" data-display-code = "Asin is not eligible because it has a retail offer "> </ div>"*, and I want to only get the price from there that is 1479.00, it is included among the tags "data-as-price = "
I dont want to use external libraries, I know that it can be done with split, index of, and substring
Thanks!!!!
You could solve both tasks by using regular expressions. Yet for the second task (extraction of the price from the HTML) you could use JSOUP which is much better suited to extract content from HTML.
Here are some possible solutions based on regular expressions for your tasks:
1. Change URL
private static String modifyUrl(String str) {
return str.replaceFirst("/[^/]+(?=/ref)", "");
}
This is just a replacement using a regular expression using a positive look-ahead (?=/ref) (see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html)
Extract Price
private static Optional<String> extractPrice(String html) {
Pattern pat = Pattern.compile("data-as-price\\s*=\\s*[\"'](?<price>.+?)[\"']", Pattern.MULTILINE);
Matcher m = pat.matcher(html);
if(m.find()) {
String price = m.group("price");
return Optional.of(price);
}
return Optional.empty();
}
Here you can use also a regular expression (data-as-price\s*=\s*["'](?<price>.+?)["']) to locate the price. With a named group ((?<price>.+?)) you can then extract the price.
I am returning an Optional here so that you can deal with the case that the price was not found.
Here is a simple test case for the two methods:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
System.out.println(modifyUrl(str));
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
If you run this simple test case you will see on the console this output:
https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1
1479.00
Update
If you want to extract the reference from the URL, you can do it using similar code to the one used to extract the price. Here is a method which extract a specific named group from a pattern:
private static Optional<String> extractNamedGroup(String str, Pattern pat, String reference) {
Matcher m = pat.matcher(str);
if (m.find()) {
return Optional.of(m.group(reference));
}
return Optional.empty();
}
Then you can use this method for extracting the reference and price:
private static Optional<String> extractReference(String str) {
Pattern pat = Pattern.compile("/(?<reference>[^/]+)(?=/ref)");
return extractNamedGroup(str, pat, "reference");
}
private static Optional<String> extractPrice(String html) {
Pattern pat = Pattern.compile("data-as-price\\s*=\\s*[\"'](?<price>.+?)[\"']", Pattern.MULTILINE);
return extractNamedGroup(html, pat, "price");
}
You can test the above methods with:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
extractReference(str).ifPresent(System.out::println);
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
This will print:
B078ZYX4R5
1479.00
Update 2: Using URL
If you want to use the java.net.URL class to help you narrow down the search scope you can do it. But you cannot use this class to do the full extraction.
Since the token you want to extract is in the URL path you can extract the path and then apply the regular expression explained above to do the extraction.
Here is the sample code you can use to narrow down the search scope:
public static void main(String[] args) throws IOException {
String str = "https://www.amazon.es/MSI-Titan-GT73EVR-7RD-1027XES-Ordenador/dp/B078ZYX4R5/ref=sr_1_1?ie=UTF8&qid=1524239679&sr=8-1";
URL url = new URL(str);
extractReference(url.getPath() /* narrowing the search scope here */).ifPresent(System.out::println);
String html = "<div id =\" cerberus-data-metrics \"style =\" display: none; \"data-asin =\" B078ZYX4R5 \"data-as-price = \"1479.00\" data-asin-shipping = \"0\" data-asin-currency-code = \"EUR\" data-substitute-count = \"0\" data-device-type = \"WEB\" data-display-code = \"Asin is not eligible because it has a retail offer \"> </ div>";
extractPrice(html).ifPresent(System.out::println);
}
Hi I am trying to parse data from yahoo finance using Jsoup in Eclipse by selecting elements by their class with the below code.
This method has worked for me with other website but will not work here. The attached link is the page I'm trying to parse. In this example the line I'm trying to parse 21.74 specifically I want to parse out the "21.74". I have tried selecting table elements but nothing seems to work. This is my first question so any suggestions are mush appreciated!!
public static final String YAHOOLINK = new String("http://finance.yahoo.com/quote/MMM/key-statistics?p=");
private String yahooLink;
private Document rawYahooData;
private static String CLASSNAME = new String("W(100%) Pos(r)");
public YahooDataCollector(String aStockTicker){
yahooLink = new String(YAHOOLINK + aStockTicker);
try
{
rawYahooData = (Document) Jsoup.connect(yahooLink).timeout(10*1000).get();
Elements yahooElements = rawYahooData.getElementsByClass(CLASSNAME);
for(Element e : yahooElements)
{
System.out.println(e.text());
}
}
catch(IOException e)
{
System.out.println("Error Grabbing Raw Data For "+ aStockTicker);
}
}
I got a question regarding XML and parsing it. I use JDOM to parse my XML-File, but I got a little Problem.
A sample of my XML-File looks like this:
<IO name="Bus" type="Class">
<ResourceAttribute name="Bandwidth" type="KiloBitPerSecond" value="50" />
</IO>
Bus is a object instance of the class IO. The object got the name and type properties. Additional it has some attributes, like in the sample, the Attribute Bandwidth with the value of 50 and the datatype KiloBitPerSecond.
So when I want to loop over the file with:
for(Element packages : listPackages)
{
Map<String, Values> valueMap = new HashMap<String, Values>();
List<Element> objectInstanceList = packages.getChildren();
for(Element objects : objectInstanceList)
{
List<Element> listObjectClasses = objects.getChildren();
for(Element classes : listObjectClasses)
{
List<Element> listObjectAttributes = classes.getChildren();
for(Element objectAttributes : listObjectAttributes)
{
List<Attribute> listAttributes = objectAttributes.getAttributes();
for(Attribute attributes : listAttributes)
{
String name = attributes.getName();
String value = attributes.getValue();
AttributeType datatype = attributes.getAttributeType();
Values v = new Values(name, datatype, value);
valueMap.put(classes.getName(), v);
System.out.println(name + ":" + value);
}
}
}
}
//System.out.println(valueMap);
}
values is a class which defines the object attribute:
public class Values{
private String name;
//private AttributeType datatype;
private String value;
Thats the rest of the Code. I got two question relating that. The first one got more priority at the moment.
How do I get the values of the object(Attribute.Name = Bandwidth; Attribute.Value = 50) ? Istead that I get
name:Bus
type:Class
I thought about an additional for-loop, but the JDOM class attribute dont have a method called getAttributes().
Thats just second priority because without question 1 I cannot go further. As you see in the sample, an Attribute got 3 properties, name, type and value. How can I extract that triple put of the sample. JDOM seems just to know 2 properties for an Attribute, name and value.
thanks a lot in advance and hopefully I managed to express my self.
Edit: Added an additional for-loop in it, so the output now is:
name:Bandwidth
type:KiloBitPerSecond
value:50
That means name is the name of that property and value is the value of name. Didnt know that. At least question one is clearer now and I can try working on 2, but the new information makes 2 clearer to me.
In xml the opening tag of elements are encosoed between < and > (or />) , after the < comes the name of the element, then comes a list of attributes in the format name="value". An element can be closed inline with /> or with a closing tag </[element name]>
It would be preferable to use recursion to parse your xml instead of badly readable/maintainable nested for loops.
Here is how it could look like:
#Test
public void parseXmlRec() throws JDOMException, IOException {
String xml = "<root>"
+ "<Package>"
+ "<IO name=\"Bus\" type=\"Class\">\r\n" +
" <ResourceAttribute name=\"Bandwidth\" type=\"KiloBitPerSecond\" value=\"50\" />\r\n" +
" </IO>"
+ "</Package>"
+ "</root>";
InputStream is = new ByteArrayInputStream(xml.getBytes());
SAXBuilder sb = new SAXBuilder();
Document document = sb.build(is);
is.close();
Element root = document.getRootElement();
List<Element> children = root.getChildren();
for(Element element : children) {
parseelement(element);
}
}
private void parseelement(Element element) {
System.out.println("Element:" + element.getName());
String name = element.getAttributeValue("name");
if(name != null) {
System.out.println("name: " + name);
}
String type = element.getAttributeValue("type");
if(type != null) {
System.out.println("type: " + type);
}
String value = element.getAttributeValue("value");
if(value != null) {
System.out.println("value: " + value);
}
List<Element> children = element.getChildren();
if(children != null) {
for(Element child : children) {
parseelement(child);
}
}
}
This outputs:
Element: Package
Element: IO
name: Bus
type: Class
Element: ResourceAttribute
name: Bandwidth
type: KiloBitPerSecond
value: 50
While parsing, check the name of each element and instanciate the coresponding objects. For that I would suggest to write a separate method to handle each element. For example:
void parsePackage(Element packageElement) { ... }
parseIO(Element ioElement) { ... }
void parseResourceAttribute(Element resourceAttributeElement) { ... }
I was wondering if anyone knows how to successfully parse the company name "Alcoa Inc." shown in the URL below. It would be much easier to show a picture but I do not have enough reputation. Any help would be appreciated.
http://www.google.com/finance?q=NYSE%3AAA&ei=LdwVUYC7Fp_YlgPBiAE
This is what I have tried so far using jsoup to parse the div class:
<div class="appbar-snippet-primary">
<span>Alcoa Inc.</span>
</div>
public Elements htmlParser(String url, String element, String elementType, String returnElement){
try {
Document doc = Jsoup.connect(url).get();
Document parse = Jsoup.parse(doc.html());
if (returnElement == null){
return parse.select(elementType + "." + element);
}
else {
return parse.select(elementType + "." + element + " " + returnElement);
}
}
public String htmlparseGoogleStocks(String url){
String pr = "pr";
String appbar_center = "appbar-snippet-primary";
String val = "val";
String span = "span";
String div = "div";
String td = "td";
Elements price_data;
Elements title_data;
Elements more_data;
price_data = htmlParser(url, pr, span, null);
title_data = htmlParser(url, appbar_center, div, span);
//more_data = htmlParser(url, val, td, null);
//String stockprice = price_data.text().toString();
String title = title_data.text().toString();
//System.out.println(more_data.text());
return title;
Myself, I'd analyze the page of interest's source HTML, and then just use JSoup to extract the information. For instance, using a very small JSoup program like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class GoogleFinance {
public static final String PAGE = "https://www.google.com/finance?q=NASDAQ:XONE";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(PAGE).get();
Elements title = doc.select("title");
System.out.println(title.text());
}
}
You get in return:
ExOne Co: NASDAQ:XONE quotes & news - Google Finance
It doesn't get much easier than that.