Parsing Google search result Error - java

I reference the answer to parse the google search result.
How can you search Google Programmatically Java API
However ,when I try the code .Error occurs .
How should I make the modifications?
import java.net.URLDecoder;
import java.net.URLEncoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements ;
public class JavaApplication22 {
public static void main(String[] args) {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
I guess it is because the libraries matters.
But I tried ctrl +shift+i .It shows that nothing to fix in import statements.
Error
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code - unreported exception java.io.IOException; must be caught
or declared to be thrown at
javaapplication22.JavaApplication22.main(JavaApplication22.java:32)
How should I modify the code so that I can parse the Google Search result ?

Please replace your main class with below code :
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}

Related

How to correctly parse HTML in Java

I'm trying to extract information from websites using Jsoup but I don't get the same HTML code as in my browser.
I tried to use .userAgent() but it didn't work. I currently use the following function wich works for Amazon.com :
public static String getHTML(String urlToRead) throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
return result.toString();
}
The website I'm trying to parse is http://www.asos.com/ but the price of the product is always missing.
I fond this topic which is pretty close to mine but I would like to do it using only java and no external app.
So after a little playing around with the site I came up with a solution.
Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.
I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.
Good luck and let me know if you need any further help!
import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();
public static void main( String[] args )
{
new Main();
}
private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}
private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;
// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );
// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();
// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}
return "";
}
// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}
Oh, and you'll also need org.json for parse the JSON response from the API.

Dropbox Core API JAVA Authorization Code

Using the dropbox core api tutorial I am able to upload a file.
However, my question is an exact replica of this SO post--- That is, once I have my authorization code and comment out the user auth lines so that I dont have to manually re-authorize approval every time I use dropbox I get the following errors:
Exception in thread "main" com.dropbox.core.DbxException$BadRequest: {"error_description": "code has already been used", "error": "invalid_grant"}
OR
Exception in thread "main" com.dropbox.core.DbxException$BadRequest: {"error_description": "code has expired (within the last hour)", "error": "invalid_grant"}
I am positive I have the correct authorization code.
I hope that I'm missing something, else whats the point of an API if you have to induce manual intervention every time you use it?
Edit: My Exact Code (keys have been scrambled)
import com.dropbox.core.*;
import java.io.*;
import java.util.Locale;
public class DropboxUpload {
public static void main(String[] args) throws IOException, DbxException {
// Get your app key and secret from the Dropbox developers website.
final String APP_KEY = "2po9b49whx74h67";
final String APP_SECRET = "m98f734hnr92kmh";
DbxAppInfo appInfo = new DbxAppInfo(APP_KEY, APP_SECRET);
DbxRequestConfig config = new DbxRequestConfig("JavaTutorial/1.0",
Locale.getDefault().toString());
DbxWebAuthNoRedirect webAuth = new DbxWebAuthNoRedirect(config, appInfo);
// Have the user sign in and authorize your app.
//String authorizeUrl = webAuth.start();
//System.out.println("1. Go to: " + authorizeUrl);
//System.out.println("2. Click \"Allow\" (you might have to log in first)");
//System.out.println("3. Copy the authorization code.");
//String code = new BufferedReader(new InputStreamReader(System.in)).readLine().trim();
DbxAuthFinish authFinish = webAuth.finish("VtwxzitUoI8DDDLx0PlLut5Gjpw3");
String accessToken = authFinish.accessToken;
DbxClient client = new DbxClient(config, accessToken);
System.out.println("Linked account: " + client.getAccountInfo().displayName);
File inputFile = new File("/home/dropboxuser/Documents/test.txt");
FileInputStream inputStream = new FileInputStream(inputFile);
try {
DbxEntry.File uploadedFile = client.uploadFile("/Public/test.txt",
DbxWriteMode.add(), inputFile.length(), inputStream);
System.out.println("Uploaded: " + uploadedFile.toString());
} finally {
inputStream.close();
}
DbxEntry.WithChildren listing = client.getMetadataWithChildren("/");
System.out.println("Files in the root path:");
for (DbxEntry child : listing.children) {
System.out.println(" " + child.name + ": " + child.toString());
}
FileOutputStream outputStream = new FileOutputStream("test.txt");
try {
DbxEntry.File downloadedFile = client.getFile("/Public/test.txt", null,
outputStream);
System.out.println("Metadata: " + downloadedFile.toString());
} finally {
outputStream.close();
}
}
}
You should be storing and reusing the access token, not the authorization code.
So after doing this once:
String accessToken = authFinish.accessToken;
You should just replace the whole thing with
String accessToken = "<the one you already got>";
BTW, if you just need an access token for your own account, you can generate one with the click of a button! See https://www.dropbox.com/developers/blog/94/generate-an-access-token-for-your-own-account.

StringIndexOutofBoundsException while trying to run Google Search Api

I am trying to run google search api from the SO link below :-
How can you search Google Programmatically Java API
Here is my code below:-
public class RetrieveArticles {
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
// TODO Auto-generated method stub
String google = "http://www.google.com/news?&start=1&q=";
String search = "Police Violence in USA";
String charset = "UTF-8";
String userAgent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().children();
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
When I try to run this I get the below error . Can anyone please help me fix it .
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1911)
at google.api.search.RetrieveArticles.main(RetrieveArticles.java:34)
Thanks in advance .
The problem is here :
url.substring(url.indexOf('=') +1, url.indexOf('&'))
Either url.indexOf('=') or url.indexOf('&') returned -1, which is an illegal argument in subString.
You should validate the url you are parsing before assuming that it contains = and &.
add System.Out.Println(Url); before the
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
then you will come to know, wether url string is containg '=','&' or not .

Using Jsoup, how can I fetch each and every information resides in each link?

package com.muthu;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import org.jsoup.nodes.*;
public class TestingTool
{
public static void main(String[] args) throws IOException
{
Validate.isTrue(args.length == 0, "usage: supply url to fetch");
String url = "http://www.stackoverflow.com/";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
System.out.println(doc.text());
Elements tags=doc.getElementsByTag("div");
String alls=doc.text();
System.out.println("\n");
for (Element link : links)
{
print(" %s ", link.attr("abs:href"), trim(link.text(), 35));
}
BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool
/linknames.txt")));
for (Element link : links) {
bw.write("Link: "+ link.text().trim());
bw.write(System.getProperty("line.separator"));
}
bw.flush();
bw.close();
} }
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
If you connect to an URL it will only parse the current page. But you can 1.) connect to an URL, 2.) parse the informations you need, 3.) select all further links, 4.) connect to them and 5.) continue this as long as there are new links.
considerations:
You need a list (?) or something else where you've store the links you already parsed
You have to decide if you need only links of this page or externals too
You have to skip pages like "about", "contact" etc.
Edit:
(Note: you have to add some changes / errorhandling code)
List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
public void visitUrl(String url) throws IOException
{
url = url.toLowerCase(); // now its case insensitive
if( !visitedUrls.contains(url) ) // Do this only if not visted yet
{
Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document
/* ... Select your Data here ... */
Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!
for( Element next : nextLinks ) // Iterate over all Links
{
visitUrl(next.absUrl("href")); // Recursive call for all next Links
}
}
}
You have to add more restrictions / checks at the part where next links are selected (maybe you want to skip / ignore some); and some error handling.
Edit 2:
To skip ignored links you can use this:
Create a Set / List / whatever, where you store ignored keywords
Fill it with those keywords
Before you call the visitUrl() method with the new Link to parse, you check if this new Url contains any of the ignored keywords. If it contains at least one it will be skipped.
I modified the example a bit to do so (but it's not tested yet!).
List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore
// ...
/*
* Add keywords to the ignorelist. Each link that contains one of this
* words will be skipped.
*
* Do this in eg. constructor, static block or a init method.
*/
ignore.add(".twitter.com");
// ...
public void visitUrl(String url) throws IOException
{
url = url.toLowerCase(); // Now its case insensitive
if( !visitedUrls.contains(url) ) // Do this only if not visted yet
{
Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document
/* ... Select your Data here ... */
Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!
for( Element next : nextLinks ) // Iterate over all Links
{
boolean skip = false; // If false: parse the url, if true: skip it
final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse
for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
{
if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
{
skip = true;
break;
}
}
if( !skip )
visitUrl(next.absUrl("href")); // Recursive call for all next Links
}
}
}
Parsing the next link is done by this:
final String href = next.absUrl("href");
/* ... */
visitUrl(next.absUrl("href"));
But possibly you should add some more stop-conditions to this part.

Extracting anchor tag from html using Java

I have several anchor tags in a text,
Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>
Output:
http://stackoverflow.com
How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???
There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):
import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
String html =
"<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
"<!-- " +
"<a href=\"http://ignoreme.com\" >...</a> " +
"--> " +
"<a href=\"http://www.google.com\" >Take me to Google</a> " +
"<a>NOOOoooo!</a> ";
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback(){
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.A) {
Object link = a.getAttribute(HTML.Attribute.HREF);
if(link != null) {
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
System.out.println(links);
}
}
which will print:
[http://stackoverflow.com, http://www.google.com]
public static void main(String[] args) {
String test = "qazwsxTake me to StackOverflowfdgfdhgfd"
+ "Take me to StackOverflow2dcgdf";
String regex = "<a href=(\"[^\"]*\")[^<]*</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
System.out.println(m.replaceAll("$1"));
}
NOTE: All Andrzej Doyle's points are valid and if you have more then simple Y in your input, and you are sure that is parsable HTML, then you are better with HTML parser.
To summarize:
The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.
However, if your req is always replace Y with "X" without considering the context, then the code i've posted will work.
You can use JSoup
String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://stackoverflow.com"
Also See
Example
The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.
Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above
complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.
The main.url property in the HtmlPage.properties file is:
main.url=http://www.whatever.com/
That way you can just parse the url that your after. :-)
Happy coding :-D
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
String html = HtmlPage.getPage();
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback()
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
{
if (t == HTML.Tag.A)
{
Object link = a.getAttribute(HTML.Attribute.HREF);
if (link != null)
{
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
// create the header
System.out.println("<html>\n<head>\n <title>Link City</title>\n</head>\n<body>");
// spit out the links and create href
for (String l : links)
{
System.out.print(" " + l + "\n");
}
// create footer
System.out.println("</body>\n</html>");
}
}
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;
public class HtmlPage
{
public static String getPage()
{
StringWriter sw = new StringWriter();
ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());
try
{
URL url = new URL(bundle.getString("main.url"));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setDoOutput(true);
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
while ((line = in.readLine()) != null)
{
sw.append(line).append("\n");
}
} catch (Exception e)
{
e.printStackTrace();
}
return sw.getBuffer().toString();
}
}
For example, this will output links from http://ebay.com.au/ if viewed in a browser.
This is a subset, as there are a lot of links
Link City
#mainContent
http://realestate.ebay.com.au/
The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.
The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.
Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Categories

Resources