How to correctly parse HTML in Java

How to correctly parse HTML in Java - java

I'm trying to extract information from websites using Jsoup but I don't get the same HTML code as in my browser.
I tried to use .userAgent() but it didn't work. I currently use the following function wich works for Amazon.com :
public static String getHTML(String urlToRead) throws Exception {
StringBuilder result = new StringBuilder();
URL url = new URL(urlToRead);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:19.0) Gecko/20100101 Firefox/19.0");
conn.setRequestMethod("GET");
BufferedReader rd = new BufferedReader(new InputStreamReader(conn.getInputStream(), "UTF-8"));
String line;
while ((line = rd.readLine()) != null) {
result.append(line);
}
rd.close();
return result.toString();
}
The website I'm trying to parse is http://www.asos.com/ but the price of the product is always missing.
I fond this topic which is pretty close to mine but I would like to do it using only java and no external app.

So after a little playing around with the site I came up with a solution.
Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.
I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.
Good luck and let me know if you need any further help!
import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();
public static void main( String[] args )
{
new Main();
}
private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}
private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;
// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );
// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();
// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}
return "";
}
// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}
Oh, and you'll also need org.json for parse the JSON response from the API.

Related

Parsing Google search result Error

I reference the answer to parse the google search result.
How can you search Google Programmatically Java API
However ,when I try the code .Error occurs .
How should I make the modifications?
import java.net.URLDecoder;
import java.net.URLEncoder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements ;
public class JavaApplication22 {
public static void main(String[] args) {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
I guess it is because the libraries matters.
But I tried ctrl +shift+i .It shows that nothing to fix in import statements.
Error
Exception in thread "main" java.lang.RuntimeException: Uncompilable
source code - unreported exception java.io.IOException; must be caught
or declared to be thrown at
javaapplication22.JavaApplication22.main(JavaApplication22.java:32)
How should I modify the code so that I can parse the Google Search result ?

Please replace your main class with below code :
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "stackoverflow";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}

Read JSP page and Write HTML file UTF-8 issuses

i want read JSP page and write it to HTML page. I have 3 method in parse class. first readHTMLBody(), second WriteNewHTML(), third ZipToEpub().
When I called this method in parse class, all method work. But called in JSP or webservice UTF-8 character looks like "?" in readHTMLBody(). How can I fix it?
public String readHTMLBody() {
try {
String url = "http://localhost:8080/Library/part.jsp";
Document doc = Jsoup.parse((new URL(url)).openStream(), "utf-8", url);
String body = doc.html();
Elements title = doc.select("xxx");
linkURI = title.toString();
linkURI = linkURI.replaceAll("<xxx>", "");
linkURI = linkURI.replaceAll("</xxx>", "");
linkURI = linkURI.replaceAll("\\s", "");
resultBody = body;
resultBody = resultBody.replaceAll("part/" + linkURI + "/assets/", "assets/");
} catch (IOException e) {
}
return resultBody;
}

StringIndexOutofBoundsException while trying to run Google Search Api

I am trying to run google search api from the SO link below :-
How can you search Google Programmatically Java API
Here is my code below:-
public class RetrieveArticles {
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
// TODO Auto-generated method stub
String google = "http://www.google.com/news?&start=1&q=";
String search = "Police Violence in USA";
String charset = "UTF-8";
String userAgent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().children();
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
When I try to run this I get the below error . Can anyone please help me fix it .
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1911)
at google.api.search.RetrieveArticles.main(RetrieveArticles.java:34)
Thanks in advance .

The problem is here :
url.substring(url.indexOf('=') +1, url.indexOf('&'))
Either url.indexOf('=') or url.indexOf('&') returned -1, which is an illegal argument in subString.
You should validate the url you are parsing before assuming that it contains = and &.

add System.Out.Println(Url); before the
url = URLDecoder.decode(url.substring(url.indexOf('=') +1, url.indexOf('&')), "UTF-8");
then you will come to know, wether url string is containg '=','&' or not .

Get URL content with Basic Authentication with Java and async-http-client

I am writing a Java lib and need to perform a request to a URL - currently using async-http-client from ning - and fetch its content. So I have a get method that returns a String
of the content of the fetched document. However, to be able to get it, I must perform a HTTP basic authentication and I'm not succeeding at this in my Java code:
public String get(String token) throws IOException {
String fetchURL = "https://www.eventick.com.br/api/v1/events/492";
try {
String encoded = URLEncoder.encode(token + ":", "UTF-8");
return this.asyncClient.prepareGet(fetchURL)
.addHeader("Authorization", "Basic " + encoded).execute().get().getResponseBody();
}
}
The code returns no error, it just doesn't fetch the URL because the authentication header is not being properly set, somehow.
With curl -u option I can easily get what I want:
curl https://www.eventick.com.br/api/v1/events/492 -u 'xxxxxxxxxxxxxxx:'
Returns:
{"events":[{"id":492,"title":"Festa da Bagaceira","venue":"Mangueirão de Paulista",
"slug":"bagaceira-fest", "start_at":"2012-07-29T16:00:00-03:00",
"links":{"tickets":[{"id":738,"name":"Normal"}]}}]}
How can this be done in Java? With the async-http-client lib? Or if you know how to do it using another way..
Any help is welcome!

You're close. You need to base 64 encode rather than URL encode. That is, you need
String encoded = Base64.getEncoder().encodeToString((user + ':' + password).getBytes(StandardCharsets.UTF_8));
rather than
String encoded = URLEncoder.encode(token + ":", "UTF-8");
(Note that for the benefit of others, since I'm answering 2 years later, in my answer I'm using the more standard "user:password" whereas your question has "token:". If "token:" is what you needed, then stick with that. But maybe that was part of the problem, too?)
Here is a short, self-contained, correct example
package so17380731;
import com.ning.http.client.AsyncHttpClient;
import java.nio.charset.StandardCharsets;
import java.util.Base64;
import javax.ws.rs.core.HttpHeaders;
public class BasicAuth {
public static void main(String... args) throws Exception {
try(AsyncHttpClient asyncClient = new AsyncHttpClient()) {
final String user = "StackOverflow";
final String password = "17380731";
final String fetchURL = "https://www.eventick.com.br/api/v1/events/492";
final String encoded = Base64.getEncoder().encodeToString((user + ':' + password).getBytes(StandardCharsets.UTF_8));
final String body = asyncClient
.prepareGet(fetchURL)
.addHeader(HttpHeaders.AUTHORIZATION, "Basic " + encoded)
.execute()
.get()
.getResponseBody(StandardCharsets.UTF_8.name());
System.out.println(body);
}
}
}

The documentation is very sketchy, but I think that you need to use a RequestBuilder following the pattern shown in the Request javadoc:
Request r = new RequestBuilder().setUrl("url")
.setRealm((new Realm.RealmBuilder()).setPrincipal(user)
.setPassword(admin)
.setRealmName("MyRealm")
.setScheme(Realm.AuthScheme.DIGEST).build());
r.execute();
(Obviously, this example is not Basic Auth, but there are clues as to how you would do it.)
FWIW, one problem with your current code is that a Basic Auth header uses base64 encoding not URL encoding; see the RFC2617 for details.

basically, do it like this:
BoundRequestBuilder request = asyncHttpClient
.preparePost(getUrl())
.setHeader("Accept", "application/json")
.setHeader("Content-Type", "application/json")
.setRealm(org.asynchttpclient.Dsl.basicAuthRealm(getUser(), getPassword()))
// ^^^^^^^^^^^-- this is the important part
.setBody(json);
Test can be found here:
https://github.com/AsyncHttpClient/async-http-client/blob/master/client/src/test/java/org/asynchttpclient/BasicAuthTest.java

This is also another way of adding Basic Authorization,
you can use any of two the classes for your use AsyncHttpClient,HttpClient,in this case i will use AsyncHttpClient
AsyncHttpClient client=new AsyncHttpClient();
Request request = client.prepareGet("https://www.eventick.com.br/api/v1/events/492").
setHeader("Content-Type","application/json")
.setHeader("Authorization","Basic b2pAbml1LXR2LmNvbTpnMGFRNzVDUnhzQ0ZleFQ=")
.setBody(jsonObjectRepresentation.toString()).build();
after adding header part
ListenableFuture<Response> r = null;
//ListenableFuture<Integer> f= null;
try{
r = client.executeRequest(request);
System.out.println(r.get().getResponseBody());
}catch(IOException e){
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
client.close();
it may be useful for you

How to save bulk documents in couchdb using lightcouch api in java

I am using the lightcouch API to connect to couchdb through Java. I am able to save a single document using dbclient.save(object) method. However, my requirement is to save bulk documents at a time. I am not able to find any methods related to saving bulk documents using the Lightcouch api. Please suggest any possible solution.
Thanks in advance!

I decided to give it a go. I have a database holding documents that describe a person.
Here is my Person class which extends Document LightCouch:
public class Person extends Document {
private String firstname = "";
private String lastname = "";
private int age = -1;
public Person(String firstname, String lastname, int age) {
super();
this.setFirstname(firstname);
this.setLastname(lastname);
this.setAge(age);
}
// setters and getters omitted for brevity
}
The algorithm is simple.
Create an array of type Document
Put your documents into the array
Create a HTTP POST request
Put the JSON converted array into the request body
Send it
Here is roughly what the code could look like.
Note: try/catch omitted for brevity! Of course you are expected to use them.
public static void main(String[] args) {
// You could also use a List and then convert it to an array
Document[] docs = new Document[2];
docs[0] = new Person("John", "Smith", 34);
docs[1] = new Person("Jane", "Smith", 30);
DefaultHttpClient httpClient = new DefaultHttpClient();
// Note the _bulk_docs
HttpPost post = new HttpPost("http://127.0.0.1:5984/persons/_bulk_docs");
Gson gson = new Gson();
StringEntity data =
new StringEntity("{ \"docs\": " + gson.toJson(docs) + "}");
data.setContentType("application/json");
post.setEntity(data);
HttpResponse response = httpClient.execute(post);
if (response.getStatusLine().getStatusCode() != 201) {
throw new RuntimeException("Failed. HTTP error code: "
+ response.getStatusLine().getStatusCode());
}
BufferedReader br = new BufferedReader(
new InputStreamReader((response.getEntity().getContent())));
String output;
while ((output = br.readLine()) != null) {
System.out.println(output);
}
httpClient.getConnectionManager().shutdown();
}
I'll describe the two noteworthy parts in this example.
First one is the collection of documents. In this case I used an array instead of a List for the example.
Document[] docs = new Document[2];
docs[0] = new Person("John", "Smith", 34);
docs[1] = new Person("Jane", "Smith", 30);
You could use a List as well and later convert it to an array using Java's utility methods.
Second one is the StringEntity. As per CouchDB's documentation on the HTTP Bulk Document API on modify multiple documents with a single request the JSON structure of your request body should look like this.
{
"docs": [
DOCUMENT,
DOCUMENT,
DOCUMENT
]
}
This is the reason for the somewhat ugly StringEntity definition.
StringEntity data = new StringEntity("{ \"docs\": " + gson.toJson(docs) + "}");
As a response you'll get a JSON array containing objects whose fields represent the *_id* and *_rev* of the inserted document along with a transaction status indicator.

I did the same thing but with spring Rest Template
I created a class which would hold the documents to be updated int he following way.
public class BulkUpdateDocument {
private List<Object> docs;
}
My Rest code looks like this.
BulkUpdateDocument doc = new BulkUpdateDocument(ListOfObjects);
Gson gson = new Gson();
RestTemplate restTemplate = new RestTemplate();
HttpHeaders header = new HttpHeaders();
header.setContentType(MediaType.APPLICATION_JSON_UTF8);
HttpEntity<?> requestObject = new HttpEntity<Object>(gson.toJson(doc), header);
ResponseEntity<Object> postForEntity = restTemplate.postForEntity(path + "/_bulk_docs", requestObject, Object.class);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to correctly parse HTML in Java - java

Related

Parsing Google search result Error

Read JSP page and Write HTML file UTF-8 issuses

StringIndexOutofBoundsException while trying to run Google Search Api

Get URL content with Basic Authentication with Java and async-http-client

How to save bulk documents in couchdb using lightcouch api in java

Categories

Resources