HtmlUnit error - com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests

HtmlUnit error - com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests - java

I am scraping data from a publication website (ResearchGate) using HtmlUnit - Java.
For scraping the data, I am giving URLs from a text file. I have almost 4000 URLs in my text file (all URLs or page has similar pattern, but different data). But when I try to run my logic for all those 4000 URLs, I am getting the error :
com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests for https://www.researchgate.net/application.RequestQuotaExceeded.html?tk=i1iSnVitFTozE0uu1nlOqH6CgwJA0vikMY_2VFnCBM3JDz4SZrupIy5I4yAT5KBOFAX-LySwTEIR4dak8u0FRHod9caWkRiNZS6RDGKXCY2Gn7kh80q72oaXjk8RWsXqqfcrNa3ULlnSHgQ
at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:537)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:362)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:434)
at com.pollak.library.Authenticator.autoLogin(Authenticator.java:70)
at com.pollak.library.FetchfromPublicationPage.main(FetchfromPublicationPage.java:34)
My code is :
package com.pollak.library;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class FetchfromPublicationPage {
public static void main(String a[]) throws Exception {
String path = "Path to the text file which contains 4000 URLs";
File file = new File(path);
BufferedReader br = new BufferedReader(new java.io.FileReader(file));
String line = null;
String baseUrl = "https://www.researchgate.net";
String login = <login_ID>;
String password = <password>;
File facurl = new File("Path to the file in which I want to save scraped information");
FileWriter fw = new FileWriter(facurl);
BufferedWriter bw = new BufferedWriter(fw);
int neha = 1;
try {
WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);
String facultyprofileurl;
while ((facultyprofileurl = br.readLine()) != null) {
String info= "", ath = "";
String arr[] = facultyprofileurl.split(",");
HtmlPage page = client.getPage(arr[2]);
if (page.asText().contains("You need to sign in for access to this page")) {
throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
}
List<HtmlElement> items = (List<HtmlElement>) page.getByXPath(
"//div[#class='nova-e-text nova-e-text--size-m nova-e-text--family-sans-serif nova-e-text--spacing-xxs nova-e-text--color-grey-700']");
List<HtmlElement> items2 = (List<HtmlElement>) page.getByXPath(
"//div[#class='nova-e-text nova-e-text--size-l nova-e-text--family-sans-serif nova-e-text--spacing-none nova-e-text--color-inherit nova-v-person-list-item__title nova-v-person-list-item__title--clamp-1']");
String print = "";
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items) {
HtmlElement articleinfo = ((HtmlElement) htmlItem.getFirstByXPath(".//ul"));
info += articleinfo.getTextContent().toString()+"**";
}
}
if (items.isEmpty()) {
System.out.println("No items found !");
} else {
for (HtmlElement htmlItem : items2) {
HtmlAnchor authors = ((HtmlAnchor) htmlItem.getFirstByXPath(".//a"));
ath += authors.getTextContent().toString()+"**";
}
}
bw.write(neha + "," + info +","+ath);
bw.newLine();
neha = neha + 1;
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Can anyone one please guide. How to solve this error.

I fear there is no simple solution for you. You have to dig yourself and figure out what is going on.
Maybe some hints.
At first you have to get familiar with Http and the general way it works. Try to understand that and read about the error code you got.
Next step is to use a web proxy (e.g. Charles) to see what is going on on the wire. Maybe the server sends some additional information (header) that contain a hint about the rules used at the server side to detect this situation.
Next start with a simple program and try to find the amount of requests that forces your problem.
All in all we can't do the analysis work for you. You have to learn about the way http works, you have to understand what http servers are doing and finally you might find a way. But keep in mind that the peoples at the server side seem to block robots like you (for various good reasons). Maybe you will find a solution, but maybe this solution will work only for some time.

Related

Is there a way to read csv file from S3 using Java without downloading it

I was able to connect Java to AWS S3, and I was able to perform basic operations like listing buckets. I need a way to read a CSV file without downloading it. I am attaching my current code here.
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3Client;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.Bucket;
import com.amazonaws.services.s3.model.CannedAccessControlList;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.IOException;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;
public class test {
public static void main(String args[])throws IOException
{
AWSCredentials credentials =new BasicAWSCredentials("----","----");
AmazonS3 s3client = AmazonS3ClientBuilder
.standard()
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withRegion(Regions.US_EAST_2)
.build();
List<Bucket> buckets = s3client.listBuckets();
for(Bucket bucket : buckets) {
System.out.println(bucket.getName());
}
}
}

There is a way with a code like this. In my code I am trying to get the file which we want to read in my S3Object obj , then I am passing that file to InputStreamReader() :
S3Object Obj = s3client.getObject("<Bucket Name>", "File Name");
BufferedReader reader = new BufferedReader(new InputStreamReader(Obj.getObjectContent()));
// this will store characters of first row in array
String row[] = line.split(",");
// this will fetch number of columns
int length = row.length;
while((line=reader.readLine()) != null) {
// storing characters of corresponding line in an array
String value[] = line.split(",");
for(int i=0;i<length;i++) {
System.out.print(value[i]+" ");
}
System.out.println();
}

The answer by #jay and #Elikill58 is super helpful! This just adds a bit of clarity and accessibility to it.
To get an object from and S3 bucket after you have done all the authentication work is with the .getObject(String bucketName, String fileName) function. Note what it says about file names in the documentation:
An Amazon S3 bucket has no directory hierarchy such as you would find in a typical computer file system. You can, however, create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, you can name it photos/2006/February/sample.jpg.
To get an object from such a logical hierarchy, specify the full key
name for the object in the GET operation. For a virtual hosted-style
request example, if you have the object
photos/2006/February/sample.jpg, specify the resource as
/photos/2006/February/sample.jpg. For a path-style request example, if
you have the object photos/2006/February/sample.jpg in the bucket
named examplebucket, specify the resource as
/examplebucket/photos/2006/February/sample.jpg.
Once you have an the S3Object that'll be returned, just pass it into this function below (which is just a modified version of #jay's that fixes a few errors)!
private static void parseCSVS3Object(S3Object data) {
BufferedReader reader = new BufferedReader(new InputStreamReader(data.getObjectContent()));
try {
// Get all the csv headers
String line = reader.readLine();
String[] headers = line.split(",");
// Get number of columns and print headers
int length = headers.length;
for (String header : headers) {
System.out.print(header + " ");
}
while((line = reader.readLine()) != null) {
System.out.println();
// get and print the next line (row)
String[] row = line.split(",");
for (String value : row) {
System.out.print(value + " ");
}
}
} catch (IOException e) {
throw new RuntimeException(e);
}
}

For your code to read the file, it needs the contents -- and that means copying it to the local system.
However, you can use "range" (Java) to read just a part.

I need assistance in creating a function that will read (open new browser tab)url from CSV file in Java (Selenium)

I am trying to learn java and selenium by myself and creating a robot that will scan job/career pages for certain string (job name e.g. QA , developer...)
I'm trying to create JAVA code using selenium, that will read URL links from CSV file and open a new tab.
the main goal is to add several url in the CSV and assert/locate a certain string in the designated url's for example: is there "Careers" link in each URL, the test will pass for this specific url.
created a selenium project
created new chromeDriver
Created CSV built from 3 columns (ID, company's name, URL) - and added it to the project
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class URLSearch {
public static void main(String[] args) {
ChromeDriver driver = new ChromeDriver();
driver.manage().window().maximize();
String fileName = "JobURLList.csv";
File file = new File(fileName); //read from file
try {
Scanner inputStream = new Scanner(file);
while (inputStream.hasNext()) {
String data = inputStream.next();
System.out.println(data);
}
inputStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
first line in the CSV - titles: id, name, url
Read the url from the second line - e.g. https://careers.google.com/jobs/"
open browsertab and start going over the url list (from the CSV)
locate a hardcoded string (e.g. "developer" , "qa" ..) in each url
if such a string was found, write in console the url that the test turned out to be positive (such a string was found in one of the url's).
if no such string was found, skip to the next url.

To open the new tab do something like this (this assumes "driver" object is your WebDriver):
((JavascriptExecutor)driver).executeScript("window.open('about:blank', '_blank');");
Set<String> tab_handles = driver.getWindowHandles();
int number_of_tabs = tab_handles.size();
int new_tab_index = number_of_tabs-1;
driver.switchTo().window(tab_handles.toArray()[new_tab_index].toString());
You could then create a function that takes a list of key/value pairs, with URL and term to search for and loop through it. Do you want to use a hashmap for this, or maybe an ArrayList of a class (id/name/url)? The code for finding the text would be something like this (assumes you've defined a var of "Pass" to boolean):
driver.get([var for URL]);
//driver will wait for pageready state, so you may
// not need the webdriver wait used below. Depends
// on if the page populates data after pagereadystate
String xpather = "//*[contains(text(), '" + [string var for text to search for] + "')]";
try
{
wait = new WebDriverWait(driver, 10);
List<WebElement> element = wait.until(ExpectedConditions.visibilityOfAllElementsLocatedBy(By.xpath(xpather)));
this.Pass = false;
if (element.size() > 0)
{
this.Pass = true;
}
}
catch (Exception ex)
{
this.Pass = false;
System.out.println ("Exception finding text: " + ex.toString());
}
Then logic for if (this.Pass==true or false)..

Java example of how to log in to Google App Engine with a Facebook account using OAuth

I searched a lot, read many blogs, articles, tutorials, but until now did not get a working example of using a Facebook account to log in to my application.
I know that I have to use OAuth, get tokens, authorizations, etc...
Can anyone share an example?

Here is how I do it on App Engine:
Step 1) Register an "app" on Facebook (cf. https://developers.facebook.com/ ). You give Facebook a name for the app and a url. The url you register is the url to the page (jsp or servlet) that you want to handle the login. From the registration you get two strings, an "app ID" and an "app secret" (the latter being your password, do not give this out or write it in html).
For this example, let's say the url I register is "http://myappengineappid.appspot.com/signin_fb.do".
2) From a webpage, say with a button, you redirect the user to the following url on Facebook, substituting your app id for "myfacebookappid" in the below example. You also have to choose which permissions (or "scopes") you want the ask the user (cf. https://developers.facebook.com/docs/reference/api/permissions/ ). In the example I ask for access to the user's email only.
(A useful thing to know is that you can also pass along an optional string that will be returned unchanged in the "state" parameter. For instance, I pass the user's datastore key, so I can retrieve the user when Facebook passes the key back to me. I do not do this in the example.)
Here is a jsp snippet:
<%#page import="java.net.URLEncoder" %>
<%
String fbURL = "http://www.facebook.com/dialog/oauth?client_id=myfacebookappid&redirect_uri=" + URLEncoder.encode("http://myappengineappid.appspot.com/signin_fb.do") + "&scope=email";
%>
<img src="/img/facebook.png" border="0" />
3) Your user will be forwarded to Facebook, and asked to approve the permissions you ask for. Then, the user will be redirected back to the url you have registered. In this example, this is "http://myappengineappid.appspot.com/signin_fb.do" which in my web.xml maps to the following servlet:
import org.json.JSONObject;
import org.json.JSONException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.net.URL;
import java.net.URLConnection;
import java.net.URLEncoder;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class SignInFB extends HttpServlet {
public void service(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException {
String code = req.getParameter("code");
if (code == null || code.equals("")) {
// an error occurred, handle this
}
String token = null;
try {
String g = "https://graph.facebook.com/oauth/access_token?client_id=myfacebookappid&redirect_uri=" + URLEncoder.encode("http://myappengineappid.appspot.com/signin_fb.do", "UTF-8") + "&client_secret=myfacebookappsecret&code=" + code;
URL u = new URL(g);
URLConnection c = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(c.getInputStream()));
String inputLine;
StringBuffer b = new StringBuffer();
while ((inputLine = in.readLine()) != null)
b.append(inputLine + "\n");
in.close();
token = b.toString();
if (token.startsWith("{"))
throw new Exception("error on requesting token: " + token + " with code: " + code);
} catch (Exception e) {
// an error occurred, handle this
}
String graph = null;
try {
String g = "https://graph.facebook.com/me?" + token;
URL u = new URL(g);
URLConnection c = u.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(c.getInputStream()));
String inputLine;
StringBuffer b = new StringBuffer();
while ((inputLine = in.readLine()) != null)
b.append(inputLine + "\n");
in.close();
graph = b.toString();
} catch (Exception e) {
// an error occurred, handle this
}
String facebookId;
String firstName;
String middleNames;
String lastName;
String email;
Gender gender;
try {
JSONObject json = new JSONObject(graph);
facebookId = json.getString("id");
firstName = json.getString("first_name");
if (json.has("middle_name"))
middleNames = json.getString("middle_name");
else
middleNames = null;
if (middleNames != null && middleNames.equals(""))
middleNames = null;
lastName = json.getString("last_name");
email = json.getString("email");
if (json.has("gender")) {
String g = json.getString("gender");
if (g.equalsIgnoreCase("female"))
gender = Gender.FEMALE;
else if (g.equalsIgnoreCase("male"))
gender = Gender.MALE;
else
gender = Gender.UNKNOWN;
} else {
gender = Gender.UNKNOWN;
}
} catch (JSONException e) {
// an error occurred, handle this
}
...
I have removed error handling code, as you may want to handle it differently than I do. (Also, "Gender" is of course a class that I have defined.) At this point, you can use the data for whatever you want, like registering a new user or look for an existing user to log in. Note that the "myfacebookappsecret" string should of course be your app secret from Facebook.
You will need the "org.json" package to use this code, which you can find at: http://json.org/java/ (just take the .java files and add them to your code in an org/json folder structure).
I hope this helps. If anything is unclear, please do comment, and I will update the answer.
Ex animo, - Alexander.
****UPDATE****
I want to add a few tidbits of information, my apologies if some of this seems a bit excessive.
To be able to log in a user by his/her Facebook account, you need to know which user in the datastore we are talking about. If it's a new user, easy, create a new user object (with a field called "facebookId", or whatever you want to call it, whose value you get from Facebook), persist it in the datastore and log the user in.
If the user exist, you need to have the field with the facebookId. When the user is redirected from Facebook, you can grab the facebookId, and look in the datastore to find the user you want to log in.
If you already have users, you will need to let them log in the way you usually do, so you know who they are, then send them to Facebook, get the facebookId back and update their user object. This way, they can log in using Facebook the next time.
Another small note: The user will be presented with a screen on Facebook asking to allow your app access to whatever scopes you ask for, there is no way around this (the less scopes you ask for, the less intrusive it seems, though). However, this only happens the first time a user is redirected (unless you ask for more scopes later, then it'll ask again).

You can try face4j https://github.com/nischal/face4j/wiki . We've used it on our product http://grabinbox.com and have open sourced it for anyone to use. It works well on GAE.
There is an example on the wiki which should help you integrate login with facebook in a few minutes.
face4j makes use of oAuth 2.0 and the facebook graph API.

I had a lot of difficulty when trying to implement the OAuth signing myself. I spent a lot of time trying to debug an issue with my tokens not actually getting authorized - a common problem apparently. Unfortunately, none of the solutions worked for me so I ended up just using Scribe, a nifty Java OAuth library that has the added benefit of supporting other providers besides for Facebook (e.g. Google, Twitter, etc.)

You can take a look at LeanEngine, the server part: https://github.com/leanengine/LeanEngine-Server/tree/master/lean-server-lib/src/main/java/com/leanengine/server/auth

Check facebook's java APIs.
Other examples: http://code.google.com/p/facebook-java-api/wiki/Examples

Strange Whitespace Error when Accessing RSS Feed

I'm not sure if anyone else has encountered or asked about this before, but for my application I make use of two Yahoo! RSS Feeds: Top News and Weather Forcast. I'm new to the idea of using these in the first place, but from what I've read, I simply need to make an HTTP GET request to a specific URL to retrieve an XML file which I can parse for the information I want. I have the parser working just fine, for I tested it with a sample XML file from each feed; however, a strange error is occuring when I use the AJAX GET call to the urls:
The XML page cannot be displayed
Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.
Whitespace is not allowed at this location.
Error processing resource 'http://localhost:8080/BBS/fservlet?p=n'. Line 28, P...
for (i = 0; i < s.length; i++){
-------------------^
Note that I have this applciation "BBS" currently deployed on my local system with Tomcat. I looked into whitespace errors like this, and most seem to point to some line within the XML file itself that's having a problem. In most cases, it had something to do with escaping the "&" symbol, but it appears as though IE is telling me that the error is within a for-loop. I'm no XML expert, but I've never seen a for-loop within an XML. Even so, I've gone to the url directly in my browser and viewed the XML file (its the one I used to test my parsing) and found no such line. In addition, no such loop exists anywhere in my code. In other words, I'm not sure if this is an error on my end, or some configuration setting. Here's the code I'm working with, however:
jQuery Code
// Located in my JSP file
var baseContext = "<%=request.getContextPath()%>";
$(document).ready(function() {
ParseWeather();
ParseNews();
}
// Located in a separate JS file
function ParseWeather() {
$.get(baseContext + "/servlet?p=w", function(data) {
// XML Parser
}
// Data Manipulation
}
function ParseNews() {
$.get(baseContext + "/servlet?p=n", function(data) {
// XML Parser
}
// Data Manipulation
}
Java Code
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import javax.servlet.http.HttpServlet;
import java.net.URL;
public class FeedServlet extends HttpServlet {
protected void doGet(final HttpServletRequest request, final HttpServletResponse response) throws ServletException, IOException {
try {
response.setContentType("text/xml");
final URL url;
String line = "";
if(request.getParameter("p").equals("w")) {
// Configuration setting that returns: "http://xml.weather.yahoo.com/forecastrss?p=USOR0186"
url = new URL(AppConfiguration.getInstance().getForcastUrl());
} else {
// Configuration setting that returns: "http://news.yahoo.com/rss/"
url = new URL(AppConfiguration.getInstance().getNewsUrl());
}
final BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream());
final PrintWriter writer = response.getWriter();
while((line = reader.readLine()) != null) {
writer.println(line);
writer.flush();
}
writer.close();
} catch(IOException e) {
e.printStackTrace();
}
}
}
My company has a AppConfiguration class that allows for certain variables, like the URL's, to be changed through the configuration page. At any rate, those two calls simple return the urls...
Yahoo! Forcast RSS Feed:
http://xml.weather.yahoo.com/forecastrss?p=USOR0186
Yahoo! News: Top Stories Feed:
http://news.yahoo.com/rss/
Anyway, any help would be incredibly helpful.

for (i = 0; i < s.length; i++){
The error is at the less-than symbol, which means that the XML parser is reading your source code! Use WGET to get the resource and check that actual XML is returned and not source code.

Extracting anchor tag from html using Java

I have several anchor tags in a text,
Input: <a href="http://stackoverflow.com" >Take me to StackOverflow</a>
Output:
http://stackoverflow.com
How can I find all those input strings and convert it to the output string in java, without using a 3rd party API ???

There are classes in the core API that you can use to get all href attributes from anchor tags (if present!):
import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class HtmlParseDemo {
public static void main(String [] args) throws Exception {
String html =
"<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
"<!-- " +
"<a href=\"http://ignoreme.com\" >...</a> " +
"--> " +
"<a href=\"http://www.google.com\" >Take me to Google</a> " +
"<a>NOOOoooo!</a> ";
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback(){
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if(t == HTML.Tag.A) {
Object link = a.getAttribute(HTML.Attribute.HREF);
if(link != null) {
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
System.out.println(links);
}
}
which will print:
[http://stackoverflow.com, http://www.google.com]

public static void main(String[] args) {
String test = "qazwsxTake me to StackOverflowfdgfdhgfd"
+ "Take me to StackOverflow2dcgdf";
String regex = "<a href=(\"[^\"]*\")[^<]*</a>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(test);
System.out.println(m.replaceAll("$1"));
}
NOTE: All Andrzej Doyle's points are valid and if you have more then simple Y in your input, and you are sure that is parsable HTML, then you are better with HTML parser.
To summarize:
The regex i posted doesn't work if you have <a> in comment. (you can treat it as special case)
It doesn't work if you have other attributes in the <a> tag. (again you can treat it as special case)
there are many other cases that regex wont work, and you can not cover all of them with regex, since HTML is not regular language.
However, if your req is always replace Y with "X" without considering the context, then the code i've posted will work.

You can use JSoup
String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://stackoverflow.com"
Also See
Example

The above example works perfect; if you want to parse an HTML document say instead of concatenated strings, write something like this to compliment the code above.
Existing code above ~ modified to show: HtmlParser.java (HtmlParseDemo.java) above
complementing code with HtmlPage.java below. The content of the HtmlPage.properties file is at the bottom of this page.
The main.url property in the HtmlPage.properties file is:
main.url=http://www.whatever.com/
That way you can just parse the url that your after. :-)
Happy coding :-D
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class HtmlParser
{
public static void main(String[] args) throws Exception
{
String html = HtmlPage.getPage();
Reader reader = new StringReader(html);
HTMLEditorKit.Parser parser = new ParserDelegator();
final List<String> links = new ArrayList<String>();
parser.parse(reader, new HTMLEditorKit.ParserCallback()
{
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
{
if (t == HTML.Tag.A)
{
Object link = a.getAttribute(HTML.Attribute.HREF);
if (link != null)
{
links.add(String.valueOf(link));
}
}
}
}, true);
reader.close();
// create the header
System.out.println("<html>\n<head>\n <title>Link City</title>\n</head>\n<body>");
// spit out the links and create href
for (String l : links)
{
System.out.print(" " + l + "\n");
}
// create footer
System.out.println("</body>\n</html>");
}
}
import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;
public class HtmlPage
{
public static String getPage()
{
StringWriter sw = new StringWriter();
ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());
try
{
URL url = new URL(bundle.getString("main.url"));
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setRequestMethod("GET");
connection.setDoOutput(true);
InputStream content = (InputStream) connection.getInputStream();
BufferedReader in = new BufferedReader(new InputStreamReader(content));
String line;
while ((line = in.readLine()) != null)
{
sw.append(line).append("\n");
}
} catch (Exception e)
{
e.printStackTrace();
}
return sw.getBuffer().toString();
}
}
For example, this will output links from http://ebay.com.au/ if viewed in a browser.
This is a subset, as there are a lot of links
Link City
#mainContent
http://realestate.ebay.com.au/

The most robust way (as has been suggested already) is to use regular expressions (java.util.regexp), if you are required to build this without using 3d party libs.
The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/javase/6/docs/api/javax/xml/parsers/package-summary.html). This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.
Still, I would recommend also looking at existing frameworks out there built for this purpose (like JSoup, also mentioned above). No need to reinvent the wheel.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HtmlUnit error - com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 429 Too Many Requests - java

Related

Is there a way to read csv file from S3 using Java without downloading it

I need assistance in creating a function that will read (open new browser tab)url from CSV file in Java (Selenium)

Java example of how to log in to Google App Engine with a Facebook account using OAuth

Strange Whitespace Error when Accessing RSS Feed

Extracting anchor tag from html using Java

Categories

Resources