Parsing HTML webpages in Java

Parsing HTML webpages in Java - java

I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error:
java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
Now that I've gotten jsoup to work, I think a better question is how do I speed it up?

Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.

I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.

A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.
Here's a nice link since you are interested in Java:
http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.

Related

Parsing C and Java for displaying code on my website

I have hosted my web-application on google appengine. I had to put up some java code on my website.So I am looking for a java parser for the website. Please suggest some.
Actually I wanted to put up a project that used C and Java so I am looking for a parser that can parse both the languages.Are there available parser or I will have to write my own ?
Edit : My sole purpose 'now' is code highlighting

If you just need code highlighting, there are tons of stuff out there. You could, for instance, use highlight.js, or even Google's own code prettifier.
At this moment, that is what I can get from your question, so until further clarification, I won't be able to give a more precise answer.

Is there a library than can trudge through AJAX/javascript?

I'm using PHP to scrape some information off webpages, however, I've discovered that the info I'm trying to scrape from the pages is loading through some manner of AJAX/javascript. I thought I remembered that Curl could iterate through the javascript, but I've found that that's not the case.
I seem to remember some sort of backend "web browser" library/function that could trace through javascript and AJAX, to get at a final page result of what a full-functioned browser would arrive at.
Is there a library or function that can do this? Any ideas on how to go about this, other than having to manually trace through the scripts/redirects myself? It doesn't have to be pretty -- I'm just looking to scrape the resulting text.

Maybe not in php but in other languages there's: Watir/WatiN, selenium, watir/selenium-webdriver, capybara-webkit, celerity, node.js runs js directly, as well as phantomjs. There's also iMacros and similar commercial options.
But I usually find that I can get the data I want without any of these by just looking at the requests the page is making and recreate them/parsing the response.

I don't think there is such a library. If you're really desperate and you have lots of time on your hands, then you can, of course, download source code of Firefox, for example, and build yourself something useful. However I don't think this is going to be the best use of yours or anybody else's resources.
Note that even google's indexing bot does not process ajax. Here is what Google has to say about it. It's quite possible that the site you're dealing with does support this, in which case you can try using this google's technique, but on the whole, unfortunately, you're out of luck.

Android app, how to log into website and display information?

I'm trying to build an android app that will log into a website, scrape the website for data specific to the user, then format that data nicely on a mobile screen.
I've noticed that there are several similar questions to my own, and after reading some of the documentation, I am still very confused as to how I should go about this.
Here's what I know
The site that I want to log into utilizes asp.net and the login.aspx uses POST for the login form.
There is no API for this website
There is also no single sign on
I'm very new to Android and a novice JAVA programmer at best. Will someone please help me carve the path of research that I need to do in order to write this app? I feel that I mostly need help with connecting to the website and getting the data, I'll be able to figure out the layouts and formatting myself.
I am more than willing to research and read whatever is necessary, but I would like to minimize any irrelevant information that would ultimately lead to more confusion.
Thank you in advance for the help

For the purpose of accessing the website all you need to know about is HTTP. It doesn't matter whether your target website is built with PHP or ASP etc. As you're only concern is how to communicate with the website through HTTP which is independent of technology used by the website. You can try wikipedia for descriptions of these methods.
It might be worthwhile reading the Java URL tutorials for how to use Java classes. As regards extracting the data itself you might want to read up on Parsers. This link might give you some first ideas.

I'd use the Jsoup
API. I've posted many many threads on that issue. This'll probably help you login. I don't know how android manages SSL certificates. So that you'll have to research on your own. But this right here is a good start.
Jsoup Cookies for HTTPS scraping

How can I implement similar to the web query feature in Excel in java?

I'm new here and I hope I'm not asking something which has already been answered. I have searched everywhere but am yet to discover an adequate answer.
My objective is fairly simple: I want to create a program which will stream the live gold and silver rates from: this website
How would I be able to pinpoint the values that I want to download? Currently, I have managed to implement this using Microsoft Excel's web query feature wherein I am able to select a table from the webpage. However, I want to make it a standalone application.
By the way, I need to retrieve the rates to perform a calculation which is then displayed to the user.
I would greatly appreciate any ideas on how this can be achieved.

I think you need to scrape or parse the data from your website. For that take a look at htmlcleaner, JSoup html parsers.
You can use XPath with htmlcleaner to pinpoint the data.Here is a nice example Xpath Example.
You can use firefox's firebug extension to get the xpath of an element. But your xpath is going to be very huge, by the look of the website you mentioned.
Than you have to execute the code at specified interval of your choice.
And if there is javascript in play than you have to execute the javascript running behind the page using your java code.
Try using Rhino from Mozilla and using its integration libraries or by using the JDK 1.6 ScriptEngine facility.
For ScriptEngine Example take a look here- http://metoojava.wordpress.com/2010/06/20/execute-javascript-from-java/
In, short take a look at html parsers to parse the content of your page.

Java library for cleaning up HTML just like a browser would

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML.
Doesn't sound so tough, right? After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. So then, why can't I find a good java library for this very task?
One thing to mention is that I'm not at all interested in parsing the HTML as XML. I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. And for this project, it's of the utmost importance that the two documents render similarly.
I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. Also, Native dependencies are a no-go, so the mozilla parser is out.
Can anyone help me in my search for the perfect HTML parser? Thanks in advance!

JSoup I would say
See Also
which-html-parser-is-best

I have used HTML Tidy in the past.

TagSoup?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.