I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.
Related
Basically, I need a table with all the possible books that exist, and I don't wanna do that, because I'm a very lazy person xD. So, my question is.. can I use a site, that I have in mind, and just like cut off the rest this site(that I don't need) and leave only the search part(maybe do some kind of changes in layout)... then, make the search, find the book and store in my database only the data that make sense for me. Is that possible? I heard that JSOUP could help.
So, I just want some tips. (thx for reading).
the site: http://www.isbn.bn.br/website/consulta/cadastro
Yes, you can do that using Jsoup, the main problem is that the URL you shared uses JavaScript so you'll need to use Selenium to force the JS execution or you can also get the book URL and parse it.
The way to parse a web using Jsoup is:
Document document = Jsoup.connect("YOUR-URL-GOES-HERE")
.userAgent("Mozilla/5.0")
.get();
The you retrieve the whole HTML in a Document so you can get any Element contained in the Element using CSS Selectors, for example, if in the HTML you want to retrieve the title of the web, you can use:
Elements elements = document.select("title");
And that for every HTML tag that you want to retrieve information from. You can check the Jsoup Doc an check some of the examples explained: Jsoup
I hope it helps you!
I get Iframe link http:\\abc.com?=blahblahiframelink from a third party rest service. I want to extract multiple values from content of that Iframe.
Here is simplified html. Please understand that the real html is far more complex having multiple nested div and tables
.css stuff
<html>
<div>
<p> NEED THIS INFO </p>
....
blah blah
<img src="NEED THIS INFO" > </img>
</div>
</html>
I marked "NEED THIS INFO" in above code as what I want to extract out, to demonstrate I want attribute values as well as element values.
I am thinking to first store that Iframe content in a java string in my rest service then use crazy Regex to get information I want.
Before I attempt that I want to check if there is more efficient way to do this. Is there some html parser I can use to get content in structured format.
If not then, please tell me how to store Iframe in Java string.
Please let me know if you need more info.
There are a couple of ways to do this for those coming here. However, the most efficient is going to be to write the iframe to a string like thus using HttpURLConnection or HttpsURLConnection (conn is the connection). Iframes are grabbable from their links.
BufferedReader br=new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line="";
html="";
while((line=br.readLine())!=null)
{
html=html+line+"\n";
}
br.close();
The most efficient is, of course, to limit the number of middle-men like Mechanize and the number of URL calls; etc.
It is possible to use java's powerful .net or .nio to do this just be creating an HttpURLConnection or javax.net's HttpsURLClient to get your page, the cookies; etc. From there the answer unfolds.
To parse the page in Java you can with A and B being the better options I know
A. Create an XML document and run an xpath. I am time limited so I've posted a resource for you. All you need is a string and you can do this. This fits your needs if you are not looking for something specific. Once you get the page, just get everthing you need.
http://www.mkyong.com/tutorials/java-xml-tutorials/
B. Regex. Look online to find a good solution I am limited to two links. Also, MyRegexTester is a great free resource for learning and testing Regex which is less daunting then you think, especially in java. Use those wildcards and look aheads.
C. Better yet, use a parser like Jsoup but set the xml ini- variable to output xml if you are not resource constrained but that appears to not be the case. JSoup does the xml parsing for you and allows you to use an xpath to get the result.
D. Use HttpUnit or a gui-less browser like Mechanize in Python(http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet/), Perl, or Ruby. My favorite is Python since there are more ready-made modules and the speeds are about the same. Python also has a Jsoup plugin
http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.
What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello
IS there a library which can transform any given HTML page with JS, CSS all over it, into a minimalistic uniform format?
For instance, if we render stackoverflow homepage, I want it to be shown in a minimal format. I want all other sites to be rendered down.
Sort of like Lynx web browser but with minimal graphics.
The best tool for HTML to Lynx style text I have come across is Jericho's Renderer.
It's easy to use:
Source source=new Source(new URL(sourceUrlString)); // or new Source("<html>pass in raw html string</html>");
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
(from here)
and handles HTML in the wild (badly formatted) very well.
Here's the first few lines of this page formatted this way using Jericho:
Stack Exchange log in | careers | chat
| meta | about | faq
Stack Overflow
* Questions
* Tags
* Users
* Badges
* Unanswered
* Ask Question
Java HTML normalizer?
**
IS there a library which can transform
any given HTML page with JS, CSS all
over it, into a minimalistic uniform
format?
For instance, if we render
stackoverflow homepage, I want it to
be shown in a minimal format. I want
all other sites to be rendered down.
Sort of like Lynx web browser but with
minimal graphics.
java lynx link|edit|flag asked 2 days
ago Kim Jong Woo 593112 89% accept
rate Do you want to transform your
HTML code to simpler HTML code, or do
your want to show this "minimalistic
uniform format" to your user? Or do
you want to create a image? – Paŭlo
Ebermann yesterday simpler html code
without sacrificing the relative
positioning of the elements. – Kim
Jong Woo 16 hours ago
2 Answers
To answer your firtst question: No. I
don'nt think there is a library for
that purpose. (At least this is what
my "googeling" resulted in).
And i think the reason for this is,
that what you want is a very special
need.
So as a solution for your problem you
can parse the html and display it the
way you want to in a JEditorpane or
whatever you are using for display.
I can only suggest a way i would do it
(this is because i am familiar with
xml and everything around it).
*
Use a library to ensure that your html conforms to xhtml:
http://htmlcleaner.sourceforge.net/release.php
*
then either parse the xml with DOM or SAX parsers and display it the
way you want.
or
* use xslt to transform the document into some other html document
which results in a view that fits your
needs.
or
* use one of the available html parser librarys. (The most of which i
found where kind of outdated (2006))
but they could be an option for you.
This is just one suggestion how you
could do it. I'm sure there are
thousands of other ways which will do
the same thing.
To answer your firtst question: No. I don'nt think there is a library for that purpose. (At least this is what my "googeling" resulted in).
And i think the reason for this is, that what you want is a very special need.
So as a solution for your problem you can parse the html and display it the way you want to in a JEditorpane or whatever you are using for display.
I can only suggest a way i would do it (this is because i am familiar with xml and everything around it).
Use a library to ensure that your html conforms to xhtml: http://htmlcleaner.sourceforge.net/release.php
then either parse the xml with DOM or SAX parsers and display it the way you want.
or
use xslt to transform the document into some other html document which results in a view that fits your needs.
or
use one of the available html parser librarys. (The most of which i found where kind of outdated (2006)) but they could be an option for you.
This is just one suggestion how you could do it. I'm sure there are thousands of other ways which will do the same thing.