Basically, I need a table with all the possible books that exist, and I don't wanna do that, because I'm a very lazy person xD. So, my question is.. can I use a site, that I have in mind, and just like cut off the rest this site(that I don't need) and leave only the search part(maybe do some kind of changes in layout)... then, make the search, find the book and store in my database only the data that make sense for me. Is that possible? I heard that JSOUP could help.
So, I just want some tips. (thx for reading).
the site: http://www.isbn.bn.br/website/consulta/cadastro
Yes, you can do that using Jsoup, the main problem is that the URL you shared uses JavaScript so you'll need to use Selenium to force the JS execution or you can also get the book URL and parse it.
The way to parse a web using Jsoup is:
Document document = Jsoup.connect("YOUR-URL-GOES-HERE")
.userAgent("Mozilla/5.0")
.get();
The you retrieve the whole HTML in a Document so you can get any Element contained in the Element using CSS Selectors, for example, if in the HTML you want to retrieve the title of the web, you can use:
Elements elements = document.select("title");
And that for every HTML tag that you want to retrieve information from. You can check the Jsoup Doc an check some of the examples explained: Jsoup
I hope it helps you!
Related
I have written a Jsoup class file to scrape a page and grab the hrefs for every element on the page. What I would like to do from there is to extract the Xpath for each of the elements from their hrefs.
Is there a way to do this in JSoup? If not is what is the best way to do this in Java (and are there any resources on this)?
Update
I want to clarify my question.
I want to scan a page for all the href identifiers and grab the links (that part is done). For my script, I need to get the xpath of all the elements I have identified and scraped from the (scanned) page.
The problem is that I assumed I could easily translate the href links to Xpath.
The comment from #Rishal dev Singh ended up being the right answer.
Check his link here:
http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath
Perhaps I'm doing something wrong, but I'm trying to parse this page using jsoup, it for some reason it doesn't find me the div I'm looking for
doc = Jsoup.connect(params[0]).get();
content = doc.select("div.itemcontent").first().text();
Where am I going wrong here?
Thanks
The problem: you get a different website using jsoup than using a browser. I set another useragent in Jsoup, but no luck. Possible the content is changed through JavaScript?!
However, you can change the selector according to the webseite you get.
It's always a good idea to take a look into document as it's parsed - a simple System.out.println(doc) is enough.
Here are some steps you can try:
Print your Document doc (eg. using System.out)
Search for the required value(s) in there
Select those tags instead
I just played around a bit, but maybe you can use this snipped:
content = doc.select("description").first().text();
It seems to me, <description>...</description> is what you're looking for.
http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?
You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.
you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.
I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.
Please deal with this trivial question. It is available in bits and pieces on stackoverflow.
I have HTML dump of a website in the form of String. I want to extract text from the specific tags of it.
In other way, I want to mimic
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements links = doc.getElementsByTag("cite");
I am not using Jsoup because I don't want it to connect to the website (I have another service for that which returns html dump in the form of text). I found HTMLEditorKit for converting text to HTMLDocument but it doesn't seem to be very easy to use(like Jsoup or HTMLParser) or I am unable to get it.
Any help would be useful.
Thanks.
If you have used Jsoup and it worked yet, you should continue using it.
Document doc = Jsoup.parse("<html>...");
should do.
see: The API
I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.