Object exists in the HTML but I'm unable to select it - java

I'm writing a scraper. When I use inspect element in chrome I see the following:
but when I run my code Elements data = doc.select("div.item-header"); and I print the object data I see that the object has the following chunk of html in it:
<div class="item-header">
<h1 class="text size-20">Snake print bell sleeves top</h1>
<div class="text size-12 muted brandname ma_top5">
<!-- data here is irrelevant -->
</div>
</div>
So, what I can't figure out is, why does my code get a different html than that visible in chrome's inspect element? What am I missing here?
I'm using java, the library is Jsoup. Any help is greatly appreciated.

Websites consist of HTML and JavaScript code. Often that JavaScript is executed when the page is loaded and it's possible that the source of a page is modified or some additional content is loaded by asynchronous AJAX calls. Jsoup can't parse Javascript so it can only parse the original HTML document.
Don't use Chrome's Inspect option as it presents HTML after possible transformations. Use View source (CTRL+U). This way you'll see original HTML source unmodified by JavaScript (you can also try reloading the page with JavaScript disabled). And that original source is what gets downloaded and parsed by Jsoup.
If that's the case and you really want to parse the data that's loaded by JavaScript try to observe XHR requests in Chrome's Network tab. You can check this answer to see what I mean: How to Load Entire Contents of HTML - Jsoup

Related

Jsoup with a plugin

I'm using Jsoup to scrape some online data from different stores, but I'm having trouble figuring out how to programmatically replicate what I do as a user. To get the data manually (after logging in), a user must select a store from a tree that pops up.
As best I can tell, the tree is not hard-coded into the site but is built interactively when your computer interacts with the server. When you look for the table in "view page source," there are no entries. When I inspect the tree, I do see the HTML and it seems to come from the "FancyTree" plugin.
As best as I can tell from tracking my activity on Developer Tools -- Network, the next step is a "GET" request which doesn't change the URL, so I'm not sure how my store selection is being transferred.
Any advice on how to get Jsoup or Java generally to programmatically interact with this table would be extremely helpful, thank you!
Jsoup can only parse the original source file, not the DOM. In order to parse the DOM, you'll need to render the page with something like HtmlUnit. Then you can parse the html content with Jsoup.
// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);
// convert page to generated HTML and convert to document
doc = Jsoup.parse(myPage.asXml());
// do something with html content
System.out.println(doc.html());
// clean up resources
webClient.close();
See Parsing Javascript Generated Page with Jsoup.

Jsoup not getting full html

I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.

breakdown of individual xpath selectors and results

Assuming the following document:
<html>
<body>
<div>
Home
</div>
<div>
Link to a page
<b>Bold text</b>
Link to another page
</div>
</body>
</html>
If I run this xPath I get the result following:
/html/body/div/a/text() -> HomeLink to a pageLink to another page
I am looking for a way to reverse-engineer the results and extract the individual xPath selectors and its associate result as simple as possible. Something as:
/html/body/div[1]/a[1]/text() <-> Home
/html/body/div[2]/a[1]/text() <-> Link to a page
/html/body/div[2]/a[2]/text() <-> Link to another page
I can guess some complicate program by traversing the DOM tree or a SAX parsing but looks too complex.
Can someone figure out a simpler way to achieve this result in xPath (maybe helped by a bit of Java as well)? Basically the problem is to know each index of each tag and the associated result for each successful combination.
Thanks
Unfortunately, I don't know java.
Here is a sample Ruby code using nokogiri gem:
require 'nokogiri'
doc = Nokogiri::HTML open('/tmp/input.html')
doc.xpath('//a//text()').each {|a| puts "#{a.path} -> #{a.text}" }

Getting a NULL when calling getElementById

I have the HTML (partial) shown below. I want to find the element using:
org.jsoup.nodes.Element elem = doc.getElementById("date-2011-04-23");
But I always get a NULL. Can anyone help me? As a check, I've also code this using VB.NET and I can access this element.
<td class="" id="date-2011-04-23" data-week="3" data-wkday="6">...</td>
Assuming that your tag looks like:
<td class="" id="date-2011-04-23" data-week="3" data-wkday="6">...</td>
You can use the JSoup Selector API for this:
for( Element element : doc.select("#date-2011-04-23") )
{
// Do something here
}
If you need only the first Element:
Element element = doc.select("#date-2011-04-23").first();
The reason you're not finding that content in the HTML is that the schedule is loaded from a JSON file by the browser executing Javascript, then adding it to the browser DOM. Jsoup does not execute Javascript, so it only can see what is in the source HTML.
If you use a debugging proxy like Charles (or the debugging network pane in Chrome / Firefox), you can see all the requests a browser makes to render a page. In this example, the schedule data is coming from http://mlb.mlb.com/gen/schedule/phi/2011_4.json

Display page data on same page without forwarding it to another page

In my webpage (Eg Link1: http://localhost:8086/MyStrutsApp/login.do) I have several links. When a user clicks on one of the links, he is taken to another page (Eg link2: http://localhost:8086/MyStrutsApp/AddBook.jsp) to fill an html form.
Now what I want to achieve is that when any user clicks on the link, that html form (Link2) is displayed on the same page (i.e. Link1).
I have no idea how to achieve this.
The AJAX way to achieve this is the following:
you have a DIV on your original page that will be replaced (i.e., either has content that only makes sense in the original context or completely empty)
your Link2 servlet produces only the contents of the above DIV (and not the contents of that page)
you use a tiny bit of Javascript to make an AJAX call and fill the DIV with the response.
If you want to use Dojo, the HTML page would look like this:
<!-- main content -->
<div id="leftpanel">
<h3>This content will be replaced</h3>
You can add a book
</div>
The Javascript code would look like this:
<script src="http://ajax.googleapis.com/ajax/libs/dojo/1.6/dojo/dojo.xd.js" type="text/javascript"></script>
<script type="text/javascript">
function display_wait(s) {
var mainPanel=dojo.byId("leftpanel");
mainPanel.innerHTML='<div class="waitingmsg">'+s+'</div>';
}
function updateFromURL(url) {
display_wait("loading content");
dojo.xhrGet({url:url,
load:function(result) {
dojo.byId('leftpanel').innerHTML=result;
}});
}
</script>
(As Rafa mentioned, you can use the same technique to display the new part in a dialog)
You can always use jQuery to present a dialog ... http://jqueryui.com/demos/dialog/
Retrieve the page with AJAX and present it inside the dialog.
To display one page within another, use an iframe. See the iframe docs.
To make a link on the outer page load its target page into the iframe, give the iframe a name attribute, and give the link a matching target attribute.

Categories

Resources