Jsoup URL.get()/post() out of memory error - java

I'm executing this code:
//doc = Jsoup.connect(data[0].getURL()).get();
Document doc = Jsoup.connect(url).post();
and am getting an out of memory exception. Obviously the web page's HTML is too much too download. All I want from the webpage are all of the elements within the following tags
<div class="animal-info">...</div>
Is there a way for me to do this using Jsoup without having to download the whole webpage, or a way to get around the out of memory exception?

Try
Document doc = Jsoup.connect(url).get();
Elements divElements = doc.getElementsByTag("div");
for(Element divElement : divElements){
if(divElement.attr("class").equals("animal-info")){
textList.add(divElement.text());
text = textList.toString();
Log.e("Content", text);
}

Related

How to get URL of video or audio from a website with jsoup

i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.

JSoup Scraping based on custom attributes

So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:
<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
<div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
<a href...
I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");
I've tried by class:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")
And:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")
I've tried another attribute method:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
{
for ( Attribute attribute : element.attributes() )
{
if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
{
myEls.add(element);
}
}
}
None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?
Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element
And one more tip. Instead of using your for-for-if construction you can use this:
for (Element element : doc.getAllElements()) {
if (element.dataset().containsKey("listing-number")) {
myEls.add(element);
}
}

How to get navigable links to pages from a site using jsoup?

I am implementing a basic crawler with the purpose of later use in a vulnerability scanner. I am using jsoup for the connection/retrieving and parsing of html document.
I supply manually the base/root of the intended site(www.example.com) and connect.
...
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument = htmlDocument;
...
Then i retrieve all the links on the page.
...
Elements linksOnPage = htmlDocument.select("a[href]");
...
After this I loop between the links and try to get the links to all the pages on the site.
for (Element link : linksOnPage) {
this.links.add(link.absUrl("href"));
}
The problem is as follows. Depending on the links I get, some might not be links to new pages or not even links to pages at all. As an example a got links like:
https://example.example.com/webmail
http://193.231.21.13
mailto:example.example#exampl.com
What i need some help whit is the filtering of the links so that i get only links to new pages of the same root/base site.
This is easy. Check if absUrl ends with image format or js or css:
if(absUrl.startsWith("http://www.ics.uci.edu/") && !absUrl.matches(".*\\.(bmp|gif|jpg|png|js|css)$"))
{
//here absUrl starts with domain name and is not image or js or css
}

Web crawler find the whole html code

Note that I am using java in eclipse and jsoup library.
My code is:
Document doc = null;
String crawUrl = this.getCrawlUrl();
doc = Jsoup.connect(crawUrl).get();
Elements hrefs2=doc.select("html");
System.out.println(hrefs2);
I am trying to get the whole html code of specific page but when there is something like div into div I am not getting it.
How can I get the whole html code from specific page?
You can try-
Document doc = Jsoup.connect(crawUrl).get();
System.out.println(doc.toString());

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.

Categories

Resources