Web crawler find the whole html code - java

Note that I am using java in eclipse and jsoup library.
My code is:
Document doc = null;
String crawUrl = this.getCrawlUrl();
doc = Jsoup.connect(crawUrl).get();
Elements hrefs2=doc.select("html");
System.out.println(hrefs2);
I am trying to get the whole html code of specific page but when there is something like div into div I am not getting it.
How can I get the whole html code from specific page?

You can try-
Document doc = Jsoup.connect(crawUrl).get();
System.out.println(doc.toString());

Related

How to get URL of video or audio from a website with jsoup

i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

Open Link in HTML with JSOUP

I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.

Jsoup URL.get()/post() out of memory error

I'm executing this code:
//doc = Jsoup.connect(data[0].getURL()).get();
Document doc = Jsoup.connect(url).post();
and am getting an out of memory exception. Obviously the web page's HTML is too much too download. All I want from the webpage are all of the elements within the following tags
<div class="animal-info">...</div>
Is there a way for me to do this using Jsoup without having to download the whole webpage, or a way to get around the out of memory exception?
Try
Document doc = Jsoup.connect(url).get();
Elements divElements = doc.getElementsByTag("div");
for(Element divElement : divElements){
if(divElement.attr("class").equals("animal-info")){
textList.add(divElement.text());
text = textList.toString();
Log.e("Content", text);
}

Categories

Resources