Open Link in HTML with JSOUP - java

I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?

If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.

Related

How to get URL of video or audio from a website with jsoup

i'm using jsoup to parse all the HTML from this website: news
I can fetch all the tilte, description with select some Elements I need. But can't find the video URL element to select. How can i get the video link with jsoup or another kind of library. Thanks!
Maybe I misunderstood your question, but can't you search for <video> elements using JSoup?
All <video> elements have a so-called src attribute.
Maybe try something like this?
// HTML from your webpage
final var html = "this should hold your HTML";
// Deconstruct into element objects
final var document = Jsoup.parse(html);
// Use CSS to select the first <video> element
final var videoElement = document.select("video").first();
// Grab the video's URL by fetching the "src" attribute
final var src = videoElement.attr("src");
Now I did not thoroughly check the website you linked. But some websites insert videos using JavaScript. If this website inserts a video tag after loading, you might be out of luck as Jsoup does not run JavaScript. It only runs on the initial HTML fetched from the page.
Jsoup is an HTML parser, which is why it only parses HTML and not, say, generated HTML.

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

Web crawler find the whole html code

Note that I am using java in eclipse and jsoup library.
My code is:
Document doc = null;
String crawUrl = this.getCrawlUrl();
doc = Jsoup.connect(crawUrl).get();
Elements hrefs2=doc.select("html");
System.out.println(hrefs2);
I am trying to get the whole html code of specific page but when there is something like div into div I am not getting it.
How can I get the whole html code from specific page?
You can try-
Document doc = Jsoup.connect(crawUrl).get();
System.out.println(doc.toString());

How to read/parse article content from link to string

I was in need of help.
How do I get content on article websites with java or android?
You can try http://jsoup.org/
Use it to fetch the page from link and parse the content.
Well, here is a sample,
String url = "http://inet.detik.com/read/2012/12/12/105558/2116258/796/produktif-kerja-mobile-dengan-samsung-ativ-smart-pc-yang-revolusioner";
Document doc = Jsoup.connect(url).timeout(20000).get();
Elements elements = doc.select("div[class=text_detail]");
if (elements.size() > 0) {
System.out.println(elements.text());
}
The above code just print outs the entire text. If you want to get a pretty print version, you need to handle some html tags (such as br) by yourself. You can easily visit the html tags with jsoup, so just spend some time on the documents and write the code on your own.

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.

Categories

Resources