Extract data from html code with Jsoup - java

I want to extract from this HTML code the word Mustafa with Jsoup.
<h1 id="firstHeading" class="firstHeading">Mustafa</h1>
<!-- /firstHeading -->
How can I do this?

With Jsoup you can use CSS selectors to select elements. An element with id="firstHeading" is selectable with CSS selector #firstHeading.
Thus, this should do:
Document document = Jsoup.parse(html);
String firstHeading = document.select("#firstHeading").text();
System.out.println(firstHeading); // Mustafa

Related

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

how to replace certain text in hyperlink

It is on Android and need to fix up the html before loaded into the WebView.
normally it could be done by
(<a[^>]+>)(.+?)(<\/a>)
to get group $1 then replace the text.
What if there are other unknown children inside the <a> tag?
the example below has <a><p>... text</p></a>, but the <p> could something else not known.
Really what it wants is to replace only the content of text element of any child inside the element.
<a href="http://news.newsletter.com/" target="_blank">
<p><img alt=“Socialbook" border="0" height="50"
src="http://news.newsletter.com/images/socialbook.gif" width="62">
THIS IS THE TEXT NEEDED TO REPLACE<p>
</a>
Can this be done inside the JAVA or has to be done inside the WebView's javascript?
You can use any Java html parser. E.g. JSoup:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links)
link.text("~" + link.text() + "~");
See Element api docs.

How to get tag name for given word in Jsoup?

I have some html code like this:
<div class="post-text" itemprop="text">sometext for example</div>
I'm searching sometext word using jsoup and I want it's tag name. For above example it will be a href. Can anyone help me?
Try this CSS selector:
*:containsOwn(sometext)
DEMO
http://try.jsoup.org/~1FKtzLpHQFii4u8FFyUuh3GgdPI
SAMPLE CODE
String html = "<div class=\"post-text\" itemprop=\"text\">sometext for example</div>";
Document doc = Jsoup.parse(html);
Elements elts = doc.select("*:containsOwn(sometext)");
for(Element e : elts) {
System.out.println(e.outerHtml());
}
OUTPUT
sometext for example
SEE ALSO
:matchesOwn(regex) - If you want to find element with more elaborate text.
Jsoup CSS selector - The complete reference on CSS selectors supported by Jsoup

Get all elements with Jsoup

I'm trying to find all elements inside this kind of html:
<body>
My text without tag
<br>Some title</br>
<img class="image" src="url">
My second text without tag
<p>Some Text</p>
<p class="MsoNormal">Some text</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</body>
I need get all elements include parts without any tag. How a can get it?
P.S.: I need to get array of "Element" for each element.
Not quite sure if you are asking to retrieve all the text within the html. to do that, you can simply do the following:
String html; // your html code
Document doc = Jsoup.parse(html); //parse the string
System.out.println(doc.text()); // get all the text from tags.
OUTPUT:
My text without tag Some title My second text without tag Some Text
Some text 1 2
Just in case if you using a html file, you can use the below code and retrieve each tag that you need. The API is Jsoup. You can find more examples in the below link http://jsoup.org/
File input = new File(htmlFilePath);
InputStream is = new FileInputStream(input);
String html = IOUtils.toString(is);
Document htmlDoc = Jsoup.parse(html);
Elements pElements = htmlDoc.select("P");
Element pElement1 = pElements.get(0);

Extracting links from HTML

I am trying to extract links from HTML. I am using the following regular expression
href=\"([^\"]*)\"
Which is extracting unnecessary links. How can I write a regular expression to extract only links with class="l" like
<a href="http://users.elite.net/runner/jennifers/hello.htm" class="l">
<a href="http://www.hellodesign.com/" class="l">
<a href="http://www.ipl.org/div/hello/" class="l">
Parsing HTML with regex is unnecessarily overcomplicated. Regex is the wrong tool for the job. Just use a normal HTML parser like Jsoup. It allows you to select HTML elements by normal CSS selectors.
Document document = Jsoup.parse(html);
Elements links = document.select("a.l"); // Select all <a class="l"> elements.
for (Element link : links) {
System.out.println(link.absUrl("href"));
}

Categories

Resources