How to extract specific text from a webpage? [duplicate] - java

This question already has answers here:
Text Extraction from HTML Java
(8 answers)
Closed 9 years ago.
I'm trying to extract a specific text from a webpage?
This is the part of the webpage which contains the specific text:
<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>
How to extract the content of Variable Name1 and Variable Name2?
Is there any html parser could do this extraction?

well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.
http://seleniumhq.org/

TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.

jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:
String html = "<div class='module'>" +
"<div class='body'>" +
"<dl class='per_info'>" +
"<dt>F.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
"<dt>L.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
"</dl>" +
"</div>" +
"</div>";
Document doc = Jsoup.parseBodyFragment(html);
With the document, you can use jsoup's selectors to locate specific elements:
// select all <a/> elements from the document
Elements anchors = doc.select("a")
With the element collection, you can iterator over the elements and extract their element contents:
for (Element anchor : anchors) {
String contents = anchor.text();
System.out.println(contents);
}

Related

JSOUP missing tag when converting html row

I having problem with jsoup whereby i want to get a row of data which later I will be inserting the row into another html document. But when i inspect time saw that there is no and tag. How can i solve it
String htmlcontent = "<tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr>";
Document docnewinput = Jsoup.parse(htmlcontent, "UTF-8");
[<html>
<head></head>
<body>
<div class="content-wrapper">
<p><strong><span class="CLASS 1 CLASS 2 CLASS 3">123</span></strong><br><strong>DATA 1</strong></p>
</div>
</body>
</html>]
You have a fragment of body HTML (e.g. a div containing a couple of p tags; as opposed to a full HTML document) that you want to parse.
Use the Jsoup.parseBodyFragment(String html) method.
String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parseBodyFragment(html);
The parseBodyFragment method creates an empty shell document, and inserts the parsed HTML into the body element. If you used the normal Jsoup.parse(String html) method, you would generally get the same result, but explicitly treating the input as a body fragment ensures that any bozo HTML provided by the user is parsed into the body element.
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:
unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)
implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>...)
reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)
EDIT:
By using Jsoup.parse():
String html = "<table><tr><td colspan=\"2\"><div class=\"content-wrapper\"><p><strong><span class=\"CLASS 1 CLASS 2 CLASS 3\">123</span></strong><br /><strong>DATA 1</strong></p></td><td></td><td></td><td></td><td></td></tr></table>";
Document doc = Jsoup.parse(html);
Working Demo: https://try.jsoup.org/~EdJSrHl_biDcQkyhL2BLH5ZNnck
Need to use xmlParser() so that it will just read the string as it without formatting it.

Strip HTML from text but also specific content wrapped in html in Java

Let's say I have the following text:
<blockquote>
<div>This is text and html in a blockquote<\/div>
More text in a block quote.
<\/blockquote>
Here's some content <b> bolded </b> and <i> other random HTML tags </i>
I'd like to strip the entire blockquote out, and keep the content in other html tags. So the output would be:
Here's some bolded and other random HTML tags.
I know theres a hundred or more answers to "Stripping HTML from content" but I can't find an answer on stripping HTML tags but also content that is wrapped specific html tags.
How can I get the desire output in Java?
You could use simple regex expressions: .replaceAll("<blockquote>.*</blockquote>", "").replaceAll("<[^>]+>", ""). It should be enough.
It may seem like an overhead but you could use Jsoup to parse the HTML and operate with the Elements.
Maybe there is something more lightweight for your problem but Jsoup should to the job just fine. You can select elements by using css selectors, remove unneeded and get the plain text (without tags) out of them.
Here is a simple sample:
final String html = "<html><div><bq>i do not want this</bq>but this <b>should</b> all <i>get</i> read</div></html>";
final Document document = Jsoup.parse(html);
final Elements div = document.select("div");
div.select("bq").remove();
System.out.println(div.text()); // prints but this should all get read
You could also use JSoup in this way:
String text = "<blockquote>\n" +
" <div>This is text and html in a blockquote</div>\n" +
" More text in a block quote.\n" +
"</blockquote> \n" +
"Here's some content <b> bolded </b> and <i> other random HTML tags </i>";
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(text, whitelist);

how to replace certain text in hyperlink

It is on Android and need to fix up the html before loaded into the WebView.
normally it could be done by
(<a[^>]+>)(.+?)(<\/a>)
to get group $1 then replace the text.
What if there are other unknown children inside the <a> tag?
the example below has <a><p>... text</p></a>, but the <p> could something else not known.
Really what it wants is to replace only the content of text element of any child inside the element.
<a href="http://news.newsletter.com/" target="_blank">
<p><img alt=“Socialbook" border="0" height="50"
src="http://news.newsletter.com/images/socialbook.gif" width="62">
THIS IS THE TEXT NEEDED TO REPLACE<p>
</a>
Can this be done inside the JAVA or has to be done inside the WebView's javascript?
You can use any Java html parser. E.g. JSoup:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links)
link.text("~" + link.text() + "~");
See Element api docs.

Get all elements with Jsoup

I'm trying to find all elements inside this kind of html:
<body>
My text without tag
<br>Some title</br>
<img class="image" src="url">
My second text without tag
<p>Some Text</p>
<p class="MsoNormal">Some text</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</body>
I need get all elements include parts without any tag. How a can get it?
P.S.: I need to get array of "Element" for each element.
Not quite sure if you are asking to retrieve all the text within the html. to do that, you can simply do the following:
String html; // your html code
Document doc = Jsoup.parse(html); //parse the string
System.out.println(doc.text()); // get all the text from tags.
OUTPUT:
My text without tag Some title My second text without tag Some Text
Some text 1 2
Just in case if you using a html file, you can use the below code and retrieve each tag that you need. The API is Jsoup. You can find more examples in the below link http://jsoup.org/
File input = new File(htmlFilePath);
InputStream is = new FileInputStream(input);
String html = IOUtils.toString(is);
Document htmlDoc = Jsoup.parse(html);
Elements pElements = htmlDoc.select("P");
Element pElement1 = pElements.get(0);

Get an URL hidden in HTML code with JSoup

I have a piece of HTML code of a web page (library thing) like:
<div class="qelcontent" id="4ed0e0ba4f1b16.47984984" style="display:block;">
<div class="description"><h4 class="first"><b>Amazon.com Product Description</b>
(ISBN 0860783227, Hardcover)</h4>
I want to get the absolute URL from an href attribute. I tried:
selector = document.select(".first .a[href]");
But it returned null. How can I get the value?
This solves this specific problem.. not sure if it will work with your entire dataset.
String html = "<div class=\"qelcontent\" id=\"4ed0e0ba4f1b16.47984984\" style=\"display:block;\">" +
"<div class=\"description\"><h4 class=\"first\"><b>Amazon.com Product Description</b>" +
"(ISBN 0860783227, Hardcover)</h4>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select(".first").select("a").attr("href"));

Categories

Resources