how to replace certain text in hyperlink - java

It is on Android and need to fix up the html before loaded into the WebView.
normally it could be done by
(<a[^>]+>)(.+?)(<\/a>)
to get group $1 then replace the text.
What if there are other unknown children inside the <a> tag?
the example below has <a><p>... text</p></a>, but the <p> could something else not known.
Really what it wants is to replace only the content of text element of any child inside the element.
<a href="http://news.newsletter.com/" target="_blank">
<p><img alt=“Socialbook" border="0" height="50"
src="http://news.newsletter.com/images/socialbook.gif" width="62">
THIS IS THE TEXT NEEDED TO REPLACE<p>
</a>
Can this be done inside the JAVA or has to be done inside the WebView's javascript?

You can use any Java html parser. E.g. JSoup:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links)
link.text("~" + link.text() + "~");
See Element api docs.

Related

Strip HTML from text but also specific content wrapped in html in Java

Let's say I have the following text:
<blockquote>
<div>This is text and html in a blockquote<\/div>
More text in a block quote.
<\/blockquote>
Here's some content <b> bolded </b> and <i> other random HTML tags </i>
I'd like to strip the entire blockquote out, and keep the content in other html tags. So the output would be:
Here's some bolded and other random HTML tags.
I know theres a hundred or more answers to "Stripping HTML from content" but I can't find an answer on stripping HTML tags but also content that is wrapped specific html tags.
How can I get the desire output in Java?
You could use simple regex expressions: .replaceAll("<blockquote>.*</blockquote>", "").replaceAll("<[^>]+>", ""). It should be enough.
It may seem like an overhead but you could use Jsoup to parse the HTML and operate with the Elements.
Maybe there is something more lightweight for your problem but Jsoup should to the job just fine. You can select elements by using css selectors, remove unneeded and get the plain text (without tags) out of them.
Here is a simple sample:
final String html = "<html><div><bq>i do not want this</bq>but this <b>should</b> all <i>get</i> read</div></html>";
final Document document = Jsoup.parse(html);
final Elements div = document.select("div");
div.select("bq").remove();
System.out.println(div.text()); // prints but this should all get read
You could also use JSoup in this way:
String text = "<blockquote>\n" +
" <div>This is text and html in a blockquote</div>\n" +
" More text in a block quote.\n" +
"</blockquote> \n" +
"Here's some content <b> bolded </b> and <i> other random HTML tags </i>";
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(text, whitelist);

Get all elements with Jsoup

I'm trying to find all elements inside this kind of html:
<body>
My text without tag
<br>Some title</br>
<img class="image" src="url">
My second text without tag
<p>Some Text</p>
<p class="MsoNormal">Some text</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</body>
I need get all elements include parts without any tag. How a can get it?
P.S.: I need to get array of "Element" for each element.
Not quite sure if you are asking to retrieve all the text within the html. to do that, you can simply do the following:
String html; // your html code
Document doc = Jsoup.parse(html); //parse the string
System.out.println(doc.text()); // get all the text from tags.
OUTPUT:
My text without tag Some title My second text without tag Some Text
Some text 1 2
Just in case if you using a html file, you can use the below code and retrieve each tag that you need. The API is Jsoup. You can find more examples in the below link http://jsoup.org/
File input = new File(htmlFilePath);
InputStream is = new FileInputStream(input);
String html = IOUtils.toString(is);
Document htmlDoc = Jsoup.parse(html);
Elements pElements = htmlDoc.select("P");
Element pElement1 = pElements.get(0);

Extracting links from HTML

I am trying to extract links from HTML. I am using the following regular expression
href=\"([^\"]*)\"
Which is extracting unnecessary links. How can I write a regular expression to extract only links with class="l" like
<a href="http://users.elite.net/runner/jennifers/hello.htm" class="l">
<a href="http://www.hellodesign.com/" class="l">
<a href="http://www.ipl.org/div/hello/" class="l">
Parsing HTML with regex is unnecessarily overcomplicated. Regex is the wrong tool for the job. Just use a normal HTML parser like Jsoup. It allows you to select HTML elements by normal CSS selectors.
Document document = Jsoup.parse(html);
Elements links = document.select("a.l"); // Select all <a class="l"> elements.
for (Element link : links) {
System.out.println(link.absUrl("href"));
}

Get an URL hidden in HTML code with JSoup

I have a piece of HTML code of a web page (library thing) like:
<div class="qelcontent" id="4ed0e0ba4f1b16.47984984" style="display:block;">
<div class="description"><h4 class="first"><b>Amazon.com Product Description</b>
(ISBN 0860783227, Hardcover)</h4>
I want to get the absolute URL from an href attribute. I tried:
selector = document.select(".first .a[href]");
But it returned null. How can I get the value?
This solves this specific problem.. not sure if it will work with your entire dataset.
String html = "<div class=\"qelcontent\" id=\"4ed0e0ba4f1b16.47984984\" style=\"display:block;\">" +
"<div class=\"description\"><h4 class=\"first\"><b>Amazon.com Product Description</b>" +
"(ISBN 0860783227, Hardcover)</h4>";
Document doc = Jsoup.parse(html);
System.out.println(doc.select(".first").select("a").attr("href"));

How to extract specific text from a webpage? [duplicate]

This question already has answers here:
Text Extraction from HTML Java
(8 answers)
Closed 9 years ago.
I'm trying to extract a specific text from a webpage?
This is the part of the webpage which contains the specific text:
<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>
How to extract the content of Variable Name1 and Variable Name2?
Is there any html parser could do this extraction?
well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.
http://seleniumhq.org/
TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.
jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:
String html = "<div class='module'>" +
"<div class='body'>" +
"<dl class='per_info'>" +
"<dt>F.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
"<dt>L.Name:</dt>" +
"<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
"</dl>" +
"</div>" +
"</div>";
Document doc = Jsoup.parseBodyFragment(html);
With the document, you can use jsoup's selectors to locate specific elements:
// select all <a/> elements from the document
Elements anchors = doc.select("a")
With the element collection, you can iterator over the elements and extract their element contents:
for (Element anchor : anchors) {
String contents = anchor.text();
System.out.println(contents);
}

Categories

Resources