Extracting links from HTML - java

I am trying to extract links from HTML. I am using the following regular expression
href=\"([^\"]*)\"
Which is extracting unnecessary links. How can I write a regular expression to extract only links with class="l" like
<a href="http://users.elite.net/runner/jennifers/hello.htm" class="l">
<a href="http://www.hellodesign.com/" class="l">
<a href="http://www.ipl.org/div/hello/" class="l">

Parsing HTML with regex is unnecessarily overcomplicated. Regex is the wrong tool for the job. Just use a normal HTML parser like Jsoup. It allows you to select HTML elements by normal CSS selectors.
Document document = Jsoup.parse(html);
Elements links = document.select("a.l"); // Select all <a class="l"> elements.
for (Element link : links) {
System.out.println(link.absUrl("href"));
}

Related

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

how to replace certain text in hyperlink

It is on Android and need to fix up the html before loaded into the WebView.
normally it could be done by
(<a[^>]+>)(.+?)(<\/a>)
to get group $1 then replace the text.
What if there are other unknown children inside the <a> tag?
the example below has <a><p>... text</p></a>, but the <p> could something else not known.
Really what it wants is to replace only the content of text element of any child inside the element.
<a href="http://news.newsletter.com/" target="_blank">
<p><img alt=“Socialbook" border="0" height="50"
src="http://news.newsletter.com/images/socialbook.gif" width="62">
THIS IS THE TEXT NEEDED TO REPLACE<p>
</a>
Can this be done inside the JAVA or has to be done inside the WebView's javascript?
You can use any Java html parser. E.g. JSoup:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
for (Element link : links)
link.text("~" + link.text() + "~");
See Element api docs.

How to get tag name for given word in Jsoup?

I have some html code like this:
<div class="post-text" itemprop="text">sometext for example</div>
I'm searching sometext word using jsoup and I want it's tag name. For above example it will be a href. Can anyone help me?
Try this CSS selector:
*:containsOwn(sometext)
DEMO
http://try.jsoup.org/~1FKtzLpHQFii4u8FFyUuh3GgdPI
SAMPLE CODE
String html = "<div class=\"post-text\" itemprop=\"text\">sometext for example</div>";
Document doc = Jsoup.parse(html);
Elements elts = doc.select("*:containsOwn(sometext)");
for(Element e : elts) {
System.out.println(e.outerHtml());
}
OUTPUT
sometext for example
SEE ALSO
:matchesOwn(regex) - If you want to find element with more elaborate text.
Jsoup CSS selector - The complete reference on CSS selectors supported by Jsoup

Java Html parser to extract specific data?

I have a html file like the following
...
<span itemprop="A">234</span>
...
<span itemprop="B">690</span>
...
In this i want to extract values as A and B.
Can u suggest any html parser library for java that can do this easily?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
http://jsoup.org/
JSoup is the way to go.
JTidy is a confusingly named yet respected HTML parser.

Extract data from html code with Jsoup

I want to extract from this HTML code the word Mustafa with Jsoup.
<h1 id="firstHeading" class="firstHeading">Mustafa</h1>
<!-- /firstHeading -->
How can I do this?
With Jsoup you can use CSS selectors to select elements. An element with id="firstHeading" is selectable with CSS selector #firstHeading.
Thus, this should do:
Document document = Jsoup.parse(html);
String firstHeading = document.select("#firstHeading").text();
System.out.println(firstHeading); // Mustafa

Categories

Resources