Java : HTML Parsing - java

I am having HTML contents as given below. The tag that i am looking out for here are "img src" and "!important". Does Java provide any HTML parsing techniques?
<fieldset>
<table cellpadding='0'border='0'cellspacing='0'style="clear :both">
<tr valign='top' ><td width='35' >
<a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return
enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >
<div style='width:25px;height:25px;overflow:hidden;'>
<img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span>
<a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000
!important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>

String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...
JSoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers

Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements pngs = doc.select("img[src$=.png]");

I like using Jericho: http://jericho.htmlparser.net/docs/index.html
It is invulnerable to bad formed html, links leading to unavailable locations etc.
There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Change HTML element's CSS style using java

I am using JSP to create my web page. I need to use java classes to access the data that I need to pull from another website's JSON (this CANNOT change).
Say I have the code:
<div class="fruit apple"></div>
<div class="fruit banana"></div>
//"fruit peach", "fruit orange", and so on...
style.fruit {display: none;}
I need to change the HTML element using JAVA, not javascript. In my JSP file, it will be in a <% %> tag.
<% var divClassINeedToChange = "banana";
//some sort of JAVA code that is equivalent to:
//document.getElementsByClass(divClassINeedToChange).style.display = "block"; %>
I cannot find the line of java code that is equivalent to the above line.
I hope this help you
you can parse your page using DOM or SAX parser.
for example
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
DocumentBuilder builder=factory.newDocumentBuilder();
Document doc=builder.parse(new File(filename));
Element e = doc.getElemetById(divClassINeedToChange);

Java / Android HTML custom tag parser

I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!

Parsing XML with embedded data

Im trying to parse an XML using android. My problem is that the XML is in a strange format. The entirety of the data I'm trying to parse is located inside one element.
Here is an example:
<a name="3"></a>
<div class="series_alpha">
<h2 class="series_alpha">3</h2>
<ul class="series_alpha"><li>3 Banme no Kareshi<span class="mangacompleted">[Completed]</span></li>
<li>3 Gatsu no Lion</li>
<li>337 Byooshi</li>
<li>360 Degrees Material</li>
<li>37 Degrees Kiss<span class="mangacompleted">[Completed]</span></li>
<li>3x3 Eyes</li>
</ul>
<div class="clear"></div>
</div>
The XML is a piece of the source from this webpage. The data I'm trying to retrieve is found in the <li> tag, specifically the link reference and Manga name. But I dont know how I would separate the link from the title.
After looking it up I found the information inside the tags is known as an "attribute" (I'm a noob i know) and with some google searches found the attributes.getvalue("name of attribute here") method is what I was looking for.
source: XML Parsing to get Attribute Value

Read in html table to java

I need to pull data from an html page using Java code. The java part is required.
The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html
.
I need to create a list of hashmaps...or some kind of data object that i can reference in later code.
This is all i have so far:
URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();
while ((cnt = is.read()) != -1){
buffer.append((char) cnt);
}
System.out.print(buffer.toString());
Any suggestions where to start?
there is a nice HTML parser called Neko:
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
More information here.
Use an HTML parser like CyberNeko
J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.
Beware, there are some bugs. It won't be able to handle bad HTML very well.
Dealing with colspan and rowspan is your business.
HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:
<table cellspacing="3" cellpadding="2" border="0" width="670">
...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

Categories

Resources