I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!
Related
we have to get out all the Text from an HTML File without the usage of Jsoup or similar. Whats the best/only way to do that? Our Example looks like this:
<ul><li>Coffee</li><li>Tea</li><li>Milk</li></ul>
<h2>An Ordered HTML List</h2>
<ol><li>Coffee</li><li>Tea</li><li>Milk</li></ol>´´´
need to get all the text out of these html tags without using any libs and if the Tag is not done correctly, print out an error message. Need help guys
How do I configure a jsoup Whitelist to allow internal anchor references, without allowing any arbitrary value?
Example html:
Jump To Section 1
<!-- ... -->
<a name="section1">Section 1</a>
If I attempt to clean the code with the relaxed Whitelist the href is removed.
Jsoup.clean(html, Whitelist.relaxed().addAttributes("a", "name", "target");
returns the following:
<a target="_self">Jump To Section 1</a>
<!-- ... -->
<a name="section1">Section 1</a>
If I manually build a Whitelist and add the tags and attributes that I want, but don't call addProtocols(....) I can get jsoup to leave the href in place, but that doesn't seem like a good solution as it doesn't filter out href's that contain JavaScript. For example, I want the a tag (or at least the href) removed from the following:
Jump To Section 1
<a name="section1">Section 1</a>
Is this possible with jsoup?
I did see the following patch submission to jsoup, but it doesn't look like it made it into the jsoup code base: https://github.com/jhy/jsoup/pull/77
Whitelist whitelist=new Whitelist();
Cleaner cleaner = new Cleaner(whitelist);
whitelist.addAttributes("a","accesskey","dir","lang","style","tabindex","title","href");
cleaner.clean(doc);
If no protocols are provided/whitelisted, then all of them are implicitly allowed (see isSafeAttribute). If you want to allow internal anchors, then you need to never call addProtocol on your whitelist's anchor tags, unfortunately (well, on the href at least). It looks like there was a pull request to add support, but it was never merged.
Be aware that if you are allowing all protocols, that a malicious user can run Javascript on link click:
Some text
so be cautious of that if you do not trust your HTML.
If you want to only allow say, http, https, and anchor tags, then I believe you are out of luck.
The reply get 3 upvotes doesn't answer the question at all.
The github link mentioned in the OP is currently merged, and for others who are looking for the answer
Whitelist.relaxed().addProtocols("a", "href", "#")
Reference: Jsoup API Document
I am having HTML contents as given below. The tag that i am looking out for here are "img src" and "!important". Does Java provide any HTML parsing techniques?
<fieldset>
<table cellpadding='0'border='0'cellspacing='0'style="clear :both">
<tr valign='top' ><td width='35' >
<a href='http://mypage.rediff.com/android/32868898'class='space' onmousedown="return
enc(this,'http://track.rediff.com/clickurl=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F3 868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >
<div style='width:25px;height:25px;overflow:hidden;'>
<img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb' width='25' vspace='0' /></div></a></td> <td><span>
<a href='http://mypage.rediff.com/android/32868898' class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.117&pos=0&feed_id=12942949154d255f839677925642&prc_id=32868898&rowid=2064549114')" >Android </a> </span><span style='color:#000000
!important;'>android se updates...</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/>
String value = Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("img").attr("src");
System.out.println(value); //http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb
System.out.println(Jsoup.parse(new File("d:\\1.html"), "UTF-8").select("span[style$=important;]").first().text());//android se updates...
JSoup
What-are-the-pros-and-cons-of-the-leading-java-html-parsers
Try NekoHtml. This is the HTML parsing library used by various higher-level testing frameworks such as HtmlUnit.
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
I used jsoup - this library have nice selector syntax (http://jsoup.org/cookbook/extracting-data/selector-syntax), and for your problem you can use code like this:
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements pngs = doc.select("img[src$=.png]");
I like using Jericho: http://jericho.htmlparser.net/docs/index.html
It is invulnerable to bad formed html, links leading to unavailable locations etc.
There's a lot of examples on their page, you just get all IMG tags and analyze their attributes to extracts those that pass your needs.
I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...> (with </a>) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex
You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
Rubular
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
What HTML parsing libraries do you recommend in Java
Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.
Hey! Here is your answer:
You can’t parse [X]HTML with regex.
Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.
I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> Other text
I need to pull data from an html page using Java code. The java part is required.
The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html
.
I need to create a list of hashmaps...or some kind of data object that i can reference in later code.
This is all i have so far:
URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();
while ((cnt = is.read()) != -1){
buffer.append((char) cnt);
}
System.out.print(buffer.toString());
Any suggestions where to start?
there is a nice HTML parser called Neko:
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.
More information here.
Use an HTML parser like CyberNeko
J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.
Beware, there are some bugs. It won't be able to handle bad HTML very well.
Dealing with colspan and rowspan is your business.
HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:
<table cellspacing="3" cellpadding="2" border="0" width="670">
...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...