Can I use jsoup to parse non-standard markup, such as <LOCATION>, <PERSON>, or <ORGANIZATION>?
This is an example sentence in my corpus:
I HAD been hearing about vineyards in <LOCATION>Malibu</LOCATION> for some time,
but I wrote them off. Had to be a tourist gimmick, like
<PERSON>Knott</PERSON>'s <ORGANIZATION>Berry Farm</ORGANIZATION>
or the LaBrea Tar Pits. <LOCATION>Malibu</LOCATION> was the playground of the stars,
a surfers' mecca, but cabernet? No way.
I'd like to extract something like:
Location: Malibu
Person: Knott
Organization: Berry Farm
If it is not part of the HTML specification the default parse method will not handle the custom markup.
You can however tell JSoup to parse it as an XML:
Jsoup.parse(yourHtml, baseUriForLinks, Parser.xmlParser());
The command above will return a Document in which you can operate with your custom markup.
Where:
yourHtml - the HTML with the custom markup as String
baseUriForLinks - the base URL of the HTML (so that JSoup can resolve relative links if any are present) also as String
Related
I have consumed rest Webservice and I get a response in text format, I want to convert this response to java object in order to do some logic.
I have tried to parse the text to XML unfortunately, doesn't work.
Response from web service:
<html>
<title>rest response</title>
<body>
XY111NWA1
XY112NWA1
XY113NWA1
XY114NWA1
XY115NWA1
XY116NWA1
XY117NWA1
XY118NWA1
XY119NWA1
</body>
</html>
I expect the output to be in XML format so I can parse then to JAVA object, or
get the date between tags and the result should be a java list as follows:
XY111NWA1
XY112NWA1
XY113NWA1
XY114NWA1
XY115NWA1
XY116NWA1
XY117NWA1
XY118NWA1
XY119NWA1
That's not a REST response(not a JSON or response formatted XML), so if you need to parse HTML page(or html-like response).
WAY 1: I would rather recommend such tool as a jsoup:
Simple document parsing https://jsoup.org/cookbook/input/parse-document-from-string
More complete example which handles a list of links:
https://jsoup.org/cookbook/extracting-data/example-list-links
This way you can extract any node/value/attribute from your html response/file.
WAY 2 You can try more general SAX or JAXB parsers for parsing xml (even if it's HTML) like in https://www.mkyong.com/java/jaxb-hello-world-example/ . But that's if you have some contract in your response. And from what I see that's not a best approach here.
Relevant link: Which is the best library for XML parsing in java
I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!
I am traversing an xml document using w3c DOM and I need to wrap the substring of the text content inside an org.w3c.dom.Element with some tag based on some business logic.
For example, I want to turn
<title id="1">Java is a cool programming language</title>
into
<title id="1">Java is a <blah id="2">cool</blah> programming language</title>
I don't insist on using the w3c DOM library for my application so any suggestions are welcome in terms of other libraries that could accomplish this.
All text in an XML document will be parsed by the parser.
But text inside a CDATA section will be ignored by the parser.
try this
<title id="1">Java is a <![CDATA[<blah id="2">cool</blah> ]]>programming language</title>
Typically you'd use < and > (and others) to construct such tags in your node values. These are so called 'entity references. See for example here for some info about them; Google/Bing/YourFavouriteSearchEngine for more details.
In your example, this would mean you'd use:
<title id="1">Java is a <blah id="2">cool</blah> programming language</title>
Cheers,
Wim
My input html is
<p>
<span>first
</span>
<span>Google Cloud Connect for Microsoft Office</span>
</p>
I am using xslt1.0 to convert the html to xml..my output xml is
<Relationship Id="rId12700703801" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://tools.google.com/dlpage/cloudconnect#utm_campaign=launch&utm_source=en-na-us-gdb-GCC-Appsperience_02242011&utm_medium=blog" TargetMode="External"/></Relationships>
with error "XML Parsing Error: not well-formed" in the location =(after launch&utm_source) in target attribute..
I want to escape the special characters present in url through xslt and make the xml.
Please help me. Thanks in advance..
are you generating the input html? if so you can use URLEncoder.encode to properly encode the string so the transformer doesn't complain about the syntax.
If this is just a random html page, and you have no control over it, then you probably need to use some html parser, such as tagsoup, et. al, to pre-correct it as most html files are not properly formatted.
XSLT expects XML as input, not HTML. You need to turn your HTML into XML if you want to transform it with XSLT.
I think it might be possible to do it with HTML Tidy.
I am using this pattern to remove all HTML tags (Java code):
String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";
html=html.replaceAll("\\<.*?\\>", "");
System.out.println(html);
Now, I want to keep tag <a ...> (with </a>) and tag <img ...>
I want the result to be:
text <a href=#>link</a> b pic<img src=#>
How to do this?
I don't need HTML parser to do this,
because I need this regex pattern to filter a lot of html fragment,
so,I want the solution with regex
You could do this using a negative lookahead:
"<(?!(?:a|/a|img)\\b).*?>"
Rubular
However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.
For more information see this question:
What HTML parsing libraries do you recommend in Java
Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.
Hey! Here is your answer:
You can’t parse [X]HTML with regex.
Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.
If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.
See also this answer.
I recommend you use strip_tags (a PHP function)
string strip_tags ( string $str [, string $allowable_tags ] )
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> Other text';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
OUTPUT
Test paragraph. Other text
<p>Test paragraph.</p> Other text