Java: enrich xml with elements through regex string parsing

Java: enrich xml with elements through regex string parsing - java

I have a complex task to transform docx document to JATS XML. For now I have grab everything that it is possible from docx with xslt. And the next step is parsing xml file and update it by changing some xml strings (text in nodes) to xml elements. I have not found this information in somewhat similar questions on this forum. My input xml looks like this:
<article dtd-version="3.0" article-type="other">
<body>
<sec>
<title>mySuperTitle</title>
<p>
This is some scientific stuff [1]. Here is more complicated info. This text is even more bizarre [2,3].
</p>
<p>
Einstein formulas [4]. String theory [5,6]. Really don`t know what to write here[7,8].
</p>
</sec>
<sec>
<title>AnotherBoringTitle</title>
<p>
Another one section and obviously here is even more citations [9,10,11]
</p>
</sec>
</body>
</article>
Ideally, I want to replace all [citations], which are simple numbers in [], to xml elements. For example:
<article dtd-version="3.0" article-type="other">
<body>
<sec>
<title>mySuperTitle</title>
<p>
This is some scientific stuff [<xref ref-type="bibr" rid="bib1">1</xref>]. Here is more complicated info. This text is even more bizarre [<xref ref-type="bibr" rid="bib2">2</xref>,<xref ref-type="bibr" rid="bib3">3</xref>].
</p>
<p>
Einstein formulas [<xref ref-type="bibr" rid ="bib4">4</xref>]. String theory [<xref ref-type="bibr" rid ="bib5">5</xref>,<xref ref-type="bibr" rid ="bib6">6</xref>]. Really don`t know what to write here [<xref ref-type="bibr" rid ="bib7">7</xref>,<xref ref-type="bibr" rid ="bib8">8</xref>].
</p>
</sec>
<sec>
<title>AnotherBoringTitle</title>
<p>
Another one section and obviously here is even more citations [<xref ref-type="bibr" rid ="bib9">9</xref>,<xref ref-type="bibr" rid ="bib10">10</xref>,<xref ref-type="bibr" rid ="bib11">11</xref>]
</p>
</sec>
</body>
</article>
I don`t have much experience in Java, but already tried to use DOM, Xpath and regex for this task. The problem is when I parse the document and get the node, I must get it from DOM, transform to string, replace chars in string to number, transform to element and make the output. I find it problematic to transform this string to element (this requires to create new documentBuilder, or how it calls) and replace the proper element in DOM for ouput a new xml.
Is here an easy solution? Or I must write many lines of code here?

This works using DOM and regexex:
I assume you know how to find the right Text node.
You then need to:
//get the split point:
int prevSplitOffset = 0;
Matcher m = Pattern.compile("\\[(\\d+)\\]").matcher(textNode.getData());
while (m.find()) {
// get the text and split it:
Text number = textNode.splitText(m.start(1) - prevSplitOffset);
textNode = number.splitText(m.group(1).length());
// Replace the number with a new DOM node:
Element xref = document.createElement("xref");
xref.setAttribute("rid", "bib" + m.group(1));
xref.setAttribute("ref-type", "bibr");
number.getParentNode().replaceChild(xref, number);
xref.appendChild(number);
prevSplitOffset = m.end(1);
}

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?

This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101

You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >

You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>

Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Java / Android HTML custom tag parser

I'm trying to figure out a way to parse a html file with custom tags in the form:
[custom tag="id"]
Here's an example of a file I'm working with:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds <a href="http://youtu.be/F5nLu232KRo"> bro
What I would like (in an ideal world) is to get back is a list of elements):
List foundElements = [text, custom tag, text, link, text]
Where the element in the above list contains:
Text:
<p>This is an <em>amazing</em> example. </p>
<p>Such amazement, <span>many wow.</span> </p>
<p>Oh look, a wild [custom tag="amaze"] appears.</p>
We need maor embeds
Custom tag:
[custom tag="amaze"]
Link:
<a href="http://youtu.be/F5nLu232KRo">
Text:
appears.</p>We need maor embeds
What I've tried:
Jsoup
Jsoup is great, it works perfectly for HTML. The issue is I can't define custom tags with opening "[" and closing "]". Correct me if I'm wrong?
Jericho
Again like Jsoup, Jericho works great..except for defining custom tags. You're required to use "<".
Java Regex
This is the option I really don't want to go for. It's not reliable and there's a lot of string manipulation that is brittle, especially when you're matching against a lot of regexes.
Last but not least, I'm looking for a performance orientated solution as this is done on an Android client.
All suggestions welcome!

I would like to parse an html source string to find a specific tag in Java

So I have the following html source:
<form action='http://example.com' method='get'>
<P>Some example text here.</P>
<input type='text' class='is-input' id='agent_name' name='deviceName' placeholder='Device Name'>
<input type='hidden' name='p' value='firefox'>
<input type='hidden' name='email' value='example#example.com'>
<input type='hidden' name='k' value='cITBk236gyd56oiY0fhk6lpuo9nt61Va'>
<p><input type='submit' class='btn-blue' style='margin-top:15px;' value='Install'></p>
</form>
Unfortunately this html source is saved as a string.
I would like to parse it using something like jsoup. and obtain the following String:
<input type='hidden' name='k' value='cITBk236gyd56oiY0fhk6lpuo9nt61Va'>
or better yet, only grab the following value: cITBk236gyd56oiY0fhk6lpuo9nt61Va
The problem I'm running into is that:
a) that value: cITBk236gyd56oiY0fhk6lpuo9nt61Va is consistently changing I cannot look for the entire html tag.
So, I am looking for a better way to do this.
Here is what I currently have that does not seem to be working:
//tried use thing, but java was angry for some reason
Jsoup.parse(myString);
// so I used this instead.
org.jsoup.nodes.Document doc = Jsoup.parse(myString);
// in this case I just tried to select the entire tag. Elements
elements = doc.select("<input name=\"k\"
value=\"cITBkdxJTFd56oiY0fhk6lUu8Owt61Va\" type=\"hidden\">");
//yeah this does not seem to work. I assume it's not a string anymorebut a document. Not sure if it
//would attempt to print anyway.
System.out.println(elements);
so I guess I can't use select, but even if this would work. I was not sure how to place select that part of the tag and place it into a new string.

You can try this way
Document doc = Jsoup.parse(myString);
Elements elements = doc.select("input[name=k]");
System.out.println(elements.attr("value"));
output:
cITBk236gyd56oiY0fhk6lpuo9nt61Va

Try this call to select to get the elements:
elements = doc.select("input[name=k][value=cITBkdxJTFd56oiY0fhk6lUu8Owt61Va]")
In this context, elements must be an Elements object. If you need to extract data from elements, you can use one of these (among others, obviously):
elements.html(); // HTML of all elements
elements.text(); // Text contents of all elements
elements.get(i).html(); // HTML of the i-th element
elements.get(i).text(); // Text contents of the i-th element
elements.get(i).attr("value"); // The contents of the "value" attribute of the i-th element
To iterate over elements, you can use any of these:
for(Element element : elements)
element.html(); // Or whatever you want
for(int i=0;i<elements.size();i++)
elements.get(i).html(); // Or whatever you want
Jsoup is an excellent library. The select method uses (lightly) modified CSS selectors for document queries. You can check the valid syntax for the method in the Jsoup javadocs.

How to escape the special char in SAX parsing

I am parsing the xml file below:
<description>
<p>
<a href="http://news.yahoo.com/jessica-chastain-talks-princess-diana-biopic- 164102608.html">
<img src="http://l3.yimg.com/bt/api/res/1.2/zD3Iwxezk8JVGQwhow7y4Q--/YXBwaWQ9eW5ld3M7Zmk9ZmlsbDtoPTg2O3E9ODU7dz0xMzA-/http://media.zenfs.com/en_us/News/Reuters/2011-11-07T171906Z_01_BTRE7A61C3Y00_RTROPTP_2_FILM-US-JESSICACHASTAIN.JPG"
alt="photo"
align="left"
title="Actress Chastain poses for photographers as she arrives on the "Wilde Salome" red carpet at the 68th Venice Film Festival" border="0" />
</a>NEW YORK (TheWrap.com) - Jessica Chastain may not win Oscar gold this year, but it appears she will wear a crown.
</p>
<br clear="all"/>
</description>
I am using SAX parser and trying to get the data inside the img tag, title attribute. But because of the special char "Wild Salome" in the text i am getting ExpatParser exception.
Could you please let me know how this can be solved?

The XML is invalid. Attribute value should not contain quotes ("). The program that generated it should replace the inner " characters with ".
If you print the " to a webpage, the browser will automatically show "-character on its place.

Removing HTML entities while preserving line breaks with JSoup

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem.
I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such:
Glóandi augu, silfurnátt
<br />Blóð alvöru, starir á
<br />Óður hundur er í vígamóð, í maga... mér
<br />
<br />Kolniður gref, kvik sem dreg hér
<br />Kolniður svart, hvergi bjart né
But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags.
However, if I use Node.text(), I can get a better looking result, free of tags and entities:
Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart,
Which has another unfortunate side-effect of removing the line breaks and compressing into a single line.
Simply replacing <br /> from the node before calling Node.text() yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines.
Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

(disclaimer) I haven't used this API ...
but a quick look at the docs suggests that you could visit each descendent node and dump out its text contents. Breaks could be inserted when special tags like <br> are encountered.
The TextNode.getWholeText() call also looks useful.

based on another answer from stackoverflow I added a few fixes and came with
String text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2nl").replaceAll("\n", "br2nl")).text();
text = text.replaceAll("br2nl ", "\n").replaceAll("br2nl", "\n").trim();
Hope this helps

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: enrich xml with elements through regex string parsing - java

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

Java / Android HTML custom tag parser

I would like to parse an html source string to find a specific tag in Java

How to escape the special char in SAX parsing

Removing HTML entities while preserving line breaks with JSoup

Categories

Resources