How to decode XHTML and/or HTML5 entities in Java? - java

I have some strings that contain XHTML character entities:
"They're quite varied"
"Sometimes the string ∈ XML standard, sometimes ∈ HTML4 standard"
"Therefore -> I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."
Is there any easy way to decode the entities? (I'm using Java)
I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("&apos;", "\'")) as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils has unescapeHtml4 and unescapeXML, but no unescapeXhtml.
EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"
EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.

This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.

Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;
public class XHTMLStringEscapeUtils {
public static final CharSequenceTranslator ESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_ESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
).with(StringEscapeUtils.ESCAPE_XML11);
public static final CharSequenceTranslator UNESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper(),
new LookupTranslator(EntityArrays.APOS_UNESCAPE)
);
public static final String escape(final String input) {
return ESCAPE_XHTML.translate(input);
}
public static final String unescape(final String input) {
return UNESCAPE_XHTML.translate(input);
}
}
Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.
You can find a full project with tests here xhtml-string-escape-utils

Related

Java, Stanford NLP : Extract specific speech labels from parser

I recently discovered the Stanford NLP parser and it seems quite amazing. I have currently a working instance of it running in our project but facing the below mentioned 2 problems.
How can I parse text and then extract only specific speech-labels from the parsed data, for example, how can I extract only NNPS and PRP from the sentence.
Our platform works in both English and German, so there is always a possibility that the text is either in English or German. How can I accommodate this scenario. Thank you.
Code :
private final String PCG_MODEL = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
private final TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "invertible=true");
public void testParser() {
LexicalizedParser lp = LexicalizedParser.loadModel(PCG_MODEL);
String sent="Complete Howto guide to install EC2 Linux server in Amazon Web services cloud.";
Tree parse;
parse = lp.parse(sent);
List taggedWords = parse.taggedYield();
System.out.println(taggedWords);
}
The above example works, but as you can see I am loading the English data. Thank you.
Try this:
for (Tree subTree: parse) // traversing the sentence's parse tree
{
if(subTree.label().value().equals("NNPS")) //If the word's label is NNPS
{ //Do what you want }
}
For Query 1, I don't think stanford-nlp has an option to extract a specific POS tags.
However, Using custom trained models, we can achieve the same. I had tried similar requirement for NER - name Entity recognition custom models.

JAVA How to retrieve text content from custom JLabel without the HTML taggings?

How do I retrieve the text from JLabel without the HTML taggings?
E.g.
CustomJLabel:
public CustomJLabel extends JLabel(){
private String text;
public CustomJLabel(String text) {
super("<html><div style='text-align: center;'>"+text+"</div></html>"),
this.text=text;
}
}
Main method:
testCustomLbl = new CustomJLabel("Testing");
System.out.println(testCustomLbl.getText());
Output I got:
<html><div style='text-align: center;'>Testing</div></html>
Desired output:
Testing
There are three options:
You pick your favorite HTML parser and parse HTML; see here for some inspiration. This is by far the most robust and straight forward solution; but of course: costly.
If you are well aware of the exact HTML content that goes into your labels, then you could turn to regular expressions; or other means of string parsing. The problem is: if you don't control those strings, then coming up with your own custom "parsing" is hard. Because each and any change somewhere to the HTML that goes in ... might break your little parser.
You rework your whole design: if having HTML text is such a core thing in your application, you might consider to really "represent" that in your class. For example by creating your own versions of JLabels that take some HtmlString input ... and simply remember which parts are HTML, and which one "pure text".
And whoops; the code you are showing is already suited for option 3. So if you want that getText() returns that original text, you could add a simple
#Override
public void String getText() {
return this.text;
}
to your CustomLabel class.
Edit: alternatively, you could simply add a new method like
public void String getTextWithoutHtmlTags()
or something alike; as overriding that inherited method somehow changes the "contract" of that method. Which (depending on the context) might be ok, or not so ok.
There's no need for complex code or 3rd party JARS / Libraries.
Here's a simple solution using RegEx:
String htmlStr = "<html><h1>Heading</h1> ...... </html>";
String noHtmlStr = htmlStr.replaceAll("\\<.*?\\>", "");
Works great for me.
Hope this helps.

How to extract one boolean field from XML?

I have a model which is in XML format as shown below and I need to parse the XML and check whether my XML has internal-flag flag set as true or not. In my other models, it might be possible, that internal-flag flag is set as false. And sometimes, it is also possible that this field won't be there so by default it will be false from my code.
<?xml version="1.0"?>
<ClientMetadata
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com client.xsd"
xmlns="http://www.google.com">
<client id="200" version="13">
<name>hello world</name>
<description>hello hello</description>
<organization>TESTER</organization>
<author>david</author>
<internal-flag>true</internal-flag>
<clock>
<clock>
<for>
<init>val(tmp1) = 1</init>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
<for>
<incr>val(tmp1) -= 1</incr>
<clock>
<eval><![CDATA[result("," + $convert(val(tmp1)))]]></eval>
</clock>
</for>
</clock>
</clock>
</client>
</ClientMetadata>
I have a POJO in which I am storing my above model -
public class ModelMetadata {
private int modelId;
private String modelValue; // this string will have my above XML data as string
// setters and getters here
}
Now what is the best way to determine whether my model has internal-flag set as true or not?
// this list will have all my Models stored
List<ModelMetadata> metadata = getModelMetadata();
for (ModelMetadata model : metadata) {
// my model will be stored in below variable in XML format
String modelValue = model.getModelValue();
// now parse modelValue variable and extract `internal-flag` field property
}
Do I need to use XML parsing for this or is there any better way to do this?
Update:-
I have started using Stax and this is what I have tried so far but not sure how can I extract that field -
InputStream is = new ByteArrayInputStream(modelValue.getBytes());
XMLStreamReader r = XMLInputFactory.newInstance().createXMLStreamReader(is);
while(r.hasNext()) {
// now what should I do here?
}
There is an easy solution using XMLBeam (Disclosure: I'm affiliated with that project), just a few lines:
public class ReadBoolean {
public interface ClientMetaData {
#XBRead("//xbdefaultns:internal-flag")
boolean hasFlag();
}
public static void main(String[] args) throws IOException {
ClientMetaData clientMetaData = new XBProjector().io().url("res://xmlWithBoolean.xml").read(ClientMetaData.class);
System.out.println("Has flag:"+clientMetaData.hasFlag());
}
}
This program prints out
Has flag:true
for your XML.
You could also do some simple string parsing, but this will only work for small cases with proper XML and if there's only a single <internal-flag> element.
This is a simple solution to your problem without using any XML parsing utilities. Other solutions may be more robust or powerful.
Find the index of the string literal <internal-flag>. If it doesn't exist, return false.
Go forward "<internal-flag>".length (15) characters. Read up to the next </internal-flag>, which should be the string true or false.
Take that string, use Boolean.parseBoolean(String) to get a boolean value.
If you want me to help you out with the code just drop a comment!
If you are willing to consider adding Groovy to your mix (e.g. see the book Making Java Groovy) then using a Groovy XMLParser and associated classes will make this simple.
If you need to stick to Java, let me put in a shameless plug for my Xen library, which mimics a lot of the "Groovy way". The answer to your question would be:
Xen doc = new XenParser().parseText(YOUR_XML_STRING);
String internalFlag = doc.getText(".client.internal-flag");
boolean isSet = "true".equals(internalFlag);
If the XML comes from a File, Stream, or URI, that can be handled too.
Caveat emptor, (even though it is free) this is a fairly new library, written solely by a random person (me), and not thoroughly tested on all the crazy XML out there. If anybody knows of a similar, more "mainstream" library I'd be very interested in hearing about it.

Java REGEX XML parse/cut-down while maintaining structure HowTo

I am writing a RESTful web service in Java.
The idea is to "cut down" an XML document and strip away all the unneeded content (~98%) and leave only the tags we're interested in, while maintaining the document's structure, which is as follows (I cannot provide the actual XML content for confidentiality reasons):
<sear:SEGMENTS xmlns="http://www.exlibrisgroup.com/xsd/primo/primo_nm_bib" xmlns:sear="http://www.exlibrisgroup.com/xsd/jaguar/search">
<sear:JAGROOT>
<sear:RESULT>
<sear:DOCSET IS_LOCAL="true" TOTAL_TIME="176" LASTHIT="9" FIRSTHIT="0" TOTALHITS="262" HIT_TIME="11">
<sear:DOC SEARCH_ENGINE_TYPE="Local Search Engine" SEARCH_ENGINE="Local Search Engine" NO="1" RANK="0.086826384" ID="2347460">
[
<PrimoNMBib>
<record>
<display>
<title></title>
</display>
<sort>
<author></author>
</sort>
</record>
</PrimoNMBib>
]
</sear:DOC>
</sear:DOCSET>
</sear:RESULT>
</sear:JAGROOT>
</sear:SEGMENTS>
Of course, this is the structure of only the tags we are interested in - there are hundreds more tags, but they are irrelevant.
The square brackets ([]) are not part of the XML and indicate that the element <PrimoNMBib></PrimoNMBib> are elements of a list of children and occur more than once - one per match of the search from the RESTFUL service.
I've been trying to parse the document with regular expressions, as to leave only the segments of the structure as shown above along with the values of <title> and <author> while removing everything else in-between the tags including other tags, however I can't get it to work for the life of me...
Previously I tried it using XSLT, however for unresolved reasons that didn't work either... I'd already asked a question for the XSLT implementation...
Anyway, I would very much appreciate a tip/hint/solution as how to solve this problem using regex and Java...
I wouldn't recommend using regex to manipulate XML.
Alternative Approach
You could use a StAX parser that leverages a StreamFilter to cut down the document and still maintain a valid structure.
How a StreamFilter Works
A StreamFilter receives event event from the XMLStreamReader, if you want to have the event reported you return true, otherwise false. In the example below the StreamFilter will reject anything in the "http://www.exlibrisgroup.com/xsd/jaguar/search" namespace. You will need to tweak the logic to get it to match the requirements of your use case.
http://docs.oracle.com/javase/6/docs/api/javax/xml/stream/StreamFilter.html
Demo
package forum10351473;
import java.io.FileReader;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum10351473/input.xml"));
xsr = xif.createFilteredReader(xsr, new StreamFilter() {
private boolean reportContent = false;
#Override
public boolean accept(XMLStreamReader reader) {
if(reader.isStartElement() || reader.isEndElement()) {
reportContent = !"http://www.exlibrisgroup.com/xsd/jaguar/search".equals(reader.getNamespaceURI());
}
return reportContent;
}
});
// The XMLStreamReader (xsr) will now only report the events you care about.
// You can process the XMLStreamReader yourself or pass as input to something
// like JAXB.
while(xsr.hasNext()) {
if(xsr.isStartElement()) {
System.out.println(xsr.getLocalName());
}
xsr.next();
}
}
}
Output
PrimoNMBib
record
display
title
sort
author

HTML entity decoding in Java: apostrophe

I have to decode, using Java, HTML strings which contain the following entities: "&#39" and "&apos".
I'm using Apache Commons Lang, but it doesn't decode those two entities, so, I'm currently doing as follows, but I'm looking for the fastest way to do what I want.
import org.apache.commons.lang.StringEscapeUtils;
public class StringUtil {
public static String decodeHTMLString(String s) {
return StringEscapeUtils.unescapeHtml((s.replace("&#39;", "`").replace("&apos;", "'")));
}
}
I searched for older questions, but none seems to answer my question.
Well, i would imagine that part of the problem is that one of your entities is double encoded: "&#39;". That will not be turned into an apostrophe by any decoder.
As for "&apos;", apparently that one is not +technically+ part of the html entity set.

Categories

Resources