Ignore whitespace while converting XML to JSON using XStream - java

I'm attempting to establish a reliable and fast way to transform XML to JSON using Java and I've started to use XStream to perform this task. However, when I run the code below the test fails due to whitespace (including newline), if I remove these characters then the test will pass.
#Test
public void testXmlWithWhitespaceBeforeStartElementCanBeConverted() throws Exception {
String xml =
"<root>\n" +
" <foo>bar</foo>\n" + // remove the newlines and white space to make the test pass
"</root>";
String expectedJson = "{\"root\": {\n" +
" \"foo\": bar\n" +
"}}";
String actualJSON = transformXmlToJson(xml);
Assert.assertEquals(expectedJson, actualJSON);
}
private String transformXmlToJson(String xml) throws XmlPullParserException {
XmlPullParser parser = XppFactory.createDefaultParser();
HierarchicalStreamReader reader = new XppReader(new StringReader(xml), parser, new NoNameCoder());
StringWriter write = new StringWriter();
JsonWriter jsonWriter = new JsonWriter(write);
HierarchicalStreamCopier copier = new HierarchicalStreamCopier();
copier.copy(reader, jsonWriter);
jsonWriter.close();
return write.toString();
}
The test fails the exception:
com.thoughtworks.xstream.io.json.AbstractJsonWriter$IllegalWriterStateException: Cannot turn from state SET_VALUE into state START_OBJECT for property foo
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.handleCheckedStateTransition(AbstractJsonWriter.java:265)
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.startNode(AbstractJsonWriter.java:227)
at com.thoughtworks.xstream.io.json.AbstractJsonWriter.startNode(AbstractJsonWriter.java:232)
at com.thoughtworks.xstream.io.copy.HierarchicalStreamCopier.copy(HierarchicalStreamCopier.java:36)
at com.thoughtworks.xstream.io.copy.HierarchicalStreamCopier.copy(HierarchicalStreamCopier.java:47)
at testConvertXmlToJSON.transformXmlToJson(testConvertXmlToJSON.java:30)
Is there a way to to tell the copy process to ignore the ignorable white space. I cannot find any obvious way to enable this behaviour, but I think it should be there. I know I can pre-process the XML to remove the white space, or maybe just use another library.
update
I can work around the issue using a decorator of the HierarchicalStreamReader interface and suppressing the white space node manually, this still does not feel ideal though. This would look something like the code below, which will make the test pass.
public class IgnoreWhitespaceHierarchicalStreamReader implements HierarchicalStreamReader {
private HierarchicalStreamReader innerHierarchicalStreamReader;
public IgnoreWhitespaceHierarchicalStreamReader(HierarchicalStreamReader hierarchicalStreamReader) {
this.innerHierarchicalStreamReader = hierarchicalStreamReader;
}
public String getValue() {
String getValue = innerHierarchicalStreamReader.getValue();
System.out.printf("getValue = '%s'\n", getValue);
if(innerHierarchicalStreamReader.hasMoreChildren() && getValue.length() >0) {
if(getValue.matches("^\\s+$")) {
System.out.printf("*** White space value suppressed\n");
getValue = "";
}
}
return getValue;
}
// rest of interface ...
Any help is appreciated.

Comparing two XML's as String objects is not a good idea. How are you going to handle case when xml is same but nodes are not in the same order.
e.g.
<xml><node1>1</node1><node2>2</node2></xml>
is similar to
<xml><node2>2</node2><node1>1</node1></xml>
but when you do a String compare it will always return false.
Instead use tools like XMLUnit. Refer to following link for more details,
Best way to compare 2 XML documents in Java

Related

When parsing JSON with java, how to getText() bounded by a maximum amount?

I am attempting to parse the output of Apache Tika Server's rmeta web servivce endpoint: https://cwiki.apache.org/confluence/display/TIKA/TikaServer#TikaServer-RecursiveMetadataandContent
It's payloads look like the following:
[
{"Application-Name":"Microsoft Office Word",
"Application-Version":"15.0000",
"X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.microsoft.ooxml.OOXMLParser"],
"X-TIKA:content":"this content string can be many MB large"
...
},
{"Content-Encoding":"ISO-8859-1",
"Content-Length":"8",
"Content-Type":"text/plain; charset=ISO-8859-1"
"X-TIKA:content":"again, this content string can be many MB large",
...
}
...
]
As indicated, the X-TIKA:content strings can be quite oppressively large. Enough to OOM my JVM if I load the entire string into memory.
So if I were to use JsonParser.getText() like this:
private void parseRmetaResponse(CloseableHttpResponse response) {
ObjectMapper objectMapper = new ObjectMapper();
JsonFactory jsonFactory = objectMapper.getFactory();
JsonParser jsonParser = jsonFactory.createParser(response.getEntity().getContent());
JsonToken arrayStartToken = jsonParser.nextToken();
if (arrayStartToken != JsonToken.START_ARRAY) {
throw new IllegalStateException("The first element of the Json structure was expected to be a start array token, but it was: " + arrayStartToken);
}
JsonToken nextToken = jsonParser.nextToken();
while (nextToken != JsonToken.END_ARRAY) {
parseNextField(jsonParser);
}
}
private String getTextContents(JsonParser jsonParser, OutputStream os, Metadata metadata) throws IOException {
String nextAttr = jsonParser.nextFieldName();
if ("X-TIKA:content".equals(nextAttr)) {
return jsonParser.getText();
}
// ...
}
It would be prone to OOM crashes because I cannot load all of that string in memory without eating up all the JVM heap.
Instead I have a maximum number of chars parameter maxChars that I want to stop reading chars from X-TIKA:content after I reach that number.
How can I say "get me text, but only read up to maxChars characters, and discard any additional characters"?
I can use GSON, Fasterxml Jackson, or any other library that helps me do what I need to do here.
Instead of calling String getText(), you can call int getText(Writer writer).
Give it a custom Writer that works similar to StringWriter, but discards any characters beyond a given threshold.
The you would use it like this:
if ("X-TIKA:content".equals(nextAttr)) {
try (LimitedStringWriter writer = new LimitedStringWriter(maxParseChars)) {
jsonParser.getText(writer);
return writer.toString();
}
}
Writing the LimitedStringWriter class is your job to do.
Added by questioner (Nicholas DiPiazza):
Here is an example of an implementation you could use as an example: https://github.com/ow2-proactive/scheduling/blob/master/common/common-api/src/main/java/org/ow2/proactive/utils/BoundedStringWriter.java

Parsing curl response with Java

Before writing something like "why don't you use Java HTTP client such as apache, etc", I need you to know that the reason is SSL. I wish I could, they are very convenient, but I can't.
None of the available HTTP clients support GOST cipher suite, and I get handshake exception all the time. The ones which do support the suite, doesn't support SNI (they are also proprietary) - I'm returned with a wrong cert and get handshake exception over and over again.
The only solution was to configure openssl (with gost engine) and curl and finally execute the command with Java.
Having said that, I wrote a simple snippet for executing a command and getting input stream response:
public static InputStream executeCurlCommand(String finalCurlCommand) throws IOException
{
return Runtime.getRuntime().exec(finalCurlCommand).getInputStream();
}
Additionally, I can convert the returned IS to a string like that:
public static String convertResponseToString(InputStream isToConvertToString) throws IOException
{
StringWriter writer = new StringWriter();
IOUtils.copy(isToConvertToString, writer, "UTF-8");
return writer.toString();
}
However, I can't see a pattern according to which I could get a good response or a desired response header:
Here's what I mean
After executing a command (with -i flag), there might be lots and lots of information like in the screen below:
At first, I thought that I could just split it with '\n', but the thing is that a required response's header or a response itself may not satisfy the criteria (prettified JSON or long redirect URL break the rule).
Also, the static line GOST engine already loaded is a bit annoying (but I hope that I'll be able to get rid of it and nothing unrelated info like that will emerge)
I do believe that there's a pattern which I can use.
For now I can only do that:
public static String getLocationRedirectHeaderValue(String curlResponse)
{
String locationHeaderValue = curlResponse.substring(curlResponse.indexOf("Location: "));
locationHeaderValue = locationHeaderValue.substring(0, locationHeaderValue.indexOf("\n")).replace("Location: ", "");
return locationHeaderValue;
}
Which is not nice, obviosuly
Thanks in advance.
Instead of reading the whole result as a single string you might want to consider reading it line by line using a scanner.
Then keep a few status variables around. The main task would be to separate header from body. In the body you might have a payload you want to treat differently (e.g. use GSON to make a JSON object).
The nice thing: Header and Body are separated by an empty line. So your code would be along these lines:
boolean inHeader = true;
StringBuilder b = new StringBuilder;
String lastLine = "";
// Technically you would need Multimap
Map<String,String> headers = new HashMap<>();
Scanner scanner = new Scanner(yourInputStream);
while scanner.hasNextLine() {
String line = scanner.nextLine();
if (line.length() == 0) {
inHeader = false;
} else {
if (inHeader) {
// if line starts with space it is
// continuation of previous header
treatHeader(line, lastLine);
} else {
b.append(line);
b.appen(System.lineSeparator());
}
}
}
String body = b.toString();

Handling errors in ANTLR4

The default behavior when the parser doesn't know what to do is to print messages to the terminal like:
line 1:23 missing DECIMAL at '}'
This is a good message, but in the wrong place. I'd rather receive this as an exception.
I've tried using the BailErrorStrategy, but this throws a ParseCancellationException without a message (caused by a InputMismatchException, also without a message).
Is there a way I can get it to report errors via exceptions while retaining the useful info in the message?
Here's what I'm really after--I typically use actions in rules to build up an object:
dataspec returns [DataExtractor extractor]
#init {
DataExtractorBuilder builder = new DataExtractorBuilder(layout);
}
#after {
$extractor = builder.create();
}
: first=expr { builder.addAll($first.values); } (COMMA next=expr { builder.addAll($next.values); })* EOF
;
expr returns [List<ValueExtractor> values]
: a=atom { $values = Arrays.asList($a.val); }
| fields=fieldrange { $values = values($fields.fields); }
| '%' { $values = null; }
| ASTERISK { $values = values(layout); }
;
Then when I invoke the parser I do something like this:
public static DataExtractor create(String dataspec) {
CharStream stream = new ANTLRInputStream(dataspec);
DataSpecificationLexer lexer = new DataSpecificationLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DataSpecificationParser parser = new DataSpecificationParser(tokens);
return parser.dataspec().extractor;
}
All I really want is
for the dataspec() call to throw an exception (ideally a checked one) when the input can't be parsed
for that exception to have a useful message and provide access to the line number and position where the problem was found
Then I'll let that exception bubble up the callstack to whereever is best suited to present a useful message to the user--the same way I'd handle a dropped network connection, reading a corrupt file, etc.
I did see that actions are now considered "advanced" in ANTLR4, so maybe I'm going about things in a strange way, but I haven't looked into what the "non-advanced" way to do this would be since this way has been working well for our needs.
Since I've had a little bit of a struggle with the two existing answers, I'd like to share the solution I ended up with.
First of all I created my own version of an ErrorListener like Sam Harwell suggested:
public class ThrowingErrorListener extends BaseErrorListener {
public static final ThrowingErrorListener INSTANCE = new ThrowingErrorListener();
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e)
throws ParseCancellationException {
throw new ParseCancellationException("line " + line + ":" + charPositionInLine + " " + msg);
}
}
Note the use of a ParseCancellationException instead of a RecognitionException since the DefaultErrorStrategy would catch the latter and it would never reach your own code.
Creating a whole new ErrorStrategy like Brad Mace suggested is not necessary since the DefaultErrorStrategy produces pretty good error messages by default.
I then use the custom ErrorListener in my parsing function:
public static String parse(String text) throws ParseCancellationException {
MyLexer lexer = new MyLexer(new ANTLRInputStream(text));
lexer.removeErrorListeners();
lexer.addErrorListener(ThrowingErrorListener.INSTANCE);
CommonTokenStream tokens = new CommonTokenStream(lexer);
MyParser parser = new MyParser(tokens);
parser.removeErrorListeners();
parser.addErrorListener(ThrowingErrorListener.INSTANCE);
ParserRuleContext tree = parser.expr();
MyParseRules extractor = new MyParseRules();
return extractor.visit(tree);
}
(For more information on what MyParseRules does, see here.)
This will give you the same error messages as would be printed to the console by default, only in the form of proper exceptions.
When you use the DefaultErrorStrategy or the BailErrorStrategy, the ParserRuleContext.exception field is set for any parse tree node in the resulting parse tree where an error occurred. The documentation for this field reads (for people that don't want to click an extra link):
The exception which forced this rule to return. If the rule successfully completed, this is null.
Edit: If you use DefaultErrorStrategy, the parse context exception will not be propagated all the way out to the calling code, so you'll be able to examine the exception field directly. If you use BailErrorStrategy, the ParseCancellationException thrown by it will include a RecognitionException if you call getCause().
if (pce.getCause() instanceof RecognitionException) {
RecognitionException re = (RecognitionException)pce.getCause();
ParserRuleContext context = (ParserRuleContext)re.getCtx();
}
Edit 2: Based on your other answer, it appears that you don't actually want an exception, but what you want is a different way to report the errors. In that case, you'll be more interested in the ANTLRErrorListener interface. You want to call parser.removeErrorListeners() to remove the default listener that writes to the console, and then call parser.addErrorListener(listener) for your own special listener. I often use the following listener as a starting point, as it includes the name of the source file with the messages.
public class DescriptiveErrorListener extends BaseErrorListener {
public static DescriptiveErrorListener INSTANCE = new DescriptiveErrorListener();
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol,
int line, int charPositionInLine,
String msg, RecognitionException e)
{
if (!REPORT_SYNTAX_ERRORS) {
return;
}
String sourceName = recognizer.getInputStream().getSourceName();
if (!sourceName.isEmpty()) {
sourceName = String.format("%s:%d:%d: ", sourceName, line, charPositionInLine);
}
System.err.println(sourceName+"line "+line+":"+charPositionInLine+" "+msg);
}
}
With this class available, you can use the following to use it.
lexer.removeErrorListeners();
lexer.addErrorListener(DescriptiveErrorListener.INSTANCE);
parser.removeErrorListeners();
parser.addErrorListener(DescriptiveErrorListener.INSTANCE);
A much more complicated example of an error listener that I use to identify ambiguities which render a grammar non-SLL is the SummarizingDiagnosticErrorListener class in TestPerformance.
What I've come up with so far is based on extending DefaultErrorStrategy and overriding it's reportXXX methods (though it's entirely possible I'm making things more complicated than necessary):
public class ExceptionErrorStrategy extends DefaultErrorStrategy {
#Override
public void recover(Parser recognizer, RecognitionException e) {
throw e;
}
#Override
public void reportInputMismatch(Parser recognizer, InputMismatchException e) throws RecognitionException {
String msg = "mismatched input " + getTokenErrorDisplay(e.getOffendingToken());
msg += " expecting one of "+e.getExpectedTokens().toString(recognizer.getTokenNames());
RecognitionException ex = new RecognitionException(msg, recognizer, recognizer.getInputStream(), recognizer.getContext());
ex.initCause(e);
throw ex;
}
#Override
public void reportMissingToken(Parser recognizer) {
beginErrorCondition(recognizer);
Token t = recognizer.getCurrentToken();
IntervalSet expecting = getExpectedTokens(recognizer);
String msg = "missing "+expecting.toString(recognizer.getTokenNames()) + " at " + getTokenErrorDisplay(t);
throw new RecognitionException(msg, recognizer, recognizer.getInputStream(), recognizer.getContext());
}
}
This throws exceptions with useful messages, and the line and position of the problem can be gotten from either the offending token, or if that's not set, from the current token by using ((Parser) re.getRecognizer()).getCurrentToken() on the RecognitionException.
I'm fairly happy with how this is working, though having six reportX methods to override makes me think there's a better way.
For anyone interested, here's the ANTLR4 C# equivalent of Sam Harwell's answer:
using System; using System.IO; using Antlr4.Runtime;
public class DescriptiveErrorListener : BaseErrorListener, IAntlrErrorListener<int>
{
public static DescriptiveErrorListener Instance { get; } = new DescriptiveErrorListener();
public void SyntaxError(TextWriter output, IRecognizer recognizer, int offendingSymbol, int line, int charPositionInLine, string msg, RecognitionException e) {
if (!REPORT_SYNTAX_ERRORS) return;
string sourceName = recognizer.InputStream.SourceName;
// never ""; might be "<unknown>" == IntStreamConstants.UnknownSourceName
sourceName = $"{sourceName}:{line}:{charPositionInLine}";
Console.Error.WriteLine($"{sourceName}: line {line}:{charPositionInLine} {msg}");
}
public override void SyntaxError(TextWriter output, IRecognizer recognizer, Token offendingSymbol, int line, int charPositionInLine, string msg, RecognitionException e) {
this.SyntaxError(output, recognizer, 0, line, charPositionInLine, msg, e);
}
static readonly bool REPORT_SYNTAX_ERRORS = true;
}
lexer.RemoveErrorListeners();
lexer.AddErrorListener(DescriptiveErrorListener.Instance);
parser.RemoveErrorListeners();
parser.AddErrorListener(DescriptiveErrorListener.Instance);
For people who use Python, here is the solution in Python 3 based on Mouagip's answer.
First, define a custom error listener:
from antlr4.error.ErrorListener import ErrorListener
from antlr4.error.Errors import ParseCancellationException
class ThrowingErrorListener(ErrorListener):
def syntaxError(self, recognizer, offendingSymbol, line, column, msg, e):
ex = ParseCancellationException(f'line {line}: {column} {msg}')
ex.line = line
ex.column = column
raise ex
Then set this to lexer and parser:
lexer = MyScriptLexer(script)
lexer.removeErrorListeners()
lexer.addErrorListener(ThrowingErrorListener())
token_stream = CommonTokenStream(lexer)
parser = MyScriptParser(token_stream)
parser.removeErrorListeners()
parser.addErrorListener(ThrowingErrorListener())
tree = parser.script()

Ideal Java library for cleaning html, and escaping malformed fragments

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.
Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.
Here is my regular expression code to escape unknown tags like <M+1>
private static String escapeUnknownTags(String input) {
Scanner scan = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (scan.hasNext()) {
String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);
if (s == null) {
builder.append(escape(scan.next(".*")));
} else {
processMatch(s, builder);
}
}
return builder.toString();
}
private static void processMatch(String s, StringBuilder builder) {
if (!isKnown(s)) {
String escaped = escape(s);
builder.append(escaped);
}
else {
builder.append(s);
}
}
private static String escape(String s) {
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
return s;
}
private static boolean isKnown(String s) {
Scanner scan = new Scanner(s);
if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {
return false;
}
MatchResult mr = scan.match();
try {
String tag = mr.group(1).toLowerCase();
if (HTML.getTag(tag) != null) {
return true;
}
}
catch (Exception e) {
// Should never happen
e.printStackTrace();
}
return false;
}
HTML cleaner
HtmlCleaner is open-source HTML parser written in Java. HTML found on
Web is usually dirty, ill-formed and unsuitable for further
processing. For any serious consumption of such documents, it is
necessary to first clean up the mess and bring the order to tags,
attributes and ordinary text. For the given HTML document, HtmlCleaner
reorders individual elements and produces well-formed XML. By default,
it follows similar rules that the most of web browsers use in order to
create Document Object Model. However, user may provide custom tag and
rule set for tag filtering and balancing.
Ok, I suspect it is this. Use the following code, it will help.
javax.swing.text.html.HTML

How can I force a SAX parser to use a DTD if one is not specified in the input file?

How can I force a SAX parser (specifically, Xerces in Java) to use a DTD when parsing a document without having any doctype in the input document? Is this even possible?
Here are some more details of my scenario:
We have a bunch of XML documents that conform to the same DTD that are generated by multiple different systems (none of which I can change). Some of these systems add a doctype to their output documents, others do not. Some use named character entities, some do not. Some use named character entities without declaring a doctype. I know that's not kosher, but it's what I have to work with.
I'm working on system that needs to parse these files in Java. Currently, it's handling the above cases by first reading in the XML document as a stream, attempting to detect if it has a doctype defined, and adding a doctype declaration if one isn't already present. The problem is that this code is buggy, and I'd like to replace it with something cleaner.
The files are large, so I can't use a DOM-based solution. I'm also trying get character entities resolved, so it doesn't help to use an XML Schema.
If you have a solution, could you please post it directly instead of linking to it? It doesn't do Stack Overflow much good if in a the future there's a correct solution with a dead link.
I think it is no sane way to set DOCTYPE, if document hasn't one. Possible solution is write fake one, as you already do. If you're using SAX, you can use this fake InputStream and fake DefaultHandler implementation. (will work only for latin1 one-byte encoding)
I know this solution also ugly, but it only one works well with big data streams.
Here is some code.
private enum State {readXmlDec, readXmlDecEnd, writeFakeDoctipe, writeEnd};
private class MyInputStream extends InputStream{
private final InputStream is;
private StringBuilder sb = new StringBuilder();
private int pos = 0;
private String doctype = "<!DOCTYPE register SYSTEM \"fake.dtd\">";
private State state = State.readXmlDec;
private MyInputStream(InputStream source) {
is = source;
}
#Override
public int read() throws IOException {
int bit;
switch (state){
case readXmlDec:
bit = is.read();
sb.append(Character.toChars(bit));
if(sb.toString().equals("<?xml")){
state = State.readXmlDecEnd;
}
break;
case readXmlDecEnd:
bit = is.read();
if(Character.toChars(bit)[0] == '>'){
state = State.writeFakeDoctipe;
}
break;
case writeFakeDoctipe:
bit = doctype.charAt(pos++);
if(doctype.length() == pos){
state = State.writeEnd;
}
break;
default:
bit = is.read();
break;
}
return bit;
}
#Override
public void close() throws IOException {
super.close();
is.close();
}
}
private static class MyHandler extends DefaultHandler {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
System.out.println("resolve "+ systemId);
// get real dtd
InputStream is = ClassLoader.class.getResourceAsStream("/register.dtd");
return new InputSource(is);
}
... // rest of code
}

Categories

Resources