Parsing Invalid XML Characters using XStream parser - Java [duplicate] - java

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
I am having a classic XML validation question -
I need to parse incoming XML (from other applications - which don't use proper XML formatter) where
there are Broken Tags and XML Special characters embedded in Data (but not using CDData tag to wrap around)
I am using simple XStream parser to unmarshall the incoming stream as it's simple serialization and not a strict parser. For special characters it throws ConverterException and won't parse the file.
I want to know if there is any other parser which can be used to parse Invalid XML files (special characters etc)
We have no control over what would be sent as Input stream and as a part of auditing application, need to read as much Good records from the incoming file as possible.
Is there a better parsing option available or do I need to write Custom Parser to parse these files?
I am using Spring Batch to do batch processing and XStream(1.x) to parse the XML files.
AS XSD validation is failing, I am wondering even if it's worth to explore other parsers/ Custom parser option..
Looking for your expert opinions on XML Validations..

I understand that you trying to make best of messy input. Unfortunately, since there doesn't seem to be a clear specification of the format of that input, you are actually on your own. An approach could be to first convert the input files to valid XML, which is basically what you would do by writing your own parser. In Java you could do this by reading and parsing the files using your own specialized code and output a standard Java XML interface (SAX, DOM, etc.). But, depending on your knowledge, it may be faster to use a different language specialized in text parsing.
My experience is that the only real long-term solution here is to force the data suppliers to provide valid XML. The reason for this is that, although you can do your best in making valid data out of the invalid data, there is always the risk that your interpretation is wrong. And half-valid data is often worse than no data at all. IMHO it is best to leave the responsibility for correct data at the suppliers.

Related

CSV to XML using JAXB?

Background
I have a situation where I can get data either in the form of an XML-file or Excel/CSV-files. In case the data comes in a non-XML format it will be divided into several different files/tables, representing different subsections of the XML. The end goal is to validate the data and generate a valid XML-file using an existing schema, regardless of the format of the indata.
When receiving an XML-file the idea is to unmarshall and validate it. For simple errors autmatic fixes will be applied, and in the end a new XML-file will be marshalled from the JAXB classes.
Question
In order to be able to generalize as much as possible of the solution, my idea was to try to generate a JAXB representation of the non-XML data too, and then generate the end XML-file from those classes. I have been trying to find a good tutorial or introduction to converting non-XML to a JAXB representation, but I haven't really been able to find anything useful, which makes me wonder, is this a really bad approach? Any better suggestions for how to solve this problem? In the majority of the cases the files are likely to be non-XML, so I am willing to throw out the current approach if anyone has better solution that uses some other technology.
I've worked before with univocity parsers. They work well and are simple to use to converting CSV to Java object which then you searialize using JAXB as well.

xml validate allowed values in java

I have a question what is the best way to validate XML against XSD. I need to validate allowed values in XML, which can be easily done in XSD using enumeration. Problem is, that the list of allowed values is quite big and do this in XSD could be paintfull. Another thing is, that allowed values can be changed from time to time, so I would like to avoid changing XSD schema. I was thinking to filter this values by using java. E.g. to make some config files for each XML tag filled with values and when validating XML, values would be checked. If content of XML tag is not in config file, error would be raised.
My another question is, which parser is the best to do this? XML file has arround 40 XML elements/tags, one XML file could have around 40k records.
And my last question is, how can I change english language of errors which are default in parser? I have read some tutorials which parser to use, but your experiences would be really helpfull. Thank you
example of values:
<order>pancake</order>
<order>milk</order>
Pancake is allowed value, so no error is raised. Milk is not allowed, so the error would be raised: Milk is not allowed.
read this JAXB Turorial to see how to convert from and to xml

Unable to unmarshal strange XML format using Java and JAXB

I need to retrieve financial data using the Open Financial Exchange (OFX) protocol. In order to do this, I am using JAXB to marshal an object tree into an XML string that specifies data request parameters, and then I am sending this XML string to a bank's server. The bank then responds with an XML string containing the requested data, which I unmarshal into an object tree using JAXB. For the first couple of banks I tried, I received the data back in well-formed XML that conformed to the published OFX schema, and I was able to unmarshal it easily using JAXB.
However, when I requested data from Citigroup, they sent me back the following:
OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
<SIGNONMSGSRSV1>
<SONRS>
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<DTSERVER>20150513180826.000
<LANGUAGE>ENG
<FI>
<ORG>Citigroup
<FID>24909
</FI>
</SONRS>
</SIGNONMSGSRSV1>
</OFX>
Note that this is an abbreviated form of the actual output, but it is enough to illustrate the problem. The problem is that I cannot figure out how to use JAXB to unmarshal this content. It is not well-formed XML because (1) it doesn't have an XML header, (2) the custom processing instructions (the first nine lines above) are not enclosed in <?...?> tags, and (3) most importantly, the simpleTypes have only opening tags but no closing tags.
I have searched all over for an answer to this and found a similar XML-ish format in a couple of places, and one of those places indicated that this may even be a valid format for sending XML over the web. But I haven't found any information that can help me unmarshal it or parse it.
Does anyone have any suggestions? I am usually pretty resourceful when it comes to these types of problems (hence why this is my first question on here), but this one has me stumped. Thanks in advance for any help you can provide.
Your basic problem is that the input you show here is not XML, it's SGML (see DATA:OFXSGML). You will have to preprocess it to make it acceptable to an XML parser. The kind of preprocessing you have to do will be application specific, as there's no general mechanism to deal well with that. If you have the SGML DTD, you might be able to get a product such as omnimark to "mostly" fix it up.
Well , maybe you need to handle this bank services in some other manner, for example when you receive data from this bank maybe read the Stream and maybe try to undetify the beggining of tag and then the end of (read line by line link)the rest of the stream ..free will . After that the string that remains is the XML that you need , so pass it through your already implemented JAXB code.

Convert Arbitrary JSON with invalid XML characters to XML in Java

Background:
I'm calling web APIs that are in JSON format and passing them through
a data orchestration tool that needs them in XML format.
Orchestration tool allows custom Java procedures.
Problem:
JSON can contain elements that when converted to XML cause issues.
For example twitter handles #john: somevalue is fine for a key in JSON
but when converted to XML <#john>somevalue causes the
orchestration tool to throw errors.
I'm hitting a wide variety of web APIs that change often. I need to
be able to convert arbitrary JSON to XML with little to no maintenance.
Research so far:
I've found several ways to convert JSON to XML in Java but many of them are for fixed input structures.
This StackOverflow post seems like what I want but I'm having issues getting it to work and tracking down all of the JARs required.
I've seen some libraries will do some basic character escapes for &, <, >, ' and ". Is there one that is more robust?
I ended up deserializing the JSON and traversing the data by using the following regex to find the nodes and then remove or replace non latin characters.
Regex that grabs JSON node
"(.*?)":

validating xml in java as the document is built

I am working on converting an excel spread sheet into an xml document that needs to be validated against a schema. I am currently building the xml document using the DOM api, and validating at the end using SAX and a custom error handler. However, I would really like to be able to validate the xml produced from each Cell as I parse the excel document so I can indicate which cells are problematic in a friendlier way.
The problem that I am currently encountering, is that after validating the xml for the simple types, once they are built into a complex type, all the children nodes get validated again, producing redundant errors.
I found this question here at SO but it is using C# and the Microsoft API.
Thoughts? Thanks!
Sorry, but I don't see the problem. You are producing the XML, so what's the point in validating the XML while you produce it?
Are you looking to validate the cell contents? If yes, then write validation logic into your code. This validation logic may replicate the schema, but I suspect that it will actually be much more detailed than the schema.
Are you looking to validate your program's output? If yes, then write unit tests.
You could try having your parsing code fire SAX events instead of directly constructing a DOM. Then you could just register a validating SAX ContentHandler to listen to it and have that build your DOM for you. That should detect validation errors as they're encountered.
So the solution that I decided to go with and am almost finished implementing, was to use XSOM to parse the XSD. Than when parsing the Excel file, I looked up the column name in the parsed XSD to pull out the restrictions (since the column headers map to simple types in the XSD) and than did manual validation against the restrictions. I am still building the tree so that at the end of it I can validate the entire XML tree against the XSD since there are some things that I can't catch at the Cell level.
Thanks for all of your input.
Try building schemas at multiple levels of granularity. Test the simple (Cells) ones against the most granular, and the complex ones (Rows?) against a less granular schema that doesn't decompose the complex types.

Categories

Resources