Parsing 'pseudo' XML (that is, not well formed) in java?

Parsing 'pseudo' XML (that is, not well formed) in java? - java

I have some xml that looks like this:
<xml><name>oscar</name><race>puppet</race><class>grouch</class></xml>
The tags change and are variable, so there won't always be a 'name' tag.
I've tried 3 or 4 parses and they all seem to choke on it. Any hints?

Just because it doesn't have a defined schema, doesn't mean it isn't "valid" XML - your sample XML is "well formed".
The dom4j library will do it for you. Once parsed (your XML will parse OK) you can iterate through child elements, no matter what their tag name, and work with your data.
Here's an example of how to use it:
import org.dom4j.*;
String text = "<xml><name>oscar</name><race>puppet</race><class>grouch</class></xml>";
Document document = DocumentHelper.parseText(text);
Element root = document.getRootElement();
for ( Iterator i = root.elementIterator(); i.hasNext(); ) {
Element element = (Element) i.next();
String tagName = element.getQName();
String contents = element.getText();
// do something
}

This is valid xml; try adding an XML Schema that allows for optional elements. If you can write an xml schema, you can use JAXB to parse it. XML allows for having optional elements; it isn't too "strict" about it.

Your XML sample is well-formed XML, and if anything "chokes" on it then it would be useful for us to know exactly what the symptoms of the "choking" are.

Related

How to stop Jackson from parsing an element?

I have a XML Document where there are nested tags that should not be interpreted as XML tags
For example something like this
<something>cbaabc</something> should be parsed as a plain String "cbaabc" (it should be mentioned that the document has other elements as well that get parsed just fine). Jackson tho tries to interpret it as an Object and I don't know how to prevent this. I tried using #JacksonXmlText, turning off wrapping and a custom Deserializer, but I didn't get it to work.

The <a should be translated to <a. This back and forth conversion normally happens with every XML API, setting and getting text will use those entities &...;.
An other option is to use an additional CDATA section: <![CDATA[ ... ]]>.
<something><![CDATA[cbaabc]]></something>
If you cannot correct that, and have to live with an already corrupted XML text, you must do your own hack:
Load the wrong XML in a String
Repair the XML
Pass the XML string to jackson
Repairing:
String xml = ...
xml = xml.replaceAll("<(/?a\\b[^>]*)>", "<$1>"); // Links
StringReader in = new StringReader(xml);

Creating java parser for json string

I'm trying to translate a java method that uses Xpath to parse XML to one that uses JsonPath instead and I'm having trouble translating what the Xpath parser is doing so i can replicate it using JsonPath.
Here is the code that currently parses "String body".
public static String parseXMLBody(String body, String searchToken) {
String xPathExpression;
try {
// we use xPath to parse the XML formatted response body
xPathExpression = String.format("//*[1]/*[local-name()='%s']", searchToken);
XPath xPath = XPathFactory.newInstance().newXPath();
return (xPath.evaluate(xPathExpression, new InputSource(new StringReader(body))));
} catch (Exception e) {
throw new RuntimeException(e); // simple exception handling, please review it
}
}
Can anyone help translate this into a method that uses JsonPath or something similar?
Thanks

I can explain the XPath for you
//*[1] selects the first element node in the document. This would be the document element and here can be only one so it is a little strange. /* returns the same node.
//*[1]/* or /*/* return all element child nodes of the document element.
[local-name()='tagname'] filters nodes by their local name (the tag name without the namespace prefix).
The full expression //*[1]/*[local-name()='tagname'] fetches all direct child nodes of the document element with the provided tagname, ignoring namespaces. It could be simplified to /*/*[local-name()='tagname'].
Without knowing the Json, here is no chance to say how the JsonPath should look like. I would not expect the Json to have a root element, but I expect the items to be different because in Json you can not have multiple siblings with the same key (You can have multiple siblings with the same node name in XML).

How to convert xpath to java code

I have a xpath of an element and need to write a java code which gives me exactly the same element as an object. I believe i need to use SAX or DOM ? i m totally newbie..
xpath :
/*[local-name(.)='feed']/*[local-name(.)='entry']/*[local-name(.)='title']

Your comment suggests you want to use DOM4J, which supports XPath out of the box:
SAXReader reader = new SAXReader();
Document doc = reader.read(new File(....)); // or URL, or wherever the XML comes from
Node selectedNode = doc.selectSingleNode("/*[local-name(.)='feed']/*[local-name(.)='entry']/*[local-name(.)='title']");
(or there's also selectNodes which returns a List, if there might be more than one node matching that XPath expression - quite likely if this is an Atom feed).
But rather than using the local-name hack like this, if you know the namespace URI of the elements in your XML you can declare a prefix for this namespace and select the nodes by their fully qualified name:
SAXReader reader = new SAXReader();
Map<String, String> namespaces = new HashMap<>();
namespaces.put("atom", "http://www.w3.org/2005/Atom");
reader.getDocumentFactory().setXPathNamespaceURIs(namespaces);
Document doc = reader.read(new File(....)); // or URL, or wherever the XML comes from
List selectedNodes = doc.selectNodes("/atom:feed/atom:entry/atom:title");

read here:
https://howtodoinjava.com/java/xml/java-xpath-tutorial-example/
I found it while I were searching to find how to convert Xpath PMD-rule to java-rule,, I did not find what I need in it.
but, anyway may be you can find yours.

How to retrieve all the elements name from a xml schema

I am having a problem with getting a name of a schema elements in java. I am creating a small xml editor which can load a xml schema and validate a xml file against xml schema. I want to parse a schema, get every elements name and then put it in my content assistant, so the user can see all the available elements.
I already read XSOM User's guide, but I didn't understand much...
Can someone help me to implement my addElementsFromSchema(File xsdfile) function, because I lost myself trying.
public static void addElementsFromSchema(File xsdfile){
}

It sounds like your primary need, at least for now, is to get the element names. You can get the element names with something like:
XSOMParser parser = new XSOMParser();
parser.parse(xsdfile);
XSSchemaSet schemas = parser.getResult();
Iterator<XSElementDecl> i = schemas.iterateElementDecls();
while (i.hasNext()) {
XSElementDecl element = i.next();
String name = element.getName();
// Add to editor
}
Showing element definitions is a lot more difficult, as element declarations in XML schemas can get quite complex.

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.

I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.

Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.

One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>

You could use another parser like Jsoup. It can parse XML without a root.

I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.

Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing 'pseudo' XML (that is, not well formed) in java? - java

I have some xml that looks like this: <xml><name>oscar</name><race>puppet</race><class>grouch</class></xml> The tags change and are variable, so there won't always be a 'name' tag. I've tried 3 or 4 parses and they all seem to choke on it. Any hints?

This is valid xml; try adding an XML Schema that allows for optional elements. If you can write an xml schema, you can use JAXB to parse it. XML allows for having optional elements; it isn't too "strict" about it.

Your XML sample is well-formed XML, and if anything "chokes" on it then it would be useful for us to know exactly what the symptoms of the "choking" are.

Related

How to stop Jackson from parsing an element?

Creating java parser for json string

How to convert xpath to java code

How to retrieve all the elements name from a xml schema

Parsing an XML file without root in Java

Categories

Resources