Remove illegal markup from xml string (node starting with number)

Remove illegal markup from xml string (node starting with number) - java

I am converting a JSON Array to a XML string:
JSONArray json = new JSONArray(response);
xml = XML.toString(json);
and unfortunately the result contains nodes like
<24x24>blah</24x24>
Afterwards I want to create a 'real' XML Node with SAXBuilder which produces following error:
The content beginning "<2" is not legal markup. Perhaps the "2" ( ) character should be a letter.
Does anybody know how to remove this illegal markup from the XML String?
Maybe a regex which replaces <24x24>blah</24x24> with <t24x24>blah</t24x24>?
Thank you!

You can try using String.replaceAll() method with regex. Live demo
System.out.println("<24x24>blah</24x24>".replaceAll("(<\\/?)(?=\\d)", "$1t"));
output:
<t24x24>blah</t24x24>

Related

Java Unescaping XML/HTML before JAXB parsing doesn't work

Can anyone help me?
In HTML/XML:
A numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the format:
&#nnnn;
or
&#x hhhh;
I have to unescape (convert to unicode) these references before I use the JAXB parser.
When I use Apache StringEscapeUtils.unescapeXml() also &amp ; and &gt ; and &lt ; are unescaped, and that is not want I want because then parsing will fail.
Is there a library that only converts the &#nnnn to unicode? But does not unescape the rest?
Example:
begin-tag Adam &lt ;&gt ; Sl.meer 4 & 5 &# 55357;&# 56900; end-tag
I have added spaces after &# otherwise you do not see the notation.
For now I fixed it like this, but I want to use a better solution.
String unEncapedString = StringEscapeUtils.unescapeXml(xmlData).replaceAll("&", "&")
.replaceAll("<>", "<>");
StringReader reader = new StringReader(unEncapedString.codePoints().filter(c -> isValidXMLChar(c))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString());
return (Xxxx) createUnmarshaller().unmarshal(reader);
Looked in the Apache Commons-text library and finally found the solution:
NumericEntityUnescaper numericEntityUnescaper = new NumericEntityUnescaper(
NumericEntityUnescaper.OPTION.semiColonRequired);
xmlData = numericEntityUnescaper.translate(xmlData);

Get an Array or List of Strings between some Strings (Search multiple Strings)

I have an large String which contains some XML. This XML contains input like:
<xyz1>...</xyz1>
<hello>text between strings #1</hello>
<xyz2>...</xyz2>
<hello>text between strings #2</hello>
<xyz3>...</xyz3>
I want to get all these <hello>text between strings</hello>.
So in the end I want to have a List or any Collection which contains all <hello>...</hello>
I tried it with Regex and Matcher but the problem is it doesn't work with large strings.... if I try it with smaller Strings, it works. I read a blogpost about this and this says the Java Regex Broken for Alternation over Large Strings.
Is there any easy and good way to do this?
Edit:
An attempt is...
String pattern1 = "<hello>";
String pattern2 = "</hello>";
List<String> helloList = new ArrayList<String>();
String regexString = Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2);
Pattern pattern = Pattern.compile(regexString);
Matcher matcher = pattern.matcher(scannerString);
while (matcher.find()) {
String textInBetween = matcher.group(1); // Since (.*?) is capturing group 1
// You can insert match into a List/Collection here
helloList.add(textInBetween);
logger.info("-------------->>>> " + textInBetween);
}

If you have to parse an XML file, I suggest you to use XPath language. So you have to do basically these actions:
Parse the XML String inside a DOM object
Create an XPath query
Query the DOM
Try to have a look at this link.
An example of what you haveto do is this:
String xml = ...;
try {
// Build structures to parse the String
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// Parse the XML string into a DOM object
Document document= builder.parse(new ByteArrayInputStream(xml.getBytes()));
// Create an XPath query
XPath xPath = XPathFactory.newInstance().newXPath();
// Query the DOM object with the query '//hello'
NodeList nodeList = (NodeList) xPath.compile("//hello").evaluate(document, XPathConstants.NODESET);
} catch (Exception e) {
e.printStackTrace();
}

You have to parse your xml with an xml parser. It is easier than using regular expressions.
DOM parser is the simplest to use, but if your xml is very big use the SAX parser

I would highly recommend using one of the multiple public XML parsers available:
Woodstox
Stax
dom4j
It is simply easier to achieve what you're trying to achieve (even if you wish to elaborate on your request in the future). If you have no issues with speed and memory, go ahead and use dom4j. There is ALOT of resource online if you wish me to post good examples on this answer for you, as my answer right now is simply redirecting you alternative options but I'm not sure what your limitations are.
Regarding REGEX when parsing XML, Dour High Arch gave a great response:
XML is not a regular language. You cannot parse it using a regular expression. An expression you think will work will break when you get nested tags, then when you fix that it will break on XML comments, then CDATA sections, then processor directives, then namespaces, ... It cannot work, use an XML parser.
Parsing XML with REGEX in Java

With Java 8 you could use the Dynamics library to do this in a straightforward way
XmlDynamic xml = new XmlDynamic(
"<bunch_of_data>" +
"<xyz1>...</xyz1>" +
"<hello>text between strings #1</hello>" +
"<xyz2>...</xyz2>" +
"<hello>text between strings #2</hello>" +
"<xyz3>...</xyz3>" +
"</bunch_of_data>");
List<String> hellos = xml.get("bunch_of_data").children()
.filter(XmlDynamic.hasElementName("hello"))
.map(hello -> hello.asString())
.collect(Collectors.toList()); // ["text between strings #1", "text between strings #2"]
See https://github.com/alexheretic/dynamics#xml-dynamics

Parsing 'pseudo' XML (that is, not well formed) in java?

I have some xml that looks like this:
<xml><name>oscar</name><race>puppet</race><class>grouch</class></xml>
The tags change and are variable, so there won't always be a 'name' tag.
I've tried 3 or 4 parses and they all seem to choke on it. Any hints?

Just because it doesn't have a defined schema, doesn't mean it isn't "valid" XML - your sample XML is "well formed".
The dom4j library will do it for you. Once parsed (your XML will parse OK) you can iterate through child elements, no matter what their tag name, and work with your data.
Here's an example of how to use it:
import org.dom4j.*;
String text = "<xml><name>oscar</name><race>puppet</race><class>grouch</class></xml>";
Document document = DocumentHelper.parseText(text);
Element root = document.getRootElement();
for ( Iterator i = root.elementIterator(); i.hasNext(); ) {
Element element = (Element) i.next();
String tagName = element.getQName();
String contents = element.getText();
// do something
}

This is valid xml; try adding an XML Schema that allows for optional elements. If you can write an xml schema, you can use JAXB to parse it. XML allows for having optional elements; it isn't too "strict" about it.

Your XML sample is well-formed XML, and if anything "chokes" on it then it would be useful for us to know exactly what the symptoms of the "choking" are.

Regex Email addresses out of xml

My question: What's a good way to parse the information below?
I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).
Requirements:
I want to be able to parse the to, cc, and bcc attributes seperately
There are other elements which have to, cc, and bcc attributes
Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
The order of the attributes does not matter.
Here's an example of the xml:
<error_options
to="your_email#your_server.com"
cc="cc_error#your_server.com"
bcc="bcc_error#your_server.com"
reply_to="someone_else#their_server.com"
from="bo_error#some_server.org"
subject="Error running System at ##TIMESTAMP##"
force_send="false"
max_email_size="10485760"
oversized_email_action="zip;split_all"
>
I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.

This piece will put all attributes from your String s="<error_options..." into a map:
Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
Map a = new HashMap() ;
Matcher m = p.matcher(s) ;
while( m.find() ) {
String key = m.group(1).trim() ;
String val = m.group(2).trim() ;
a.put(key, val) ;
}
...then you can extract the values that you're interested in from that map.

This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.
Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.

/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;

Invalid character while converting from JSON to XML using jsonlib

I'm trying to convert a JSON string to XML using jsonlib in Java.
JSONObject json = JSONObject.fromObject(jsonString);
XMLSerializer serializer = new XMLSerializer();
String xml = serializer.write( json );
System.out.println(xml);
The error that I get is
nu.xom.IllegalNameException: 0x24 is not a legal NCName character
The problem here is that I have some properties in my JSON that are invalid XML characters. eg. I have a property named "$t". The XMLSerializer throws the exception while trying to create a XML tag in this name because $ is not allowed in XML tag names. Is there any way in which I can override this XML well formedness check done by the serializer?

First I'd suggest to add the language you are using (it is Java, right?).
You could override the method where it checks your XML tag name to do nothing.

I took a look at the spec for the json-lib XMLSerializer and to my surprise it seems to have no option for serialising a JSON object whose keys are not valid XML names. If that's the case then I think you will need to find a different library.

You could loop over json.keySet (recursively if necessary) and replace invalid keys with valid ones (using remove and add).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove illegal markup from xml string (node starting with number) - java

You can try using String.replaceAll() method with regex. Live demo System.out.println("<24x24>blah</24x24>".replaceAll("(<\\/?)(?=\\d)", "$1t")); output: <t24x24>blah</t24x24>

Related

Java Unescaping XML/HTML before JAXB parsing doesn't work

Get an Array or List of Strings between some Strings (Search multiple Strings)

Parsing 'pseudo' XML (that is, not well formed) in java?

Regex Email addresses out of xml

Invalid character while converting from JSON to XML using jsonlib

Categories

Resources