How to replace invalid characters using java - java

Invalid XML: Error on line 190: An invalid XML character (Unicode: 0x10) was found in the CDATA section.
I get this error while parsing an XML file, I used String.replaceAll to replace this character but my regex pattern seems to be incorrect.
The following is a different string, but it just gives me back the original string. How should I do it?
str = str.replaceAll("\\^p", "");

Use this:
String replaced = your_original_string.replaceAll("\\x10", "");
The xdd... is the Java syntax to match a single unicode character
Your error said Unicode: 0x10

str = str.replace("\u0010", "");
Or maybe you need a space
str = str.replace("\u0010", " ");

Related

Getting JsonException while using JSONML.toJSONObject()

I'm using JSONML for converting xml String to JSONObject.
This is my xml String
"<soapenv:Body xmlns:soapenv=\"http://schemas.xmlsoap.org/soap/envelope/\"><jsonArray><jsonElement><message>entity is deleted<\/message><errorCode>ENTITY_IS_DELETED<\/errorCode><\/jsonElement><jsonElement><message>entity is deleted<\/message><errorCode>ENTITY_IS_DELETED<\/errorCode><\/jsonElement><\/jsonArray><\/soapenv:Body>"
when I try JSONML.toJSONObject() It gives me
Caused by: org.json.JSONException: Bad character in a name at 32 [character 33 line 1]
at org.json.JSONTokener.syntaxError(JSONTokener.java:433)
at org.json.XMLTokener.nextToken(XMLTokener.java:288)
at org.json.JSONML.parse(JSONML.java:173)
at org.json.JSONML.toJSONObject(JSONML.java:286)
at org.json.JSONML.toJSONObject(JSONML.java:304)
at com.thbs.automaton.commonUtils.TestcaseUtils.compareXml(TestcaseUtils.java:144)
... 57 more
Its due to the escape character (\). I tried resolving this by removing all the \ characters , which solved my problem. However I don't think its a good practice.
Can anyone suggest a better approach?
The "\"s shows the original String is not a "XML String". It is an "escaped XML String". You should find out why and how the XML String is escaped.
Maybe it because of transferring as JSON. In that case, you should transform the original(JSON) String into data String, so to say a XML String. With code like this
String xmlString = jsonParser(originalString, String.class);
after that run as yours
JSONML.toJSONObject(xmlString);

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?
In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

Illegal Character in XML are not being replaced

SOLUTION So this was not an xml issue at all. My xml escapes were done properly, however there was an encoding issue. So i would like to share my solution with everyone, i hope you find this useful.
public static String entityEncode(String text) throws UnsupportedEncodingException {
String result = text;
if (result == null) {
return result;
}
byte ptext[] = result.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");
String temp = XMLStringUtil.escapeControlChrs(value);
return temp;
}
EXPLANATION The xml function above is for XML 1.0. We take our given text, convert it into a byte since String does not have an associated encoding. After which we create a new string off of the byte in "UTF-8". That is also why java was just telling me that character reference error with &#, it couldn't recognize the character at fault. Now that I did the encoding and assigned it to UTF-8, there are no issues and the xml escape proceeds properly!
EDIT: How do i print out all illegal xml characters in the provided string? According to StringEscapeUtils.escapeXml parameters? The problem i have is that i don't want to escape everything, because it doesn't properly decode after. So right now, i just need to find out what my invalid characters in the text are. The oens that are causing issues and need to be encoded.
I have the following error message:
ERROR: 'Character reference "&#'
ERROR: 'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Character reference "&#'
It does not specifically tell me what the character is which is a problem.
I do my original XML parse to convert to an xml document and then after that. I sanitize further to remove illegal characters
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
However, it's not removing them so i'm not sure how to go about this. Currently i have:
String temp = entityEncode(temp);
String legal = temp.replaceAll(xml10pattern , "");
item.setResponseBody(legal);
Entity encode just uses a standard xml parse class to escape characters XMLStringUtil.escapeControlChrs which is based off of StringEscapeUtils.escapeXml and just has additional escapes, nothing removed. But something is being missed.

conditional replaceAll java

I have html code with img src tags pointing to urls. Some have mysite.com/myimage.png as src others have mysite.com/1234/12/12/myimage.png. I want to replace these urls with a cache file path. Im looking for something like this.
String website = "mysite.com"
String text = webContent.replaceAll(website+ "\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
This code however does not work when the url does not have the extra date stamp at the end. Does anyone know how i might achieve this? Thanks!
Try this one
mysite\.com/(\d{4}/\d{2}/\d{2}/)?
here ? means zero or more occurance
Note: use escape character \. for dot match because .(dot) is already used in regex
Sample code :
String[] webContents = new String[] { "mysite.com/myimage.png",
"mysite.com/1234/12/12/myimage.png" };
for (String webContent : webContents) {
String text = webContent.replaceAll("mysite\\.com/(\\d{4}/\\d{2}/\\d{2}/)?",
String.valueOf("mysite.com/abc/"));
System.out.println(text);
}
output:
mysite.com/abc/myimage.png
mysite.com/abc/myimage.png
You are missing a forward slash between the website.com and the first 4 digits.
String text = webContent.replaceAll(Pattern.quote(website) + "/\\d{4}\\/\\d{2}\\/\\d{2}", String.valueOf(cacheDir));
I'd also recommend using a literal for your website.com value (the Pattern.quote part).
Finally you are also missing the last forward slash after the last two digits so it won't be replaced, but that may be on purpose...
Try:
String text = webContent.replaceAll("(?<="+website+")(.*)(?=\\/)",
String.valueOf(cacheDir));

Making sure a path string is a valid java path string

This is how i try to make sure a path given in a property file is a valid java path (with \\ instead of \) :
String path = props.getProperty("path");
if (path.length()>1) path=path.replaceAll("\\\\", "\\");
if (path.length()>1) path=path.replaceAll("\\", "\\\\");
in the first replace im making sure that if the path already valid (has \\ instead of \) then it wont get doubled to \\\\ instead of \\ in the second replace...
anyway i get this weird exception :
java.lang.StringIndexOutOfBoundsException: String index out of range: 1
at java.lang.String.charAt(Unknown Source)
at java.util.regex.Matcher.appendReplacement(Unknown Source)
at java.util.regex.Matcher.replaceAll(Unknown Source)
at java.lang.String.replaceAll(Unknown Source)
at com.hw.Launcher.main(Launcher.java:56)
can anyone tell why?!
replaceAll expects RegExes, use replace instead.
You can find the JavaDocs here
If you want to be sure the path is valid, how about trying
File f = new File("c:\\this\\that");
f.getCanonicalPath();
The File class is made for taking apart paths. It's probably the best way to verify that a path is valid.
(Let me spell it out for newbies too.)
If you have a text file or a String, normally only a single backslash should occur.
In java source code, a string or character denotation, backslash is the escape character, transforming the next one into a special meaning. Backslash itself should be given doubled, as \\. The string value itself will have only one backslash character.
If you read special text, using backslash escaping (like \n for a line break), then use the non-regex replace of strings:
// First escapes of other:
path = path.replace("\\n", "\n"); // Text `\n` -> linefeed
path = path.replace("\\t", "\t"); // Text `\t` -> tab
// Then escape of backslash:
path = path.replace("\\\\", "\\"); // Text `\\` -> backslash
For file paths only the last might make sense, but it should not have been needed.

Categories

Resources