Can Java String.toUpperCase() ever fail? - java

Situation: There is a Java ESB, which is taking input (family name) from a Vaadin web form, and should guarantee upper-casing it before saving it into DB.
I was assigned to investigate a reported issue, that lower-case characters sometimes appear in DB. I have learned, that the program is using String.toUpperCase() just before saving data through EntityManager (it is the only place that is modifying received data).
So what I wonder is, whether this shall be enough. So far I havent found any "well-known" problems related to toUpperCase() function, but I wanna be sure.
So the question - Does String.toUpperCase() always do its job? Or are there any possible characters or circumstances when error may occur and the letters may not be upper-cased?

Can Java String.toUpperCase() ever fail?
It depends on whether you are passing in locale sensitive Strings (see below).
In the implementation for Java.lang.String, it simply uses the default locale:
public String toUpperCase() {
return toUpperCase(Locale.getDefault());
}
toUpperCase(Locale) converts all of the characters in this String to upper case using the rules of the given Locale. Case mapping is based on the Unicode Standard version specified by the Character class. Since case mappings are not always 1:1 char mappings, the resulting String may be a different length than the original String.
This method is locale sensitive, and may produce unexpected results if used for strings that are intended to be interpreted locale independently. Examples are programming language identifiers, protocol keys, and HTML tags.
To obtain correct results for locale insensitive strings, use toUpperCase(Locale.ENGLISH).
In case you are interested on how toUpperCase(Locale) was implemented:
public String toUpperCase(Locale locale) {
if (locale == null) {
throw new NullPointerException();
}
int firstLower;
final int len = value.length;
/* Now check if there are any characters that need to be changed. */
scan: {
for (firstLower = 0 ; firstLower < len; ) {
int c = (int)value[firstLower];
int srcCount;
if ((c >= Character.MIN_HIGH_SURROGATE)
&& (c <= Character.MAX_HIGH_SURROGATE)) {
c = codePointAt(firstLower);
srcCount = Character.charCount(c);
} else {
srcCount = 1;
}
int upperCaseChar = Character.toUpperCaseEx(c);
if ((upperCaseChar == Character.ERROR)
|| (c != upperCaseChar)) {
break scan;
}
firstLower += srcCount;
}
return this;
}
/* result may grow, so i+resultOffset is the write location in result */
int resultOffset = 0;
char[] result = new char[len]; /* may grow */
/* Just copy the first few upperCase characters. */
System.arraycopy(value, 0, result, 0, firstLower);
String lang = locale.getLanguage();
boolean localeDependent =
(lang == "tr" || lang == "az" || lang == "lt");
char[] upperCharArray;
int upperChar;
int srcChar;
int srcCount;
for (int i = firstLower; i < len; i += srcCount) {
srcChar = (int)value[i];
if ((char)srcChar >= Character.MIN_HIGH_SURROGATE &&
(char)srcChar <= Character.MAX_HIGH_SURROGATE) {
srcChar = codePointAt(i);
srcCount = Character.charCount(srcChar);
} else {
srcCount = 1;
}
if (localeDependent) {
upperChar = ConditionalSpecialCasing.toUpperCaseEx(this, i, locale);
} else {
upperChar = Character.toUpperCaseEx(srcChar);
}
if ((upperChar == Character.ERROR)
|| (upperChar >= Character.MIN_SUPPLEMENTARY_CODE_POINT)) {
if (upperChar == Character.ERROR) {
if (localeDependent) {
upperCharArray =
ConditionalSpecialCasing.toUpperCaseCharArray(this, i, locale);
} else {
upperCharArray = Character.toUpperCaseCharArray(srcChar);
}
} else if (srcCount == 2) {
resultOffset += Character.toChars(upperChar, result, i + resultOffset) - srcCount;
continue;
} else {
upperCharArray = Character.toChars(upperChar);
}
/* Grow result if needed */
int mapLen = upperCharArray.length;
if (mapLen > srcCount) {
char[] result2 = new char[result.length + mapLen - srcCount];
System.arraycopy(result, 0, result2, 0, i + resultOffset);
result = result2;
}
for (int x = 0; x < mapLen; ++x) {
result[i + resultOffset + x] = upperCharArray[x];
}
resultOffset += (mapLen - srcCount);
} else {
result[i + resultOffset] = (char)upperChar;
}
}
return new String(result, 0, len + resultOffset);
}

Without any further information which charater (you descide to be lowercase) is stored in the database I would guess the origin is similar to cases which are explained in those blogs
by Heinz Kabutz
http://www.javaspecialists.eu/archive/Issue209.html
http://www.javaspecialists.eu/archive/Issue211.html
by Elliotte Rusty Harold
http://cafe.elharo.com/blogroll/turkish/
edit It could be that in the database is a character stored which looks similar (based on the font) to a Latin character and for which no uppercase letter exists.
One example is the GREEK LETTER YOT which looks similar to the LATIN SMALL LETTER J and has no uppercase letter.
Small snippet for demonstration.
int[] codePoints = { 0x03F3, 0x006A};
for (int codePoint : codePoints) {
char lowerCase = (char) Character.toLowerCase(codePoint);
char upperCase = (char) Character.toUpperCase(codePoint);
System.out.printf("Unicode name: %s%n", Character.getName(codePoint));
System.out.printf("lowercase : %s%n", lowerCase);
System.out.printf("uppercase : %s (%s)%n", upperCase,
Character.isUpperCase(upperCase));
}
The output is
Unicode name: GREEK LETTER YOT
lowercase : ϳ
uppercase : ϳ (false)
Unicode name: LATIN SMALL LETTER J
lowercase : j
uppercase : J (true)

Related

the counter is not updating with my loop and I don't know why

I'm trying to practice for a techniqual test where I have to count the number of characters in a DNA sequence, but no matter what I do the counter won't update, this is really frustrating as I learnt code with ruby and it would update, but Java seems to have an issue. I know there's something wrong with my syntaxt but for the life of me I can't figure it out.
public class DNA {
public static void main(String[] args) {
String dna1 = "ATGCGATACGCTTGA";
String dna2 = "ATGCGATACGTGA";
String dna3 = "ATTAATATGTACTGA";
String dna = dna1;
int aCount = 0;
int cCount = 0;
int tCount = 0;
for (int i = 0; i <= dna.length(); i++) {
if (dna.substring(i) == "A") {
aCount+= 1;
}
else if (dna.substring(i) == "C") {
cCount++;
}
else if (dna.substring(i) == "T") {
tCount++;
}
System.out.println(aCount);
}
}
}
It just keeps returning zero instead of adding one to it if the conditions are meet and reassigning the value.
Good time to learn some basic debugging!
Let's look at what's actually in that substring you're looking at. Add
System.out.println(dna.substring(i));
to your loop. You'll see:
ATGCGATACGCTTGA
TGCGATACGCTTGA
GCGATACGCTTGA
CGATACGCTTGA
GATACGCTTGA
ATACGCTTGA
TACGCTTGA
ACGCTTGA
CGCTTGA
GCTTGA
CTTGA
TTGA
TGA
GA
A
So, substring doesn't mean what you thought it did - it's taking the substring starting at that index and going to the end of the string. Only the last character has a chance of matching your conditions.
Though, that last one still won't match your condition, which is understandably surprising if you're new to the language. In Java, == is "referential equality" - when applied to non-primitives, it's asserting the two things occupy the same location in memory. For strings in particular, this can give surprising and inconsistent results. Java keeps a special section of memory for strings, and tries to avoid duplicates (but doesn't try that hard.) The important takeaway is that string1.equals(string2) is the correct way to check.
It's a good idea to do some visibility and sanity checks like that, when your program isn't doing what you think it is. With a little practice you'll get a feel for what values to inspect.
Edward Peters is right about misuse of substring that returns a String.
In Java, string must be places between double quotes. A String is an object and you must use method equals to compare 2 objects:
String a = "first string";
String b = "second string";
boolean result = a.equals(b));
In your case, you should consider using charAt(int) instead. Chars must be places between simple quotes. A char is a primitive type (not an object) and you must use a double equals sign to compare two of them:
char a = '6';
char b = 't';
boolean result = (a==b);
So, your code should look like this:
public class DNA {
public static void main(String[] args) {
String dna1 = "ATGCGATACGCTTGA";
String dna2 = "ATGCGATACGTGA";
String dna3 = "ATTAATATGTACTGA";
String dna = dna1;
int aCount = 0;
int cCount = 0;
int tCount = 0;
for (int i = 0; i < dna.length(); i++) {
if (dna.charAt(i) == 'A') {
aCount += 1;
} else if (dna.charAt(i) == 'C') {
cCount++;
} else if (dna.charAt(i) == 'T') {
tCount++;
}
System.out.println(aCount);
}
}
}
substring(i) doesn't select one character but all the characters from i to the string length, then you also made a wrong comparison: == checks 'object identity', while you want to check that they are equals.
You could substitute
if (dna.substring(i) == "A")
with:
if (dna.charAt(i) == 'A')
this works because charAt(i) returns a primitive type, thus you can correctly compare it to 'A' using ==
One of the problems, as stated, was the way you are comparing Strings. Here is a way
that uses a switch statement and a iterated array of characters. I put all the strings in an array. If you only have one string, the outer loop can be eliminated.
public class DNA {
public static void main(String[] args) {
String dna1 = "ATGCGATACGCTTGA";
String dna2 = "ATGCGATACGTGA";
String dna3 = "ATTAATATGTACTGA";
String[] dnaStrings =
{dna1,dna2,dna3};
int aCount = 0;
int cCount = 0;
int tCount = 0;
int gCount = 0;
for (String dnaString : dnaStrings) {
for (char c : dnaString.toCharArray()) {
switch (c) {
case 'A' -> aCount++;
case 'T' -> tCount++;
case 'C' -> cCount++;
case 'G' -> gCount++;
}
}
}
System.out.println("A's = " + aCount);
System.out.println("T's = " + tCount);
System.out.println("C's = " + cCount);
System.out.println("G's = " + gCount);
}
prints
A's = 14
T's = 13
C's = 6
G's = 10

How to access an array when it is within an arraylist?

The overall goal of what I'm trying to do is to compare a string to index 0 of an array (that is held within an arraylist), and if the strings are the same (ignoring case), call a method that matches the case of the string to the translated word (held at index 1 of the array inside an arraylist). When I run this code and I print out the contents of my translated arraylist, I get all "no match" characters. I'm assuming this is because I'm not accessing the index I want in the correct manner. Please help!
public static String translate(String word, ArrayList<String[]> wordList) {
if (word == "." || word == "!" || word == ";" || word == ":") {
return word;
}
for (int i = 0; i < wordList.size(); i++) {
String origWord = wordList.get(i)[0];
String transWord = wordList.get(i)[1];
if (word.equalsIgnoreCase(origWord)) { //FIXME may need to change if you need to switch from translated to original
String translated = matchCase(word, transWord);
return translated;
}
}
String noMatch = Character.toString(Config.LINE_CHAR);
return noMatch;
}
Sample Data and expected result
word = "hello"
wordList.get(i)[0] = "Hello"
wordList.get(i)[1] = "Hola"
(word and wordList.get(i)[0] match, so the next step is executed)
match case method is called and returns the translated word with the same case as the original word ->
translated = "hola"
returns the translated word.
(the for loop iterates through the entire wordList until it finds a match, then it calls the translate method)
**
Match Case's Code
public static String matchCase(String template, String original) {
String matched = "";
if (template.length() > original.length()) {
for (int i = 1; i <= original.length(); i++) {
if (template.charAt(i-1) >= 'a' && template.charAt(i-1) <= 'z') {
if (i == original.length()) {
matched += original.substring(original.length() - 1).toLowerCase();
}
else {
matched += original.substring((i-1), i).toLowerCase();
}
}
else if (template.charAt(i-1) >= 'A' && template.charAt(i-1) <= 'Z') {
if (i == original.length()) {
matched += original.substring(original.length() - 1).toUpperCase();
}
else {
matched += original.substring((i-1), i).toUpperCase();
}
}
}
return matched;
}
else if (template.length() < original.length()) {
int o;
original.toLowerCase();
for (int i = 1; i <= template.length(); i++) {
if (template.charAt(i-1) >= 'a' && template.charAt(i-1) <= 'z') {
if (i == template.length()) {
matched += original.substring(original.length() - 1).toLowerCase();
}
else {
matched += original.substring((i-1), i).toLowerCase();
}
}
else if (template.charAt(i-1) >= 'A' && template.charAt(i-1) <= 'Z') {
if (i == template.length()) {
matched += original.substring(original.length() - 1).toUpperCase();
}
else {
matched += original.substring((i-1), i).toUpperCase();
}
}
String newMatched = matched + original.substring(i, original.length() - 1);
matched = newMatched;
newMatched = "";
}
return matched;
}
return original;
}
I have tested your code and it works rather well with the example you have provided. I cannot help for your bug.
There are however some bugs to notify and improvement to suggest:
matchCase fails when template is shorter than the translated word.
Never compare strings with ==. Use the equals method and look why .
This is not really important but why is noMatch always computed. Why don't you declare it as a constant once?
public static final String NO_MATCH = String.valueOf(Config.LINE_CHAR);
More importantly I think that matchCase is not really pertinent by design and is over complicated. I think that You should just determine if the word to translate is all lower case or upper case or with the first letter in uppercase and the following letters in lower case. What you do (comparing the case letter by letter) is not really pertinent when the length is different.
When you consider a single character, use charAt instead of substringit is simpler and faster.
You also might have a look a regex to analyze your Strings.
Have you considered Maps for your translation lookup?
...

Checking if a string matches all characters but one of another string

I have a list of strings and with each string I want to check it's characters against every other string to see if all it's characters are identical except for one.
For instance a check that would return true would be checking
rock against lock
clock and flock have one character that is different, no more no less.
rock against dent will obviously return false.
I have been thinking about first looping through the list and then having a secondary loop within that one to check the first string against the second.
And then using split(""); to create two arrays containing the characters of each string and then checking the array elements against each other (i.e. comparing each string with the same position in the other array 1-1 2-2 etc...) and so long as only one character comparison fails then the check for those two strings is true.
Anyway I have a lot of strings (4029) and considering what I am thinking of implementing at the moment would contain 3 loops each within the other that would result in a cubic loop(?) which would take a long long time with that many elements wouldn't it?
Is there an easier way to do this? Or will this method actually work okay? Or -hopefully not- but is there some sort of potential logical flaw in the solution I have proposed?
Thanks a lot!
Why not do it the naive way?
bool matchesAlmost(String str1, String str2) {
if (str1.length != str2.length)
return false;
int same = 0;
for (int i = 0; i < str1.length; ++i) {
if (str1.charAt(i) == str2.charAt(i))
same++;
}
return same == str1.length - 1;
}
Now you can just use a quadratic algorithm to check every string against every other.
Assuming the length of two strings are equal
String str1 = "rock";
String str2 = "lick";
if( str1.length() != str2.length() )
System.out.println( "failed");
else{
if( str2.contains( str1.substring( 0, str1.length()-1)) || str2.contains( str1.substring(1, str1.length() )) ){
System.out.println( "Success ");
}
else{
System.out.println( "Failed");
}
}
Not sure if this is the best approach but this one works even when two strings are not of same length. For example : cat & cattp They differ by one character p and t is repeated. Looks like O(n) time solution using additional space for hashmap & character arrays.
/**
* Returns true if two strings differ by one character
* #param s1 input string1
* #param s2 input string2
* #return true if strings differ by one character
*/
boolean checkIfTwoStringDifferByOne(String s1, String s2) {
char[] c1, c2;
if(s1.length() < s2.length()){
c1 = s1.toCharArray();
c2 = s2.toCharArray();
}else{
c1 = s2.toCharArray();
c2 = s1.toCharArray();
}
HashSet<Character> hs = new HashSet<Character>();
for (int i = 0; i < c1.length; i++) {
hs.add(c1[i]);
}
int count = 0;
for (int j = 0; j < c2.length; j++) {
if (! hs.contains(c2[j])) {
count = count +1;
}
}
if(count == 1)
return true;
return false;
}
Assuming that all the strings have the same length, I think this would help:
public boolean differByOne(String source, String destination)
{
int difference = 0;
for(int i=0;i<source.length();i++)
{
if(source.charAt(i)!=destination.charAt(i))
{
difference++;
if(difference>1)
{
return false;
}
}
}
return difference == 1;
}
Best way is to concatenate strings together one forward and other one in reverse order. Then check in single loop for both ends matching chars and also start from middle towards ends matching char. If more than 2 chars mismatch break.
If one mismatch stop and wait for the next one to complete if it reaches the same position then it matches otherwise just return false.
public static void main(String[] args) {
New1 x = new New1();
x.setFunc();
}
static void setFunc() {
Set s = new HashSet < Character > ();
String input = " aecd";
String input2 = "abcd";
String input3 = new StringBuilder(input2).reverse().toString();
String input4 = input.concat(input3);
int length = input4.length();
System.out.println(input4);
int flag = 0;
for (int i = 1, j = length - 1; j > i - 1; i++, j--) {
if (input4.charAt(i) != input4.charAt(j)) {
System.out.println(input4.charAt(i) + " doesnt match with " + input4.charAt(j));
if (input4.charAt(i + 1) != input4.charAt(j)) {
System.out.println(input4.charAt(i + 1) + " doesnt match with " + input4.charAt(j));
flag = 1;
continue;
} else if (input4.charAt(i) != input4.charAt(j - 1)) {
System.out.println(input4.charAt(i) + " doesnt match with " + input4.charAt(j - 1));
flag = 1;
break;
} else if (input4.charAt(i + 1) != input4.charAt(j - 1) && i + 1 <= j - 1) {
System.out.println(input4.charAt(i + 1) + " doesnt match with xxx " + input4.charAt(j - 1));
flag = 1;
break;
}
} else {
continue;
}
}
if (flag == 0) {
System.out.println("Strings differ by one place");
} else {
System.out.println("Strings does not match");
}
}

How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.
Thanks in advance for your help.
public static String removeSurrogates(String query) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < query.length() - 1; i++) {
char firstChar = query.charAt(i);
char nextChar = query.charAt(i+1);
if (Character.isSurrogatePair(firstChar, nextChar) == false) {
sb.append(firstChar);
} else {
i++;
}
}
if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
&& Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
sb.append(query.charAt(query.length() - 1));
}
return sb.toString();
}
Here's a couple things:
Character.isSurrogate(char c):
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.
Checking for pairs seems pointless, why not just remove all surrogates?
x == false is equivalent to !x
StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).
I suggest this:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char c = query.charAt(i);
// !isSurrogate(c) in Java 7
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
}
return sb.toString();
}
Breaking down the if statement
You asked about this statement:
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:
static boolean isSurrogate(char c) {
return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}
static boolean isNotSurrogate(char c) {
return !isSurrogate(c);
}
...
if (isNotSurrogate(c)) {
sb.append(firstChar);
}
Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).
Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:
return query.replaceAll("[^\u0000-\uffff]", "");
why not simply
for (int i = 0; i < query.length(); i++)
char c = query.charAt(i);
if(!isHighSurrogate(c) && !isLowSurrogate(c))
sb.append(c);
you probably should replace them with "?", instead of out right erasing them.
Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char ch = query.charAt(i);
if (Character.isHighSurrogate(ch))
i++;//skip the next char is it's supposed to be low surrogate
else
sb.append(ch);
}
return sb.toString();
}
if remove, all these solutions are useful
but if repalce, below is better
StringBuffer sb = new StringBuffer();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if(Character.isHighSurrogate(c)){
sb.append('*');
}else if(!Character.isLowSurrogate(c)){
sb.append(c);
}
}
return sb.toString();

Control code 0x6 causing XML error

I have a Java application running which fetches data by XML, but once in a while i have some data consisting some sort of control code?
An invalid XML character (Unicode: 0x6) was found in the CDATA section.
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x6) was found in the CDATA section.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at domain.Main.processLogFromUrl(Main.java:342)
at domain.Main.<init>(Main.java:67)
at domain.Main.main(Main.java:577)
Can anyone explain what this control code exactly does as i cannot find much info?
Thanks in advance.
You need to write a FilterInputStream to filter the data before the SAX parser gets it. It must either remove or recode the bad data.
Apache have a super-flexible example. You may wish to put together a much simpler one.
Here's one of mine which does other cleaning up but I am sure it will be a good start.
/* Cleans up often very bad xml.
*
* 1. Strips leading white space.
* 2. Recodes £ etc to &#...;.
* 3. Recodes lone & as &amp.
*
*/
public class XMLInputStream extends FilterInputStream {
private static final int MIN_LENGTH = 2;
// Everything we've read.
StringBuilder red = new StringBuilder();
// Data I have pushed back.
StringBuilder pushBack = new StringBuilder();
// How much we've given them.
int given = 0;
// How much we've read.
int pulled = 0;
public XMLInputStream(InputStream in) {
super(in);
}
public int length() {
// NB: This is a Troll length (i.e. it goes 1, 2, many) so 2 actually means "at least 2"
try {
StringBuilder s = read(MIN_LENGTH);
pushBack.append(s);
return s.length();
} catch (IOException ex) {
log.warning("Oops ", ex);
}
return 0;
}
private StringBuilder read(int n) throws IOException {
// Input stream finished?
boolean eof = false;
// Read that many.
StringBuilder s = new StringBuilder(n);
while (s.length() < n && !eof) {
// Always get from the pushBack buffer.
if (pushBack.length() == 0) {
// Read something from the stream into pushBack.
eof = readIntoPushBack();
}
// Pushback only contains deliverable codes.
if (pushBack.length() > 0) {
// Grab one character
s.append(pushBack.charAt(0));
// Remove it from pushBack
pushBack.deleteCharAt(0);
}
}
return s;
}
// Returns false at eof.
// Might not actually push back anything but usually will.
private boolean readIntoPushBack() throws IOException {
// File finished?
boolean eof = false;
// Next char.
int ch = in.read();
if (ch >= 0) {
// Discard whitespace at start?
if (!(pulled == 0 && isWhiteSpace(ch))) {
// Good code.
pulled += 1;
// Parse out the &stuff;
if (ch == '&') {
// Process the &
readAmpersand();
} else {
// Not an '&', just append.
pushBack.append((char) ch);
}
}
} else {
// Hit end of file.
eof = true;
}
return eof;
}
// Deal with an ampersand in the stream.
private void readAmpersand() throws IOException {
// Read the whole word, up to and including the ;
StringBuilder reference = new StringBuilder();
int ch;
// Should end in a ';'
for (ch = in.read(); isAlphaNumeric(ch); ch = in.read()) {
reference.append((char) ch);
}
// Did we tidily finish?
if (ch == ';') {
// Yes! Translate it into a &#nnn; code.
String code = XML.hash(reference);
if (code != null) {
// Keep it.
pushBack.append(code);
} else {
throw new IOException("Invalid/Unknown reference '&" + reference + ";'");
}
} else {
// Did not terminate properly!
// Perhaps an & on its own or a malformed reference.
// Either way, escape the &
pushBack.append("&").append(reference).append((char) ch);
}
}
private void given(CharSequence s, int wanted, int got) {
// Keep track of what we've given them.
red.append(s);
given += got;
log.finer("Given: [" + wanted + "," + got + "]-" + s);
}
#Override
public int read() throws IOException {
StringBuilder s = read(1);
given(s, 1, 1);
return s.length() > 0 ? s.charAt(0) : -1;
}
#Override
public int read(byte[] data, int offset, int length) throws IOException {
int n = 0;
StringBuilder s = read(length);
for (int i = 0; i < Math.min(length, s.length()); i++) {
data[offset + i] = (byte) s.charAt(i);
n += 1;
}
given(s, length, n);
return n > 0 ? n : -1;
}
#Override
public String toString() {
String s = red.toString();
String h = "";
// Hex dump the small ones.
if (s.length() < 8) {
Separator sep = new Separator(" ");
for (int i = 0; i < s.length(); i++) {
h += sep.sep() + Integer.toHexString(s.charAt(i));
}
}
return "[" + given + "]-\"" + s + "\"" + (h.length() > 0 ? " (" + h + ")" : "");
}
private boolean isWhiteSpace(int ch) {
switch (ch) {
case ' ':
case '\r':
case '\n':
case '\t':
return true;
}
return false;
}
private boolean isAlphaNumeric(int ch) {
return ('a' <= ch && ch <= 'z')
|| ('A' <= ch && ch <= 'Z')
|| ('0' <= ch && ch <= '9');
}
}
Quite why you've got that character will depend on what the data is meant to represent. (Apparently it's ACK, but that's odd to represent in a file...) However, the important point is that it makes the XML invalid - you simply can't represent that character in XML.
From the XML 1.0 spec, section 2.2:
Character Range
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
Note how this excludes Unicode values below U+0020 other than U+0009 (tab), U+000A (line-feed) and U+000D (carriage return).
If you have any influence over the data coming back, you should change it to return valid XML. If not, you'll have to do some preprocessing on it before parsing it as XML. Quite what you'll want to do with unwanted control characters depends on what meaning they have in your situation.
Try to define your XML as version 1.1:
<?xml version="1.1"?>

Categories

Resources