Handling Strings with octal ASCII code (in Java) - java

i'm having some trouble with a text file that contains strings like these:
Grandchamp-le-Ch\303\242teau
It's the name of a Wikipedia page by the way. The two asciis represent "â" I think.
Is there any piece of software that easily converts the string above into
Grandchamp-le-Château
or maybe
Grandchamp-le-Ch%C3%A2teau
I would prefer a java absed solution, but any other idea is just as well!
Any advice or hint is very much appreciated!

This is a slightly hacky way to achieve your goal:
final String name = "Grandchamp-le-Ch\\303\\242teau";
final Matcher m = Pattern.compile("\\\\(\\d{3})").matcher(name);
final StringBuffer out = new StringBuffer();
while (m.find()) m.appendReplacement(out, String.valueOf((char)parseInt(m.group(1), 8)));
m.appendTail(out);
final String decoded = new String(out.toString().getBytes(ISO_8859_1), UTF_8);
System.out.println(decoded);
How it works:
the regular expression matches the octal character notation;
the original string is transformed by replacing each such octal notation with a char whose numeric value equals that octal number;
the new string (now in "mojibake" state) is written out as bytes, using a single-byte encoding (any will do, but ISO_8859_1 happens to be the standard one);
the bytes are re-read, now assuming they are an UTF-8-encoded string.
The code will print out
Grandchamp-le-Château

Here you are:
String myString = "Grandchamp-le-Ch\303\242teau";
byte[] byteArray = myString.getBytes("ISO-8859-1");
String result = new String(byteArray, "UTF-8");
System.out.println(result);
This prints:
Grandchamp-le-Château

Related

How to Convert UTF-16 Surrogate Decimal to UNICODE in Java

I have some string data like
&#55357 ;&#56842 ;
These are surrogate pairs in UTF 16 in decimal format.
How can I convert them to Unicode Code Points in Java, so that my client can understand the Unicode decimal html entity without the surrogate pair?
Example: &#128522 ; - Get this response for the above string
Assuming you already parsed the string to get the 2 numbers, just create a String from those two char values:
String s = new String(new char[] { 55357, 56842 });
System.out.println(s);
Output
😊
To get the code point of that:
s.codePointAt(0) // returns 128522
You don't have to create a string though:
Character.toCodePoint((char) 55357, (char) 56842) // returns 128522

Convert ASCII representation of unicode to unicode

I have an application that get som Strings by JSON.
The problem is that I think that they are sending it as ASCII and the text really should be in unicode.
For example, there are parts of the string that is "\u00f6" which is the swedish letter "ö"
For example the swedish word for "buy" is "köpa" and the string I get is "k\u00f6pa"
Is there an easy way for me after I recived this String in java to convert it to the correct representation?
That is, I want to convert strings like "k\u00f6pa" to "köpa"
Thank for all help!
Well, that is easy enough, just use a JSON library. With Jackson for instance you will:
final ObjectMapper mapper = new ObjectMapper();
final JsonNode node = mapper.readTree(your, source, here);
The JsonNode will in fact be a TextNode; you can just retrieve the text as:
node.textValue()
Note that this IS NOT an "ASCII representation" of a String; it just happens that JSON strings can contain UTF-16 code unit character escapes like this one.
(you will lose the quotes around the value, though, but that is probably what you expect anyway)
The hex code is just 2 bytes of integer, which an int can handle just fine -- so you can just use Integer.parse(s, 16) where s is the string without the "\u" prefix. Then you just narrow that int to a char, which is guaranteed to fit.
Throw in some regex (to validate the string and also extract the hex code), and you're all done.
Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})");
Matcher m = p.matcher(arg);
if (m.matches()) {
String code = m.group(1);
int i = Integer.parseInt(code, 16);
char c = (char) i;
System.out.println(c);
}

Convert a text with special unicode to normal text (java)

I have a text which includes numerous unicode (?) characters in it, like the followings:
passaic$002c new jersey
Which should be : passaic, new jersey
Albert_W$002E_Barney
Which should be : albert w. barney
Roosevelt_High_School_$0028Yonkers$002C_New_York$0029
which should be: Roosevelt_High_School_(Yonkers,_New_York)
I searched the web and there is a big list of these characters: http://colemak.com/pub/mac/wordherd_source.txt
Do you know any fast method that I can replace these characters with their original characters? Note that I don't want to replace each of these characters one by one (like using replaceAll.) Instead I want to use a function that has already implemented this (maybe an external library)
Try native2ascii tool of java. Refer http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.html
Assuming those are UTF-16BE encoded values you can just use parse the values and cast to char:
public static String parse(CharSequence csq) {
StringBuilder out = new StringBuilder();
Matcher matcher = Pattern.compile("\\$(\\p{XDigit}{4}+)").matcher(csq);
int last = 0;
while (matcher.find()) {
out.append(csq.subSequence(last, matcher.start()));
String hex = matcher.group(1);
char ch = (char) Integer.parseInt(hex, 16);
out.append(ch);
last = matcher.end();
}
out.append(csq.subSequence(last, csq.length()));
return out.toString();
}

Convert String to unicode?

How to convert Strings to unicode? Characters are easy. But if I have "C" stored as a String, how can convert it to unicode? Because for characters, you just can use (int)charvariable but how to do for strings?
Actually I am using String.split() to split a String and then want to check if the 1st character is capital or small. Integer.parseInt is not working. It says NumberFormatException.
You may try this -
byte[] bytes = new byte[10];
String str = new String(bytes, Charset.forName("UTF-8"));
System.out.println(str);
for more detail you can see this tutorial
and for checking the first character you my use str.CharAt(0)

How to convert a string representation of unicode hex "0x20000" to the int code point 0x20000 in Java

I have a list of String representations of unicode hex values such as "0x20000" (𠀀) and "0x00F8" (ø) that I need to get the int code point of so that I can use functions such as:
char[] chars = Character.toChars(0x20000);
This should cover the BMP as well as supplementary characters. I cannot find any way to do it so would be glad of some help.
You can create your own NumberFormat implementation, but easier than that you can do something like this:
String hexString = "0x20000";
int hexInt = Integer.parseInt(hexString.substring(2), 16);
String stringRepresentation = new String(Character.toChars(hexInt));
System.out.println(stringRepresentation); //prints "𠀀"

Categories

Resources