Converting to utf-8 in java - java

I just have the string \u0130smail and I want to convert it to
İsmail and also convert
\u0130 --> İ
\u00E7 --> ç
I tried
String str = "\u0130smail";
sysout(str);
and it worked, but whenever I get the string "\u0130smail" from the DB or the internet it doesn't give the correct result.
static String deneme(String string){
String string2 = null;
try {
byte[] utf8 = string.getBytes("UTF-8");
string2 = new String(utf8, "UTF-8");
} catch (UnsupportedEncodingException e) {
}
return string2;
}
didn't work either.

Strings "\u0130smail" and "İsmail" are absolutely the same from the language standpoint. If you mean that you get a string "\\u0130smail" (note that I've escaped the backslash), then you will have to find the pattern of the unicode code points and convert them to normal unicode letters or just print the number, whichever you need. Regular expressions could help you in this case.

Converting the existing string to bytes and back again isn't going to help you. You need to look at the exact characters in the string you've got - and work out how you got them.
I suggest you print out the integer value of each character in the string as an integer (ideally in hex) to find out exactly what you've got... then trace it back as far as you can, to work out what's going wrong.

Related

How to convert special characters in a string to unicode?

I couldn't find an answer to this problem, having tried several answer here combined to find something that works, to no avail.
An application I'm working on uses a users name to create PDF's with that name in it. However, when someones name contains a special character like "Yağmur" the pdf creator freaks out and omits this special character.
However, when it gets the unicode equivalent ("Yağmur"), it prints "Yağmur" in the pdf as it should.
How do I check a name/string for any special character (regex = "[^a-z0-9 ]") and when found, replace that character with its unicode equivalent and returning the new unicoded string?
I will try to give the solution in generic way as the frame work you are using is not mentioned as the part of your problem statement.
I too faced the same kind of issue long time back. This should be handled by the pdf engine if you set the text/char encoding as UTF-8. Please find how you can set encoding in your framework for pdf generation and try it out. Hope it helps !!
One hackish way to do this would be as follows:
/*
* TODO: poorly named
*/
public static String convertUnicodePoints(String input) {
// getting char array from input
char[] chars = input.toCharArray();
// initializing output
StringBuilder sb = new StringBuilder();
// iterating input chars
for (int i = 0; i < input.length(); i++) {
// checking character code point to infer whether "conversion" is required
// here, picking an arbitrary code point 125 as boundary
if (Character.codePointAt(input, i) < 125) {
sb.append(chars[i]);
}
// need to "convert", code point > boundary
else {
// for hex representation: prepends as many 0s as required
// to get a hex string of the char code point, 4 characters long
// sb.append(String.format("&#xu%04X;", (int)chars[i]));
// for decimal representation, which is what you want here
sb.append(String.format("&#%d;", (int)chars[i]));
}
}
return sb.toString();
}
If you execute: System.out.println(convertUnicodePoints("Yağmur"));...
... you'll get: Yağmur.
Of course, you can play with the "conversion" logic and decide which ranges get converted.

Convert ASCII representation of unicode to unicode

I have an application that get som Strings by JSON.
The problem is that I think that they are sending it as ASCII and the text really should be in unicode.
For example, there are parts of the string that is "\u00f6" which is the swedish letter "ö"
For example the swedish word for "buy" is "köpa" and the string I get is "k\u00f6pa"
Is there an easy way for me after I recived this String in java to convert it to the correct representation?
That is, I want to convert strings like "k\u00f6pa" to "köpa"
Thank for all help!
Well, that is easy enough, just use a JSON library. With Jackson for instance you will:
final ObjectMapper mapper = new ObjectMapper();
final JsonNode node = mapper.readTree(your, source, here);
The JsonNode will in fact be a TextNode; you can just retrieve the text as:
node.textValue()
Note that this IS NOT an "ASCII representation" of a String; it just happens that JSON strings can contain UTF-16 code unit character escapes like this one.
(you will lose the quotes around the value, though, but that is probably what you expect anyway)
The hex code is just 2 bytes of integer, which an int can handle just fine -- so you can just use Integer.parse(s, 16) where s is the string without the "\u" prefix. Then you just narrow that int to a char, which is guaranteed to fit.
Throw in some regex (to validate the string and also extract the hex code), and you're all done.
Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})");
Matcher m = p.matcher(arg);
if (m.matches()) {
String code = m.group(1);
int i = Integer.parseInt(code, 16);
char c = (char) i;
System.out.println(c);
}

Handling Strings with octal ASCII code (in Java)

i'm having some trouble with a text file that contains strings like these:
Grandchamp-le-Ch\303\242teau
It's the name of a Wikipedia page by the way. The two asciis represent "â" I think.
Is there any piece of software that easily converts the string above into
Grandchamp-le-Château
or maybe
Grandchamp-le-Ch%C3%A2teau
I would prefer a java absed solution, but any other idea is just as well!
Any advice or hint is very much appreciated!
This is a slightly hacky way to achieve your goal:
final String name = "Grandchamp-le-Ch\\303\\242teau";
final Matcher m = Pattern.compile("\\\\(\\d{3})").matcher(name);
final StringBuffer out = new StringBuffer();
while (m.find()) m.appendReplacement(out, String.valueOf((char)parseInt(m.group(1), 8)));
m.appendTail(out);
final String decoded = new String(out.toString().getBytes(ISO_8859_1), UTF_8);
System.out.println(decoded);
How it works:
the regular expression matches the octal character notation;
the original string is transformed by replacing each such octal notation with a char whose numeric value equals that octal number;
the new string (now in "mojibake" state) is written out as bytes, using a single-byte encoding (any will do, but ISO_8859_1 happens to be the standard one);
the bytes are re-read, now assuming they are an UTF-8-encoded string.
The code will print out
Grandchamp-le-Château
Here you are:
String myString = "Grandchamp-le-Ch\303\242teau";
byte[] byteArray = myString.getBytes("ISO-8859-1");
String result = new String(byteArray, "UTF-8");
System.out.println(result);
This prints:
Grandchamp-le-Château

Convert String to unicode?

How to convert Strings to unicode? Characters are easy. But if I have "C" stored as a String, how can convert it to unicode? Because for characters, you just can use (int)charvariable but how to do for strings?
Actually I am using String.split() to split a String and then want to check if the 1st character is capital or small. Integer.parseInt is not working. It says NumberFormatException.
You may try this -
byte[] bytes = new byte[10];
String str = new String(bytes, Charset.forName("UTF-8"));
System.out.println(str);
for more detail you can see this tutorial
and for checking the first character you my use str.CharAt(0)

How to build the longest String with different Unicode characters

Thanks in advance for your patience. This is my problem.
I'm writing a program in Java that works best with a big set of different characters.
I have to store all the characters in a String. I started with
private static final String values = "0123456789";
Then I added A-Z, a-z and all the commons symbols.
But they are still too few, so I tought that maybe Unicode could be the solution.
The problem is now: what is the best way to get all the unicode characters that can be displayed in Eclipse (my algorithm will probably fail if there are unrecognized characters - those displayed like little rectangles). Is it possible to build a string (or some strings) with all the characters present here (en.wikipedia.org/wiki/List_of_Unicode_characters) correctly displayed?
I can do a rough copy-paste from http://www.terena.org/activities/multiling/euroml/tests/test-ucspages1ucs.html or http://zenoplex.jp/tools/unicoderange_generator.html, but I would appreciate some cleaner solution.
I don't know if there is a way to extract characters fron a font (the Unifont one). Or maybe I should parse this (www. utf8-chartable.de/unicode-utf8-table.pl) webpage.
Moreover, by adding all the characters into a String I will probably get the error:
"The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool" (discussed in this question on SO: /questions/10798769/how-to-process-a-string-with-823237-characters).
Hybrid solutions can be accepted. I can remove duplicates following this question on SO questions/4989091/removing-duplicates-from-a-string-in-java)
Finally: every solution to get the longest only-different-characters string is accepted.
Thanks!
You are mixing some things up. The question whether a character can be displayed in Eclipse depends on the font you have chosen; and whether the source file can be processed correctly depends on which character encoding you have set up for the source file. When choosing UTF-8 and a good unicode font you can use and display almost any character, at least more than fit into a single String literal.
But is it really required to show the character in Eclipse? You can use the unicode escapes, e.g. \u20ac to refer to characters, regardless of whether they can be displayed or if the file encoding can handle them.
And if it is not a requirement to blow up your source code, it’s easy to create a String containing all existing characters:
// all chars (i.e. UTF-16 values)
StringBuilder sb=new StringBuilder(Character.MAX_VALUE);
for(char c=0; c<Character.MAX_VALUE; c++) sb.append(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
or
// all unicode characters (aka code points)
StringBuilder sb=new StringBuilder(2162686);
for(int c=0; c<Character.MAX_CODE_POINT; c++) sb.appendCodePoint(c);
String s=sb.toString();
// if it should behave like a compile-time constant:
s=s.intern();
If you wan’t the String to contain valid unicode characters only you can use if(Character.isDefined(c)) … inside the loop. But that’s a moving target— newer JRE’s will most probably know more defined characters.
Smply use Apache classes, org.apache.commons.lang.RandomStringUtils (commons-lang) can solve your purpose.
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/RandomStringUtils.html
Also please refer to below code for api usage,
import org.apache.commons.lang3.RandomStringUtils;
public class RandomString {
public static void main(String[] args) {
// Random string only with numbers
String string = RandomStringUtils.random(64, false, true);
System.out.println("Random 0 = " + string);
// Random alphabetic string
string = RandomStringUtils.randomAlphabetic(64);
System.out.println("Random 1 = " + string);
// Random ASCII string
string = RandomStringUtils.randomAscii(32);
System.out.println("Random 2 = " + string);
// Create a random string with indexes from the given array of chars
string = RandomStringUtils.random(32, 0, 20, true, true, "bj81G5RDED3DC6142kasok".toCharArray());
System.out.println("Random 3 = " + string);
}
}

Categories

Resources