How to convert "Æàìáûë" to readable cyrillic in Java? - java

I tryed to get byte and then convert with Utf-8.
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext,"UTF-8"));
But it's not working .Sorry for my dumbness. I'm very confused.

try {
String s = new String("Æàìáûë".getBytes(StandardCharsets.ISO_8859_1), "Windows-1251");
Files.write(Paths.get("C:/cyrillic.txt"),
("\uFEFF" + s).getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
e.printStackTrace();
}
Assuming that the editor and compiler are set to UTF-8 to have a correct erroneous string literal.
This treats the characters as single bytes, abusing ISO-8859-1. Then tries the Windows-1251 encoding for Cyrillic (there are others).
This way we have a java String (always in Unicode).
This we'll write to a text file in UTF-8, with a BOM, so Windows Notepad will identify the file as UTF-8.
Writing to any Cyrillic encoding will be no problem.
Жамбыл

Your byte array must have some encoding. The encoding cannot be ASCII if you've got negative values. Once you figure that out, you can convert a set of bytes to a String using:
byte[] bytes = {...}
String str = new String(bytes, "UTF-8"); // for UTF-8 encoding
Log.i("value", str);
There are a bunch of encodings you can use, look at the Charset class in the Sun javadocs..

Seems your original encoding is Cp1251:
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext, "Cp1251")); // <- put it here
Resulting word is Жамбыл.

Related

String format when reading from file

I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:
Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?
filereader = new FileReader(file);
read= new BufferedReader(filereader);
String s= null;
while ((s= read.readLine()) != null)
{
System.out.println(s);
}
So when i do this:
s= s.replace("\uFEFF","A");
nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.
Thank you
Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as #seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.

Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?
String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";
You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

How to convert UTF8 string to UTF16

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16.
Sample received string: S\0a\0m\0p\0l\0e (UTF8).
What I want is : Sample (UTF16)
FileItem item = (FileItem) iter.next();
String field = "";
String value = "";
if (item.isFormField()) {
try{
value=item.getString();
System.out.println("====" + value);
}
The bytes from the server are not UTF-8 if they look like S\0a\0m\0p\0l\0e. They are UTF-16. You can convert UTF16 bytes to a Java String with:
byte[] bytes = ...
String string = new String(bytes, "UTF-16");
Or you can use UTF-16LE or UTF-16BE as the character set name if you know the endian-ness of the byte stream coming from the server.
If you've already (mistakenly) constructed a String from the bytes as if it were UTF-8, you can convert to UTF-16 with:
string = new String(string.getBytes("UTF-8"), "UTF-16");
However, as JB Nizet points out, this round trip (bytes -> UTF-8 string -> bytes) is potentially lossy if the bytes weren't valid UTF-8 to start with.
I propose the following solution:
NSString *line_utf16[ENOUGH_MEMORY_SIZE];
line_utf16= [NSString stringWithFormat: #"%s", line_utf8];
ENOUGH_MEMORY_SIZE is at least twice exceeds memory used for line_utf8
I suppose memory for
line_utf16
has to be dynamically or statically allocated at least twice of the size of
line_utf8.
If you run into similar problem please add a couple of sentences!

Encode String to UTF-8

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:
byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");
How do I encode that string to utf-8?
How about using
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)
String objects in Java use the UTF-16 encoding that can't be modified*.
The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).
* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).
In Java7 you can use:
import static java.nio.charset.StandardCharsets.*;
byte[] ptext = myString.getBytes(ISO_8859_1);
String value = new String(ptext, UTF_8);
This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.
If you're using an older Java version you can declare the charset constants yourself:
import java.nio.charset.Charset;
public class StandardCharsets {
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
//....
}
Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.
A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.
You can try this way.
byte ptext[] = myString.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");
In a moment I went through this problem and managed to solve it in the following way
first i need to import
import java.nio.charset.Charset;
Then i had to declare a constant to use UTF-8 and ISO-8859-1
private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");
Then I could use it in the following way:
String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";
text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);
String value = new String(myString.getBytes("UTF-8"));
and, if you want to read from text file with "ISO-8859-1" encoded:
String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
while ((line = br.readLine()) != null) {
System.out.println(new String(line.getBytes("UTF-8")));
}
} catch (IOException ex) {
//...
}
I have use below code to encode the special character by specifying encode format.
String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");
A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.
NetBeans default encoding UTF-8 step-by-step guide
Go to etc folder in NetBeans installation directory
Edit netbeans.conf file
Find netbeans_default_options line
Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line
(example: netbeans_default_options="-J-Dfile.encoding=UTF-8")
Restart NetBeans
You set NetBeans default encoding UTF-8.
Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.
Example:
netbeans_default_options="-J-client -J-Xss128m -J-Xms256m
-J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"
here is link for Further Details
This solved my problem
String inputText = "some text with escaped chars"
InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).
If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.
As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Categories

Resources