Java Output UTF-8 to Real Characters?

Java Output UTF-8 to Real Characters? - java

In Java, how can I output UTF-8 to real string?
我们
\u6211\u4eec
String str = new String("\u6211\u4eec");
System.out.println(str); // still ouput \u6211\u4eec, but I expect 我们 to be an output
-----
String tmp = request.getParameter("tag");
System.out.println("request:"+tmp);
System.out.println("character set :"+request.getCharacterEncoding());
String tmp1 = new String("\u6211\u4eec");
System.out.println("string equal:"+(tmp.equalsIgnoreCase(tmp1)));
String tag = new String(tmp);
System.out.println(tag);
request:\u6211\u4eec
character set :UTF-8
string equal:false
\u6211\u4eec
From the output, the value from the request is the same as the string value of tmp1, but why does equalsIgnoreCase output false?

did you try to display just one of them? like
String str = new String("\u6211");
System.out.println(str);
I bet there is a problem in how you create that string.

Java String are encoded in UTF-16. I do not see any problem in your code, I would believe the problem comes from your console and it doesn't show correctly the content of the String.
If you are using eclipse, change your console encoding here to UTF-8
Eclipse > Preferences > General > Workspace > Text file encoding

Related

Base64 encoding string in Java doubles its length

Problem
I am trying to encode file contents of doc/pdf extensions to Base64 string in Java.
The encoded string length almost doubles from the original(115k -> 230k).
Whereas encoding the same file contents in Python/PHP or any online tool only gives a third increase(115k -> 154k).
What causes this increase in size for Java and is there any way to get equivalent result as the other sources?
Code
import java.util.Base64;
...
//String content;
System.out.println(content.length());
String encodedStr = new String(Base64.getEncoder().encode(content.getBytes()));
System.out.println(encodedStr.length());
String urlEncodedStr = new String(Base64.getUrlEncoder().encode(content.getBytes()));
System.out.println(urlEncodedStr.length());
String mimieEncodedStr = new String(Base64.getMimeEncoder().encode(content.getBytes()));
System.out.println(mimieEncodedStr.length());
Output
For pdf file
115747
230816
230816
236890
For doc file
13685
26392
26392
27086

First, never use new String. Second, pass an encoding to String.getBytes(String) (e.g. content.getBytes(encoding)). For example,
String encodedStr = Base64.getEncoder()
.encodeToString(content.getBytes("UTF-8"));
or
String encodedStr = Base64.getEncoder()
.encodeToString(content.getBytes("US-ASCII"));

char from ldap not displaying correctly in java

In the Eclipse Ldap Browser plugin, I see an attribute value that has a UTF-8 char (a lowercase n with a ~ above it). This is UTF char c3b1 or USC2 char 00F1 which I've read Java uses for its Strings. But when I print it out to the log file with the following code, it shows up as an uppercase A with a ~ above it, followed by a +/- symbol. All three output statements show the same thing.
while(allAttributes.hasNext()) {
LDAPAttribute attribute = (LDAPAttribute)allAttributes.next();
String attributeValue = new String(attribute.getStringValue());
byte[] attByteValue = attribute.getByteValue();
String utf8Str = new String( attByteValue, "UTF-8");
log.debug("attribute.getStringValue="+attribute.getStringValue());
log.debug("attributeValue="+attributeValue);
log.debug("utf8Str="+utf8Str);
boolean isValidUTF8 = Base64.isValidUTF8(attByteValue, true);
if (isValidUTF8) log.debug("string contains all valid UTF8 chars and UCS2 chars");
else log.debug("string contains invalid UTF8 char(s) or invalid UCS2 char(s)");
}
isValidUTF returns true so it seems there are no invalid chars. Any suggestions how to make it display correctly in the log?

Android convert diamond question marks to UTF-8 Arabic string

I'm using an API that sends and receives raw bytes.
But i have problem with displaying the Arabic words that comes over the API, it's displaying like diamond question marks "���"
I've tried to convert the string from and to utf-8.
This example returns question marks but not inside the black square "??? ???" :
String str = new String(originalStr.getBytes("ISO-8859-1"), "UTF-8");
This one returns empty string :
String str = new String(originalStr.getBytes("WINDOWS-1256"), "UTF-8");
And this one also returns an empty string :
String str = new String(originalStr.getBytes("WINDOWS-1252"), "UTF-8");
I've succeded to display the Arabic words in PHP by converting from cp1256 to utf-8 :
echo iconv('cp1256', 'utf-8', $string);
The correct character encoding for Arabic is cp1256
How can i achieve that?

java convert String windows-1251 to utf8

Scanner sc = new Scanner(System.in);
System.out.println("Enter text: ");
String text = sc.nextLine();
try {
String result = new String(text.getBytes("windows-1251"), Charset.forName("UTF-8"));
System.out.println(result);
} catch (UnsupportedEncodingException e) {
System.out.println(e);
}
I'm trying change keyboard: input cyrylic keyboard, output latin. Example: qwerty +> йцукен
It doesn't work, can anyone tell me what i'm doing wrong?

First java text, String/char/Reader/Writer is internally Unicode, so it can combine all scripts.
This is a major difference with for instance C/C++ where there is no such standard.
Now System.in is an InputStream for historical reasons. That needs an indication of encoding used.
Scanner sc = new Scanner(System.in, "Windows-1251");
The above explicitly sets the conversion for System.in to Cyrillic. Without this optional parameter the default encoding is taken. If that was not changed by the software, it would be the platform encoding. So this might have been correct too.
Now text is correct, containing the Cyrillic from System.in as Unicode.
You would get the UTF-8 bytes as:
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
The old "recoding" of text was wrong; drop this line. in fact not all Windows-1251 bytes are valid UTF-8 multi-byte sequences.
String result = text;
System.out.println(result);
System.out is a PrintStream, a rather rarely used historic class. It prints using the default platform encoding. More or less rely on it, that the default encoding is correct.
System.out.println(result);
For printing to an UTF-8 encoded file:
byte[] bytes = ("\uFEFF" + text).getBytes(StandardCharsets.UTF_8);
Path path = Paths.get("C:/Temp/test.txt");
Files.writeAllBytes(path, bytes);
Here I have added a Unicode BOM character in front, so Windows Notepad may recognize the encoding as UTF-8. In general one should evade using a BOM. It is a zero-width space (=invisible) and plays havoc with all kind of formats: CSV, XML, file concatenation, cut-copy-paste.

The reason why you have gotten the answer to a different question, and nobody answered yours, is because your title doesn't fit the question. You were not attempting to convert between charsets, but rather between keyboard layouts.
Here you shouldn't worry about character layout at all, simply read the line, convert it to an array of characters, go through them and using a predefined map convert these.
The code will be something like this:
Map<char, char> table = new TreeMap<char, char>();
table.put('q', 'й');
table.put('Q', 'Й');
table.put('w', 'ц');
// .... etc
String text = sc.nextLine();
char[] cArr = text.toCharArray();
for(int i=0; i<cArr.length; ++i)
{
if(table.containsKey(cArr[i]))
{
cArr[i] = table.get(cArr[i]);
}
}
text = new String(cArr);
System.out.println(text);
Now, i don't have time to test that code, but you should get the idea of how to do your task.

Java Replacing Help Needed

Hey guy's so am trying to replace all characters and numbers to get the /hello/what/ only without the REMOVEThis4.PNG i don't want to use string.replace("REMOVEThis4.PNG", ""); cause i wanna use it on other strings not only that
Any help is great my code
String sFile = "/hello/what/REMOVEThis4.PNG";
if (sFile.contains("/")){
String Replaced = sFile.replaceAll("(?s)", "");
System.out.println(Replaced);
}
I want the the output to be
/hello/what/
Only thanks alot!

If you are trying to parse a path, I recommend to find the last index of /, and get the substring to this index plus one. So
string = string.substring(0, string.lastIndexOf("/") + 1);

No need to use regular expressions in your case:
String sFile = "/hello/what/REMOVEThis4.PNG";
// TODO check actual last index of "/" against -1
System.out.println(sFile.substring(0, sFile.lastIndexOf("/") + 1));
Output
/hello/what/
Note
In case you are dealing with actual files, you can probably spare yourself the String manipulation and use File.getParent() instead:
File file = new File("/hello/what/REMOVEThis4.PNG");
System.out.println(file.getParent());
Output (may change depending on your system)
\hello\what

Use Java's File API:
String example = "/hello/what/REMOVEThis4.PNG";
File file = new File(example);
System.out.println(example);
String absolutePath = file.getAbsolutePath();
String filePath = absolutePath.substring(0, absolutePath.lastIndexOf(File.separator));
System.out.println(filePath);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Output UTF-8 to Real Characters? - java

did you try to display just one of them? like String str = new String("\u6211"); System.out.println(str); I bet there is a problem in how you create that string.

Related

Base64 encoding string in Java doubles its length

char from ldap not displaying correctly in java

Android convert diamond question marks to UTF-8 Arabic string

java convert String windows-1251 to utf8

Java Replacing Help Needed

Categories

Resources