UTF-8 issue in Java code

UTF-8 issue in Java code - java

I'm getting a string 'ÐÐ°Ð»ÐµÐ½Ð´Ð°ÑÐ' instead of getting 'Календари' in Java code. How can I convert 'ÐÐ°Ð»ÐµÐ½Ð´Ð°ÑÐ' to 'Календари'?
I used
String convert =new String(convert.getBytes("iso-8859-1"), "UTF-8")
String convert =new String(convert.getBytes(), "UTF-8")

I believe your code is okay. It appears that your problem is that you need to do a specific character conversion, and maybe your "real" input is not being encoded correctly. To test, I would do a standard step by step CharSet encoding/decoding, to see where things are breaking.
Your encodings look fine, http://docs.oracle.com/javase/1.6/docs/guide/intl/encoding.doc.html
And the following seems to run normally :
//i suspect your problem is here - make sure your encoding the string correctly from the byte/char stream. That is, make sure that you want "iso-8859-1" as your input characters.
Charset charsetE = Charset.forName("iso-8859-1");
CharsetEncoder encoder = charsetE.newEncoder();
//i believe from here to the end will probably stay the same, as per your posted example.
Charset charsetD = Charset.forName("UTF-8");
CharsetDecoder decoder = charsetD.newDecoder();
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inputString));
CharBuffer cbuf = decoder.decode(bbuf);
final String result = cbuf.toString();
System.out.println(result);

Use the Unicode values instead of string literals. For more information, see:
Russian on-screen keyboard (hover over for Unicode values)
And how about a list of Unicode characters?
Edit -
Note that it's important to use an output font that supports displaying Unicode values (e.g. Arial Unicode MS).
Example -
import java.awt.FlowLayout;
import javax.swing.JButton;
import javax.swing.JFrame;
import javax.swing.SwingUtilities;
final class RussianDisplayDemo extends JFrame
{
private static final long serialVersionUID = -3843706833781023204L;
/**
* Constructs a frame the is initially invisible to display Russian text
*/
RussianDisplayDemo()
{
setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
getContentPane().setLayout(new FlowLayout());
add(getRussianButton());
setLocationRelativeTo(null);
pack();
}
/**
* Returns a button with Russian text
*
* #return a button with Russian text
*/
private final JButton getRussianButton()
{
final JButton button = new JButton("\u042da\u043d\u044f\u0442\u043e"); // Russian for "Busy"
return button;
}
public static final void main(final String[] args)
{
SwingUtilities.invokeLater(new Runnable()
{
#Override
public final void run()
{
final RussianDisplayDemo demo = new RussianDisplayDemo();
demo.setVisible(true);
}
});
}
}

Related

Using Google Translate Java Library, Languages with special chars return question marks

I have setup a Java program that I made for my apprenticeship project that takes in a JSON file of English strings and outputs a different language JSON file that is defined in the console. Some languages like french and Italian will output with the correct translations whereas Russian or Japanese will output with question marks as seen in the images bellow.
I had searched around at saw that I needed to get the bytes of my string and then encode that to UTF-8 I did do this but was still getting question marks so I started to use he standard charsets built into Java and tried different ways of encoding/decoding the string I tried this:
and this gave me a different output of this : Ð?Ñ?Ð¸Ð²ÐµÑ?
package com.bis.propertyfiletranslator;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.security.GeneralSecurityException;
import java.util.List;
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport;
import com.google.api.client.googleapis.json.GoogleJsonResponseException;
import com.google.api.client.json.jackson2.JacksonFactory;
import com.google.api.services.translate.Translate;
import com.google.api.services.translate.model.TranslationsListResponse;
import com.google.api.services.translate.model.TranslationsResource;
public class Translator {
public static Translate.Translations.List list;
private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");
public static void translateJSONMapThroughGoogle(String input, String output, String API, String language,
List<String> subLists) throws IOException, GeneralSecurityException {
Translate t = new Translate.Builder(GoogleNetHttpTransport.newTrustedTransport(),
JacksonFactory.getDefaultInstance(), null).setApplicationName("PhoenUX-Google-Translate").build();
try {
list = t.new Translations().list(subLists, language).setFormat("text");
list.setKey(API);
} catch (GoogleJsonResponseException e) {
if (e.getDetails().getMessage().equals("Invalid Value")) {
System.err.println(
"\n Language not currently supported, check the accepted language codes and try again.\n\n Language Requested: "
+ language);
} else {
System.out.println(e.getDetails().getMessage());
}
}
for (TranslationsResource translationsResource : response.getTranslations()) {
for (String key : JSONFunctions.jsonHashMap.keySet()) {
JSONFunctions.jsonHashMap.remove(key);
String value = translationsResource.getTranslatedText();
String encoded = new String(value.getBytes(StandardCharsets.UTF_8), StandardCharsets.ISO_8859_1);
JSONFunctions.jsonHashMap.put(key, encoded);
System.out.println(encoded);
break;
}
}
JSONFunctions.outputTranslationsBackToJson(output);
}
}
So this is using the google cloud library, I added a sysout so I could see the results of what I had tried, so this code should be all you need to replicate it.
I expect the output of "Hello" to be "Привет"(russian) actual output is ???? or Ð?Ñ?Ð¸Ð²ÐµÑ? dependent on the encoding I use.

String encoded = new String(...) is dead wrong. Just
put(key, value):
Note that System.out.println will always have problems as the OS encoding might be some Windows ANSI encoding. Then it is likely non Unicode-capable - and String contains Unicode.

Java GZip makes small differences when compressing file and decompressing it again

After a week of work I designed a binary file format, and made a Java reader for it. It's just an experiment, which works fine, unless I'm using the GZip compression function.
I called my binary type MBDF (Minimal Binary Database Format), and it can store 8 different types:
Integer (There is nothing like a byte, short, long or anything like that, since it is stored in flexible space (bigger numbers take more space))
Float-32 (32-bits floating point format, like java's float type)
Float-64 (64-bits floating point format, like java's double type)
String (A string in UTF-16 format)
Boolean
Null (Just specifies a null value)
Array (Something like java's ArrayList<Object>)
Compound (A String - Object map)
I used this data as test data:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xml: STRING "two length compound"
int: INTEGER 23
}
string1: STRING "Hello world!"
string2: STRING "3"
arr1: ARRAY [
STRING "Hello world!"
INTEGER 3
STRING "3"
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING "one length compound"
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
The xml key in a compound does matter!!
I made a file from it using this java code:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(false)
);
Here, the variable b is a MBDFBinary object, containing all the data given above. With the makeMBDF function it generates the ISO 8859-1 encoded string and if the given boolean is true, it compresses the string using GZip. Then, when writing, an extra information character is added at the beginning of the file, containing information about how to read it back.
Then, after writing the file, I read it back into java and parse it
MBDF mbdf = MBDFFile.readMBDFFromFile("/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf");
System.out.println(mbdf.getBinaryObject().parse());
This prints exactly the information mentioned above.
Then I try to use compression:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(true)
);
I do exactly the same to read it back as I did with the uncompressed file, which should work. It prints this information:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xUT: STRING 'two length compound'
int: INTEGER 23
}
string1: STRING 'Hello world!'
string2: STRING '3'
arr1: ARRAY [
STRING 'Hello world!'
INTEGER 3
STRING '3'
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING 'one length compound'
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
Comparing it to the initial information, the name xml changed into xUT for some reason...
After some research I found little differences in binary data between before the compression and after the compression. Such patterns as 110011 change into 101010.
When I make the name xml longer, like xmldm, it is just parsed as xmldm for some reason.
I currently saw the problem only occur on names with three characters.
Directly compressing and decompressing the generated string (without saving it to a file and reading that) does work, so maybe the bug is caused by the file encoding.
As far as I know, the string output is in ISO 8859-1 format, but I couldn't get the file encoding right. When a file is read, it is read as it has to be read, and all the characters are read as ISO 8859-1 characters.
I've some things that could be a reason, I actually don't know how to test them:
The GZip output has a different encoding than the uncompressed encoding, causing small differences while storing as a file.
The file is stored as UTF-8 format, just ignoring the order to be ISO 8859-1 encoding ( don't know how to explain :) )
There is a little bug in the java GZip libraries.
But which one is true, and if none of them is right, what is the true reason for this bug?
I couldn't figure it out right now.
The MBDFFile class, reading and storing the files:
/* MBDFFile.java */
package com.redgalaxy.mbdf;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MBDFFile {
public static MBDF readMBDFFromFile(String filename) throws IOException {
// FileInputStream is = new FileInputStream(filename);
// InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");
// BufferedReader br = new BufferedReader(isr);
//
// StringBuilder builder = new StringBuilder();
//
// String currentLine;
//
// while ((currentLine = br.readLine()) != null) {
// builder.append(currentLine);
// builder.append("\n");
// }
//
// builder.deleteCharAt(builder.length() - 1);
//
//
// br.close();
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
return new MBDF(new String(data, "ISO-8859-1"));
}
private static void writeToFile(String filename, byte[] txt) throws IOException {
// BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
//// FileWriter writer = new FileWriter(filename);
// writer.write(txt.getBytes("ISO-8859-1"));
// writer.close();
// PrintWriter pw = new PrintWriter(filename, "ISO-8859-1");
FileOutputStream stream = new FileOutputStream(filename);
stream.write(txt);
stream.close();
}
public static void writeMBDFToFile(String filename, MBDF info) throws IOException {
writeToFile(filename, info.pack().getBytes("ISO-8859-1"));
}
}
The pack function generates the final string for the file, in ISO 8859-1 format.
For all the other code, see my MBDF Github repository.
I commented the code I've tried, trying to show what I tried.
My workspace:
- Macbook Air '11 (High Sierra)
- IntellIJ Community 2017.3
- JDK 1.8
I hope this is enough information, this is actually the only way to make clear what I'm doing, and what exactly isn't working.
Edit: MBDF.java
/* MBDF.java */
package com.redgalaxy.mbdf;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
public class MBDF {
private String data;
private InfoTag tag;
public MBDF(String data) {
this.tag = new InfoTag((byte) data.charAt(0));
this.data = data.substring(1);
}
public MBDF(String data, InfoTag tag) {
this.tag = tag;
this.data = data;
}
public MBDFBinary getBinaryObject() throws IOException {
String uncompressed = data;
if (tag.isCompressed) {
uncompressed = GZipUtils.decompress(data);
}
Binary binary = getBinaryFrom8Bit(uncompressed);
return new MBDFBinary(binary.subBit(0, binary.getLen() - tag.trailing));
}
public static Binary getBinaryFrom8Bit(String s8bit) {
try {
byte[] bytes = s8bit.getBytes("ISO-8859-1");
return new Binary(bytes, bytes.length * 8);
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return new Binary(new byte[0], 0);
}
}
public static String get8BitFromBinary(Binary binary) {
try {
return new String(binary.getByteArray(), "ISO-8859-1");
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return "";
}
}
/*
* Adds leading zeroes to the binary string, so that the final amount of bits is 16
*/
private static String addLeadingZeroes(String bin, boolean is16) {
int len = bin.length();
long amount = (long) (is16 ? 16 : 8) - len;
// Create zeroes and append binary string
StringBuilder zeroes = new StringBuilder();
for( int i = 0; i < amount; i ++ ) {
zeroes.append(0);
}
zeroes.append(bin);
return zeroes.toString();
}
public String pack(){
return tag.getFilePrefixChar() + data;
}
public String getData() {
return data;
}
public InfoTag getTag() {
return tag;
}
}
This class contains the pack() method. data is already compressed here (if it should be).
For other classes, please watch the Github repository, I don't want to make my question too long.

Solved it by myself!
It seemed to be the reading and writing system. When I exported a file, I made a string using the ISO-8859-1 table to turn bytes into characters. I wrote that string to a text file, which is UTF-8. The big problem was that I used FileWriter instances to write it, which are for text files.
Reading used the inverse system. The complete file was read into memory as a string (memory consuming!!) and was then being decoded.
I didn't know a file was binary data, where specific formats of them form text data. ISO-8859-1 and UTF-8 are some of those formats. I had problems with UTF-8, because it splitted some characters into two bytes, which I couldn't manage...
My solution to it was to use streams. There exist FileInputStreams and FileOutputStreams in Java, which could be used for reading and writing binary files. I didn't use the streams, as I thought there was no big difference ("files are text, so what's the problem?"), but there is... I implemented this (by writing a new similar library) and I'm now able to pass every input stream to the decoder and every output stream to the encoder. To make uncompressed files, you need to pass a FileOutputStream. GZipped files could use GZipOutputStreams, relying on a FileOutputStream. If someone wants a string with the binary data, a ByteArrayOutputStream could be used. Same rules apply to reading, where the InputStream variant of the mentioned streams should be used.
No UTF-8 or ISO-8859-1 problems anymore, and it seemed to work, even with GZip!

load RTF into JTextPane

I created a class of type JTextPane in my text editor program. it has a subclass of text and richtext that inherts from my main JTextPaneClass. However, I'm unable to load RTF into my richtext because the method of reading fileinput stream isn't in the superclass JTextPane. So how do I read rich text into jtextpane? This seems very simple I must be missing something. I see lots of examples using RTFEditorKit and filling into the JTextPane but not when its instantiated as a class.
public class RichTextEditor extends TextEditorPane {
private final String extension = ".rtf";
private final String filetype = "text/richtext";
public RichTextEditor() {
// super( null, "", "Untitled", null );
super();
// this.setContentType( "text/richtext" );
}
/**
* Constructor for tabs with content.
*
* #param stream
* #param path
* #param fileName
* #param color
*/
public RichTextEditor( FileInputStream stream, String path, String fileName, Color color, boolean saveEligible ) {
super( path, fileName, color, saveEligible );
super.getScrollableTracksViewportWidth();
//RTFEditorKit rtf = new RTFEditorKit();
//this.setEditorKit( rtf );
setEditor();
this.read(stream, this.getDocument(), 0);
//this.read( stream, "RTFEditorKit" );
this.getDocument().putProperty( "file name", fileName );
}
private void setEditor() {
this.setEditorKit( new RTFEditorKit() );
}
the line:
this.read(stream, this.getDocument(), 0);
tells me
The method read(InputStream, Document) in the type JEditorPane is not applicable for the arguments (FileInputStream, Document, int)

To be able to access your editor kit, you should keep a reference to it. In fact, your setEditor() method's name is setXXX so this should be a setter (in fact, I'm not convinced that you need to set it more than once, so it may be that this method should not exist at all). Define a field:
private RTFEditorKit kit = new RTFEditorKit();
Then in the constructor,
setEditorKit( kit );
kit.read(...);
If you insist on keeping the method, its code should be
kit = new RTFEditorKit();
setEditorKit( kit );
And if you use this from the constructor, remember to set kit to void initially so as not to create an extra object that will be immediately discarded.

I've been looking for a java implementation for loading an RTF document into a JTextPane. Besides this thread, I couldn't find anything else. Thus, I'll post here my solution in case this helps other developers:
private static final RTFEditorKit RTF_KIT = new RTFEditorKit();
(...)
_txtHelp.setContentType("text/rtf");
final InputStream inputStream = new FileInputStream(_helpFile);
final DefaultStyledDocument styledDocument = new DefaultStyledDocument(new StyleContext());
RTF_KIT.read(inputStream, styledDocument, 0);
_txtHelp.setDocument(styledDocument);

GSON - use unicode characters

In my GSON testing class I have a class with a string that has to be serialized.
The problem is, that special unicode characters like the \u06A4 or the ► are converted to ?. That is not how I want this to work.
Here's my class:
public final class JSONvsBinary {
public static final void run() throws Exception {
A a1 = new A();
a1.a = "bla, blu., € # xyz Ø, \u06A4 ►";
GsonBuilder builder = new GsonBuilder();
builder.excludeFieldsWithModifiers(Modifier.TRANSIENT);
builder.setPrettyPrinting();
builder.disableHtmlEscaping();
builder.serializeNulls();
builder.serializeSpecialFloatingPointValues();
Gson gson = builder.create();
final String gsonString = gson.toJson(a1, A.class);
final byte[] gsonBytes = gsonString.getBytes("UTF8");
System.out.println("GSON:\n" + new String(gsonBytes, "UTF8"));
System.out.println("GSON bytes: " + gsonBytes.length);
}
#SuppressWarnings("unused")
private static final class A {
public String a;
}
}
And that's the output:
GSON:
{
"a": "bla, blu., € # xyz Ø, ? ?"
}
GSON bytes: 44
I set the byte encoding to UTF-8 but it doesn't work...

One, make sure your compiler and editor are using the same encoding. This is usually not an issue in an IDE.
The problem is probably here: System.out.println.
From the documentation for PrintStream:
All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.
So, depending on the platform encoding, System.out can lose data.
On top of this, the rendering engine of the device System.out is sending data to must support a grapheme for rendering each code point in the character data.

How to replace ï¿½ in a string

I have a string that contains a character ï¿½ I haven't been able to replace it correctly.
String.replace("ï¿½", "");
doesn't work, does anyone know how to remove/replace the ï¿½ in the string?

That's the Unicode Replacement Character, \uFFFD. (info)
Something like this should work:
String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");

Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.
As I (and apparently others) see it, you've pasted three characters:
codepoint glyph escaped windows-1252 info
=======================================================================
U+00ef ï \u00ef ef, LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf ¿ \u00bf bf, LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd ½ \u00bd bd, LATIN_1_SUPPLEMENT, OTHER_NUMBER
To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.

You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".
Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:
javac -encoding UTF-8 xx.java
Or, modify your source code to do:
String.replaceAll("\uFFFD", "");

As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:
public static void dumpString(String text)
{
for (int i=0; i < text.length(); i++)
{
System.out.println("U+" + Integer.toString(text.charAt(i), 16)
+ " " + text.charAt(i));
}
}
If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)

Change the Encoding to UTF-8 while parsing .This will remove the special characters

Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):
str = str.replaceAll("\uABCD", "");

for detail
import java.io.UnsupportedEncodingException;
/**
* File: BOM.java
*
* check if the bom character is present in the given string print the string
* after skipping the utf-8 bom characters print the string as utf-8 string on a
* utf-8 console
*/
public class BOM
{
private final static String BOM_STRING = "ï»¿Hello World";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;
public static void main(String[] args) throws UnsupportedEncodingException {
final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
if (isUTF8(bytes)) {
printSkippedBomString(bytes);
printUTF8String(bytes);
}
}
private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
int length = bytes.length - UTF8_BOM_LENGTH;
byte[] barray = new byte[length];
System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
System.out.println(new String(barray, ISO_ENCODING));
}
private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
System.out.println(new String(bytes, UTF8_ENCODING));
}
private static boolean isUTF8(byte[] bytes) {
if ((bytes[0] & 0xFF) == 0xEF &&
(bytes[1] & 0xFF) == 0xBB &&
(bytes[2] & 0xFF) == 0xBF) {
return true;
}
return false;
}
}

dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.

profilage basï¿½ sur l'analyse de l'esprit (french)
should be translated as:
profilage basé sur l'analyse de l'esprit
so, in this case ï¿½ = é

No above answer resolve my issue. When i download xml it apppends ï»¿<xml to my xml. I simply
xml = parser.getXmlFromUrl(url);
xml = xml.substring(3);// it remove first three character from string,
now it is running accurately.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.