I think the easiest way to explain my problem is with a little example:
My string at the beginning is: Pâtes, and the result should be: Pâtes. What I get as result is still Pâtes How can I fix this?
Here the code:
private String encode(String string) {
try {
byte ptext[] =string.getBytes("UTF8");
string = new String(ptext, "UTF8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return string;
}
There are two problems with your code. The first is that you're using UTF8, but the correct character set is UTF-8.
The second is that you're essentially performing a no op. By calling byte ptext[] =string.getBytes("UTF-8"); you are saying that this string is UTF-8. Then you convert it to UTF-8 which it already is.
What I think you mean is that the input is ISO-8859-1 and you want to convert it to UTF-8. (This fits with the example input and output you've given).
Try:
private String encode(String string) {
try {
byte ptext[] = string.getBytes("ISO-8859-1");
string = new String(ptext, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return string;
}
This assumes that your initial string was originally read from somewhere and only contains ISO-8859-1 characters. As mentioned in a comment you should try to ensure the data is loaded in correctly from the source (i.e. when it is still just an array of bytes).
Related
I have file encoded in ISO-8859-1. I'm trying to read it in as a single String, do some regex substitutions on it, and write it back out in the same encoding.
However, the resulting file I get always seems to be UTF-8 (according to Notepad++ at least), mangling some characters.
Can anyone see what I'm doing wrong here?
private static void editFile(File source, File target) {
// Source and target encoding
Charset iso88591charset = Charset.forName("ISO-8859-1");
// Read the file as a single string
String fileContent = null;
try (Scanner scanner = new Scanner(source, iso88591charset)) {
fileContent = scanner.useDelimiter("\\Z").next();
} catch (IOException exception) {
LOGGER.error("Could not read input file as a single String.", exception);
return;
}
// Do some regex substitutions on the fileContent string
String newContent = regex(fileContent);
// Write the file back out in target encoding
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), iso88591charset))) {
writer.write(newContent);
} catch (Exception exception) {
LOGGER.error("Could not write out edited file!", exception);
}
}
There is nothing actually wrong with your code. Notepad++ sees the file encoded in UTF-8 because on a basic level there is no difference between UTF-8 and the encoding you're trying to use. Only specific characters differ and some (a lot) are missing from ISO compared to UTF. You can read more here or by simply searching ISO-8859-1 vs UTF-8 in Google.
I've created a simple project with your code and tested it with characters that are different for the ISO encoding - the result is a file that IntelliJ (and probably Notepad++ as well - cannot easily check, I'm on Linux) recognizes as ISO-8859-1. Apart from that, I've added another class that makes use of new (JDK11) features from Files class. The new Scanner(source, charset) that you've used was added in JDK10, so I think that you may be using 11 already. Here's the simplified code:
private static void editFile(File source, File target) {
Charset charset = StandardCharsets.ISO_8859_1;
String fileContent;
try {
fileContent = Files.readString(source.toPath(), charset);
} catch (IOException exception) {
System.err.println("Could not read input file as a single String.");
exception.printStackTrace();
return;
}
String newContent = regex(fileContent);
try {
Files.writeString(target.toPath(), newContent, charset);
} catch (IOException exception) {
System.err.println("Could not write out edited file!");
exception.printStackTrace();
}
}
Feel free to clone the repository or check it on GitHub and use whichever code version you prefer.
I receive a XML file with a tag whose value is "97ò00430 ò" while this tag initially contains only numbers. The encoding use is "ISO-8859-1".
How to detect the bad characters (ò...) in java, please ?
LNA
I guess you could use a Regex to check the format of your tag (here, "\d+" if you want numbers only).
public static String encode(String chr) {
try {
byte[] bytes = chr.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return chr;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("No char" + e.getMessage());
}
}
I have a string like this in Java:
"\xd0\xb5\xd0\xbd\xd0\xb4\xd0\xbf\xd0\xbe\xd0\xb9\xd0\xbd\xd1\x82"
How can I convert it to a human readable equivalent?
Note:
actually it is GWT and this string is coming from python as part of a JSON data.
The JSONParser transforms it to something that is totally irrelevant, so I want to be able to convert the string prior to parsing.
The expected, so called by me "human readable", should be "ендойнт" (https://mothereff.in/utf-8#%D0%B5%D0%BD%D0%B4%D0%BF%D0%BE%D0%B9%D0%BD%D1%82)
Assuming that the pattern is a repetition of characters in the form of "\x00", where 00 can be any number or letter in [a-fA-F], you can convert it with something like this:
String values = "\\xd0\\xb5\\xd0\\xbd\\xd0\\xb4\\xd0\\xbf\\xd0\\xbe\\xd0\\xb9\\xd0\\xbd\\xd1\\x82";
for (String val: values.split("\\\\x")) {
if (val.length() > 0 ) System.err.print((char) Integer.parseInt(val, 16));
}
Note that the if condition is due to the first delimiter: see How to prevent java.lang.String.split() from creating a leading empty string?
I don't know if it's just my console or it's not working, but you may try this code:
import java.io.UnsupportedEncodingException;
import javax.xml.bind.DatatypeConverter;
public class Utf8Decoder {
public static void main(String[] args) {
// TODO Auto-generated method stub
String url = "\\xd0\\xb5\\xd0\\xbd\\xd0\\xb4\\xd0\\xbf\\xd0\\xbe\\xd0\\xb9\\xd0\\xbd\\xd1\\x82";
url= url.replaceAll("\\\\x", ""); //remove the \x on the string...
//it is now hex so let's parse it
//convert to human readable text
String result="";
try {
byte[] bytes= DatatypeConverter.parseHexBinary(url);
result = new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.print("decoded value:"+result);
}
}
I have one function that returns me String :
public String getString(String password){
......
try {
.......
encodedPassword = Base64.encodeToString(msgDigest,1 );
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return encodedPassword;
}
I want to add (concatenate) "=" String to returning string from function
I try using this:
encrptdPassword = getString("1234");
encrptdPassword = encrptdPassword+"=";
Or:
encrptdPassword = encrptdPassword .concat("=");
but I get result like two different objects (space or brake between)
I think problem is in Base64.encodeToString , but I must use 64 based string
Function getString returns me:
A6xnQhbz4Vx2HuGl4lXwZ5U2I8iziLRFnhP5eNfIRvQ
I want to add = to the returning string as:
A6xnQhbz4Vx2HuGl4lXwZ5U2I8iziLRFnhP5eNfIRvQ=
but I receive this on output
A6xnQhbz4Vx2HuGl4lXwZ5U2I8iziLRFnhP5eNfIRvQ =
Or:
A6xnQhbz4Vx2HuGl4lXwZ5U2I8iziLRFnhP5eNfIRvQ
=
...like 2 different strings.
Where I'm wrong?
I assume you're using Base64 from Apache Commons Codec.
The default constructor for this class uses "\r\n" as a line separator, which it adds to the end of every encoded line. If you don't want this, construct the object as:
new Base64(76, '');
If this isn't the class you're calling (it looks like from your code sample you're calling a static method), check the API and see if you can set a line separator for the conversion.
Isn't the 1 in Base64.encodeToString(msgDigest,1 ) padding?
If it's not, then you could just trim() the string to remove the whitespace.
Hi
I have a scenario where I need to convert the default Charset should be overridden bu UTF-8. I am using below class. But I am not getting the expected output. Because I use a unix system that has default UTF-8 as charset and I compare the results there. Am I wrong somewhere in this program?
public class CharsetDisplay {
public static void main(String[] args) {
System.out.println(Charset.defaultCharset().name());
System.out.println(Charset.isSupported("UTF-8"));
final Charset UTF8_CHARSET = Charset.forName("UTF-8");
try {
byte[] byteArray = new byte[] {34,34,0};
String str = new String(byteArray,UTF8_CHARSET);
System.out.println("String*** "+str);
System.out.println("String to Hex *** "+stringToHex(str));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Prints output as
windows-1252
true
String*** ""
Note after "" in the string output I have a spl char, which I don't get in a unix env
What do you expect the zero byte to render as in this environment? Your output looks exactly correct to me.
Don't forget that any differences that you encounter between environments might not be down to Java. If you're invoking your Java program from a console (which I expect you are), it's up to the console to actually convert the program's output to what you see on the screen. So depending on the charset the console is using, it's entirely possible for Java to output the characters that you expect, but for the console to fail to render them properly.
If Java doesn't pick up your locale's encoding properly you may have to tell it explicitly, at the command-line:
java -Dfile.encoding=utf-8 CharsetDisplay