Check if a string contain characters with bad encoding

Check if a string contain characters with bad encoding - java

I receive a XML file with a tag whose value is "97ÃÂ²00430 ÃÂ²" while this tag initially contains only numbers. The encoding use is "ISO-8859-1".
How to detect the bad characters (ÃÂ²...) in java, please ?
LNA

I guess you could use a Regex to check the format of your tag (here, "\d+" if you want numbers only).

public static String encode(String chr) {
try {
byte[] bytes = chr.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return chr;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("No char" + e.getMessage());
}
}

Related

Reading and writing file in ISO-8859-1 encoding?

I have file encoded in ISO-8859-1. I'm trying to read it in as a single String, do some regex substitutions on it, and write it back out in the same encoding.
However, the resulting file I get always seems to be UTF-8 (according to Notepad++ at least), mangling some characters.
Can anyone see what I'm doing wrong here?
private static void editFile(File source, File target) {
// Source and target encoding
Charset iso88591charset = Charset.forName("ISO-8859-1");
// Read the file as a single string
String fileContent = null;
try (Scanner scanner = new Scanner(source, iso88591charset)) {
fileContent = scanner.useDelimiter("\\Z").next();
} catch (IOException exception) {
LOGGER.error("Could not read input file as a single String.", exception);
return;
}
// Do some regex substitutions on the fileContent string
String newContent = regex(fileContent);
// Write the file back out in target encoding
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(target), iso88591charset))) {
writer.write(newContent);
} catch (Exception exception) {
LOGGER.error("Could not write out edited file!", exception);
}
}

There is nothing actually wrong with your code. Notepad++ sees the file encoded in UTF-8 because on a basic level there is no difference between UTF-8 and the encoding you're trying to use. Only specific characters differ and some (a lot) are missing from ISO compared to UTF. You can read more here or by simply searching ISO-8859-1 vs UTF-8 in Google.
I've created a simple project with your code and tested it with characters that are different for the ISO encoding - the result is a file that IntelliJ (and probably Notepad++ as well - cannot easily check, I'm on Linux) recognizes as ISO-8859-1. Apart from that, I've added another class that makes use of new (JDK11) features from Files class. The new Scanner(source, charset) that you've used was added in JDK10, so I think that you may be using 11 already. Here's the simplified code:
private static void editFile(File source, File target) {
Charset charset = StandardCharsets.ISO_8859_1;
String fileContent;
try {
fileContent = Files.readString(source.toPath(), charset);
} catch (IOException exception) {
System.err.println("Could not read input file as a single String.");
exception.printStackTrace();
return;
}
String newContent = regex(fileContent);
try {
Files.writeString(target.toPath(), newContent, charset);
} catch (IOException exception) {
System.err.println("Could not write out edited file!");
exception.printStackTrace();
}
}
Feel free to clone the repository or check it on GitHub and use whichever code version you prefer.

Java: Convert encoded characters to regular string

I have a string like this in Java:
"\xd0\xb5\xd0\xbd\xd0\xb4\xd0\xbf\xd0\xbe\xd0\xb9\xd0\xbd\xd1\x82"
How can I convert it to a human readable equivalent?
Note:
actually it is GWT and this string is coming from python as part of a JSON data.
The JSONParser transforms it to something that is totally irrelevant, so I want to be able to convert the string prior to parsing.
The expected, so called by me "human readable", should be "ендойнт" (https://mothereff.in/utf-8#%D0%B5%D0%BD%D0%B4%D0%BF%D0%BE%D0%B9%D0%BD%D1%82)

Assuming that the pattern is a repetition of characters in the form of "\x00", where 00 can be any number or letter in [a-fA-F], you can convert it with something like this:
String values = "\\xd0\\xb5\\xd0\\xbd\\xd0\\xb4\\xd0\\xbf\\xd0\\xbe\\xd0\\xb9\\xd0\\xbd\\xd1\\x82";
for (String val: values.split("\\\\x")) {
if (val.length() > 0 ) System.err.print((char) Integer.parseInt(val, 16));
}
Note that the if condition is due to the first delimiter: see How to prevent java.lang.String.split() from creating a leading empty string?

I don't know if it's just my console or it's not working, but you may try this code:
import java.io.UnsupportedEncodingException;
import javax.xml.bind.DatatypeConverter;
public class Utf8Decoder {
public static void main(String[] args) {
// TODO Auto-generated method stub
String url = "\\xd0\\xb5\\xd0\\xbd\\xd0\\xb4\\xd0\\xbf\\xd0\\xbe\\xd0\\xb9\\xd0\\xbd\\xd1\\x82";
url= url.replaceAll("\\\\x", ""); //remove the \x on the string...
//it is now hex so let's parse it
//convert to human readable text
String result="";
try {
byte[] bytes= DatatypeConverter.parseHexBinary(url);
result = new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.print("decoded value:"+result);
}
}

How to convert utf-8 into base64 string?

I have a base64 String and I have a function that actually encode Strings with "utf-8" like this :
public void send(Car charact,String data) {
try {
Log.i(TAG, "data " + URLEncoder.encode(data, "utf-8"));
charact.setValue(URLEncoder.encode(data, "utf-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
So when I actually try send(myBase64String) , I have some characters that are replaced by %2F, %2B just as here http://www.degraeve.com/reference/urlencoding.php.
I looked but didn't find a way to convert it into base64String. Can you advice please ? Thanks

Java Problems encoding UTF8

I think the easiest way to explain my problem is with a little example:
My string at the beginning is: PÃ¢tes, and the result should be: Pâtes. What I get as result is still PÃ¢tes How can I fix this?
Here the code:
private String encode(String string) {
try {
byte ptext[] =string.getBytes("UTF8");
string = new String(ptext, "UTF8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return string;
}

There are two problems with your code. The first is that you're using UTF8, but the correct character set is UTF-8.
The second is that you're essentially performing a no op. By calling byte ptext[] =string.getBytes("UTF-8"); you are saying that this string is UTF-8. Then you convert it to UTF-8 which it already is.
What I think you mean is that the input is ISO-8859-1 and you want to convert it to UTF-8. (This fits with the example input and output you've given).
Try:
private String encode(String string) {
try {
byte ptext[] = string.getBytes("ISO-8859-1");
string = new String(ptext, "UTF-8");
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return string;
}
This assumes that your initial string was originally read from somewhere and only contains ISO-8859-1 characters. As mentioned in a comment you should try to ensure the data is loaded in correctly from the source (i.e. when it is still just an array of bytes).

getting number value from string

If I have the string
thisIsSomthing=4891\r\n
thisIsSomthingElse=27398472\r\n
thisIsNumber1=1\r\n
how would I find
thisIsNumber1
and then return 1 using regex

This assumes you really have that posted content in a string, and not in a file. As you're dealing with properties, you should use Properties and not a regex:
String yourString = ...
Properties prop = new Properties();
try {
prop.load(new StringReader(yourString));
String result = prop.getProperty("thisIsNumber1");
System.out.println(result);
} catch (IOException e) {
System.out.println("Error loading properties:");
e.printStackTrace();
}

/thisIsNumber1=(\d+).*/
It'll be in capture group 1.

String line="thisIsNumber1=1\r\n";
String temp=line.split("\\r?\\n")[0].split("=")[1];
System.out.println("Value="+temp+"*"); // 1* "*" shows nothing is concatenated after
the character in the output

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check if a string contain characters with bad encoding - java

I receive a XML file with a tag whose value is "97ÃÂ²00430 ÃÂ²" while this tag initially contains only numbers. The encoding use is "ISO-8859-1". How to detect the bad characters (ÃÂ²...) in java, please ? LNA

I guess you could use a Regex to check the format of your tag (here, "\d+" if you want numbers only).

public static String encode(String chr) { try { byte[] bytes = chr.getBytes("ISO-8859-1"); if (!validUTF8(bytes)) return chr; return new String(bytes, "UTF-8"); } catch (UnsupportedEncodingException e) { throw new IllegalStateException("No char" + e.getMessage()); } }

Related

Reading and writing file in ISO-8859-1 encoding?

Java: Convert encoded characters to regular string

How to convert utf-8 into base64 string?

Java Problems encoding UTF8

getting number value from string

Categories

Resources