Converting UTF-8 to ISO-8859-1 in Java - java

I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.

I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.

Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.

With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}

when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");

Related

Regular expression for matching "Shift-JIS" string against given set of ranges

Problem Statement :-
Let's call 0x8140~0x84BE, 0x889F~0x9872, 0x989F~0x9FFC, 0xE040~0xEAA4, 0x8740~0x879C, 0xED40~0xEEFC, 0xFA40~0xFC4B, 0xF040~0xF9FC as range.
I want to validate if input String contains a kanji which is not in the the above range.
Here are examples of input Kanji characters not in the above range with output results :-
龔 --> OK
鑫 --> OK
璐 --> Need Change
Expected result should be "Need Change" for all of them.
please help.
Here is a code :-
import java.io.UnsupportedEncodingException;
import java.util.regex.*;
//import java.util.regex.Pattern;
public class RegExpDemo2 {
private boolean validateMnpName(String name) {
try {
byte[] utf8Bytes = name.getBytes("UTF-8");
String string = new String(utf8Bytes, "UTF-8");
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
String strName = new String(shiftJisBytes, "Shift-JIS");
System.out.println("ShiftJIS Str name : "+strName);
final String regex = "([\\x{8140}-\\x{84BE}]+)|([\\x{889F}-\\x{9872}]+)|([\\x{989F}-\\x{9FFC}]+)|([\\x{E040}-\\x{EAA4}]+)|([\\x{8740}-\\x{879C}]+)|([\\x{ED40}-\\x{EEFC}]+)|([\\x{FA40}-\\x{FC4B}]+)|([\\x{F040}-\\x{F9FC}]+)";
if (Pattern.compile(regex).matcher(strName).find()) {
return true;
} else
return false;
}
catch (Exception e) {
e.printStackTrace();
return false;
}
}
public static void main(String args[]) {
RegExpDemo2 obj = new RegExpDemo2();
if (obj.validateMnpName("ロ")) {
System.out.println("OK");
} else {
System.out.println("Need Change");
}
}
}
Your approach cannot work, because a String is Unicode in Java.
As observed by #VGR and myself, a round-trip through a Shift-JIS byte array does not change that. You simply converted Unicode to Shift-JIS and back to Unicode.
There are two approaches possible:
Convert the Java String (which is Unicode) into an array of bytes (in Shift-JIS encoding), and then examine the byte array for the allowed/forbidden values.
Convert the 'allowed' ranges into Unicode (and a single range in Shift-JIS may not be a single range in Unicode) and work with the String representation in Unicode.
Neither way seems pretty, but if you have to use old character codes instead of the not-quite-so-old (only 30 years!) Unicode, this is necessary.

characters not appearing when I print when I import a file?

I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}

Java replace ascii char

I have a file (prueba.txt) and I would like to replace ascii characters 0xE1 (á) for 0x14, 0xE9 (é) for 0x15, 0xF3 (ó) for 0x16 ... In string is possible with String.replace() but it is a char.
import java.io.File;
import java.util.Scanner;
public class Reemplazar {
public static void main(String[] args) throws Exception {
Scanner archivo = new Scanner(new File("prueba.txt"));
while(archivo.hasNextLine()) {
String frase = archivo.nextLine();
for (int i = 0; i < frase.length(); i++) {
char current = frase.charAt(i);
if (current == 0xe1) {
System.out.println("contiene la á: '"+frase+"'");
}
if (current == 0xe9) {
System.out.println("contiene es la é: '"+frase+"'");
}
}
}
}
}
I guess this code is much improved, but ...
Greetings.
First read the text file, then replace the characters.
Reading
A text file has some specific character set and encoding. You must know exactly what it is or that is definitely the system default ANSI character set and encoding. ANSI is not one specific encoding.
But, since you said ANSI, you probably meant the system default. The Scanner constructor you used is for Java's default. You can reasonably assume that Java's default correctly matches the system default.
Replacing characters
All "characters" in Java's String, char and Character datatypes and in an analyzed Java source file are UTF-16 code units, one or two of which encode a Unicode codepoint. Unescaped literal strings and characters are going to be in the encoding of the source file. (Of course, that should be UTF-8.) Regardless, if you type it, see it, save it and compile it with the same encoding, the characters will be what you think they are.
So, once you have text in a string, you can replace, replace, replace, like this:
frase
.replace('á', '►')
.replace('é', '☼')
.replace('ñ', '◄')
or
frase
.replace('\u00E1', '\u25B6')
…
BTW—0x14, 0x15, 0x16 are the encodings for ►, ☼, ◄ in the one encoding for the OEM437 character set.
If you'd rather iterate through the elements of the String, you could do it by each UTF-16 code unit, such as using charAt. That would work best if all your text was characters that UTF-16 encodes with just one code unit. Given that your file encoding is one of the ANSI character sets for a European language, that is likely the case. Or, you can iterate with a codepoint-aware technique as seen in the Java documentation on CharSequence.
It is even better that it is char, because you can do something like this:
yourStringToReplace.replace(0xe1);
char is an integer that is threated like a character insteed of a number (simply speaking)
this replaces the characters and creates a new file "nueva_prueba.txt" with the changed text
public class Reemplazar {
public static void main(String[] args) throws IOException
{
BufferedWriter out;
File f = new File("nueva_prueba.txt");
f.createNewFile();
out = new BufferedWriter(new FileWriter(f));
Scanner archivo = new Scanner(new File("prueba.txt"));
while(archivo.hasNextLine()) {
String frase = archivo.nextLine();
for (int i = 0; i < frase.length(); i++) {
char current = frase.charAt(i);
switch(current)
{
case 0xe1:
System.out.println("contiene la á: '"+frase+"'");
frase = frase.replace((char) 0xe1, (char) 0x14);
System.out.println("nova frase: "+frase);
break;
case 0xe9:
System.out.println("contiene la é: '"+frase+"'");
frase = frase.replace((char) 0xe9, (char) 0x15);
System.out.println("nova frase: "+frase);
break;
case 0xf3:
System.out.println("contiene la ó: '"+frase+"'");
frase = frase.replace((char) 0xf3, (char) 0x16);
System.out.println("nova frase: "+frase);
break;
//... outros / others
default:
break;
}
}
try{
out.write(frase);
out.newLine();
}catch(IOException e){
e.printStackTrace();
}
}
archivo.close();
out.close();
}
}
Hope this helps!

Special characters coming through as ? in SMPP and Java

I've spent a crazy amount of time trying to get special characters to come through properly in our application. Our provider told us to use "GSM0338, also known as ISO-8859". To me, this means ISO-8895-1, since we want spanish characters.
The flow: (Telling you everything, since I've been playing around with this for a while.)
Used notepad++ to create the message files in UTF-8 encoding. (No option to save as ISO-8859-1).
Sent each file through a quick Java program which converts and writes new files:
String text = readTheFile(....);
output = text.getBytes("ISO-8859-1");
FileOutputStream fos = new FileOutputStream(filesPathWithoutName + "\\converted\\" + filename);
fos.write(output);
fos.close();
SMPP test class in another project reads these files:
private static String readMessageFile(final String filenameOfFirstMessage) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filenameOfFirstMessage));
String message;
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
message = sb.toString();
} finally {
br.close();
}
return message;
}
Calls send
public void send(final String message, final String targetPhone) throws MessageException {
SmppMessage smppMessage = toSmppMessage(message, targetPhone);
smppSmsService.sendMessage(smppMessage);
}
private SmppMessage toSmppMessage(final String message, final String targetPhone) {
SmppMessage smppMessage = new SmppMessage();
smppMessage.setMessage(message);
smppMessage.setRecipientAddress(toGsmAddress(targetPhone));
smppMessage.setSenderAddress(getSenderGsmAddress());
smppMessage.setMessageType(SmppMessage.MSG_TYPE_DATA);
smppMessage.setMessageMode(SmppMessage.MSG_MODE_SAF);
smppMessage.requestStatusReport(true);
return smppMessage;
}
Problem:
SMSs containing letters ñ í ó are delivered, but with these letters displaying as question marks.
Configuration:
smpp.smsc.charset=ISO-8859-1
smpp.data.coding=0x03
Absolutely any help with this would be GREATLY appreciated. Thank you so much for reading.
Well, your provider is wrong. GSM 03.38 is not ISO-8859-1. They are the same up through "Z" (0x5A), but after that they diverge. For instance, in GSM 03.38, ñ is 0x7D, while in ISO-8859-1, it is 0xF1. Since GSM 03.38 is a 7-bit code, anything above 0x7F is going to come out as a "?". Anything after 0x5A is going to come out as something unexpected.
Since Java doesn't usually come with GSM 03.38 support, you're going to have to decode by hand. It shouldn't be too difficult to do, and the following piece of software might already do most of what you need:
Java GSM 03.38 SMS Character Set Translator
You might also find this translation table between GSM 03.38 and Unicode useful.

Decode a string in Java

How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}

Categories

Resources