java convert String windows-1251 to utf8

java convert String windows-1251 to utf8 - java

Scanner sc = new Scanner(System.in);
System.out.println("Enter text: ");
String text = sc.nextLine();
try {
String result = new String(text.getBytes("windows-1251"), Charset.forName("UTF-8"));
System.out.println(result);
} catch (UnsupportedEncodingException e) {
System.out.println(e);
}
I'm trying change keyboard: input cyrylic keyboard, output latin. Example: qwerty +> йцукен
It doesn't work, can anyone tell me what i'm doing wrong?

First java text, String/char/Reader/Writer is internally Unicode, so it can combine all scripts.
This is a major difference with for instance C/C++ where there is no such standard.
Now System.in is an InputStream for historical reasons. That needs an indication of encoding used.
Scanner sc = new Scanner(System.in, "Windows-1251");
The above explicitly sets the conversion for System.in to Cyrillic. Without this optional parameter the default encoding is taken. If that was not changed by the software, it would be the platform encoding. So this might have been correct too.
Now text is correct, containing the Cyrillic from System.in as Unicode.
You would get the UTF-8 bytes as:
byte[] bytes = text.getBytes(StandardCharsets.UTF_8);
The old "recoding" of text was wrong; drop this line. in fact not all Windows-1251 bytes are valid UTF-8 multi-byte sequences.
String result = text;
System.out.println(result);
System.out is a PrintStream, a rather rarely used historic class. It prints using the default platform encoding. More or less rely on it, that the default encoding is correct.
System.out.println(result);
For printing to an UTF-8 encoded file:
byte[] bytes = ("\uFEFF" + text).getBytes(StandardCharsets.UTF_8);
Path path = Paths.get("C:/Temp/test.txt");
Files.writeAllBytes(path, bytes);
Here I have added a Unicode BOM character in front, so Windows Notepad may recognize the encoding as UTF-8. In general one should evade using a BOM. It is a zero-width space (=invisible) and plays havoc with all kind of formats: CSV, XML, file concatenation, cut-copy-paste.

The reason why you have gotten the answer to a different question, and nobody answered yours, is because your title doesn't fit the question. You were not attempting to convert between charsets, but rather between keyboard layouts.
Here you shouldn't worry about character layout at all, simply read the line, convert it to an array of characters, go through them and using a predefined map convert these.
The code will be something like this:
Map<char, char> table = new TreeMap<char, char>();
table.put('q', 'й');
table.put('Q', 'Й');
table.put('w', 'ц');
// .... etc
String text = sc.nextLine();
char[] cArr = text.toCharArray();
for(int i=0; i<cArr.length; ++i)
{
if(table.containsKey(cArr[i]))
{
cArr[i] = table.get(cArr[i]);
}
}
text = new String(cArr);
System.out.println(text);
Now, i don't have time to test that code, but you should get the idea of how to do your task.

Related

Problem with input from user saved to file by RandomAccessFile methods

I've got a problem with input from user. I need to save input from user into binary file and when I read it and show it on the screen it isn't working properly. I dont want to put few hundreds of lines, so I will try to dexcribe it in more compact form. And encoding in NetBeans in properties of project is "UTF-8"
I got input from user, in NetBeans console or cmd console. Then I save it to object made up of strings, then add it to ArrayList<Ksiazka> where Ksiazka is my class (basically a book's properties). Then I save whole ArrayList object to file baza.bin. I do it by looping through whole list of objects of class Ksiazka, taking each String one by one and saving it into file baza.bin using method writeUTF(oneOfStrings). When I try to read file baza.bin I see question marks instead of special characters (ą, ć, ę, ł, ń, ó, ś, ź). I think there is a problem in difference in encoding of file and input data, but to be honest I don't have any idea ho to solve that.
Those are attributes of my class Ksiazka:
private String id;
private String tytul;
private String autor;
private String rok;
private String wydawnictwo;
private String gatunek;
private String opis;
private String ktoWypozyczyl;
private String kiedyWypozyczona;
private String kiedyDoOddania;
This is method for reading data from user:
static String podajDana(String[] tab, int coPokazac){
System.out.print(tab[coPokazac]);
boolean podawajDalej = true;
String linia = "";
Scanner klawiatura = new Scanner(System.in, "utf-8");
do{
try {
podawajDalej = false;
linia = klawiatura.nextLine();
}
catch(NoSuchElementException e){
System.err.println("Wystąpił błąd w czasie podawania wartości!"
+ " Spróbuj jeszcze raz!");
}
catch(IllegalStateException e){
System.err.println("Wewnętrzny błąd programu typu 2! Zgłoś to jak najszybciej"
+ " razem z tą wiadomością");
}
}while(podawajDalej);
return linia;
}
String[] tab is just array of strings I want to be able to show on the screen, each set (array) has its own function, int coPokazac is number of line from an array I want to show.
and this one saves all data from ArrayList<Ksiazka> to file baza.bin:
static void zapiszZmiany(ArrayList<Ksiazka> bazaKsiazek){
try{
RandomAccessFile plik = new RandomAccessFile("baza.bin","rw");
for(int i = 0; i < bazaKsiazek.size(); i++){
plik.writeUTF(bazaKsiazek.get(i).zwrocId());
plik.writeUTF(bazaKsiazek.get(i).zwrocTytul());
plik.writeUTF(bazaKsiazek.get(i).zwrocAutor());
plik.writeUTF(bazaKsiazek.get(i).zwrocRok());
plik.writeUTF(bazaKsiazek.get(i).zwrocWydawnictwo());
plik.writeUTF(bazaKsiazek.get(i).zwrocGatunek());
plik.writeUTF(bazaKsiazek.get(i).zwrocOpis());
plik.writeUTF(bazaKsiazek.get(i).zwrocKtoWypozyczyl());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyWypozyczona());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyDoOddania());
}
plik.close();
}
catch (FileNotFoundException ex){
System.err.println("Nie znaleziono pliku z bazą książek!");
}
catch (IOException ex){
System.err.println("Błąd zapisu bądź odczytu pliku!");
}
}
I think that there is a problem in one of those two methods (either I do something wrong while reading it or something wrong when it is saving data to file using writeUTF()) but even tho I tried few things to solve it, none of them worked.
After quick talk with lecturer I got information that I can use at most JDK 8.

You are using different techniques for reading and writing, and they are not compatible.
Despite the name, the writeUTF method of RandomAccessFile does not write a UTF-8 string. From the documentation:
Writes a string to the file using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character.
writeUTF will write a two-byte length, then write the string as UTF-8, except that '\u0000' characters are written as two UTF-8 bytes and supplementary characters are written as two UTF-8 encoded surrogates, rather than single UTF-8 codepoint sequences.
On the other hand, you are trying to read that data using new Scanner(System.in, "utf-8") and klawiatura.nextLine();. This approach is not compatible because:
The text was not written as a true UTF-8 sequence.
Before the text was written, two bytes indicating its numeric length were written. They are not readable text.
writeUTF does not write a newline. It does not write any terminating sequence at all, in fact.
The best solution is to remove all usage of RandomAccessFile and replace it with a Writer:
Writer plik = new FileWriter(new File("baza.bin"), StandardCharsets.UTF_8);
for (int i = 0; i < bazaKsiazek.size(); i++) {
plik.write(bazaKsiazek.get(i).zwrocId());
plik.write('\n');
plik.write(bazaKsiazek.get(i).zwrocTytul());
plik.write('\n');
// ...

Print unicode character in java

Displaying unicode character in java shows "?" sign. For example, i tried to print "अ". Its unicode Number is U+0905 and html representation is "अ".
The below codes prints "?" instead of unicode character.
char aa = '\u0905';
String myString = aa + " result" ;
System.out.println(myString); // displays "? result"
Is there a way to display unicode character directly from unicode itself without using unicode numbers? i.e "अ" is saved in file now display the file in jsp.

Java defines two types of streams, byte and character.
The main reason why System.out.println() can't show Unicode characters is that System.out.println() is a byte stream that deal with only the low-order eight bits of character which is 16-bits.
In order to deal with Unicode characters(16-bit Unicode character), you have to use character based stream i.e. PrintWriter.
PrintWriter supports the print( ) and println( ) methods. Thus, you can use these methods
in the same way as you used them with System.out.
PrintWriter printWriter = new PrintWriter(System.out,true);
char aa = '\u0905';
printWriter.println("aa = " + aa);

try to use utf8 character set -
Charset utf8 = Charset.forName("UTF-8");
Charset def = Charset.defaultCharset();
String charToPrint = "u0905";
byte[] bytes = charToPrint.getBytes("UTF-8");
String message = new String(bytes , def.name());
PrintStream printStream = new PrintStream(System.out, true, utf8.name());
printStream.println(message); // should print your character

Your myString variable contains the perfectly correct value. The problem must be the output from System.out.println(myString) which has to send some bytes to some output to show the glyphs that you want to see.
System.out is a PrintStream using the "platform default encoding" to convert characters to byte sequences - maybe your platform doesn't support that character. E.g. on my Windows 7 computer in Germany, the default encoding is CP1252, and there's no byte sequence in this encoding that corresponds to your character.
Or maybe the encoding is correct, but simply the font that creates graphical glyphs from characters doesn't have that charater.
If you are sending your output to a Windows CMD.EXE window, then maybe both reasons apply.
But be assured, your string is correct, and if you send it to a destination that can handle it (e.g. a Swing JTextField), it'll show up correctly.

I ran into the same problem wiht Eclipse. I solved my problem by switching the Encoding format for the console from ISO-8859-1 to UTF-8. You can do in the Run/Run Configurations/Common menu.
https://eclipsesource.com/blogs/2013/02/21/pro-tip-unicode-characters-in-the-eclipse-console/

Unicode is a unique code which is used to print any character or symbol.
You can use unicode from --> https://unicode-table.com/en/
Below is an example for printing a symbol in Java.
package Basics;
/**
*
* #author shelc
*/
public class StringUnicode {
public static void main(String[] args) {
String var1 = "Cyntia";
String var2 = new String(" is my daughter!");
System.out.println(var1 + " \u263A" + var2);
//printing heart using unicode
System.out.println("Hello World \u2665");
}
}
******************************************************************
OUTPUT-->
Cyntia ☺ is my daughter!
Hello World ♥

Check whether data can be represented in a specified encoding

I'm writing a Java program that saves data to UTF8 text files. However, I'd also like to provide the option to save to IBM437 for compatibility with an old program that uses the same sort of data files.
How can I check to see if the data the user is trying to save isn't representable in IBM437? At the moment the file saves without complaining but results in unusual characters being replaced with question marks.
I'd prefer it if I could show a warning to the user that the data they are saving isn't supported in IBM437. The user could then have the option of manually replacing characters with the nearest ASCII equivalent.
Current code for saving is:
String encoding = "UTF-8";
if (forceLegacySupport)
{
// Force character encoding to IBM437
encoding = "IBM437";
}
BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(saveFile.getAbsoluteFile()), encoding));
IOController.writeFileToDisk(bw);
bw.close();

As mentioned by JB Nizet in comments you can use charset encoder
and for creating text/String as UTF-8
just a suggestion from my end:
public static char[] cookie = "HEADER_COOKIE".toCharArray();
byte[] cookieInBytes = new byte[COOKIE_SIZE];
for(int i=0;i<cookie.length;i++)
{
if(i < cookie.length)
cookieInBytes[i] = (byte)cookie[i];
}
String headerStr = new String(cookieInBytes,StandardCharsets.UTF_8);

How to get rid of "Rogue Chars" in an .txt encoded under UTF-8

My program is reading from a .txt encoded with UTF-8. The reason why I'm using UTF-8 is to handle the characters åäö. The problem I come across is when the lines are read is that there seems to be some "rogue" characters sneaking in to the string which causes problems when I'm trying to store those lines into variables. Here's the code:
public void Läsochlista()
{
String Content = "";
String[] Argument = new String[50];
int index = 0;
Log.d("steg1", "steg1");
try{
InputStream inputstream = openFileInput("text.txt");
if(inputstream != null)
{
Log.d("steg2", "steg2");
//InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
//BufferedReader bufferreader = new BufferedReader(inputstreamreader);
BufferedReader in = new BufferedReader(new InputStreamReader(inputstream, "UTF-8"));
String reciveString = "";
StringBuilder stringbuilder = new StringBuilder();
while ((reciveString = in.readLine()) != null)
{
Argument[index] = reciveString;
index++;
if(index == 6)
{
Log.d(Argument[0], String.valueOf((Argument[0].length())));
AllaPlatser.add(new Platser(Float.parseFloat(Argument[0]), Float.parseFloat(Argument[1]), Integer.parseInt(Argument[2]), Argument[3], Argument[4], Integer.parseInt(Argument[5])));
Log.d("En ny plats skapades", Argument[3]);
Arrays.fill(Argument, null);
index = 0;
}
}
inputstream.close();
Content = stringbuilder.toString();
}
}
catch (FileNotFoundException e){
Log.e("Filen", " Hittades inte");
} catch (IOException e){
Log.e("Filen", " Ej läsbar");
}
}
Now, I'm getting the error
Invalid float: "61.193521"
where the line only contains the chars "61.193521". When i print out the length of the string as read within the program, the output shows "10" which is one more character than the string is supposed to contain. The question; How do i get rid of those invisible "Rouge" chars? and why are they there in the first place?

When you save a file as "UTF-8", your editor may be writing a byte-order mark (BOM) at the beginning of the file.
See if there's an option in your editor to save UTF-8 without the BOM.
Apparently the BOM is just a pain in the butt: What's different between UTF-8 and UTF-8 without BOM?
I know you want to be able to have extended characters in your data; however, you may want to pick a different encoding like Latin-1 (ISO 8859-1).
Or you can just read & discard the first three bytes from the input stream before you wrap it with the reader.

Unfortunately you have not provided the sample text file so testing with your code exactly is not possible and here is the theoretical answer based on guess, what could have been the reasons:
Looks like it is BOM related issue and you may have to treat this. Some related detail is given here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
And some information here: What is XML BOM and how do I detect it?
Basically there are various situation:
In one of the situation we face issues when we don't read and write using correct encoding.
In another situation we use an editor or reader which doesn't support UTF-8
Third is when we are using correct encoding for reading and writing, we are not facing issue in a text editor but facing issue in some other application or program. I think your issues is related to third case.
In third situation we may have to remove the BOM using a program or deal with it according to our context.
Here is some solution I guess you may find interesting:
UTF-8 file reading: the first character issue
You can use code given in this threads answer or use apache commons to deal with it:
Byte order mark screws up file reading in Java

Java : How to determine the correct charset encoding of a stream

With reference to the following thread:
Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.

You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.

I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet

check this out:
http://site.icu-project.org/ (icu4j)
they have libraries for detecting charset from IOStream
could be simple like this:
BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();
if (cm != null) {
reader = cm.getReader();
charset = cm.getName();
}else {
throw new UnsupportedCharsetException()
}

Here are my favorites:
TikaEncodingDetector
Dependency:
<dependency>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23-encoding</artifactId>
<version>1.1</version>
</dependency>
Sample:
public static Charset guessCharset(InputStream is) throws IOException {
return Charset.forName(new TikaEncodingDetector().guessEncoding(is));
}
GuessEncoding
Dependency:
<dependency>
<groupId>org.codehaus.guessencoding</groupId>
<artifactId>guessencoding</artifactId>
<version>1.4</version>
<type>jar</type>
</dependency>
Sample:
public static Charset guessCharset2(File file) throws IOException {
return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
}

You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.

Which library to use?
As of this writing, they are three libraries that emerge:
GuessEncoding
ICU4j
juniversalchardet
I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.
How to tell which one has detected the right charset (or as close as possible)?
It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.
How to score the returned response?
Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.
Is there any sample code?
Here is a full snippet implementing the strategy described in the previous lines.
public static String guessEncoding(InputStream input) throws IOException {
// Load input data
long count = 0;
int n = 0, EOF = -1;
byte[] buffer = new byte[4096];
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
output.write(buffer, 0, n);
count += n;
}
if (count > Integer.MAX_VALUE) {
throw new RuntimeException("Inputstream too large.");
}
byte[] data = output.toByteArray();
// Detect encoding
Map<String, int[]> encodingsScores = new HashMap<>();
// * GuessEncoding
updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
// * ICU4j
CharsetDetector charsetDetector = new CharsetDetector();
charsetDetector.setText(data);
charsetDetector.enableInputFilter(true);
CharsetMatch cm = charsetDetector.detect();
if (cm != null) {
updateEncodingsScores(encodingsScores, cm.getName());
}
// * juniversalchardset
UniversalDetector universalDetector = new UniversalDetector(null);
universalDetector.handleData(data, 0, data.length);
universalDetector.dataEnd();
String encodingName = universalDetector.getDetectedCharset();
if (encodingName != null) {
updateEncodingsScores(encodingsScores, encodingName);
}
// Find winning encoding
Map.Entry<String, int[]> maxEntry = null;
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
maxEntry = e;
}
}
String winningEncoding = maxEntry.getKey();
//dumpEncodingsScores(encodingsScores);
return winningEncoding;
}
private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
String encodingName = encoding.toLowerCase();
int[] encodingScore = encodingsScores.get(encodingName);
if (encodingScore == null) {
encodingsScores.put(encodingName, new int[] { 1 });
} else {
encodingScore[0]++;
}
}
private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
System.out.println(toString(encodingsScores));
}
private static String toString(Map<String, int[]> encodingsScores) {
String GLUE = ", ";
StringBuilder sb = new StringBuilder();
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
}
int len = sb.length();
sb.delete(len - GLUE.length(), len);
return "{ " + sb.toString() + " }";
}
Improvements:
The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.
It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.

As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)

The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text

If you use ICU4J (http://icu-project.org/apiref/icu4j/)
Here is my code:
String charset = "ISO-8859-1"; //Default chartset, put whatever you want
byte[] fileContent = null;
FileInputStream fin = null;
//create FileInputStream object
fin = new FileInputStream(file.getPath());
/*
* Create byte array large enough to hold the content of the file.
* Use File.length to determine size of the file in bytes.
*/
fileContent = new byte[(int) file.length()];
/*
* To read content of the file in byte array, use
* int read(byte[] byteArray) method of java FileInputStream class.
*
*/
fin.read(fileContent);
byte[] data = fileContent;
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch cm = detector.detect();
if (cm != null) {
int confidence = cm.getConfidence();
System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
//Here you have the encode name and the confidence
//In my case if the confidence is > 50 I return the encode, else I return the default value
if (confidence > 50) {
charset = cm.getName();
}
}
Remember to put all the try-catch need it.
I hope this works for you.

If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.

I found a nice third party library which can detect actual encoding:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
I didn't test it extensively but it seems to work.

For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.
UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.
Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:
$ file sample2.sql
sample2.sql: Unicode text, UTF-16, big-endian
For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding

An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();

A good strategy to handle this, is with a way to auto detect the input charset.
I use org.xml.sax.InputSource in Java 11 to solve it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**

In plain Java:
final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
List<String> lines;
for (String encoding : encodings) {
try {
lines = Files.readAllLines(path, Charset.forName(encoding));
for (String line : lines) {
// do something...
}
break;
} catch (IOException ioe) {
System.out.println(encoding + " failed, trying next.");
}
}
This approach will try the encodings one by one until one works or we run out of them.
(BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)

Can you pick the appropriate char set in the Constructor:
new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java convert String windows-1251 to utf8 - java

Related

Problem with input from user saved to file by RandomAccessFile methods

Print unicode character in java

Check whether data can be represented in a specified encoding

How to get rid of "Rogue Chars" in an .txt encoded under UTF-8

Java : How to determine the correct charset encoding of a stream

Categories

Resources