I have an input file in XML format and it is well formed, with accents well written. The file is created with a PHP script that works fine.
But when i read the XML File and write it in another XML using a Java program, it puts strange characters instead of the characters with accents.
This is the method that reads the XML File:
public static String getArchivo(FileInputStream fileinputstream)
{
String s = null;
try
{
byte abyte0[] = new byte[1024];
int i = fileinputstream.read(abyte0);
if(i != -1)
{
s = new String(abyte0, 0, i);
for(int j = fileinputstream.read(abyte0); j != -1; j = fileinputstream.read(abyte0))
{
s = s + new String(abyte0, 0, j);
}
}
}
catch(IOException ioexception)
{
s = null;
}
return s;
}
Due to the fact that the file is read byte per byte, How do i replace the "bad" bytes for the correct bytes for the accented characters?
If reading files like these byte per byte is not a good idea, how can i do it better?
The characters that i need are: á, é, í, ó, ú, Á, É, Í, Ó, Ú, ñ, Ñ and °.
Thanks in advance
Probably you are reading the file with UTF-8 charset. Special chars are not part of the UTF-8 charset. Change from UTF-8 to UTF-16
Something like
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
As Jordi correctly said there are no special chars outside of utf-8.
So consider the first part as an information for other special chars.
Looking deeper at your code I see that you read an int and you convert it to a String. Don't convert it. Read bytes and write bytes to be sure that data will not changed.
Works for me using Chaserset ISO 8859-1. Syntax in kotlin:
val inputStream : InputStream = FileInputStream(filePath)
val json = inputStream.bufferedReader(Charsets.ISO_8859_1).use { it.readText()}
When you read the file use encoding utf-8 is best
BufferedReader rd = new BufferedReader(new InputStreamReader(is, "utf-8"));
In writing also use utf-8
OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream(filePath, true), "utf-8");
This worked for me.
When read file in vi editor or other editor change default encoding into utf-8
locale charmap
LANG=en_US.UTF-8
Related
I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all.
Here's my environment:
Windows 2003, OS encoding: CP1252
Java 5.0
My files are UTF-8 encoded or CP1252 encoded, and some of them (UTF-8 encoded files) may contain Chinese (non-Latin) characters.
I use the following code to do my work:
private static String readFileAsString(String filePath)
throws java.io.IOException{
StringBuffer fileData = new StringBuffer(1000);
FileReader reader = new FileReader(filePath);
//System.out.println(reader.getEncoding());
BufferedReader reader = new BufferedReader(reader);
char[] buf = new char[1024];
int numRead=0;
while((numRead=reader.read(buf)) != -1){
String readData = String.valueOf(buf, 0, numRead);
fileData.append(readData);
buf = new char[1024];
}
reader.close();
return fileData.toString();
}
The above code doesn't work. I found the FileReader's encoding is CP1252 even if the text is UTF-8 encoded. But the JavaDoc of java.io.FileReader says that:
The constructors of this class assume
that the default character encoding
and the default byte-buffer size are
appropriate.
Does this mean that I am not required to set character encoding by myself if I am using FileReader? But I did get wrongly encoded data currently, what's the correct way to deal with my situtaion? Thanks.
Yes, you need to specify the encoding of the file you want to read.
Yes, this means that you have to know the encoding of the file you want to read.
No, there is no general way to guess the encoding of any given "plain text" file.
The one-arguments constructors of FileReader always use the platform default encoding which is generally a bad idea.
Since Java 11 FileReader has also gained constructors that accept an encoding: new FileReader(file, charset) and new FileReader(fileName, charset).
In earlier versions of java, you need to use new InputStreamReader(new FileInputStream(pathToFile), <encoding>).
FileReader uses Java's platform default encoding, which depends on the system settings of the computer it's running on and is generally the most popular encoding among users in that locale.
If this "best guess" is not correct then you have to specify the encoding explicitly. Unfortunately, FileReader does not allow this (major oversight in the API). Instead, you have to use new InputStreamReader(new FileInputStream(filePath), encoding) and ideally get the encoding from metadata about the file.
For Java 7+ doc you can use this:
BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8);
Here are all Charsets doc
For example if your file is in CP1252, use this method
Charset.forName("windows-1252");
Here is other canonical names for Java encodings both for IO and NIO doc
If you do not know with exactly encoding you have got in a file, you may use some third-party libs like this tool from Google this which works fairly neat.
Since Java 11 you may use that:
public FileReader(String fileName, Charset charset) throws IOException;
FileInputStream with InputStreamReader is better than directly using FileReader, because the latter doesn't allow you to specify encoding charset.
Here is an example using BufferedReader, FileInputStream and InputStreamReader together, so that you could read lines from a file.
List<String> words = new ArrayList<>();
List<String> meanings = new ArrayList<>();
public void readAll( ) throws IOException{
String fileName = "College_Grade4.txt";
String charset = "UTF-8";
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(fileName), charset));
String line;
while ((line = reader.readLine()) != null) {
line = line.trim();
if( line.length() == 0 ) continue;
int idx = line.indexOf("\t");
words.add( line.substring(0, idx ));
meanings.add( line.substring(idx+1));
}
reader.close();
}
For another as Latin languages for example Cyrillic you can use something like this:
FileReader fr = new FileReader("src/text.txt", StandardCharsets.UTF_8);
and be sure that your .txt file is saved with UTF-8 (but not as default ANSI) format. Cheers!
I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:
Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?
filereader = new FileReader(file);
read= new BufferedReader(filereader);
String s= null;
while ((s= read.readLine()) != null)
{
System.out.println(s);
}
So when i do this:
s= s.replace("\uFEFF","A");
nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.
Thank you
Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as #seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.
FileInputStream fin = new FileInputStream("D:\\testout.txt");
BufferedInputStream bin = new BufferedInputStream(fin);
int i;
while((i = bin.read())!=-1) {
System.out.print((char)i);
}
bin.close();
fin.close();
output: ÿþGreat
I have checked the file testout.txt, it contains only one word i.e, Great.
When you're using text, you should use a Reader. eg.
try(
BufferedReader reader = Files.newBufferedReader(
Paths.get("D:\\testout.txt"),
StandardCharsets.UTF_8)
){
int i;
while((i = reader.read())!=-1) {
System.out.print((char)i);
}
}
That's most probably the Byte order mark, optional but allowed in files using UTF-8 character encoding. Some programs (e.g. Notepad) account for this possibility, some don't. Java by default doesn't strip them.
One utility to solve this is the BOMInputStream from Apache Commons IO.
Also, Notepad will write the byte order mark in the file when you save it as UTF-8.
ÿþ is the byte order mark in UTF-16. You can convert your string to UTF-8 with java.io as explained here.
You may also refer to the answer for more detail.
Please use utf-8 Characters encoding for resolving this kind of issue.
byte[] utf_8 = input.getBytes("UTF-8"); // convert unicode string to UTF-8
String test = new String(utf_8);
I tryed to get byte and then convert with Utf-8.
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext,"UTF-8"));
But it's not working .Sorry for my dumbness. I'm very confused.
try {
String s = new String("Æàìáûë".getBytes(StandardCharsets.ISO_8859_1), "Windows-1251");
Files.write(Paths.get("C:/cyrillic.txt"),
("\uFEFF" + s).getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
e.printStackTrace();
}
Assuming that the editor and compiler are set to UTF-8 to have a correct erroneous string literal.
This treats the characters as single bytes, abusing ISO-8859-1. Then tries the Windows-1251 encoding for Cyrillic (there are others).
This way we have a java String (always in Unicode).
This we'll write to a text file in UTF-8, with a BOM, so Windows Notepad will identify the file as UTF-8.
Writing to any Cyrillic encoding will be no problem.
Жамбыл
Your byte array must have some encoding. The encoding cannot be ASCII if you've got negative values. Once you figure that out, you can convert a set of bytes to a String using:
byte[] bytes = {...}
String str = new String(bytes, "UTF-8"); // for UTF-8 encoding
Log.i("value", str);
There are a bunch of encodings you can use, look at the Charset class in the Sun javadocs..
Seems your original encoding is Cp1251:
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext, "Cp1251")); // <- put it here
Resulting word is Жамбыл.
I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.
The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.
Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.