String format when reading from file

String format when reading from file - java

I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:
Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?
filereader = new FileReader(file);
read= new BufferedReader(filereader);
String s= null;
while ((s= read.readLine()) != null)
{
System.out.println(s);
}
So when i do this:
s= s.replace("\uFEFF","A");
nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.
Thank you

Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as #seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.

Related

Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?

String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";

You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

While reading from file in java, input is prefixed by two junk characters

FileInputStream fin = new FileInputStream("D:\\testout.txt");
BufferedInputStream bin = new BufferedInputStream(fin);
int i;
while((i = bin.read())!=-1) {
System.out.print((char)i);
}
bin.close();
fin.close();
output: ÿþGreat
I have checked the file testout.txt, it contains only one word i.e, Great.

When you're using text, you should use a Reader. eg.
try(
BufferedReader reader = Files.newBufferedReader(
Paths.get("D:\\testout.txt"),
StandardCharsets.UTF_8)
){
int i;
while((i = reader.read())!=-1) {
System.out.print((char)i);
}
}

That's most probably the Byte order mark, optional but allowed in files using UTF-8 character encoding. Some programs (e.g. Notepad) account for this possibility, some don't. Java by default doesn't strip them.
One utility to solve this is the BOMInputStream from Apache Commons IO.
Also, Notepad will write the byte order mark in the file when you save it as UTF-8.

ÿþ is the byte order mark in UTF-16. You can convert your string to UTF-8 with java.io as explained here.
You may also refer to the answer for more detail.

Please use utf-8 Characters encoding for resolving this kind of issue.
byte[] utf_8 = input.getBytes("UTF-8"); // convert unicode string to UTF-8
String test = new String(utf_8);

Read and write files with accents

I have an input file in XML format and it is well formed, with accents well written. The file is created with a PHP script that works fine.
But when i read the XML File and write it in another XML using a Java program, it puts strange characters instead of the characters with accents.
This is the method that reads the XML File:
public static String getArchivo(FileInputStream fileinputstream)
{
String s = null;
try
{
byte abyte0[] = new byte[1024];
int i = fileinputstream.read(abyte0);
if(i != -1)
{
s = new String(abyte0, 0, i);
for(int j = fileinputstream.read(abyte0); j != -1; j = fileinputstream.read(abyte0))
{
s = s + new String(abyte0, 0, j);
}
}
}
catch(IOException ioexception)
{
s = null;
}
return s;
}
Due to the fact that the file is read byte per byte, How do i replace the "bad" bytes for the correct bytes for the accented characters?
If reading files like these byte per byte is not a good idea, how can i do it better?
The characters that i need are: á, é, í, ó, ú, Á, É, Í, Ó, Ú, ñ, Ñ and °.
Thanks in advance

Probably you are reading the file with UTF-8 charset. Special chars are not part of the UTF-8 charset. Change from UTF-8 to UTF-16
Something like
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
As Jordi correctly said there are no special chars outside of utf-8.
So consider the first part as an information for other special chars.
Looking deeper at your code I see that you read an int and you convert it to a String. Don't convert it. Read bytes and write bytes to be sure that data will not changed.

Works for me using Chaserset ISO 8859-1. Syntax in kotlin:
val inputStream : InputStream = FileInputStream(filePath)
val json = inputStream.bufferedReader(Charsets.ISO_8859_1).use { it.readText()}

When you read the file use encoding utf-8 is best
BufferedReader rd = new BufferedReader(new InputStreamReader(is, "utf-8"));
In writing also use utf-8
OutputStreamWriter writer = new OutputStreamWriter( new FileOutputStream(filePath, true), "utf-8");
This worked for me.
When read file in vi editor or other editor change default encoding into utf-8
locale charmap
LANG=en_US.UTF-8

Java Charset InputStreamReader, File Channel Differences

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).

If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.

As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String format when reading from file - java

Related

Java convert String UTF-8 to UTF-16

While reading from file in java, input is prefixed by two junk characters

Read and write files with accents

Java Charset InputStreamReader, File Channel Differences

Java InputStream encoding/charset

Categories

Resources