println(char), characters turn into Chinese?

println(char), characters turn into Chinese? - java

Please help me to troubleshoot this problem.
A have an input file 'Trial.txt' with content "Thanh Le".
Here is the function I used in an attempt to read from the file:
public char[] importSeq(){
File file = new File("G:\\trial.txt");
char temp_seq[] = new char[100];
try{
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
DataInputStream dis = new DataInputStream(bis);
int i = 0;
//Try to read all character till the end of file
while(dis.available() != 0){
temp_seq[i]=dis.readChar();
i++;
}
System.out.println(" imported");
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e){
e.printStackTrace();
}
return temp_seq;
}
And the main function:
public static void main(String[] args) {
Sequence s1 = new Sequence();
char result[];
result = s1.importSeq();
int i = 0;
while(result[i] != 0){
System.out.println(result[i]);
i++;
}
}
And this is the output.
run:
imported
瑨
慮
栠
汥
BUILD SUCCESSFUL (total time: 0 seconds)

That's honestly said a pretty clumsy way to read a text file into a char[].
Here's a better example, assuming that the text file contains only ASCII characters.
File file = new File("G:/trial.txt");
char[] content = new char[(int) file.length()];
Reader reader = null;
try {
reader = new FileReader(file);
reader.read(content);
} finally {
if (reader != null) try { reader.close(); } catch (IOException ignore) {}
}
return content;
And then to print the char[], just do:
System.out.println(content);
Note that InputStream#available() doesn't necessarily do what you're expecting.
See also:
Java IO tutorial

Because in Java a char is made by 2 bytes, so, when you use readChar, it will read pairs of letters and compose them into unicode characters.
You can avoid this by using readByte(..) instead..

Some code to demonstrate, what exactly is happening. A char in Java consists of two bytes and represents one character, the glyph (pixels) you see on the screen. The default encoding in Java is UTF-16, one particular way to use two bytes to represent one of all the glyphs. Your file had one byte to represent one character, probably ASCII. When you read one UTF-16 character, you read two bytes and thus two ASCII characters from your file.
The following code tries to explain how single ASCII bytes 't' and 'h', become one chinese UTF-16 character.
public class Main {
public static void main(String[] args) {
System.out.println((int)'t'); // 116 == x74 (116 is 74 in Hex)
System.out.println((int)'h'); // 104 == x68
System.out.println((int)'瑨'); // 29800 == x7468
// System.out.println('\u0074'); // t
// System.out.println('\u0068'); // h
// System.out.println('\u7468'); // 瑨
char th = (('t' << 8) + 'h'); //x74 x68
System.out.println(th); //瑨 == 29800 == '\u7468'
}
}

Related

How to add a comma after each byte read from a file and write the byte and the comma to another file?

I have a text file that contains only numbers and I want to copy those numbers from that file to another one and to put a comma after each digit.
I have tried to write another byte that represents the comma in ASCII after each byte read and rewritten from the file to the other one, but it seems to override it or being added to it.
Trying to fix this problem I have used flush() but the nothing changes.
BufferedInputStream input = null;
BufferedOutputStream output = null;
try {
// inPath & outPath are already defined
input = new BufferedInputStream(new FileInputStream(inpath));
output = new BufferedOutputStream(new FileOutputStream(outPath));
int c;
while ((c = input.read()) != -1) {
if (c >= 48 && c <= 57) { // making sure that the byte is a number
output.write(c);
output.write(44); // 44 is the decimal representation of the comma (,)
}
}
} finally {
if (input != null) {
input.close();
}
if (output != null) {
output.close();
}
}
if I have numbers like this in the first file:
123456789
I expect to see them in the other file like this:
1,2,3,4,5,6,7,8,9
but I'm seeing things like this:
ⰱⰲⰳⰴⰵⰶⰷⰸⰹ

Files.write(Paths.get("e:/numbersSeparated.txt"),
new String(Files.readAllBytes(Paths.get("e:/numbers.txt")), StandardCharsets.UTF_8)
.replaceAll(".(?!$)", "$0,").getBytes(StandardCharsets.UTF_8));

How to convert escape-decimal text back to unicode in Java

A third-party library in our stack is munging strings containing emoji etc like so:
"Ben \240\159\144\144\240\159\142\169"
That is, decimal bytes, not hexadecimal shorts.
Surely there is an existing routine to turn this back into a proper Unicode string, but all the discussion I've found about this expects the format \u12AF, not \123.

I am not aware of any existing routine, but something simple like this should do the job (assuming the input is available as a string):
public static String unEscapeDecimal(String s) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(baos, "utf-8");
int pos = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '\\') {
writer.flush();
baos.write(Integer.parseInt(s.substring(i+1, i+4)));
i += 3;
} else {
writer.write(c);
}
}
writer.flush();
return new String(baos.toByteArray(), "utf-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
}
The writer is just used to make sure existing characters in the string with code points > 127 are encoded correctly, should they occur unescaped. If all non-ascii characters are escaped, the byte array output stream should be sufficient.

Java remove diacritic

I am trying to make function which will remove diacritic(dont want to use Normalizer on purpose).Function looks like
private static String normalizeCharacter(Character curr) {
String sdiac = "áäčďéěíĺľňóôőöŕšťúůűüýřžÁÄČĎÉĚÍĹĽŇÓÔŐÖŔŠŤÚŮŰÜÝŘŽ";
String bdiac = "aacdeeillnoooorstuuuuyrzAACDEEILLNOOOORSTUUUUYRZ";
char[] s = sdiac.toCharArray();
char[] b = bdiac.toCharArray();
String ret;
for(int i = 0; i < sdiac.length(); i++){
if(curr == s[i])
curr = b[i];
}
ret = curr.toString().toLowerCase();
ret = ret.replace("\n", "").replace("\r","");
return ret;
}
funcion is called like this(every charracter from file is sent to this function)
private static String readFile(String fName) {
File f = new File(fName);
StringBuilder sb = new StringBuilder();
try{
FileInputStream fStream = new FileInputStream(f);
Character curr;
while(fStream.available() > 0){
curr = (char) fStream.read();
sb.append(normalizeCharacter(curr));
System.out.print(normalizeCharacter(curr));
}
}catch(IOException e){
e.printStackTrace();
}
return sb.toString();
}
file text.txt contains this: ľščťžýáíéúäôň and i expect lcstzyaieuaonin return from program but insted of expected string i get this ¾è yaieuaoò. I know that problem is somewhere in encoding but dont know where. Any ideas ?

You are trying to convert bytes into characters.
However, the character ľ is not represented as a single byte. Its unicode representation is U+013E, and its UTF-8 representation is C4 BE. Thus, it is represented by two bytes. The same is true for the other characters.
Suppose the encoding of your file is UTF-8. Then you read the byte value C4, and then you convert it to a char. This will give you the character U+00C4 (Ä), not U+013E. Then you read the BE, and it is converted to the character U+00BE (¾).
So don't confuse bytes and characters. Instead of using the InputStream directly, you should wrap it with a Reader. A Reader is able to read charecters based on the encoding it is created with:
BufferedReader reader = new BufferedReader(
new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8
)
);
Now, you'll be able to read characters or even whole lines and the encoding will be done directly.
int readVal;
while ( ( readVal = reader.read() ) != -1 ) {
curr = (char)readVal;
// ... the rest of your code
}
Remember that you are still reading an int if you are going to use read() without parameters.

Creating a new String using a byte array is giving a strange result

I am reading a file using the readFully method of RandomAccessFile class, but the results are not what I exactly expected.
This is the simple function which reads the file and returns a new String using the byte array where all the bytes are stored:
public String read(int start)
{
setFilePointer(start);//Sets the file pointer
byte[] bytes = new byte[(int) (_file.length() - start)];
try
{
_randomStream.readFully(bytes);
}
catch(IOException e)
{
e.printStackTrace();
}
return new String(bytes);
}
In the main:
public static void main(String[] args)
{
String newline = System.getProperty("line.separator");
String filePath = "C:/users/userZ/Desktop/myFile.txt";
RandomFileManager rfmanager = new RandomFileManager(filePath, FileOpeningMode.READ_WRITE);
String content = rfmanager.read(10);
System.out.println("\n"+content);
rfmanager.closeFile();
}
This function is called in the constructor of the RandomFileManager. It creates the file, if it doesn't exist already.
private void setRandomFile(String filePath, String mode)
{
try
{
_file = new File(filePath);
if(!_file.exists())
{
_file.createNewFile();// Throws IOException
System.out.printf("New file created.");
}
else System.out.printf("A file already exists with that name.");
_randomStream = new RandomAccessFile(_file, mode);
}
catch(IOException e)
{
e.printStackTrace();
}
}
I write to the file using this write method:
public void write(String text)
{
//You can also write
if(_mode == FileOpeningMode.READ_WRITE)
{
try
{
_randomStream.writeChars(text);
}
catch(IOException e)
{
e.printStackTrace();
}
}
else System.out.printf("%s", "Warning!");
}
Output:

I used the writeChars method.
This write all characters as UTF-16 which is unlikely to be the default encoding. If you use UTF-16BE character encoding, this will decode the characters. UTF_16 uses two bytes, per character.
If you only need characters between (char) 0 and (char) 255 I suggest using ISO-8859-1 encoding as it will be half the size.

The problem is that you are not specifying a Charset and so the "platform default" is being used. This is almost always a bad idea. Instead, use this constructor: String(byte[], Charset) and be explicit about the encoding the file was written with. Given the output you are showing, it appears to be a two-byte encoding, likely UTF-16BE.
Short answer: bytes are not characters

Reading from InflaterInputStream and parsing the result

I am quite new to java, just started yesterday. Since I am a big fan of learning by doing, I am making a small project with it. But I am stucked in this part. I have written a file using this function:
public static boolean writeZippedFile(File destFile, byte[] input) {
try {
// create file if doesn't exist part was here
try (OutputStream out = new DeflaterOutputStream(new FileOutputStream(destFile))) {
out.write(input);
}
return true;
} catch (IOException e) {
// error handlind was here
}
}
Now that I have successully wrote a compressed file using above method, I want to read it back to console. First I need to be able to read the decompressed content and write string representaion of that content to console. However, I have a second problem that I don't want to write characters up to first \0 null character. Here is how I attempt to read the compressed file:
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
}
and I am completely stuck here. Question is, how to discard first few character until '\0' and then write the rest of the decompressed file to console.

I understand that your data contain text since you want to print a string respresentation. I further assume that the text contains unicode characters. If this is true, then your console should also support unicode for the characters to be displayed correctly.
So you should first read the data byte by byte until you encounter the \0 character and then you can use a BufferedReader to print the rest of the data as lines of text.
try (InputStream is = new InflaterInputStream(new FileInputStream(destFile))) {
// read the stream a single byte each time until we encounter '\0'
int aByte = 0;
while ((aByte = is.read()) != -1) {
if (aByte == '\0') {
break;
}
}
// from now on we want to print the data
BufferedReader b = new BufferedReader(new InputStreamReader(is, "UTF8"));
String line = null;
while ((line = b.readLine()) != null) {
System.out.println(line);
}
b.close();
} catch(IOException e) { // handle }

Skip the first few characters using InputStream#read()
while (is.read() != '\0');

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

println(char), characters turn into Chinese? - java

Because in Java a char is made by 2 bytes, so, when you use readChar, it will read pairs of letters and compose them into unicode characters. You can avoid this by using readByte(..) instead..

Related

How to add a comma after each byte read from a file and write the byte and the comma to another file?

How to convert escape-decimal text back to unicode in Java

Java remove diacritic

Creating a new String using a byte array is giving a strange result

Reading from InflaterInputStream and parsing the result

Categories

Resources