I have a text file that i'm trying to print but it prints boxes in between two characters. My code works fine for all the text files except this particular one. I cannot copy-paste this box character. So that i can check if the given character is that box and not print it using if condition. Please help. Thanks
Without a sample of the text you are trying to print, I think that it might be an issue with encoding. Here is a list of encodings supported by the java language. You might then want to do something like this:
Charset charset = Charset.forName("US-ASCII");
String s = ...;
try (BufferedWriter writer = Files.newBufferedWriter(file, charset)) {
writer.write(s, 0, s.length());
} catch (IOException x) {
System.err.format("IOException: %s%n", x);
}
(Example taken from here.)
My guess is that your document is UTF-16 encoded. Try to reencode it to UTF-8 or ASCII.
Related
I have a CSV file, which using Excel to save as CSV UTF-8 encoded.
I have my java code read the file as byte array
then
String result = new String(b, 0, b.length, "UTF-8");
But somehow the content "Montréal" becomes "Montr?al" when save to DB, what might be the problem?
The environment is unix with:
-bash-4.1$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
BTW it works on my windows machine when I run my code and see in DB the correct "Montréal". So my guess is that the environment has some default locale setting that forces the use of dedault encoding.
Thanks
I don't have your complete code, but I tried the following code and it works for me:
String x = "c:/Book2.csv";
BufferedReader br = null;
try{
br = new BufferedReader(new InputStreamReader(new FileInputStream(
x), "UTF8"));
String b;
while ((b = br.readLine()) != null) {
System.out.println(b);
}
} finally {
if (br != null){
br.close();
}
}
If you see "Montr?al" printed on your console, don't worry. It does not mean that the program is not working. Now, you may want to check if your console supports printing UTF-8 characters. So, you can put a debug and inspect the variable to check if has what you want.
If you see correct value in debug and it prints a "?" in your output, you can rest assured that the String variable is having the right value and you can write it back to any file or DB as needed.
If you see "?" when you query your DB, the tool you may be using is not printing the output correctly. Try reading the DB value in java code an check by putting a debug in you code. I usually use putty to query the DB to see the double byte characters correctly. That's all I have, hope that helps.
You have to use ISO/IEC 8859, not UTF-8, if you look at the list of character encodings on Wikipedia page you'll understand the difference.
Basically, UTF-8 its the commom encoding used by western country...
Also, you can check your terminal encoding, maybe the problem is there.
My assignment is to create a program that does compression using the Huffman algorithm. My program must be able to compress any type of file. Hence why i'm not using the Reader that works with characters.
Im not understanding how to be able to make some kind of frequency table when encoding a binary file?
EDIT!! Problem solved.
public static void main(String args[]){
try{
FileInputStream in = new FileInputStream("./src/hello.jpg");
int currentByte;
while((currentByte = in.read())!=-1){ //in.read()
//read all byte streams in file and create a frequency
//table
}
}catch (IOException e){
e.printStackTrace();
}
}
I'm not sure what you mean by "reading from an image and look at the characters" but talking about text files (as you're reading one in in your code example) this is most of the time working by casting the read byte to char by doing a
char charVal = (char) currentByte;
It's mostly working because most data is ASCII and most charsets contain ASCII. It gets more complicated with non-ASCII characters because a simple cast is equivalent with using charset ISO-8859-1. This will still most of the time produce correct results, because e.g. Window's cp1252 (on german systems) only differ with ISO-8859-1 at the Euro-sign.
Things start to run havoc with charsets like UTF-8 where non-ASCII characters are encoded with multiple bytes, so you will see things like ä instead of an ä. Same for files being encoded with Unicode where every second byte is most likely a binary zero.
You could use Files.readAllBytes and then iterate over this array.
Path path = Paths.get("hello.txt");
try {
byte[] array = Files.readAllBytes(path);
} catch (IOException ) {
}
Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)
So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.
files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.
I'm writing a simple program that writes data to the selected file .
everything is going great except the line breaks \n the string is written in the file but without line breaks
I've tried \n and \n\r but nothing changed
the program :
public void prepare(){
String content = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n\r<data>\n\r<user><username>root</username><password>root</password></user>\n\r</data>";
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
} catch (FileNotFoundException ex) {
System.out.println("File Not Found .. prepare()");
}
byte b[] = content.getBytes();
try {
fos.write(b);
fos.close();
} catch (IOException ex) {
System.out.println("IOException .. prepare()");
}
}
public static void main(String args[]){
File f = new File("D:\\test.xml");
Database data = new Database(f);
data.prepare();
}
Line endings for Windows follow the form \r\n, not \n\r. However, you may want to use platform-dependent line endings. To determine the standard line endings for the current platform, you can use:
System.lineSeparator()
...if you are running Java 7 or later. On earlier versions, use:
System.getProperty("line.separator")
My guess is that you're using Windows. Write \r\n instead of \n\r - as \r\n is the linebreak on Windows.
I'm sure you'll find that the characters you're writing into the file are there - but you need to understand that different platforms use different default line breaks... and different clients will handle things differently. (Notepad on Windows only understands \r\n, other text editors may be smarter.)
The correct linebreak sequence on windows is \r\n not \n\r.
Also your viewer may interpret them differently. For example, notepad will only display CRLF linebreaks, but Write or Word have no problem displaying CR or LF alone.
You should use System.lineSeparator() if you want to find the linebreak sequence for the current platform. You are correct in writing them explicitly if you are attempting to force a linebreak format regardless of the current platform.
I'm outputting a byte array to a text file using the following method:
try{
FileOutputStream fos = new FileOutputStream(filePath+".8102");
fos.write(concatenatedIVCipherMAC);
fos.close();
}catch(Exception e)
{
e.printStackTrace();
}
which outputs to the file a UTF-16 encoded data, example:
¢¬6î)ªÈP~m˜LïiƟê•Àe»/#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
However when I'm reading it back in I get þÿ prepended to the front of the data, e.g:
þÿ¢¬6î)ªÈP~m˜LïiƟê•Àe»/?#Ó ö¹¥‘þ²XhÃ&¼lG:Öé )GU3«´DÃ{+í—Ã]íò
This is the method I'm using to read in the file:
private String getFilesContents()
{
String fileContents = "";
Scanner sc = null;
try {
sc = new Scanner(file, "UTF-16");
System.out.println("Can read file: "+file.canRead());
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while(sc.hasNextLine()){
fileContents += sc.nextLine();
}
sc.close();
return fileContents;
}
and then byte[] contentsOfFile = fileContents.getBytes("UTF-16"); to convert the String into a byte array.
A quick Google told me that þÿ represents the byte order but is it Java putting that there or Windows? How can I avoid having the þÿ prepended at the start of the data I'm reading in? I was thinking of just ignoring the first two bytes but if it is Windows then this will obviously break the program on other platforms.
edit: changed appended to prepended.
The file is the IV+data+MAC. It's not meant to be readable text? Should be I be doing something differently?
Yes. You shouldn't be trying to treat it as text anywhere.
If you really need to convert arbitrary binary data into text, use Base64 to convert it. Other than that, stick to byte arrays, InputStream and OutputStream.
I don't know exactly why you're supposedly getting extra characters, but the fact that you haven't got real text to start suggests that it's not really worth diagnosing that side. Just start handling binary data as binary data instead.
EDIT: Have a look at Guava's IO helpers for simplicity...
þÿ is the byte order mark (BOM) unicode character saved as UTF16-BE, interpreted as ISO-8859-1.
You shouldn't treat binary data as text (in whatever encoding), if you want to avoid such errors.