java read write unicode / UTF-8 filenames (not contents)

java read write unicode / UTF-8 filenames (not contents) - java

i have a few directories/files with Japanese characters. If i try to read a filename (not the contents) containing (as example) a ク i receive a String containing a �. If i try to create a file/directory containing an ク a file/directory appears containing a ?.
As example:
I list the files with.
File file = new File(".");
String[] filesAndDirs = file.list();
the filesAndDirs array now contains the directories this the special characters. The String now only contains ����. It seams there is nothing to decode because the a getbytes shows only "-17 -65 -67" for every char in the filename even for different chars.
I use MacOS 10.8.2 Java 7_10 and Netbeans.
Any ideas?
Thank You in advance :)

Those bytes are 0xef 0xbf 0xbd, which is the UTF-8-encoded form of the \ufffd character you're seeing instead of the Japanese characters. It appears whatever OS function Java is using to list the files is in fact returning those incorrect characters.
Perhaps Files.newDirectoryStream will be more reliable. Try this instead:
try (DirectoryStream<Path> dir = Files.newDirectoryStream(Paths.get("."))) {
for (Path child : dir) {
String filename = child.getFileName().toString();
System.out.println("name=" + filename);
for (char c : filename.toCharArray()) {
System.out.printf("%04x ", (int) c);
}
System.out.println();
}
}

It's a bug in the old java File api (maybe just on a mac). Anyway, it's all fixed in the new java.nio.
I have several files containing unicode characters in the filename and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...
...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?

The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.

ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

Reading path from ini file, backslash escape character disappearing?

I'm reading in an absolute pathname from an ini file and storing the pathname as a String value in my program. However, when I do this, the value that gets stored somehow seems to be losing the backslash so that the path just comes out one big jumbled mess? For example, the ini file would have key, value of:
key=C:\folder\folder2\filename.extension
and the value that gets stored is coming out as C:folderfolder2filename.extension.
Would anyone know how to escape the keys before it gets read in?
Let's also assume that changing the values of the ini file is not an alternative because it's not a file that I create.

Try setting the escape property to false in Ini4j.
http://ini4j.sourceforge.net/apidocs/org/ini4j/Config.html#setEscape%28boolean%29
You can try:
Config.getGlobal().setEscape(false);

If you read the file and then translate the \ to a / before processing, that would work. So the library you are using has a method Ini#load(InputStream) that takes the INI file contents, call it like this:
byte[] data = Files.readAllBytes(Paths.get("directory", "file.ini");
String contents = new String(data).replaceAll("\\\\", "/");
InputStream stream = new ByteArrayInputStream(contents.getBytes());
ini.load(stream);
The processor must be doing the interpretation of the back-slashes, so this will give it data with forward-slashes instead. Or, you could escape the back-slashes before processing, like this:
String contents = new String(data).replaceAll("\\\\", "\\\\\\\\");

How to preserve correct offset of string which is read from a file

I have a text.txt file which contains following txt.
Kontagent Announces Partnership with Global Latino Social Network Quepasa
Released By Kontagent
I read this text file into a string documentText.
documentText.subString(0,9) gives Kontagent, which is good.
But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.
To read file as string I used all the functions talked about here
How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.
Currently I am using this function to read the file into documentText String:
public static String readFileAsString(String fileName)
{
File file = new File(fileName);
StringBuilder fileContents = new StringBuilder((int)file.length());
Scanner scanner = null;
try {
scanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String lineSeparator = System.getProperty("line.separator");
try {
while(scanner.hasNextLine()) {
fileContents.append(scanner.nextLine() + lineSeparator);
}
return fileContents.toString();
} finally {
scanner.close();
}
}
EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode.
Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.

The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.
EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:
String s = CharStreams.toString(new FileReader(fileName));

On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.
Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.
Good luck!

Based on input you guys provided, I wrote something like this
documentText = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");
to strip off extra \r if a file has \r.
Now,I am getting expect result in windows environment as well as unix. Problem solved!!!
It works fine irrespective of what mode file has been copied.
:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

Display Hindi language in console using Java

StringBuffer contents=new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader("/home/xyz/abc.txt"));
String line = null; //not declared within while loop
while (( line = input.readLine()) != null){
contents.append(line);
}
System.out.println(contents.toString());
File abc.txt contains
\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092
I want to dispaly in Hindi language in console using Java.
if i simply print like this
String str="\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092";
System.out.println(str);
then it works fine but when i try to read from a file it doesn't work.
help me out.

Use Apache Commons Lang.
import org.apache.commons.lang3.StringEscapeUtils;
// open the file as ASCII, read it into a string, then
String escapedStr; // = "\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 ..."
// (to include such a string in a Java program you would have to double each \)
String hindiStr = StringEscapeUtils.unescapeJava( escapedStr );
System.out.println(hindiStr);
(Make sure your console is set up to display Hindi (correct fonts, etc) and the console's encoding matches your Java encoding. The Java code above is just the bare bones.)

You should store the contents in the file as UTF-8 encoded Hindi characters. For instance, in your case it would be अभी समय है जनता जो चाहती है. That is, instead of saving unicode escapes, directly save the raw Hindi characters. You can then simply read like normal.
You just have to make sure that the editor you use saves it using UTF-8 encoding. See Spanish language chars are not displayed properly?
Otherwise, you'll have to make the file a .properties file and read using java.util.Properties as it offers unicode unescaping support inherently.
Also read Reading unicode character in java

How to format system path to a file in java?

I want to read a file say c.txt in java in windows. So can anybody suggest me that how can I format a system path to a file say D:\a\b\c.txt to D:/a/b/c.txt in java? I know it will work like this D:\\a\\b\\c.txt but I want to use this D:/a/b/c.txt. Thanks!

I'm not sure of your problem but rarely is it good practice to hard code / or \. Use Java's File.separator to help you.

You could use the char replace: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replace%28char,%20char%29
Example:
String pathToFile = "D:\\a\\b\\c.txt";
pathToFile = pathToFile.replace('\\','/'); <-- with ' and not "
Documentation of replace(char, char):
Returns a new string resulting from replacing all occurrences of
oldChar in this string with newChar.

You can use File API
File f = new File("c.txt");
System.out.println(f.getAbsolutePath());
System.out.println(f.getCanonicalPath());
or just simply substring
String fname = "D:\\a\\b\\c.txt".replace('\\', '/');
System.out.println(fname);

String file="D:\\a\\b\\c.txt";
file=file.replace('\\','/');
System.out.println(file);
output D:/a/b/c.txt
But if you are trying to make it more platform dependent you should use File.separator (for replacement based on Strings) or File.separatorChar (for replacement based on chars).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java read write unicode / UTF-8 filenames (not contents) - java

Related

Unzip files that contain chinese characters

Reading path from ini file, backslash escape character disappearing?

How to preserve correct offset of string which is read from a file

Display Hindi language in console using Java

How to format system path to a file in java?

Categories

Resources