java encode String to UTF

java encode String to UTF - java

I want to display characters (Chinese or other language) from property file on Windows box.
Let's say I read a property server.location=上海的位置 from System property, which is set when server is started.
I tried to do this
new String(locationStr.getBytes(System.getProperty("file.encoding")), "UTF-8");
This works with Linux, but couldn't get it working on Windows.
Following is summarized snipet, without syntax of how the System Property is set
URL fileURL = new URL("file:filePathAndName");
InputStream iStream = fileURL.openStream () ;
Properties prop = new Properties();
prop.load(iStream);
//Enumerate over prop and set System.setProperty (key, value);
Reading property as System.getProperty("server.location")
This is done centrally for all property files, hence modifying anything while reading or setting specific encoding could affect others, hence not advisable.
Also tried to encode using URLEncoder.encode but didn't help.
I do not see any specific encoding set. Java uses UTF-16, on Windows the encoding is 'Cp1252'. What am I missing here?
Any help to make this work or throw some light is appreciated. Also tried to go through existing questions, but the answers didn't apply directly hence creating new question.
Thanks
Edit:
Couldn't convert the obtained String to UTF-8. Somehow convinced people to read properties in way Joop mentioned and retrieve the String properly

String/char/Reader/Writer in java contain Unicode text. Binary data, byte[], InputStream/OutputStream must be associated with an encoding to be convertable to text, String.
It seems your Properties file is in UTF-8. Then specify a fixed encoding when loading the properties.
InputStream iStream = fileURL.openStream();
Reader reader = new BufferedReader(new InputStreamReader(iStream, StandardCharsets.UTF_8));
Properties prop = new Properties();
prop.load(reader);
Here the InputStreamReader bridges the transition from binary data to (Unicode) text by a conversion specifying the encoding of the InputStream.

Properties prop = new Properties();
InputStream input = null;
String filename = "config.properties";
input = ClassName.class.getClassLoader().getResourceAsStream(filename);
//loading properties
prop.load(input);
//getting the properties
System.out.println(prop.getProperty("propertyname1"));
System.out.println(prop.getProperty("propertyName2"));
System.out.println(prop.getProperty("propertyName3"));
or you can enumerate over the proerties
Enumeration e = prop.propertyNames();
while (e.hasMoreElements()) {
String key = (String) e.nextElement();
System.out.println(key + " -- " + prop.getProperty(key));
}
this is how you should actually get properties from a property file
and you dont have to worry about the utf-8 characters.

Related

How to convert from Properties file to CSV file which has multiple languages like hu?

In property file it looks like,
value1=CÍMSOR (NEM KÖTELEZŐ)
value2=NEM KÖTELEZŐ/NEM ALKALMAZHATÓ
I have loaded this in Java .properties which comes exactly like above, then the problem comes while storing to String. In both values 'Ő' comes as '?'.
I have to store it in a String and save all these values in CSV (comma separated) file.
First I am reading the property file
FileInputStream input = new FileInputStream(new File(fileName));
Properties prop = new Properties();
prop.load(new InputStreamReader(input, Charset.forName("UTF-8")));
This part works properly.
Second reading the values
String value1 = prop.getProperty("value1");
value1 comes like "4. CÍMSOR (NEM KÖTELEZ?)" instead of 'Ő' it comes as '?'
The third part is to add this values into CSV file,
OutputStream stream = new FileOutputStream(outputCSVLocation);
CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(stream, "UTF-8")));
csvWriter.writeNext(value1);

I found the answer,
For conversion to String use this,
StringUtils.toEncodedString("".getBytes(Charset.forName("UTF-8")), Charset.forName("UTF-8"));
For converting this values into csv use this below,
OutputStream fileWriter = new FileOutputStream(outputCSVLocation);
//Informing csv to encode in UTF-8
byte[] enc = new byte[] { (byte)0xEF, (byte)0xBB, (byte)0xBF };
fileWriter.write(enc);
CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(fileWriter, Charset.forName("UTF-8")));
So When I open file it comes in the language which was in the property file.
But again I got another issue,
When I read the csv file it comes with values like ?"header value" this that too for first row first column value alone. This comes because of those 3 bytes for encoding which I have added.
Any idea to fix this ??

Are you sure your input Properties file is really ISO-8859-1? If so then you should be loading the properties with:
prop.load(new InputStreamReader(input, StandardCharsets.ISO_8859_1));
Or - you can call the default Properties load(InputStream) which makes the assumption that the input stream is ISO-8859-1 character set, and allows for other unicode characters as escape sequences of format \uNNNN.

JAVA Rewrite/store only specified key value of properties

Let say i have property file test.properties.
There are already defined some key/values pairs e.g:
key1=value1
key2=value2
key3=value3
I change in memory some value of these properties (let say only one key's value). I would like to store changes into property file, but to store really only changed key/value => not rewrite whole file.
Is that possible?
Any implementation of some library to I could achieve something like that?

String fileName = "C:\\test\\test.txt";
File f = new File(fileName);
InputStream is = new FileInputStream(f);
Properties p = new Properties();
p.load(is);
p.setProperty("key3","value4");
OutputStream os = new FileOutputStream(f);
p.store(os,"comments");
But I think this will overwrite the entire properties file.

Look at java.util.prefs.Preferences
EDIT:
This is a Java utility class that does what you seem to want -- store key/value pairs (only strings as keys) without having to (re)write an entire file of them to change one value. Java has implemented them with system-dependent backing so they're portable.

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters

It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)

You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).

You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Encoding cp-1252 as utf-8?

I am trying to write a Java app that will run on a linux server but that will process files generated on legacy Windows machines using cp-1252 as the character set. Is there anyway to encode these files as utf-8 instead of the cp-1252 it is generated as?

If the file names as well as content is a problem, the easiest way to solve the problem is setting the locale on the Linux machine to something based on ISO-8859-1 rather than UTF-8. You can use locale -a to list available locales. For example if you have en_US.iso88591 you could use:
export LANG=en_US.iso88591
This way Java will use ISO-8859-1 for file names, which is probably good enough. To run the Java program you still have to set the file.encoding system property:
java -Dfile.encoding=cp1252 -cp foo.jar:bar.jar blablabla
If no ISO-8859-1 locale is available you can generate one with localedef. Installing it requires root access though. In fact, you could generate a locale that uses CP-1252, if it is available on your system. For example:
sudo localedef -f CP1252 -i en_US en_US.cp1252
export LANG=en_US.cp1252
This way Java should use CP1252 by default for all I/O, including file names.
Expanded further here: http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/

You can read and write text data in any encoding that you wish. Here's a quick code example:
public static void main(String[] args) throws Exception
{
// List all supported encodings
for (String cs : Charset.availableCharsets().keySet())
System.out.println(cs);
File file = new File("SomeWindowsFile.txt");
StringBuilder builder = new StringBuilder();
// Construct a reader for a specific encoding
Reader reader = new InputStreamReader(new FileInputStream(file), "windows-1252");
while (reader.ready())
{
builder.append(reader.read());
}
reader.close();
String string = builder.toString();
// Construct a writer for a specific encoding
Writer writer = new OutputStreamWriter(new FileOutputStream(file), "UTF8");
writer.write(string);
writer.flush();
writer.close();
}
If this still 'chokes' on read, see if you can verify that the the original encoding is what you think it is. In this case I've specified windows-1252, which is the java string for cp-1252.

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.

Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.

Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java encode String to UTF - java

Related

How to convert from Properties file to CSV file which has multiple languages like hu?

JAVA Rewrite/store only specified key value of properties

Error when reading non-English language character from file

Encoding cp-1252 as utf-8?

CSV file validation with Java

Categories

Resources