BufferedReader automatic encoding type - java

I am using BufferedReader to get data fro ma url.
URL url = new URL("http://");
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "windows-1251"));
On some url's encoding is windows-1251 (cyrilyc) so i specified that in the reader. But on some ones, enconding is different, e.g KOI8-R Any way to get the data from both sources without using naother reader? I really can use only one here.

No, the BufferedReader cannot examine the Content-Enconding header. You have to supply that. Or use a library for encoding recognition/detection.

Related

Default character encoding in java for inputStream of HTTPUrlConnection

I am using java's InputStream of HttpUrlConnection to get body of an URL and write same to a file.
Things works fine on my laptop (Ubuntu/Centos Desktop version) but on server(centos 6.5 server edition), special characters, incoming in body gets garbled to question marks.
I tried to compare Java's Charset.defaultCharset() and System.getProperty("file.encoding"), both are same on laptop and server.
Can anyone please help me to find out what is different in laptop and server OS related to Character Encoding issue.
StringBuilder response = new StringBuilder();
URL obj = new URL("http://www.Some URL That Has spl Char (eg. EN Dash)");
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
In the headers the encoding are often given (connection.getContentEncoding() for instance, could be null). This is useful for text, to convert an InputStream to a Reader (InputStreamReader) and such.
If you are using InputStream/OutputStream, you are working with binary data - as is -, hence no corruption will occure. But you'll loose the header info, that might have said something about the encoding. You might want to store any data with a given encoding as UTF-8 for consistency. However in HTML the encoding may be given in the content.
On the given code
The input is encoded by the default. Which is quite variable by platform, and even user settings.
Better use an explicit encoding.
// Nice if the connection has in its headers an encoding
// or in Content-Type charset=...
String encoding = con.getContentEncoding();
if (encoding == null) {
// Otherwise ISO-8859-1 is the HTTP standard, and
// browsers extend ISO-8859-1 to Windows-1252.
encoding = "Windows-1252";
}
Charset charset = Charset.forName(encoding);
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream(), charset));
Of course writing the String of the StringBuilder to a media with the right encoding.

How can I change the text-coding of my Java Programm?

I have a Java-Programm, which I develop with Netbeans.
I changed the settings on Netbeans, so that it will understand UTF-8.
But if I clean, and build my Programm and use it with my Windows System, the textcoding changes and letters like: "ü", "ä", and "ö" aren't displayed and used properly anymore.
How can I communicate with my OS and tell him to use UTF-8?
Or is there any good workaround?
EDIT: Sry for beeing so unspecific.
Well, first of all: I use Docx4j and the Apache POI with the getText() Methods to get some Texts from doc, docx, and pdf's and save them in a String.
Then Im trying to match Keywords within those texts, that I read out of an .txt file.
Those Keywords are displayed in a Combobox in the runnable Java-file.
I can see the encoding problems there. It wont match any of Keywords using the words described above.
In my IDE its working fine.
Im trying to post some code here, after I redesign it.
TXT-File is in UTF-8. If I convert it ti ANSI I see the same Problems like in the Jar.
reading out of it:
if(inputfile.exists() && inputfile.canRead())
{
try {
FileReader reader = new FileReader(inputfilepath);
BufferedReader in = new BufferedReader(reader);
String zeile = null;
while ((zeile = in.readLine()) != null) {
while(zeile.startsWith("#"))
{
if (zeile.startsWith(KUERZELTITEL)) {
int cut = zeile.indexOf('=');
zeile = zeile.substring(cut, zeile.length());
eingeleseneTagzeilen.put(KUERZELTITEL, zeile.substring(1));
kuerzel = zeile.substring(1);
}
...
this did it for me:
File readfile = new File(inputfilepath);
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream(readfile), "UTF8"));
Thx!
Congratulations, I also use UTF-8 for my projects, which seems best.
Simply make sure that editor and compiler use the same encoding. This ensures that string literals in java are correctly encoded in the jar, .class files.
In NetBeans 7.3 there is now one setting (I am using maven builds).
Properties files are historically in ISO-8859-1 or encoded as \uXXXX. So there you have to take care.
Internally Java uses Unicode, so there might be no other problems.
FileReader reader = new FileReader(inputfilepath);
should be
BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(inputfilepath), "UTF-8")));
The same procedure (explicit extra encoding parameter) for FileWriter (OutputStreamWriter + encoding), String.getBytes(encoding), new String(bytes, encoding).
Try passing -Dfile.encoding=utf-8 as JVM argument.

Reading Unicode characters in Java

I am using "FileInputStream" and "FileReader" to read a data from a file which contains unicode characters.
When i am setting the default encoding to "cp-1252" both are reading junk data, when i am setting default encoding to UTF-8 both are reading fine.
Is it true that both these use System Default Encoding to read the data?
Then whats the benifit of using Character stream if it depends on System Encoding.
Is there any way apart from:
BufferedReader fis = new BufferedReader(new InputStreamReader(new FileInputStream("some unicode file"),"UTF-8"));
to read the data correctly when the default encoding is other than UTF-8.
FileReader and FileWriter should IMHO be deprecated.
Use
new InputStreamReader(new FileInputStream(file), "UTF-8")
or so.
Here also there exists an overloaded version without the encoding parameter, using the default platform encoding: System.getProperty("file.encoding").

Error when reading non-English language character from file

I am building an app where users have to guess a secret word. I have *.txt files in assets folder. The problem is that words are in Albanian language. Our language uses letters like "ë" and "ç", so whenever I try to read from the file some word containing any of those characters I get some wicked symbol and I can not implement string.compare() for these characters. I have tried many options with UTF-8, changed Eclipse setting but still the same error.
I wold really appreciate if someone has got any advice.
The code I use to read the files is:
AssetManager am = getAssets();
strOpenFile = "fjalet.txt";
InputStream fins = am.open(strOpenFile);
reader = new BufferedReader(new InputStreamReader(fins));
ArrayList<String> stringList = new ArrayList<String>();
while ((aDataRow = reader.readLine()) != null) {
aBuffer += aDataRow + "\n";
stringList.add(aDataRow);
}
Otherwise the code works fine, except for mentioned characters
It seems pretty clear that the default encoding that is in force when you create the InputStreamReader does not match the file.
If the file you are trying to read is UTF-8, then this should work:
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
If the file is not UTF-8, then that won't work. Instead you should use the name of the file's true encoding. (My guess is that it is in ISO/IEC_8859-1 or ISO/IEC_8859-16.)
Once you have figured out what the file's encoding really is, you need to try to understand why it does not correspond to your Java platform's default encoding ... and then make a pragmatic decision on what to do about it. (Should you hard-wire the encoding into your application ... as above? Should you make it a configuration property or command parameter? Should you change the default encoding? Should you change the file?)
You need to determine the character encoding that was used when creating the file, and specify this encoding when reading it. If it's UTF-8, for example, use
reader = new BufferedReader(new InputStreamReader(fins, "UTF-8"));
or
reader = new BufferedReader(new InputStreamReader(fins, StandardCharsets.UTF_8));
if you're under Java 7.
Text editors like Notepad++ have good heuristics to guess what the encoding of a file is. Try opening it with such an editor and see which encoding it has guessed (if the characters appear correctly).
You should know encoding of the file.
InputStream class reads file binary. Although you can interpet input as character, it will be implicit guessing, which may be wrong.
InputStreamReader class converts binary to chars. But it should know character set.
You should use the following version to feed it by character set.
UPDATE
Don't suggest you have UTF-8 encoded file, which may be wrong. Here in Russia we have such encodings as CP866, WIN1251 and KOI8, which are all differ from UTF8. Probably you have some popular Albanian encoding of text files. Check your OS setting to guess.

Reading UTF-8 encoded XML from URL in java

I'm trying to read XML data from Google weather webservice. The response contain some Spanish characters. Problem is that these characters are not displayed properly. I've tried to convert everything to UTF-8 but that does not seem to help. Code is given below
public static void main(String[] args) {
try {
URL url = new URL("http://www.google.com/ig/api?weather=Noja&hl=es");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), "UTF-8"));
String str = in.readLine();
//this does not work even
//String str = new String(in.readLine().getBytes("UTF-8"),"UTF-8");
System.out.println(str);
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output is given below (trimmed to keep the post in limits). Notice "mi�" and s�b
trimmed to keep max char limit
<day_of_week data="mi�"/><day_of_week data="s�b"/><low data="11"/><high data="16"/><icon data="/ig/images/weather/chance_of_rain.gif"/><condition data="Posibilidad de lluvia"/></forecast_conditions></weather></xml_api_reply>
If that page is xml then you should usually pass the InputStream directly to the xml parser and let it automatically detect the encoding. Otherwise you should look at the charset parameter of the content type response header to determine the correct encoding and create the appropriate InputStreamReader.
Edit: That server is indeed responding with different encodings to the browser and the java client, probably depending on the Accept-Charset request header. For firefox this header has the value
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n
This means both charset are accepted, there is no preference for either one. The server responds with a Content-Type header of text/xml; charset=UTF-8. The java client does not send this header and the server responds with text/xml; charset=ISO-8859-1.
To use the charset supplied by the server you can use code like the following:
Matcher matcher = Pattern.compile("charset\\s*=\\s*([^ ;]+)").matcher(contentType);
String charset = "utf-8"; // default
if (matcher.find()) {
charset = matcher.group(1);
}
System.out.println(con.getContentType());
BufferedReader in = new BufferedReader(new InputStreamReader(
con.getInputStream(), charset));
Edit 2: Turns out the server decides the charset to use based on the user-agent header. If you add the following line, it responds with a charset of utf-8.
con.setRequestProperty("User-Agent", "Mozilla/5.0");
Anyway, the Content-Type response header contains the correct charset to use.
Your input may be correct, although I would use an XML parser to read the XML, rather than try and interpret this as a line-by-line feed. However your output may be incorrect.
What's the default char encoding of your JVM ? Check (and set) the confusingly named property -Dfile.encoding=UTF-8
Do the requisite fonts etc. exist on your system ? Can you check the actual character codes you're outputting and not rely on your terminal settings ? I would suspect this is perhaps the case, since the encoding/decoding appears to work and you're just missing those individual characters.

Categories

Resources