Cannot get URL content as UTF-8 - java

i'm trying to read content from a URL but it does return strange symbols instead of "è", "à", etc.
This is the code i'm using:
public static String getPageContent(String _url) {
URL url;
InputStream is = null;
BufferedReader dis;
String line;
String text = "";
try {
url = new URL(_url);
is = url.openStream();
//This line should open the stream as UTF-8
dis = new BufferedReader(new InputStreamReader(is, "UTF-8"));
while ((line = dis.readLine()) != null) {
text += line + "\n";
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
return text;
}
I saw other questions like this, and all of them were answered like
Declare your inputstream as
new InputStreamReader(is, "UTF-8")
But i can't get it to work.
For example, if my url content contains
è uno dei più
I get
è uno dei più
What am i missing?

Judging by your example. You do receive a multibyte UTF-8 byte stream but your text editor reads in as ISO-8859-1. Tell your editor to read bytes as UTF-8!

I don't really know why this should not work, however the Java 7 way would be to use StandardCharsets.UTF_8 see
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/StandardCharsets.html
in the (new) Constructor InputStreamReader(InputStream in, Charset cs), see
http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html.

Related

Java fast stream copy with ISO-8859-1

I have the following code, which will read in files in ISO-8859-1, as thats what is required in this application,
private static String readFile(String filename) throws IOException {
String lineSep = System.getProperty("line.separator");
File f = new File(filename);
StringBuffer sb = new StringBuffer();
if (f.exists()) {
BufferedReader br =
new BufferedReader(
new InputStreamReader(
new FileInputStream(filename), "ISO-8859-1"));
String nextLine = "";
while ((nextLine = br.readLine()) != null) {
sb.append(nextLine+ " ");
// note: BufferedReader strips the EOL character.
// sb.append(lineSep);
}
br.close();
}
return sb.toString();
}
The problem is it is pretty slow. I have this function, which is MUCH faster, but I can not seem to find how to place the character encoding:
private static String fastStreamCopy(String filename)
{
String s = "";
FileChannel fc = null;
try
{
fc = new FileInputStream(filename).getChannel();
MappedByteBuffer byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
int size = byteBuffer.capacity();
if (size > 0)
{
byteBuffer.clear();
byte[] bytes = new byte[size];
byteBuffer.get(bytes, 0, bytes.length);
s = new String(bytes);
}
fc.close();
}
catch (FileNotFoundException fnfx)
{
System.out.println("File not found: " + fnfx);
}
catch (IOException iox)
{
System.out.println("I/O problems: " + iox);
}
finally
{
if (fc != null)
{
try
{
fc.close();
}
catch (IOException ignore)
{
}
}
}
return s;
}
Any one have an idea of where i should be putting the ISO encoding?
From the code you posted, you're not trying to "copy" the stream, but read it into a string.
You can simply provide the encoding in the String constructor:
s = new String(bytes, "ISO-88591-1");
Personally I'd just replace the whole method with a call to the Guava method Files.toString():
String content = Files.toString(new File(filename), StandardCharsets.ISO_8859_1);
If you're using Java 6 or earlier, you'll need to use the Guava field Charsets.ISO_8859_1 instead of StandardCharsets.ISO_8859_1 (which was only introduced in Java 7).
However your use of the term "copy" suggests that you want to write the result to some other file (or stream). If that is true, then you don't need to care about the encoding at all, since you can just handle the byte[] directly and avoid the (unnecessary) conversion to and from String.
where you are converting bytes to string e.g. s = new String(bytes, encoding); or vice versa.

special char Android (ø)

i have to retrieve some data from a txt file and then show those data inside my app.
My problem is that if i have the special char 'ø' inside my txt, this is not shown and a '?' is shown instead.
i tried to check data like
if(string.charAt(i) == 'ø') do sth
or
string.replace('ø' , 'O')
but none of them is working and i think that Java could not recognize that char at all.
Do you have any idea?
thanks
edit
this is how i read data
String[] obj = getText(getActivity(), myTXT.txt").split("\n");
where getText is:
public String getText(Context c, String fileName){
ByteArrayOutputStream outputStream = null;
try {
AssetManager am = c.getAssets();
InputStream is = am.open(fileName);
outputStream = new ByteArrayOutputStream();
byte buf[] = new byte[1024];
int len;
while ((len = is.read(buf)) != -1){
outputStream.write(buf,0,len);
}
outputStream.close();
is.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return outputStream.toString();
}
These chars must be in UTF-8 encoding, check for your file while its getting saved whether its incoding. Create an InputStreamReader instance that uses the constructor specifying encoding.
InputStreamReader r= new InputStreamReader(new FileInputStream(myFile),"UTF-8");
// read your contents here.
r.close();

UTF-8 encoding in JLabel on Windows

I have a problem with encoding in JLabel on Windows(on *nix OSes everything is okay).
Here's an image: http://i.imgur.com/DEkj3.png (the problematic character is the L with ` on the top, it should be ł) and here the code:
public void run()
{
URL url;
HttpURLConnection conn;
BufferedReader rd;
String line;
String result = "";
try {
url = new URL(URL);
conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
while ((line = rd.readLine()) != null) {
result += line;
}
rd.close();
} catch (Exception e) {
try
{
throw e;
}
catch (Exception e1)
{
Window.news.setText("");
}
}
Window.news.setText(result);
}
I've tried Window.news.setText(new String(result.getBytes(), "UTF-8"));, but it hasn't helped. Maybe I need to run my application with specified JVM flags?
You are breaking the data before it gets to the window when you use new InputStreamReader with no explicit charset. this will use the platform default charset, which is probably cp1252 on windows, hence your broken characters.
if you know the charset of the data you are reading, you should specify it explicitly, e.g.:
new InputStreamReader(conn.getInputStream(), "UTF-8")
in the case of downloading data from an arbitrary url, however, you should probably be preferring the charset in the 'Content-Type' header, if present.

Reading from binary file and converting certain info to string

I have to read data from a binary file like S/W version,vendor etc.
i have to show the output in a textarea.After reading the configurations the usre can send the selected file through a serial port.
I have written some code here:
InputStream is=null;
try {
File urt=filebrowser.getSelectedFile();
is = new FileInputStream(urt);
DataInputStream in = new DataInputStream(is);
long l=urt.length();
char[] bytes= new char[(int)l];
int o=bytes.length;
errlabel.setText(String.valueOf(o));
String content;
int offset;
BufferedReader br = new BufferedReader(new InputStreamReader(in));
int numRead;
try {
br.read(bytes, 0, 46);
while ((content = br.readLine()) != null) {
StringBuilder sb=new StringBuilder(content);
jTextArea1.setText(sb.toString());
errlabel.setText(""+sb.length());
}
} catch (IOException ex) {
Logger.getLogger(MyBoxUpdator.class.getName()).log(Level.SEVERE, null, ex);
}
} catch (FileNotFoundException ex) {
Logger.getLogger(MyBoxUpdator.class.getName()).log(Level.SEVERE, null, ex);
} finally {
try {
is.close();
} catch (IOException ex) {
Logger.getLogger(MyBoxUpdator.class.getName()).log(Level.SEVERE, null, ex);
}
}
And The Output
EE��6**UT�h��}�(:�萢Ê�*:�茢���_��(RQ��N���S��h����rMQ��(_Q����9mTT��\�nE�PtP�!E�UtBߌz��z���������
What may be wrong?
You are converting bytes into chars
So you must tell what encoding to use. If you don't indicate one the InputStreamReader (it is: reader that reads chars from an input stream of bytes) will use a default. I'm sure the default is not what you need.
Try this:
new InputStreamReader(in, "UTF-8"); // or whatever encoding you need
As a general rule: always indicate encoding when dealing with char to bytes conversion and viceversa! :)
Edit
Of course, I'm assuming your file has TEXT encoded into it. If it's binary as #alfasin said... well... it's normal to see garbage. You should read bytes and write chars representing them (as an hex representation of each byte, by example).

Java UTF-8 encoding not set to URLConnection

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47
As you can see in text there are sings like this: /ælˈdʒɪəriə/.
When I try to get source from the page I get text with sings like ú etc.
So far I've tried with the following code:
urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");
What am I doing wrong?
My entire code:
URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}
try {
urlConn = url.openConnection();
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");
urlConn.setDoInput(true);
urlConn.setUseCaches(false);
StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
strBseznam.deleteCharAt(strBseznam.length() - 1);
try {
input = new DataInputStream(urlConn.getInputStream());
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
while (null != ((str = input.readLine())))
{
strB.append(str);
}
input.close();
} catch (IOException e) { e.printStackTrace(); }
The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.
You have to decode the entities yourself. Something like:
String decodeNumericEntities(String s) {
StringBuffer sb = new StringBuffer();
Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
while (m.find()) {
int uc = Integer.parseInt(m.group(1));
m.appendReplacement(sb, "");
sb.appendCodepoint(uc);
}
m.appendTail(sb);
return sb.toString();
}
By the way those entities could stem from processed HTML forms, so on the editing side of the web app.
After code in question:
I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.
try {
BufferedReader input = new BufferedReader(
new InputStreamReader(urlConn.getInputStream(), "UTF-8"));
StringBuilder strB = new StringBuilder();
String str;
while (null != (str = input.readLine())) {
strB.append(str).append("\r\n");
}
input.close();
} catch (IOException e) {
e.printStackTrace();
}
Try adding also the user agent to your URLConnection:
urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");
This solved my decoding problem like a charm.
Well I'm thinking the problem is when you are reading from the stream. You should either call the readUTF method on the DataInputStream instead of calling readLine or, what I would do, would be to create an InputStreamReader and set the encoding, then you can read from the BufferedReader line by line (this would be inside your existing try/catch):
Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();
String read = "";
while ((read = reader.readLine()) != null) {
responseBuffer.append(read);
}

Categories

Resources