inputStream and utf 8 sometimes shows "?" characters - java

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.

To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.

Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Related

How to get two-sequence representation of UTF-8 character using JavaMail's MimeUtility or Apache Commons and quoted-printable?

I'm having a string which contains the German ü character. Its UTF value is 0xFC, but its quoted-printable sequence should actually be =C3=BC instead of =FC. However, using JavaMail's MimeUtility like below, I can only get the single-sequence representation.
String s = "Für";
ByteArrayOutputStream baos = new ByteArrayOutputStream ();
OutputStream encodedOut = MimeUtility.encode (baos, "quoted-printable");
encodedOut.write (s.getBytes (StandardCharsets.UTF_8));
String encoded = baos.toString (); // F=FCr
(Defining StandardCharsets.US_ASCII instead of UTF_8 resulted in F?r, which is - obviously - not what I want.)
I have also already taken a look into Apache Commons' QuotedPrintableCodec, which I used like this:
String s = "Für";
QuotedPrintableCodec qpc = new QuotedPrintableCodec ();
String encoded = qpc.encode (s, StandardCharsets.UTF_8);
However, this resulted in F=EF=BF=BDr, which is similar to the result Java's URLEncoder would produce (% instead of = as an escape character, F%EF%BF%BDr), and which is not understandable to me.
I'm getting the string from a JavaMail MimeMessage using a ByteArrayOutputStream like so:
ByteArrayOutputStream baos = new ByteArrayOutputStream ();
message.writeTo (baos);
String s = baos.toString ();
On the initial store procedure, I receive a string containing a literal � (whose correct quoted-printable sequence seems to be =EF=BF=BD) instead of an umlaut-u. However, on any consecutive request Thunderbird makes (e.g. copying to Sent), I receive the correct ü. Is that something I can fix?
What I would like to receive is the two-sequence representation as required by IMAP and the respective mail clients. How would I go about that?

Is it possible to peek HttpClient's org.apache.http.conn.EofSensorInputStream?

I am trying to peek at an input stream contents from HttpClient, up to 64k bytes.
The stream comes from an HttpGet, nothing unusual about it:
HttpGet requestGet = new HttpGet(encodedUrl);
HttpResponse httpResponse = httpClient.execute(requestGet);
int status = httpResponse.getStatusLine().getStatusCode();
if (status == HttpStatus.SC_OK) {
return httpResponse.getEntity().getContent();
}
The input stream it returns is of type org.apache.http.conn.EofSensorInputStream
Our use-case is such that we need to "peek" at the first (up to 64k) bytes of the input stream. I use an algorithm described here How do I peek at the first two bytes in an InputStream?
PushbackInputStream pis = new PushbackInputStream(inputStream, DEFAULT_PEEK_BUFFER_SIZE);
byte [] peekBytes = new byte[DEFAULT_PEEK_BUFFER_SIZE];
int read = pis.read(peekBytes);
if (read < DEFAULT_PEEK_BUFFER_SIZE) {
byte[] trimmed = new byte[read];
System.arraycopy(peekBytes, 0, trimmed, 0, read);
peekBytes = trimmed;
}
pis.unread(peekBytes);
When I use a ByteArrayInputStream, this works with no problem.
The Issue: When using the org.apache.http.conn.EofSensorInputStream I only get a small number of bytes at the beginning of the stream. usually around 400 bytes. When I expected up to 64k bytes.
I also tried using a BufferedInputStream where I read up to the first 64k bytes then call a .reset() but that doesn't work either. Same issue.
Why might this be? I do not think anything is closing the stream because if you call IOUtils.toString(inputStream) I do get all the content.
See InputStream#read(byte[] b,int off,int len) contract:
Reads up to len bytes of data from the input stream into an array of
bytes. An attempt is made to read as many as len bytes, but a
smaller number may be read. The number of bytes actually read is
returned as an integer
Instead of using this method, use IOUtils.read which reads until you get the number of bytes you requested in a loop.

Java Charset InputStreamReader, File Channel Differences

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.
The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.
Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}
If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.
If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Regex and ISO-8859-1 charset in java

I have some text encoded in ISO-8859-1 which I then extract some data from using Regex.
The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".
How do I stop the regex library from scrambling my chars?
Edit: Here's some code:
private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
HttpGet get = new HttpGet(url);
return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
InputStream input = response.getEntity().getContent();
StringBuilder builder = new StringBuilder();
int read;
byte[] tmp = new byte[1024];
while ((read = input.read(tmp))!=-1)
{
builder.append(new String(tmp), 0,read-1);
}
return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff
This is probably the immediate cause of your problem, and it's definitely an error:
builder.append(new String(tmp), 0, read-1);
When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.
But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.
In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.
It's html from a website.
Use a HTML parser and this problem and all future potential problems will disappear.
I can recommend picking Jsoup for the job.
See also:
Regular Expressions - Now you have two problems
Parsing HTML - The Cthulhu way
Pros and cons of HTML parsers in Java

Categories

Resources