Regex and ISO-8859-1 charset in java

Regex and ISO-8859-1 charset in java - java

I have some text encoded in ISO-8859-1 which I then extract some data from using Regex.
The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".
How do I stop the regex library from scrambling my chars?
Edit: Here's some code:
private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
HttpGet get = new HttpGet(url);
return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
InputStream input = response.getEntity().getContent();
StringBuilder builder = new StringBuilder();
int read;
byte[] tmp = new byte[1024];
while ((read = input.read(tmp))!=-1)
{
builder.append(new String(tmp), 0,read-1);
}
return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff

This is probably the immediate cause of your problem, and it's definitely an error:
builder.append(new String(tmp), 0, read-1);
When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.
But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.
In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

It's html from a website.
Use a HTML parser and this problem and all future potential problems will disappear.
I can recommend picking Jsoup for the job.
See also:
Regular Expressions - Now you have two problems
Parsing HTML - The Cthulhu way
Pros and cons of HTML parsers in Java

Related

How to get two-sequence representation of UTF-8 character using JavaMail's MimeUtility or Apache Commons and quoted-printable?

I'm having a string which contains the German ü character. Its UTF value is 0xFC, but its quoted-printable sequence should actually be =C3=BC instead of =FC. However, using JavaMail's MimeUtility like below, I can only get the single-sequence representation.
String s = "Für";
ByteArrayOutputStream baos = new ByteArrayOutputStream ();
OutputStream encodedOut = MimeUtility.encode (baos, "quoted-printable");
encodedOut.write (s.getBytes (StandardCharsets.UTF_8));
String encoded = baos.toString (); // F=FCr
(Defining StandardCharsets.US_ASCII instead of UTF_8 resulted in F?r, which is - obviously - not what I want.)
I have also already taken a look into Apache Commons' QuotedPrintableCodec, which I used like this:
String s = "Für";
QuotedPrintableCodec qpc = new QuotedPrintableCodec ();
String encoded = qpc.encode (s, StandardCharsets.UTF_8);
However, this resulted in F=EF=BF=BDr, which is similar to the result Java's URLEncoder would produce (% instead of = as an escape character, F%EF%BF%BDr), and which is not understandable to me.
I'm getting the string from a JavaMail MimeMessage using a ByteArrayOutputStream like so:
ByteArrayOutputStream baos = new ByteArrayOutputStream ();
message.writeTo (baos);
String s = baos.toString ();
On the initial store procedure, I receive a string containing a literal � (whose correct quoted-printable sequence seems to be =EF=BF=BD) instead of an umlaut-u. However, on any consecutive request Thunderbird makes (e.g. copying to Sent), I receive the correct ü. Is that something I can fix?
What I would like to receive is the two-sequence representation as required by IMAP and the respective mail clients. How would I go about that?

why Encoding in http request?

I am trying to learn request and retrive data from server with http protocol on Java this is the code I found on Oracle>Tutorial>networking (Code is pasted at the bottom of question)
Question 1: in out.write("string=" + stringToReverse);why "string=" isn't encoded? like stringToReverse varable
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
Question 2:
there are two codes below one from oracle code and other from android studio tuts
code in oracle tuts
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
android tuts code
inputStream = urlConnection.getInputStream();
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
BufferedReader reader = new BufferedReader(inputStreamReader);
why is Charset.forName("UTF-8") missing in oracle code?
Note: explaining from basics is very much useful :)
import java.io.*;
import java.net.*;
public class Reverse {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: java Reverse "
+ "http://<location of your servlet/script>"
+ " string_to_reverse");
System.exit(1);
}
String stringToReverse = URLEncoder.encode(args[1], "UTF-8");
URL url = new URL(args[0]);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write("string=" + stringToReverse);
out.close();
BufferedReader in = new BufferedReader(
new InputStreamReader(
connection.getInputStream()));
String decodedString;
while ((decodedString = in.readLine()) != null) {
System.out.println(decodedString);
}
in.close();
}
}

Question 1:
There is no need to encode "string=" (as it does not contain any special characters as explained in https://docs.oracle.com/javase/6/docs/api/java/net/URLEncoder.html)
Question 2:
The charset in the following example is not explicitly defined:
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
Therefore defaut charset is used (which may not be UTF-8)
Every instance of the Java virtual machine has a default charset,
which may or may not be one of the standard charsets. The default
charset is determined during virtual-machine startup and typically
depends upon the locale and charset being used by the underlying
operating system. (https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html)

In a url the string after ? is called as query string
example.com/users/profile?key1=value1&key2=value2
So for the above url the query string is "key1=value1&key2=value2"
In a query string there are key,value pairs which a server script can access.These key value pairs are called as request parameters and are separated by an &.So ?,& ,space etc are called special characters in a url as they are treated specially by the browser.
So what happens in case the value1 itself contains an & character.The server will in advertently end the value1 before & character at user1.
name=user1&23=hello&place=hyd
If you see above example it will not work as expected.
So that's why you use url encoding to convert special characters like & ,? , space etc to some other non special characters when they are used in query string.The server will convert back them to their actual form once it is received.
Now coming to your question 1),URL encoding is not needed in your case as you are not sending the string_to_reverse as a request parameter in query string.As jesper pointed out this is not url encoding.You are sending it as body using the outputstream.
Now question 2),If you see the http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html class,it states as below
Utility class for HTML form encoding. This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format.
So html form data is posted as application/x-www-form-urlencoded and in ur case URLEncoder is taking care of that.If no charset is specified the default character set is used.How to Find the Default Charset/Encoding in Java?.
The name URL in URLEncoder class is little misleading to you as its not really used for encoding url here but used for encoding the request body(string_to_reverse)as application/x-www-form-urlencoded.

inputStream and utf 8 sometimes shows "?" characters

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.

To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.

Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}

If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.

If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Encode String to UTF-8

I have a String with a "ñ" character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn't work:
byte ptext[] = myString.getBytes();
String value = new String(ptext, "UTF-8");
How do I encode that string to utf-8?

How about using
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)

String objects in Java use the UTF-16 encoding that can't be modified*.
The only thing that can have a different encoding is a byte[]. So if you need UTF-8 data, then you need a byte[]. If you have a String that contains unexpected data, then the problem is at some earlier place that incorrectly converted some binary data to a String (i.e. it was using the wrong encoding).
* As a matter of implementation, String can internally use a ISO-8859-1 encoded byte[] when the range of characters fits it, but that is an implementation-specific optimization that isn't visible to users of String (i.e. you'll never notice unless you dig into the source code or use reflection to dig into a String object).

In Java7 you can use:
import static java.nio.charset.StandardCharsets.*;
byte[] ptext = myString.getBytes(ISO_8859_1);
String value = new String(ptext, UTF_8);
This has the advantage over getBytes(String) that it does not declare throws UnsupportedEncodingException.
If you're using an older Java version you can declare the charset constants yourself:
import java.nio.charset.Charset;
public class StandardCharsets {
public static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");
public static final Charset UTF_8 = Charset.forName("UTF-8");
//....
}

Use byte[] ptext = String.getBytes("UTF-8"); instead of getBytes(). getBytes() uses so-called "default encoding", which may not be UTF-8.

A Java String is internally always encoded in UTF-16 - but you really should think about it like this: an encoding is a way to translate between Strings and bytes.
So if you have an encoding problem, by the time you have String, it's too late to fix. You need to fix the place where you create that String from a file, DB or network connection.

You can try this way.
byte ptext[] = myString.getBytes("ISO-8859-1");
String value = new String(ptext, "UTF-8");

In a moment I went through this problem and managed to solve it in the following way
first i need to import
import java.nio.charset.Charset;
Then i had to declare a constant to use UTF-8 and ISO-8859-1
private static final Charset UTF_8 = Charset.forName("UTF-8");
private static final Charset ISO = Charset.forName("ISO-8859-1");
Then I could use it in the following way:
String textwithaccent="Thís ís a text with accent";
String textwithletter="Ñandú";
text1 = new String(textwithaccent.getBytes(ISO), UTF_8);
text2 = new String(textwithletter.getBytes(ISO),UTF_8);

String value = new String(myString.getBytes("UTF-8"));
and, if you want to read from text file with "ISO-8859-1" encoded:
String line;
String f = "C:\\MyPath\\MyFile.txt";
try {
BufferedReader br = Files.newBufferedReader(Paths.get(f), Charset.forName("ISO-8859-1"));
while ((line = br.readLine()) != null) {
System.out.println(new String(line.getBytes("UTF-8")));
}
} catch (IOException ex) {
//...
}

I have use below code to encode the special character by specifying encode format.
String text = "This is an example é";
byte[] byteText = text.getBytes(Charset.forName("UTF-8"));
//To get original string from byte.
String originalString= new String(byteText , "UTF-8");

A quick step-by-step guide how to configure NetBeans default encoding UTF-8. In result NetBeans will create all new files in UTF-8 encoding.
NetBeans default encoding UTF-8 step-by-step guide
Go to etc folder in NetBeans installation directory
Edit netbeans.conf file
Find netbeans_default_options line
Add -J-Dfile.encoding=UTF-8 inside quotation marks inside that line
(example: netbeans_default_options="-J-Dfile.encoding=UTF-8")
Restart NetBeans
You set NetBeans default encoding UTF-8.
Your netbeans_default_options may contain additional parameters inside the quotation marks. In such case, add -J-Dfile.encoding=UTF-8 at the end of the string. Separate it with space from other parameters.
Example:
netbeans_default_options="-J-client -J-Xss128m -J-Xms256m
-J-XX:PermSize=32m -J-Dapple.laf.useScreenMenuBar=true -J-Dapple.awt.graphics.UseQuartz=true -J-Dsun.java2d.noddraw=true -J-Dsun.java2d.dpiaware=true -J-Dsun.zip.disableMemoryMapping=true -J-Dfile.encoding=UTF-8"
here is link for Further Details

This solved my problem
String inputText = "some text with escaped chars"
InputStream is = new ByteArrayInputStream(inputText.getBytes("UTF-8"));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.