UrlDecoder decode several times - java

Is there any method that fully decodes a String ? For example I have
monta%25C3%25B1a , if I use UrlDecoder.decode method ONCE : it returns monta%C3%B1a and if I decodify AGAIN , it finally returns montaña (that is the fully decodified string).Is there any method or library in Java that achieves this result?

monta[%25]C3[%25]B1a
monta % C3 % B1a which has a UTF-8 multi-byte sequence
monta ñ a
It is important to decode with the same Charset as it was encoded.
Evidently it was URL encoded twice, first into UTF-8, and then % was still encoded once.
Twice doing the encoding should be repaired, as otherwise an incomprehensible patch remains:
s = URLDecoder.decode(s, StandardCharsets.UTF_8);
s = URLDecoder.decode(s, StandardCharsets.UTF_8);

Related

How to detect encoding mismatch

I have a bunch of old AES-encrypted Strings encrypted roughly like this:
String is converted to bytes with ISO-8859-1 encoding
Bytes are encrypted with AES
Result is converted to BASE64 encoded char array
Now I would like to change the encoding to UTF8 for new values (eg. '€' does not work with ISO-8859-1). This will of
course cause problems if I try to decrypt the old ISO-8859-1 encoded values with UTF-8 encoding:
org.junit.ComparisonFailure: expected:<!#[¤%&/()=?^*ÄÖÖÅ_:;>½§#${[]}<|'äöå-.,+´¨]'-Lorem ipsum dolor ...> but was:<!#[�%&/()=?^*����_:;>��#${[]}<|'���-.,+��]'-Lorem ipsum dolor ...>
I'm thinking of creating some automatic encoding fallback for this.
So the main question would be that is it enough to inspect the decrypted char array for '�' characters to figure out encoding mismatch? And what is the 'correct' way to declare that '�' symbol when comparing?
if (new String(utf8decryptedCharArray).contains("�")) {
// Revert to doing the decrypting with ISO-8859-1
decryptAsISO...
}
When decrypting, you get back the original byte sequence (result of your step 1), and then you can only guess whether these bytes denote characters according to the ISO-8859-1 or the UTF-8 encoding.
From a byte sequence, there's no way to clearly tell how it is to be interpreted.
A few ideas:
You could migrate all the old encrypted strings (decrypt, decode to string using ISO-8859-1, encode to byte array using UTF-8, encrypt). Then the problem is solved once and forever.
You could try to decode the byte array in both versions, see if one version is illegal, or if both versions are equal, and if it still is ambiguous, take the one with higher probability according to expected characters. I wouldn't recommend to go that way, as it needs a lot of work and still there's some probability of error.
For the new entries, you could prepend the string / byte sequence by some marker that doesn't appear in ISO-8859-1 text. E.g. some people follow the convention to prepend a Byte Order Marker at the beginning of UTF-8 encoded files. Although the resulting bytes (EF BB BF) aren't strictly illegal in ISO-8859-1 (being read as ), they are highly unlikely. Then, when your decrypted bytes start with EF BB BF, decode to string using UTF-8, otherwise using ISO-8859-1. Still, there's a non-zero probability of error.
If ever possible, I'd go for migrating the existing entries. Otherwise, you'll have to carry on with "old-format compatibility stuff" in your code base forever, and still can't absolutely guarantee correct behaviour.
When decoding bytes to text, don't rely on the � character to detect malformed input. Use a strict decoder. Here is a helper method for that:
static String decodeStrict(byte[] bytes, Charset charset) throws CharacterCodingException {
return charset.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.decode(ByteBuffer.wrap(bytes))
.toString();
}
Here is the corresponding strict encoder helper method, in case you need it:
static byte[] encodeStrict(String str, Charset charset) throws CharacterCodingException {
ByteBuffer buf = charset.newEncoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT)
.encode(CharBuffer.wrap(str));
byte[] bytes = buf.array();
if (bytes.length == buf.limit())
return bytes;
return Arrays.copyOfRange(bytes, 0, buf.limit());
}
Since ISO-8859-1 allows all bytes, you can't use it to detect malformed input. UTF-8 is however validating, so it is very likely to detect malformed input. It is however not 100% guaranteed, but it's the best we get do.
So, try decoding using strict UTF-8, and then fall back to ISO-8859-1 if it fails:
static String decode(byte[] bytes) {
try {
return decodeStrict(bytes, StandardCharsets.UTF_8);
} catch (CharacterCodingException e) {
return new String(bytes, StandardCharsets.ISO_8859_1);
}
}
Test
System.out.println(decode("señor".getBytes(StandardCharsets.ISO_8859_1))); // prints: señor
System.out.println(decode("señor".getBytes(StandardCharsets.UTF_8))); // prints: señor
System.out.println(decode("€100".getBytes(StandardCharsets.UTF_8))); // prints: €100

Java Encoding for "GB2312" CHARACTER ® replacing with question mark(?)

I'm trying to get encoded value using GB2312 characterset but I'm getting '? 'instead of '®'
Below is my sample code:
new String("Test ®".getBytes("GB2312"));
but I'm getting Test ? instead of Test ®.
Does any one faced this issue?
Java version- JDK6
Platform: Window 7
I'm not aware of Chinese character encoding so need suggestion.
For better understanding, the statement divided in two parts:
byte[] bytes = "Test ®".getBytes("GB2312"); // bytes, encoding the string to GB2312
new String(bytes); // back to string, using default encoding
Probably ® is not a valid GB2312 character, so it is converted to ?. See the result of
Charset.forName("GB2312").newEncoder().canEncode("®")
Based on documentation of getBytes:
The behavior of this method when this string cannot be encoded in the given charset is unspecified. The CharsetEncoder class should be used when more control over the encoding process is required.
which also suggest using CharsetEncoder.

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:
String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");
This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.
From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.
Can someone explain to me why this is necessary?
My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.
HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.
I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).
I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)
When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and
return it as a string? here is an example that i have used before.
public class Decoder
{
public String decode(byte[] bytes)
{
//Turns the bytes array into a string
String decodedString = new String(bytes);
return decodedString;
}
}
Try use this instead of .getBytes(). hope this works.

Java 8 Base64 Encode (Basic) doesn't add new line anymore. How can I reimplement this?

I essentially have the exact opposite problem as
new-line-appending-on-my-encrypted-string
It seems like the old Java Base64 utility would always add new lines every 76 characters when returning a string, but using the following code, I don't get those breaks I need.
Path path = Paths.get(file);
byte[] data = Files.readAllBytes(path);
String txt= Base64.getEncoder().encodeToString(data);
Is there an easy way to tell the encoder to add the newlines?
I've tried implementing a stringbuilder to insert the newlines, But it ends up changing the entire output (I copy the text from java console to HxD editor, and compare against my known working 'BLOB' with newlines).
String txt= Base64.getEncoder().encodeToString(data);
//Byte code for newline
byte b1 = 0x0D;
byte b2 = 0x0A;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < txt.length(); i++) {
if (i > 0 && (i % 76 == 0)) {
sb.append((char)b1);
sb.append((char)b2);
}
sb.append(txt.charAt(i));
}
EDIT (in response to question):
It's not the easiest thing to explain, but when I don't use string builder, the output of the encode will start like this:
AAAAPAog4lBVgGJrT2b+mQVicHN3d////////3hhcDJiLWVtMjUwLWVtMjUwLWRldjA0NTUAAAAAAA
But I want it to look like this:
AAAAPAog4lBVgGJrT2b+mQVicHN3d////////3hhcDJiLWVtMjUwLWVtMjUwLWRldjA0NTUAAAAA..AA
As you can see, the ".." represents 0x0D and 0X0A or a newline, which is insterted at the 76th character (this is what the old base64 would output).
However, when I append the bytes b1 and b2 (newline) after the 76th character, the output becomes:
BPwAFHwA0CUFoG8AgDRCAAIlQgAAJUIAAhUfNEIAAiUkmw/0fADQFSInART/ADUlfADQFQE0fADQ..
So it looks like the ".." is in the right spot, but everything before it is different.
Thanks!
You want getMimeEncoder instead:
MIME
Uses the "The Base64 Alphabet" as specified in Table 1 of RFC 2045 for encoding and decoding operation. The encoded output must be represented in lines of no more than 76 characters each and uses a carriage return '\r' followed immediately by a linefeed '\n' as the line separator. No line separator is added to the end of the encoded output. All line separators or other characters not found in the base64 alphabet table are ignored in decoding operation.
(emphasis mine)
Note that the encoding scheme is otherwise the same as the basic encoder from getEncoder - they are both derived from RFC 2045.
Today I splitted Base64 representation of X509Certificate with the following code:
StringBuilder sb = new StringBuilder();
int chunksCount = str.length()/76;
for(int i=0;i<chunksCount;i++){
sb.append(str.substring(76*i,76*(i+1))).append("\r\n");
}
if(str.length() % 76 != 0) sb.append(str.substring(76*chunksCount)).append("\r\n");
I think, adding big parts better than iterating over each letter. Also, some libraries provide Base64 encoder with special parameter allowing to split with required part size but I had to use some library without such feature.

Java Strings Character Encoding - For French - Dutch Locales

I have the following piece of code
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println(Charset.defaultCharset().toString());
String accentedE = "é";
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes("utf-8"));
System.out.println(utf8);
utf8 = new String(accentedE.getBytes());
System.out.println(utf8);
}
The output of the above is as follows
windows-1252
é
?
é
é
Can someone help me understand what does this do ? Why this output ?
If you already have a String, there is no need to encode and decode it right back, the string is already a result from someone having decoded raw bytes.
In the case of a string literal, the someone is the compiler reading your source as raw bytes and decoding it in the encoding you have specified to it. If you have physically saved your source file in Windows-1252 encoding, and the compiler decodes it as Windows-1252, all is well. If not, you need to fix this by declaring the correct encoding for the compiler to use when compiling your source...
The line
String utf8 = new String(accentedE.getBytes("utf-8"), Charset.forName("UTF-8"));
Does absolutely nothing. (Encode as UTF-8, Decode as UTF-8 == no-op)
The line
utf8 = new String(accentedE.getBytes(), Charset.forName("UTF-8"));
Encodes string as Windows-1252, and then decodes it as UTF-8. The result must only be decoded in Windows-1252 (because it is encoded in Windows-1252, duh), otherwise you will get strange results.
The line
utf8 = new String(accentedE.getBytes("utf-8"));
Encodes a string as UTF-8, and then decodes it as Windows-1252. Same principles apply as in previous case.
The line
utf8 = new String(accentedE.getBytes());
Does absolutely nothing. (Encode as Windows-1252, Decode as Windows-1252 == no-op)
Analogy with integers that might be easier to understand:
int a = 555;
//The case of encoding as X and decoding right back as X
a = Integer.parseInt(String.valueOf(a), 10);
//a is still 555
int b = 555;
//The case of encoding as X and decoding right back as Y
b = Integer.parseInt(String.valueOf(b), 15);
//b is now 1205 I.E. strange result
Both of these are useless because we already have what we needed before doing any of the code, the integer 555.
There is a need for
encoding your string into raw bytes when it leaves your system and there is a need for decoding raw bytes into a string when they come into your system. There is no need to encode and decode right back within the system.
Line #1 - the default character set on your system is windows-1252.
Line #2 - you created a String by encoding a String literal to UTF-8 bytes, and then decoding it using the UTF-8 scheme. The result is correctly formed String, which can be output correctly using windows-1252 encoding.
Line #3 - you created a String by encoding a string literal as windows-1252, and then decoding it using UTF-8. The UTF-8 decoder has detected a sequence that cannot possibly be UTF-8, and has replaced the offending character with a question mark"?". (The UTF-8 format says that any byte that has the top bit set to 1 is one byte of a multi-byte character. But the windows-1252 encoding is just one byte long .... ergo, this is bad UTF-8)
Line #4 - you created a String by encoding in UTF-8 and then decoding in windows-1252. In this case the decoding has not "failed", but it has produced garbage (aka mojibake). The reason you got 2 characters of output is that the UTF-8 encoding of "é" is a 2 byte sequence.
Line #5 - you created a String by encoding as windows-1252 and decoding as windows-1252. This produce the correct output.
And the overall lesson is that if you encode characters to bytes with one character encoding, and then decode with a different character encoding you are liable to get mangling of one form or another.
When you call upon String getBytes method it:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.
So whenever you do:
accentedE.getBytes()
it takes the contents of accentedE String as bytes encoded in the default OS code page, in your case cp-1252.
This line:
new String(accentedE.getBytes(), Charset.forName("UTF-8"))
takes the accentedE bytes (encoded in cp1252) and tries to decode them in UTF-8, hence the error. The same situation from the other side for:
new String(accentedE.getBytes("utf-8"))
The getBytes method takes the accentedE bytes encoded in cp-1252, reencodes them in UTF-8 but then the String constructor encodes them with the default OS codepage which is cp-1252.
Constructs a new String by decoding the specified array of bytes using the platform's default charset. The length of the new String is a function of the charset, and hence may not be equal to the length of the byte array.
I strongly recommend reading this excellent article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
UPDATE:
In short, every character is stored as a number. In order to know which character is which number the OS uses the codepages. Consider the following snippet:
String accentedE = "é";
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[0]));
System.out.println(String.format("%02X ", accentedE.getBytes("UTF-8")[1]));
System.out.println(String.format("%02X ", accentedE.getBytes("windows-1252")[0]));
which outputs:
C3
A9
E9
That is because small accented e in UTF-8 is stored as two bytes of value C3A9, while in cp-1252 is stored as a single byte of value E9. For detailed explanation read the linked article.

Categories

Resources