URLDecode.decode method not working as expected in Java - java

I was trying to decode URL encoded post body and came across this problem.
I was using this method to decode (it decodes multiple encoded urls too) :
public static String decodeUrl(String url)
{
try {
String prevURL="";
String decodeURL=url;
while(!prevURL.equals(decodeURL))
{
prevURL=decodeURL;
decodeURL= URLDecoder.decode( decodeURL, "UTF-8" );
}
return decodeURL;
} catch (UnsupportedEncodingException e) {
return "Issue while decoding" +e.getMessage();
}
}
When the input url was "a%20%2B%20b%20%3D%3D%2013%25!" , the control somehow doens't show up after line decodeURL = when debugging . No exceptions are raised too.
The issue is that control doesn't go beyond the line"decodeURL" .
What might be causing the issue ? Please use debugger to hopefully mimic this problem.

Just tested it on Java 8u151. This throws an IllegalArgumentException on the second spin of the loop: "URLDecoder: Incomplete trailing escape (%) pattern". That's because after the first decoding you have "a + b == 13%!", and during the second decoding that % is supposed to introduce an encoding sequence but it does not. I think that's the expected behaviour, even if the standard library of other languages does not agree. Python 3.6 for example:
>>> from urllib.parse import unquote
>>> result = unquote('a%20%2B%20b%20%3D%3D%2013%25!')
>>> result
'a + b == 13%!'
>>> unquote(result)
'a + b == 13%!'

Related

How to percent encode in Java?

How do I do percent encoding of a string, as described in RFC 3986? I.e. I do not want (IMO, weird) www-url-form-encoded, as that is different.
If it matters, I am encoding data that is not necessarily an entire URL.
As you have identified, the standard libraries don't cope very well with the problem.
Try to use either Guava's PercentEscaper, or directly one of the URL escapers depending on which part of the URL you're trying to encode.
Guava's com.google.common.net.PercentEscaper (marked "Beta" and therefore unstable):
UnicodeEscaper basicEscaper = new PercentEscaper("-", false);
String s = basicEscaper.escape(s);
Workaround with java.net.URLEncoder:
try {
String s = URLEncoder.encode(s, "UTF-8").replace("+", "%20");
} catch (UnsupportedEncodingException e) {
..
}

java or node.js removes one character when sending or receiving a string

I'm trying to send a string from a java (android) app to a node.js server.
But one character disappears somewhere in the middle and I can't really figure out why.
To send I use a HttpUrlConnection (conn) and send the string like this:
try {
OutputStream os = conn.getOutputStream();
os.write(json.getBytes());
os.close();
} catch (Exception e) {
e.printStackTrace();
}
Here is the base64 encoded string when sent, and string when received:
khVGUBH2kNAR5PPRy7v5dO5iz48Rc7benYARu78\/9wY=\n
khVGUBH2kNAR5PPRy7v5dO5iz48Rc7benYARu78/9wY=\n
so one backslash has be removed.
In node I use this:
exports.getString = function(req, res) {
var string = req.body.thestring;
}
which outputs the later of the two strings.
var express = require('express'),
http = require('http'),
stylus = require('stylus'),
nib = require('nib');
var app = express();
app.configure(function () {
app.use(express.logger('dev'));
//app.use(express.bodyParser());
app.use(express.json());
app.use(express.urlencoded());
app.use(app.router);
}
Any ideas of how I can get the missing character?
The missing backslash character is most probably disappearing in node.js side.
As per the chosen answer on the following question:
Two part question on JavaScript forward slash
As far as JS is concerned / and \ / are identical inside a string
So maybe a fix from Java's would solve your problem by using String's replaceAll method to replace all occurrences of \/ with \\/:
os.write(json.replaceAll("\\/", "\\\\/").getBytes());
Note that replaceAll returns the new string and doesn't change the original string.
Making the base64 encoding url safe solved my problem.

how to avoid recursive url-decoding

I am writing something like an image proxy where I receive URLs from my site front-end, and I download images , re-size them, and return smaller images for the front end and client to download from the "proxy".
This means I need to take care of all-sorts of url patterns, this is why I chose to decode the given url and than encode it using URIUtils.decode:
private String fixUrl(String fromUrl) throws URIException {
fromUrl = URIUtil.decode(fromUrl);
fromUrl = URIUtil.encodeQuery(fromUrl);
return fromUrl;
}
This should help me take care of urls that are already encoded.
My problem is that some of the urls are double encoded, and from what I saw, URIUtils.decode, performs recursive decode and this means that in cases of double encoded urls I will get a bad url that does not work.
Is there a simple way to decode only once?
I'd try to check that URL still contains character %. If it does not contain any % it is not encoded and you can stop your decoding procedure.
Easiest option I know is to use the built-in
java.net.URLDecoder.decode
In order to decode automatically only once.
If like me you have the case that sometimes URL's are double / triple encoded - you can use this recursive function in order to decode again and again until there are no "%" or "+" :
private static String completeDecode(String url) {
if(url.contains("%") || url.contains("+"))
{
try
{
return(completeDecode(java.net.URLDecoder.decode(url, "UTF-8")));
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
}
}
return url;
}
Cheers

Desktop.Action.MAIL encoding subject and body strings correctly for mailto: in URI

I have a question specific to the Java Desktop API in Java 6, more specifically desktop.mail(URI uri)..
I was wondering if there is a function one could use to ensure that the Subject and Body in f.ex:
mailToURI = new URI("mailto", getToEmails() + "?SUBJECT=" + getEmailSubject()
+ "&BODY=" + getEmailBody(), null);
desktop.mail(mailToURI);
will be kept in accordance with rfc2368 and still be displayed correctly in the email application?
Right now examples of problematic texts are the scandinavian letters: æøå / ÆØÅ and adding complex URLS in the Body containing ampersands (&) and such f.ex: http://www.whatever.com?a=b&c=d etc..
Is there a function in Java that ensures the aboved seeked integrity is preserved when using the mailto: URI scheme with Java Desktops mail(URI) function?
Would it be possible to make one?
At this point I have tried everything I can think of including:
MimeUtility.encodeText()
URLEncode.encode(..
A custom function encodeUnusualCharacters()
private static final Pattern SIMPLE_CHARS = Pattern.compile("[a-zA-Z0-9]");
private String encodeUnusualChars(String aText) {
StringBuilder result = new StringBuilder();
CharacterIterator iter = new StringCharacterIterator(aText);
for (char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
char[] chars = {c};
String character = new String(chars);
if (isSimpleCharacter(character)) {
result.append(c);
} else {
//hexEncode(character, "UTF-8", result);
}
}
return result.toString();
}
private boolean isSimpleCharacter(String aCharacter) {
Matcher matcher = SIMPLE_CHARS.matcher(aCharacter);
return matcher.matches();
}
/**
For the given character and encoding, appends one or more hex-encoded characters.
For double-byte characters, two hex-encoded items will be appended.
*/
private static void hexEncode(String aCharacter, String aEncoding, StringBuilder aOut) {
try {
String HEX_DIGITS = "0123456789ABCDEF";
byte[] bytes = aCharacter.getBytes(aEncoding);
for (int idx = 0; idx < bytes.length; idx++) {
aOut.append('%');
aOut.append(HEX_DIGITS.charAt((bytes[idx] & 0xf0) >> 4));
aOut.append(HEX_DIGITS.charAt(bytes[idx] & 0xf));
}
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(LocalMail.class.getName()).log(Level.SEVERE, null, ex);
}
}
And many more...
At the best I end up with the encoded text in the email that is opened up.
Not providing any special encoding will cause æøå or similar to stop further processing of the content.
I feel I am missing something crucial. Could anyone please enlighten me with a solution to this?
For line breaks I use String NL = System.getProperty("line.separator");
Perhaps there is some System specific stuff that needs to be called to make this work??
By the way I am currently on Mac OS X 10.6.8 with Mail 4.5
marius$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-384-10M3425)
Java HotSpot(TM) Client VM (build 20.1-b02-384, mixed mode)
I really feel there must be a way - otherwise the subject and message part of the desktop.mail(URI) function is completely unreliable to the point of being useless.
Any help to point me in the right direction is greatly appreciated!!
Thanks Marius, it's a very useful line of code.
I modified it a bit for performances...
It's better to use "replace" instead of "replaceAll", when you are not using RegExp.
This:
.replace("+", "%20")
is faster than:
.replaceAll("\\+", "%20")
Both replace ALL occurrences, but the first one does not have to do any regexp parsing.
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replace%28java.lang.CharSequence,%20java.lang.CharSequence%29
Also, if the original string already has \r\n for new lines, the second replace will double the \r. It's not a big issue, but I prefer to remove that one and provide a proper input string:
String result = java.net.URLEncoder.encode(src, "utf-8").replace("+", "%20")
Try this, hope it will work for you.
String result = java.net.URLEncoder.encode(src, "utf-8").replaceAll("\\+", "%20").replaceAll("\\%0A", "%0D%0A");

How to determine if a String contains invalid encoded characters

Usage scenario
We have implemented a webservice that our web frontend developers use (via a php api) internally to display product data. On the website the user enters something (i.e. a query string). Internally the web site makes a call to the service via the api.
Note: We use restlet, not tomcat
Original Problem
Firefox 3.0.10 seems to respect the selected encoding in the browser and encode a url according to the selected encoding. This does result in different query strings for ISO-8859-1 and UTF-8.
Our web site forwards the input from the user and does not convert it (which it should), so it may make a call to the service via the api calling a webservice using a query string that contains german umlauts.
I.e. for a query part looking like
...v=abcädef
if "ISO-8859-1" is selected, the sent query part looks like
...v=abc%E4def
but if "UTF-8" is selected, the sent query part looks like
...v=abc%C3%A4def
Desired Solution
As we control the service, because we've implemented it, we want to check on server side wether the call contains non utf-8 characters, if so, respond with an 4xx http status
Current Solution In Detail
Check for each character ( == string.substring(i,i+1) )
if character.getBytes()[0] equals 63 for '?'
if Character.getType(character.charAt(0)) returns OTHER_SYMBOL
Code
protected List< String > getNonUnicodeCharacters( String s ) {
final List< String > result = new ArrayList< String >();
for ( int i = 0 , n = s.length() ; i < n ; i++ ) {
final String character = s.substring( i , i + 1 );
final boolean isOtherSymbol =
( int ) Character.OTHER_SYMBOL
== Character.getType( character.charAt( 0 ) );
final boolean isNonUnicode = isOtherSymbol
&& character.getBytes()[ 0 ] == ( byte ) 63;
if ( isNonUnicode )
result.add( character );
}
return result;
}
Question
Will this catch all invalid (non utf encoded) characters?
Does any of you have a better (easier) solution?
Note: I checked URLDecoder with the following code
final String[] test = new String[]{
"v=abc%E4def",
"v=abc%C3%A4def"
};
for ( int i = 0 , n = test.length ; i < n ; i++ ) {
System.out.println( java.net.URLDecoder.decode(test[i],"UTF-8") );
System.out.println( java.net.URLDecoder.decode(test[i],"ISO-8859-1") );
}
This prints:
v=abc?def
v=abcädef
v=abcädef
v=abcädef
and it does not throw an IllegalArgumentException sigh
I asked the same question,
Handling Character Encoding in URI on Tomcat
I recently found a solution and it works pretty well for me. You might want give it a try. Here is what you need to do,
Leave your URI encoding as Latin-1. On Tomcat, add URIEncoding="ISO-8859-1" to the Connector in server.xml.
If you have to manually URL decode, use Latin1 as charset also.
Use the fixEncoding() function to fix up encodings.
For example, to get a parameter from query string,
String name = fixEncoding(request.getParameter("name"));
You can do this always. String with correct encoding is not changed.
The code is attached. Good luck!
public static String fixEncoding(String latin1) {
try {
byte[] bytes = latin1.getBytes("ISO-8859-1");
if (!validUTF8(bytes))
return latin1;
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
// Impossible, throw unchecked
throw new IllegalStateException("No Latin1 or UTF-8: " + e.getMessage());
}
}
public static boolean validUTF8(byte[] input) {
int i = 0;
// Check for BOM
if (input.length >= 3 && (input[0] & 0xFF) == 0xEF
&& (input[1] & 0xFF) == 0xBB & (input[2] & 0xFF) == 0xBF) {
i = 3;
}
int end;
for (int j = input.length; i < j; ++i) {
int octet = input[i];
if ((octet & 0x80) == 0) {
continue; // ASCII
}
// Check for UTF-8 leading byte
if ((octet & 0xE0) == 0xC0) {
end = i + 1;
} else if ((octet & 0xF0) == 0xE0) {
end = i + 2;
} else if ((octet & 0xF8) == 0xF0) {
end = i + 3;
} else {
// Java only supports BMP so 3 is max
return false;
}
while (i < end) {
i++;
octet = input[i];
if ((octet & 0xC0) != 0x80) {
// Not a valid trailing byte
return false;
}
}
}
return true;
}
EDIT: Your approach doesn't work for various reasons. When there are encoding errors, you can't count on what you are getting from Tomcat. Sometimes you get � or ?. Other times, you wouldn't get anything, getParameter() returns null. Say you can check for "?", what happens your query string contains valid "?" ?
Besides, you shouldn't reject any request. This is not your user's fault. As I mentioned in my original question, browser may encode URL in either UTF-8 or Latin-1. User has no control. You need to accept both. Changing your servlet to Latin-1 will preserve all the characters, even if they are wrong, to give us a chance to fix it up or to throw it away.
The solution I posted here is not perfect but it's the best one we found so far.
You can use a CharsetDecoder configured to throw an exception if invalid chars are found:
CharsetDecoder UTF8Decoder =
Charset.forName("UTF8").newDecoder().onMalformedInput(CodingErrorAction.REPORT);
See CodingErrorAction.REPORT
This is what I used to check the encoding:
CharsetDecoder ebcdicDecoder = Charset.forName("IBM1047").newDecoder();
ebcdicDecoder.onMalformedInput(CodingErrorAction.REPORT);
ebcdicDecoder.onUnmappableCharacter(CodingErrorAction.REPORT);
CharBuffer out = CharBuffer.wrap(new char[3200]);
CoderResult result = ebcdicDecoder.decode(ByteBuffer.wrap(bytes), out, true);
if (result.isError() || result.isOverflow() ||
result.isUnderflow() || result.isMalformed() ||
result.isUnmappable())
{
System.out.println("Cannot decode EBCDIC");
}
else
{
CoderResult result = ebcdicDecoder.flush(out);
if (result.isOverflow())
System.out.println("Cannot decode EBCDIC");
if (result.isUnderflow())
System.out.println("Ebcdic decoded succefully ");
}
Edit: updated with Vouze suggestion
Replace all control chars into empty string
value = value.replaceAll("\\p{Cntrl}", "");
URLDecoder will decode to a given encoding. This should flag errors appropriately. However the documentation states:
There are two possible ways in which this decoder could deal with illegal strings. It could either leave illegal characters alone or it could throw an IllegalArgumentException. Which approach the decoder takes is left to the implementation.
So you should probably try it. Note also (from the decode() method documentation):
The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites
so there's something else to think about!
EDIT: Apache Commons URLDecode claims to throw appropriate exceptions for bad encodings.
I've been working on a similar "guess the encoding" problem. The best solution involves knowing the encoding. Barring that, you can make educated guesses to distinguish between UTF-8 and ISO-8859-1.
To answer the general question of how to detect if a string is properly encoded UTF-8, you can verify the following things:
No byte is 0x00, 0xC0, 0xC1, or in the range 0xF5-0xFF.
Tail bytes (0x80-0xBF) are always preceded by a head byte 0xC2-0xF4 or another tail byte.
Head bytes should correctly predict the number of tail bytes (e.g., any byte in 0xC2-0xDF should be followed by exactly one byte in the range 0x80-0xBF).
If a string passes all those tests, then it's interpretable as valid UTF-8. That doesn't guarantee that it is UTF-8, but it's a good predictor.
Legal input in ISO-8859-1 will likely have no control characters (0x00-0x1F and 0x80-0x9F) other than line separators. Looks like 0x7F isn't defined in ISO-8859-1 either.
(I'm basing this off of Wikipedia pages for UTF-8 and ISO-8859-1.)
You might want to include a known parameter in your requests, e.g. "...&encTest=ä€", to safely differentiate between the different encodings.
You need to setup the character encoding from the start. Try sending the proper Content-Type header, for example Content-Type: text/html; charset=utf-8 to fix the right encoding. The standard conformance refers to utf-8 and utf-16 as the proper encoding for Web Services. Examine your response headers.
Also, at the server side — in the case which the browser do not handles properly the encoding sent by the server — force the encoding by allocating a new String. Also you can check each byte in the encoded utf-8 string by doing a single each_byte & 0x80, verifying the result as non zero.
boolean utfEncoded = true;
byte[] strBytes = queryString.getBytes();
for (int i = 0; i < strBytes.length(); i++) {
if ((strBytes[i] & 0x80) != 0) {
continue;
} else {
/* treat the string as non utf encoded */
utfEncoded = false;
break;
}
}
String realQueryString = utfEncoded ?
queryString : new String(queryString.getBytes(), "iso-8859-1");
Also, take a look on this article, I hope it would help you.
the following regular expression might be of interest for you:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/185624
I use it in ruby as following:
module Encoding
UTF8RGX = /\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x unless defined? UTF8RGX
def self.utf8_file?(fileName)
count = 0
File.open("#{fileName}").each do |l|
count += 1
unless utf8_string?(l)
puts count.to_s + ": " + l
end
end
return true
end
def self.utf8_string?(a_string)
UTF8RGX === a_string
end
end
Try to use UTF-8 as a default as always in anywhere you can touch. (Database, memory, and UI)
One and single charset encoding could reduce a lot of problems, and actually it can speed up your web server performance. There are so many processing power and memory wasted to encoding/decoding.

Categories

Resources