How do I encode URI parameter values? - java

I want to send a URI as the value of a query/matrix parameter. Before I can append it to an existing URI, I need to encode it according to RFC 2396. For example, given the input:
http://google.com/resource?key=value1 & value2
I expect the output:
http%3a%2f%2fgoogle.com%2fresource%3fkey%3dvalue1%2520%26%2520value2
Neither java.net.URLEncoder nor java.net.URI will generate the right output. URLEncoder is meant for HTML form encoding which is not the same as RFC 2396. URI has no mechanism for encoding a single value at a time so it has no way of knowing that value1 and value2 are part of the same key.

Jersey's UriBuilder encodes URI components using application/x-www-form-urlencoded and RFC 3986 as needed. According to the Javadoc
Builder methods perform contextual encoding of characters not permitted in the corresponding URI component following the rules of the application/x-www-form-urlencoded media type for query parameters and RFC 3986 for all other components. Note that only characters not permitted in a particular component are subject to encoding so, e.g., a path supplied to one of the path methods may contain matrix parameters or multiple path segments since the separators are legal characters and will not be encoded. Percent encoded values are also recognized where allowed and will not be double encoded.

You could also use Spring's UriUtils

I don't have enough reputation to comment on answers, but I just wanted to note that downloading the JSR-311 api by itself will not work. You need to download the reference implementation (jersey).
Only downloading the api from the JSR page will give you a ClassNotFoundException when the api tries to look for an implementation at runtime.

I wrote my own, it's short, super simple, and you can copy it if you like:
http://www.dmurph.com/2011/01/java-uri-encoder/

It seems that CharEscapers from Google GData-java-client has what you want. It has uriPathEscaper method, uriQueryStringEscaper, and generic uriEscaper. (All return Escaper object which does actual escaping). Apache License.

I think that the URI class is the one that you are looking for.

Mmhh I know you've already discarded URLEncoder, but despite of what the docs say, I decided to give it a try.
You said:
For example, given an input:
http://google.com/resource?key=value
I expect the output:
http%3a%2f%2fgoogle.com%2fresource%3fkey%3dvalue
So:
C:\oreyes\samples\java\URL>type URLEncodeSample.java
import java.net.*;
public class URLEncodeSample {
public static void main( String [] args ) throws Throwable {
System.out.println( URLEncoder.encode( args[0], "UTF-8" ));
}
}
C:\oreyes\samples\java\URL>javac URLEncodeSample.java
C:\oreyes\samples\java\URL>java URLEncodeSample "http://google.com/resource?key=value"
http%3A%2F%2Fgoogle.com%2Fresource%3Fkey%3Dvalue
As expected.
What would be the problem with this?

Related

How can I get value after hashtag from URL in Java

I have a URL and I want to print in my graphical user interface the ID value after the hashtag.
For example, we have www.site.com/index.php#hello and I want to print hello value on a label in my GUI.
How can I do this using Java in Netbeans?
Simple solution is getRef() in URL class:
URL url = new URL("http://www.anyhost.com/index.php#hello");
jLabel.setText(url.getRef());
EDIT: According to #Henry comment:
I would recommend to use the java.net.URI as it also deals with encoding. The Javadocs say: "Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use URI, and to convert between these two classes using toURI() and URI.toURL()."
and this comment:
Why not just doing uri.getFragment()
URI uri = new URI("http://www.anyhost.com/index.php#hello");
jLabel.setText(uri.getFragment());
Use the String.split() Method.
public static String getId(string url) {
return url.split("#")[1];
}
String.split() returns an array of Strings that are delimited, or "Split," by the value you pass to it, or in this case #.
Because you want only the string after the #, you can just use the second item in the array that it returns by adding [1] to the end of it.
For more on String.split() go to Tutorials Point.
By the way, the part of the URL you are referencing is the Element ID. It is used to jump to an Element on a webpage.

UTF-8 for URL, Java

So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".
Here is my current code:
String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
System.out.println(url);
I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.
Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.
Solution: It wasn't Java, but IntelliJ.
Summary from comment
The test code works fine.
import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;
public class MainApp {
public static void main(String[] args) throws UnsupportedEncodingException {
String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
System.out.println(url);
}
}
It emits like below
http://www.teanglann.ie/en/gram/f%EF%BF%BDg
Which would goto correct page.
Correct steps are
Ensure that source code encoding is correct. (IntelliJ probably
cannot guess it all correct)
Run the program with appropriate encoding (utf-8 in this case)
(See
What is the default encoding of the JVM?
for a relevant discussion)
Edit from Wyzard's comment
Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow
..
String url = "http://www.teanglann.ie/en/gram/fág";
System.out.println(new URI(url).toASCIIString());
This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax

apache commons-validator alternative for new gTLDS

I need to validate emails and domains. I just need a formal validation, no whois or other forms of domain lookup needed.
Currently I'm using apache's commons-validator v1.4.0
Unfortunately my customers use the new gTLDs, like .bike or .productions that are not yet supported by the DomainValidator class.
See Apache's Jira issue for more details.
Are there any sound alternatives that I may easily include in my Maven POM?
If you are not concerned about internationalized addresses, you could change last part of address, and continue to use Apache commons.
This approach is based on the fact that whatever the TLD is, the validity of the whole domain name is equivalent to the validity of the same domain name with the TLD replaced with com. For example:
abc.def.com is valid. Similarly abc.def.name, abc.def.xx--kput3i, abc.def.uk are valid.
ab,de.com is not valid. Similarly ab,de.name, ab,de.xx-kput3i, ab,de.uk are not valid.
So instead of calling
return EmailValidator.getInstance().isValid(userEmail);
You can call
if ( userEmail == null ) {
return false;
}
return EmailValidator.getInstance().isValid(userEmail.trim().replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$", ".com"));
Explanation
The regular expression "\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$" checks for the TLD part: it's at the end of the string (because of the $), it starts with a dot and contains no other dot, and it conforms to the standards: begins with an ASCII Alpha character, followed by zero or more alphanumerics or dashes, and ends with an alphanumeric character.
I am using trim() because until now, if you used EmailValidator, it allows spaces before and after the address. Removing the spaces just makes it easier to replace the TLD, and it shouldn't matter as far as the validity of the address is concerned.
If the string doesn't have a valid TLD at the end, String.replaceFirst() will return it as is. It could still be valid, because email addresses of the format x#[n.n.n.n] where n.n.n.n. is a valid IP address are valid. So basically, if you didn't find a TLD, you let EmailValidator decide the validity issue itself.
Of course, if the TLD is not an IANA recognized TLD, this validation will not tell you that. An e-mail like david#galaxy.hoopie-frood will be accepted as legal,but IANA doesn't have that TLD as yet.
Checking a domain is similar, without the trim() part:
if (userDomain == null ) {
return false;
}
return DomainValidator.getInstance().isValid(userDomain.replaceFirst("\\.\\p{Alpha}[\\p{Alnum}-]*\\p{Alnum}$"));
I have also tried JavaMail's email address validation, but I don't really like it: it allows completely invalid domain names such as net-name.net- (ending with a dash) or IP addresses (which are not allowed for e-mail without square brackets around them), and it's only good for e-mail addresses, not for domains.
Internationalization
If you need to check for internationalized domains and e-mails, it's a bit different. It's easy to check for internationalized domains (for example 元気。テスト). All you need to do is convert them to ASCII with java.net.IDN.toASCII() (yielding xn--z4qx76d.xn--zckzah for my example domain - this is a valid TLD), and then do the same as I wrote above.
Internationalized e-mails are a different story. If the local part is ASCII, you can convert the domain part to ASCII. If you have to display the email address, you need to use the Unicode version, and if you have to send an email message, you use the ASCII version.
But recently a standard has been introduced for internationalized local parts as well, which also allows sending to the unicode version of the domain name without translating it to ASCII first. Whether you want to support that or not requires some thought, as not many mail servers and mail transfer agents support it at the moment.
Copied the implementation from DomainValidator and replaced the TOP_LABEL_REGEX expression with "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}".
In addition, I removed validation against the hard coded list of approved gTLDs. This is, basically, quite weak in that it doesn't validate against the actual domains. But I think it's good enough (catches the gTLDs similar to XN--YGBI2AMMX).
See full list of approved gTLDs here.
// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_LABEL_REGEX = "\\p{Alnum}(?>[\\p{Alnum}-]*\\p{Alnum})*";
// Changed to include new gTLD - http://data.iana.org/TLD/tlds-alpha-by-domain.txt
private static final String TOP_LABEL_REGEX = "\\p{Alpha}[\\p{Alnum}-]*\\p{Alpha}";
// Copied from org.apache.commons.validator.routines.DomainValidator
private static final String DOMAIN_NAME_REGEX = "^(?:" + DOMAIN_LABEL_REGEX + "\\.)+" + "(" + TOP_LABEL_REGEX + ")$";
private static final RegexValidator domainRegex = new RegexValidator(DOMAIN_NAME_REGEX);
private static final EmailValidator EMAIL_VALIDATOR = new EmailValidator();
public static boolean isValidDomain(String domain) {
String[] groups = domainRegex.match(domain);
return groups != null && groups.length > 0;
}
What I often do in this situation is to checkout the source code for the library in question (it's open source remember?), modify it to suit my requirement, and then contribute the patch back to the project.
Your use case certainly sounds like it would be a useful contribution.
I made you a public suffix list Java API. The method PublicSuffixList.getRegistrableDomain() can be used for Domain validation:
PublicSuffixListFactory factory = new PublicSuffixListFactory();
PublicSuffixList suffixList = factory.build();
assertNull(suffixList.getRegistrableDomain("galaxy.hoopie-frood"));
assertNotNull(suffixList.getRegistrableDomain("example.bike"));
While DomainValidator is missing some of the new TLDs, for me the best solution was to update TLD.
DomainValidator.updateTLDOverride(ArrayType.COUNTRY_CODE_PLUS, new String[]{"someTLD"});
And then initiate EmailValidator Instance
EmailValidator.getInstance(false, true)

Camel: How to include an ampersand as data in a URI (NOT as a delimiter)?

(Camel 2.9.2)
Very simple use case, but I can't seem to find the answer. My code boils down to this:
String user = "user";
String password = "foo&bar";
String uri = "smtp://hostname:25?username=" + user +
"&password=" + password +
"&to=somthing#something.com"; // etc. You get the idea
from("seda:queue:myqueue").to(uri);
Camel throws a ResolveEndpointFailedException with "Unknown parameters=[{bar=null}]."
If I try "foo%26bar," I get the same result.
If I try "foo&bar" camel responds with "Unknown parameters=[{amp;bar=null}]."
I tried using URISupport to create the URI. It escapes the & to %26, and then I get "Unknown parameters=[{bar=null}]" again.
Any ideas?
As from Camel 2.11 you could use raw syntax
For instance:
.to("ftp:joe#myftpserver.com?password=RAW(se+re?t&23)&binary=true"
In the above example, we have declare the password value as raw, and
the actual password would be as typed, eg se+re?t&23
https://cwiki.apache.org/confluence/display/CAMEL/How+do+I+configure+endpoints
You can specify the password as part of the authority of the uri, eg in the front.
Also the & should be escaped to %26, but there was a bug in Camel that didnt parse the escaped value to well. Try 2.10 when its out.
The RAW() syntax works, yet it is Camel-proprietary syntax. In our usecase it burdened following processing of URI.
We used alternative solution: component configured as using raw URIs (Component.useRawUri() == true). Component parameters are then simply once encoded (foo%26bar) and pass through Camel without change. I consider this solution better as percent-sign encoding is standard way of expressing sensitive characters.

Substitute {0}, {1} .. {n} in a template with given varargs

Consider a string template of the following format:
String template = "The credentials you provided were username '{0}' with password '{1}'";
Substitution variable fields are of the form {n}, where n is a zero based index.
This is the template format used in Adobe Flex, see StringUtil.substitute(...). And also .NET, IIRC.
Since I want to re-use the templates used by the Flex code I'm looking for an Java equivalent. I'm aware of String.format(...) but the template structure is not identical.
What is the best way to get the same "Flex compatible" template functionality in Java?
Basically this is the desired end-result:
assert(StringUtil.substitute(template, "powerUser", "difficultPassword") == "The credentials you provided were username 'powerUser' with password 'difficultPassword'");
Use MessageFormat
You want java.text.MessageFormat http://download-llnw.oracle.com/javase/6/docs/api/java/text/MessageFormat.html

Categories

Resources