Dealing with special characters in a URL using Java - java

I've written a Java program to generate an m3u file based on a CD ripped from k3b which pretty much preserves special character encodings in artist, album and track names. I then place these m3u files on a server and generate a GWT web application where the m3u file name is the target of an HTML anchor tag. For 99+% of cases, this all works perfectly. For a few cases, special characters cause the link to fail.
One failing example is the Movits! album Äppelknyckarjazz (note the first character which gets encoded by a URI constructor as %C3%84). Since the client is GWT, view source does not show the link, :-( But when hovering over the link Firefox shows the correctly decoded URL. When clicking on the link, Firefox fails with: "...Äppelknyckarjazz.m3u was not found on this server" It is as though different character encoding schemes are at play but frankly my brain is hurting in trying to unravel the puzzle at this level.
So there are really two questions:
1) Is my problem an encoding scheme issue?
2) Assuming it is, how can I maintain consistency given the various pieces of the application (Java m3u generater, GWT client, Firefox browser, Apache web server).

String result = java.net.URLEncoder.encode("Äppelknyckarjazz", "UTF-8");
I think this is a solution for you.

Ä can be encoded as %C3%84 (UTF8) or %C4 (Latin1). Sounds like you are using a mixture of Latin11 and UTF8. You need to make sure the same encoding is used across all your systems.
In rare case that you can't control the encoding, see my answer to this question,
How to determine if a String contains invalid encoded characters

First you have to declare a charset on your HTML-page. Best ist UTF-8.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Then you should configure your webserver to interpret requests from clients as UTF-8. When using tomcat, set the URIEncoding-parameter on your Connector-tag:
<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8" />

Related

How should I store Unicode file paths in MySQL and on FTP?

I have written a web application where users from all over the world can upload files. These may different character encodings. For that reason I have defined my MySQL database in UTF-8, so it renders all special chars correctly.
However, when a user tries to access a file, the web browser says that it cannot be found, with a weird encoded result. I have tried some different encoding approaches (such as URLEncoder/Decoder), but there are still some use-cases where they don't work.
I want to know how Dropbox, Google, etc. solve this encoding problem.
Do they save the string in a form like test+test%C3%BC.txt?
Do they rename the file and store it with a secure name (like 123890123.file)?
Do they use some other technique?
I am also wondering whether the URLEncoder is the best approach to get things working. For example, it replaces a space with a +, instead of with %20 — why does it do that if no browser can handle a plus for a space?
I need to keep the base URL the same, and encode only the filename.
For example,
www.example.com/folder/blä äh.txt
should be encoded as
www.example.com/folder/bl%..+%...txt
How can I achieve this?

JSP contentType and pageEncoding not working, I've tried it all

I'm completely puzzled. I've set my Apache and Tomcat config files, Java Servlet project and JSP pages, ALL OF IT to "UTF-8" to support spanish characters (á, í, ó, etc). I've systematically followed all guidelines found on the documentation and forums. I know I could use Latin1, but since it seems to be easier to use UTF-8, but after 4 days of trial and error, I've decided (since my servlet will support only spanish characters) to switch to "ISO-8859-1", which is actually mostly working.
The only problem is that ONLY my JSP pages still says "UTF-8" when I right click --> Properties. The page directive and meta tag is correct (ISO-8859-1), but when I open it on the browser says "Windows-1252".
I have no idea why this is happening. If I switch to "UTF-8" (all of it, including server config, Java project, etc), the characters appears garbled at the browser, e.g.: "puntuación" instead of "puntuación".
So, to iron this question out...
Does anyone knows how to implement UTF-8 correctly and make spanish characters work everywhere?
or
Does anyone knows how to change JSP pages to be "ISO-8859-1" everywhere? Right now, it's UTF-8 at the Properties window, ISO-8859-1 as #page (contentType charset and pageEncoding) directive, and Windows-1252 at the browser
As always, I'm more than grateful in advance for your patience and support.
After adding the "CharsetFilter" class to my project, everything worked just fine.
Follow these guidelines: How to get UTF-8 working in Java webapps?
PS: I've completely removed all lines mentioning:
response.setCharacterEncoding("UTF-8");
response.setContentType("text/html;charset=UTF-8");
But I kept on the JSPs:
Happy coding!

How to configure content type and content encoding using glassfish server?

I am building a web application using JSF and Spring in eclipse indigo using Glassfish server 3.1.2 . Everything is going great but it is showing me this error in firebug in 2 JavaScript files.
When I check in those files I didn't find any illegal character in those files but firebug still showing this.
I have used these files in one of ASP.Net project and they didn't mess there so i checked and matched their content type from both projects then I found that in ASP.Net project these files have
Content-Type = application/x-javascript
And in my JSP-Spring(JAVA) project there
Content-Type = text/javascript;charset=ISO-8859-1
is this.So you can see that sames files have changed their content scheme. I found that this scheme can be change by configuration in glassfish server.So I want to change my JS files content-Type to same as in ASP type.
If anyone have any other solution then please share because I haven't found any solution other than changing the scheme from glassfish serverThanks
Those strange characters you are seeing is the UTF-8 Byte Order Mark. They are a special set of bytes that indicate how a document is encoded. Unfortunately, when interpreted as ISO-8859-1, you wind up with the problem you have. There are two ways to resolve this.
The first way is to change the output character set to UTF-8. I believe this can be done in your server configuration, in your web.xml configuration, or by setting the character set on the HTTP request object in code; for example: yourServletRequest.setCharacterEncoding("UTF-8");
The second way is to remove the BOM from your Javascript files. You can do this by opening them in Notepad++, going to Encoding > Convert to ANSI, and then saving them. Alternatively, open them in Notepad, then go to Save As and ensure that the Encoding option is set to ANSI before saving them. Note that this may cause issues if you have non-ISO-8859-1 text in your Javascript files, although this is unlikely.

Encountering encoding issues on linux box, not Windows

I'm running into an encoding issue that has stumped me for a few weeks and nothing seems to work. I have a website that works fine on my local machine, but when I push the jsp files to a Linux box for review, characters that previously rendered fine are now displaying as funky characters.
For some reason, some characters display just fine, but other characters will not encode properly. All text on the page is being read from java .properties files and output to the page using beans.
I've added a meta tag to the page to set encoding, which did nothing. I also added <%# page contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%> but this did nothing on the linux box and actually made the encoding errors appear on my local windows machine.
Any help would be greatly appreciated.
Check that the method loading the properties is using the character encoding that the property files are actually written in.
Without explicit setting this, the default encoding for the file system is used, and it is ISO-Latin-1 on Windows, and UTF-8 on some Linux distributions.
The following need to play together for character encoding to work properly in Nixes and Nuxes:
file system encoding
database encoding (does not seem to apply)
database connector encoding
Java-internal string encoding (UTF-16, if I remember correctly)
Java output encoding
HTML page encoding
With your page directive, you only addressed the last bullet. In other words, you are instructing the brower to decode the page as UTF-8, but that's not what you are sending.
Take a look at this (admittedly a few years old) paper, chapter 11 in particular.
Also, check the physical files on both machines. I've seen several FTP clients muck up files during transfer. A quick check is to push your file as html instead of jsp. You'll get garbage for all the <% %> sequences, but the other text should show up unchanged. You've also taken the app server out of the equation. If the text is still funky, it's your FTP or WebDAV client trying to "help".
Look at the http headers sent by the server. That is the first place the browser looks for encoding before anything else.

Unicode/Japanese characters in a Java applet

I'm writing an applet that's supposed to show both English and Japanese (unicode) characters on a JLabel. The Japanese characters show up fine when I run the applet on my system, but all I get is mojibake when I run it from the web page. The page can display Japanese characters if they're hard-coded into the HTML, but not in the applet. I'm pretty sure I've seen this sort of thing working before. Is there anything I can do in the Java code to fix this?
My first guess would be that the servlet container is not sending back the right character set for your webapp resources. Have a look at the response in an HTTP sniffer to see what character set is included - if the response says that the charset is e.g. CP-1252, then Japanese characters would not be decoded correctly.
You may be able to fix this in code by explicitly setting a Content-Type header with the right charset; but I'd argue it's more appropriate to fix the servlet container's config to return the correct character set for the relevant resources.
Well I'm not sure what was causing the problem, but I set EVERYTHING to read in and display out in UTF-8 and it seems to work now.

Categories

Resources