Tomcat text file encoding

Tomcat text file encoding - java

I have a java webapp which reads from file on disk and returns the needed values. The file on disk contains UTF-8 characters.
Example of the file content:
lähedus teeb korterist atraktiivse üüriobjekti välismaalastele
When the webapp is run on localhost* then the servlet reads from disk and returns:
lähedus teeb korterist atraktiivse üüriobjekti välismaalastele
When I run the same app on a separate server the same request returns this:
l??hedus teeb korterist atraktiivse ????riobjekti v??lismaalastele
This is purely an encoding issue but I don't know how to solve it.
What I have tried:
I added this to config/server.xml
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443"
URIEncoding="UTF-8"/> <!-- THIS PART
But it didn't help.
What should I change in config to have it working on server as well?
Thanks!
EDIT
I am reading from a txt file on server containing json strings.
I am using java BufferReader to read the content. As I mentioned in the comments, this problem is not caused by the reader because the same works on localhost.
I am sending the response via a servlet which just flushes the json string out. Again the same story as with the reader.
I get the question marks on any client I make the request (browser, android, etc).

Your local file seems to be in UTF-8, with a wrong conversion to some single-byte encoding. As one sees a multi-byte sequence for one special char resulting in two unconvertible chars (?).
The application is reading it without specification of the encoding, hence using the system's encoding. That is not something you want.
And then you need to find the wrong reading code: often there is an overloaded method where one can add the encoding. Notorious however is FileReader, that utility class always uses the default encoding. Check occurrences of:
InputStreamReader
new String
String.getBytes
Scanner
For good order, but probably not the case here: any response yielding that text should specify the charset in the content-type.

Related

Apache Camel: file processing with accented characters

We are trying to parse a text file from AWS S3 (sdk2) which has some accented characters like î. We are using camel bindy format #FixedLengthRecord to map the file rows to our DTO, but these accented chars are getting mapped as question mark ?
We are not sure yet about the source file encoding but it shows as ANSI in Notepad++ and also shows the char properly in the input file.
Tried multiple approaches so far like overriding the default charset with different ones US-ASCII, cp1252
System.setProperty("org.apache.camel.default.charset", "cp1252");
Along with .convertBodyTo(String.class, "UTF-8") in our route definition but none seems to work.
Tried reading the camel documentation https://camel.apache.org/components/latest/file-component.html and similar questions on stackoverflow but didn't find any matching solution yet, any other pointers will be highly appretiated.

Finally got a clue in the way camel AWS2S3Endpoint was reading the S3 objects. It was defaulting to UTF-8
Reader reader = new BufferedReader(new InputStreamReader(s3Object, Charset.forName(StandardCharsets.UTF_8.name())));
This is being fixed in latest 3.6.0 snapshot version as mentioned over camel mailing list. We could test it successfully with the snapshot version along with convertBodyTo(String.class, "ISO-8859-1") in camel route

Java Character Encoding Writing to Text File

My Issue is as follows:
Having issue with character encoding when writing to text file. The issue is characters are not showing the intended value. for example I am writing ' '(which is probably a Tab character) and 'Â' is what is displayed in the text file.
Background information
This data is being stored on a MSQL Database. The Database Collation is SQL_Latin1_General_CP1_CI_AS and the fields are varchar. I've come to learn the collation and type determine what character encoding is used on the database side. Values are stored correctly so no issues here.
My Java application runs queries to pull the data from the DB and this too also looks OK. I have debugged the code and seen all the Strings have the correct representation before writing to the file.
Next I write the text to the .TXT file using a OutputStreamWriter as follows:
public OfferFileBuilder(String clientAppName, boolean isAppend) throws IOException, URISyntaxException {
String exportFileLocation = getExportedFileLocation();
File offerFile = new File(getDatedFileName(exportFileLocation+"/"+clientAppName+"_OFFERRECORDS"));
bufferedWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(offerFile, isAppend), "UTF-8"));
}
Now once I open up the file on the Linux server by running cat command on file or open up the file using notepad++ some of the characters are incorrectly displaying.
I've ran the following commands on the server to see its encoding locale charmap which prints UTF-8, echo $LANG which prints en_US.UTF-8, and echo $LC_CTYPE` prints nothing.
Here is what I've attempted so far.
I've attempted to change the Character encoding used by the OutputStreamWriter I've tried UTF-8, and CP1252. When switching encoding some characters are fixed when others are then improperly displayed.
My Question is this:
Which encoding should my OutputStreamWriter be using?
(Bonus Questions) how are we supposed to avoid issues like this from happening. The rule of thumb i was provided was use UTF-8 and you will never run into problems, but this isn't the case for me right now.

running file -bi command on the server revealed that the file was encoded with ascii instead of utf8. Removing the file completely and rerunning the process fixed this for me.

Strange behaviour with Jersey multipart request for UTF-8 encoding

I have seen strange behavior with jersey and tomcat multipart request.
I have files in different languages example
минуты назад.txt or 您好.txt
With help of other post I figured out that we need to convert this in UTF-8 format.
Something like
String fileName=new String(bodyPart.getContentDisposition().getFileName().getBytes(),"UTF-8");
With this I see that the names are converted back but some characters are garbled with question marks. The above mentioned file names are converted to something like
мин�?�?�? назад.txt and �?�好.txt
I am not sure why only few characters are lost. In above code bodyPart is nothing but FormDataBodyPart bodyPart from Jersey.
Is there any additional configuration needed in Tomcat ? I tried adding URIEncoding="UTF-8" but that did not help.
Need some help to understand this.

Converting XML to JSON results in unknown characters when running on Centos instead of Windows

I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?

byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)

There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8

How to check encoding in java?

I am facing a problem about encoding.
For example, I have a message in XML, whose format encoding is "UTF-8".
<message>
<product_name>apple</product_name>
<price>1.3</price>
<product_name>orange</product_name>
<price>1.2</price>
.......
</message>
Now, this message is supporting multiple languages:
Traditional Chinese (big5),
Simple Chinese (gb),
English (utf-8)
And it will only change the encoding in specific fields.
For example (Traditional Chinese),
蘋果
1.3
橙
1.2
.......
Only "蘋果" and "橙" are using big5, "<product_name>" and "</product_name>" are still using utf-8.
<price>1.3</price> and <price>1.2</price> are using utf-8.
How do I know which word is using different encoding?

It looks like whoever is providing the XML is providing incorrect XML. They should be using a consistent encoding.
http://sourceforge.net/projects/jchardet/files/ is a pretty good heuristic charset detector.
It's a port of the one used in Firefox to detect the encoding of pages that are missing a charset in content-type or a BOM.
You could use that to try and figure out the encoding for substrings in a malformed XML file if you can't get the provider to fix their output.

you should use only one encoding in one xml file. there are counterparts of the characters of big5 in the UTF_8 encoding.

Because I cannot get the provider to fix the output, so I should be handle it by myself and I cannot use the extend library in this project.
I only can solve that like this,
String str = new String(big5String.getByte("UTF-8"));
before display the message.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.