Requirements
Downlading a CSV File
Code
I have a csvFormattedString like
String csvFormattedString = "\"Column_One\",\"Column_Two\"\n\"Row_Col1\",\"Row_Col2\"\n";
This CSV String is written to the reponse print writer using
response.getWriter().write(csvFormattedString);
I have set the headers as application-force-download and have set the charcter encoding to UTF-8.
I would like to send the response length back to the user as well.
The csvFormattedString.length() does not seem to be correct as some my characters get truncated
csvFormattedString.length() counts the characters.
Use s.getBytes("UTF-8").length to get the number of bytes used for that string represented as UTF-8.
You have to catch UnsupportedEncodingException in order to use getBytes(String encoding).
Related
I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:
String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");
This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.
From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.
Can someone explain to me why this is necessary?
My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.
HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.
I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).
I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)
When decoding, could you not create a class with a decoder method that takes the bytes [] as a parameter and
return it as a string? here is an example that i have used before.
public class Decoder
{
public String decode(byte[] bytes)
{
//Turns the bytes array into a string
String decodedString = new String(bytes);
return decodedString;
}
}
Try use this instead of .getBytes(). hope this works.
I have a large XML. It has some characters like ZÖE,DÉCOR CIARÁN in my XML. I am using Java and MarkLogic as my DB. I am unable to read my XML with these words and when I remove these words and check it's working perfectly.
My Java Code:
DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);
XMLDocumentManager docMgr = client.newXMLDocumentManager();
DOMHandle xmlhandle = new DOMHandle();
docMgr.read("/" + filename, xmlhandle);
Changed Question:
As i said i was unable to read special chars, now how can i insert the special characters so that while i am reading i get the same format.
Example:
When i insert characters like CIARÁN AURÉLIE BARGÈME it is saving but when i read, the data is like this CIARAN AURELIE BARGEME but not as inserted.
DatabaseClient client = DatabaseClientFactory.newClient(IP, PORT,
DATABASE_NAME, USERNAME, PWD, Authentication.DIGEST);
XMLDocumentManager docMgr = client.newXMLDocumentManager();
DOMHandle xmlhandle = new DOMHandle();
docMgr.read("/" + filename, xmlhandle);
String doc = xmlhandle.ToString();
String data = Normalizer.normalize(doc, Normalizer.Form.NFD)
.replaceAll("[^\\p{ASCII}]", "");
Am using Normalizer to read special characters, else normal xmlhandle is fine.
According to their official documentation:
If you specify the encoding and it turns out to be the wrong encoding, then the conversion will likely not turn out as you expect.
MarkLogic Server stores text, XML, and JSON as UTF-8. In Java, characters in memory and reading streams are UTF-16. The Java API converts characters to and from UTF-8 automatically.
When writing documents to the server, you need to know if they are already UTF-8 encoded. If a document is not UTF-8, you must specify its encoding or you are likely to end up with data that has incorrect characters due to the incorrect encoding. If you specify a non-UTF-8 encoding, the Java API will automatically convert the encoding to UTF-8 when writing to MarkLogic.
https://docs.marklogic.com/guide/java/document-operations#id_11208
I have a trouble to convert email attachment(simple text file in windows-1251 encoding with latin and cyrillic symbols) to String. I.e I have a problem with converting cyrillic.
I got attachment file as base64 encoded String like this:
Base64Encoded email Attachment
Original file
So when I try to decode it, I got "?" instead of Cyrillic symbols.
How can I get right Cyrillic(Russian) symbols instead of "?"
I've already tried this code with all encodings, but nothing help to get correct Russian symbols.
BASE64Decoder dec = new BASE64Decoder();
for (String key : Charset.availableCharsets().keySet()) {
System.out.println("K=" + key + " Value:" +
Charset.availableCharsets().get(key));
try {
System.out.println(new String(dec.decodeBuffer(encoded), key));
} catch (Exception e) {
continue;
}
}
Thank You beforehand.
I am not very familiar with BPEL and protocols it uses. If you communicate between nodes using some binary protocols, then you must 1) ensure, client and receiver use the same charset and 2) convert java string into proper bytes in this encoding. Java stores string internally in UTF-16 format. So when you execute String correct = new String(commonName.getBytes("ISO-8859-1"), "ISO-8859-5") you will get correct string in UTF-16. Then you need to export it to bytes in requested encoding, eg. byte[] buff = correct.getBytes("UTF-8") assuming the encoding you use between nodes is UTF-8. If happen the encoding is different, then you must make sure, it actually supports Cyrillic characters (e.g. ISO-8859-1 does not support it).
If you use XML for data exchange, make sure it uses suitable encoding in <?xml encoding="UTF-8"?>. You don't need then to play with bytes, you just need to correctly "import" the string (see correct variable). Writing to XML converts characters automatically, but it (encoding) must support characters you want to write. So if you set encoding="ISO-88591", then you will get those question marks again.
I am using Jax RS and have simple POST WS, that takes InputStream, that contains MIME message (xml + file).
The MIME message is in UTF-8, file contained as a body part is an email message in MIME RFC 822 in ISO-8859-1 encoding, that I'm converting to PDF using Aspose.
When running as a webservice, the resulting PDF has incorrect characters (ø, å etc.). But when I tried to use the exact input, but reading it from file instead and call the method with FileInputStream, the resulting PDF is OK.
Here is the simplified version of the code:
#POST
#Path(value = "/documents/convert/{flag}")
#Produces("text/plain")
public String convertFile(InputStream input, #PathParam("flag") String flag) throws WebApplicationException {
FileInfo info = convertToPdf(input);
return info.getResponse();
}
If I run this as webservice it produces PDF with incorrectly encoded characters with "box" instead of some charcters (such as ø, å etc.). When I run the the same code with the same input by by calling
FileInputStream fis = new FileInputStream(file);
convertFile(fis);
the resulting PDF has correct encoding (the WS is run on server, testing with file is done on my local machine).
Could this be incorrect setting of locale on the server?
Do you use an InputStreamReader to read the FileInputStream ? If so, did you initialize it using the 2-parameters constructor, with CharSet.forName("UTF-8") as the second argument ? (as you mentionned the incoming stream is already in UTF-8) ?
You might need to tell the container that it's UTF-8.
something like...
#Produces("text/plain; charset=utf-8")
Apparently your local file and you MIME message body are not encoded the same way.
Your post states that the file is encoded in ISO-8859-1.
If you are using an InputStreamReader (as Xavier Coulon's is suggesting) you should pass the expected encoding to it. In this case
CharSet.forName("ISO-8859-1")
If this does not help, could you please provide the content of the convertToPdf(InputStream is) method
I have a JSON response which i want to store in DB and display in text view or edit text. This json response is encoded by UTF-8 format.
Response is somthing like
"currencies": [[0,"RUR"," ",1,0],[1,"EUR","â¬",1.44,100],[2,"GBP","£",1.6,100],[3,"JPY","Â¥",0.0125,100],[4,"AUD","$",1.1,100]]}
where â¬,£,Â¥ are currency symbol. I have to decode this and then display. This symbols are symbol in Unicode (transferrred as UTF8). How can I convert this encoded symbol. Plz help.
I tried this but it didnt works:
byte[] b = stringSymbol.getBytes("UTF-8"); // â¬,£,Â¥
final String str = new String(b);
You're showing the text with non-currency symbols... it's as if you're taking the original text, then encoding that as UTF-8, then decoding it as ISO-8859-1.
It's just text - you shouldn't need to do anything to it afterwards, and you should never see it in this broken format. If you have to convert the text back to bytes and then to a string again, that means you've already lost, basically.
Check the headers on the HTTP response which returns the JSON - I suspect you'll find that it's claiming the data is ISO-8859-1 rather than UTF-8. The actual encoding has to match the encoding that's specified in the headers, otherwise you end up with this sort of effect.
Another possibility is that whatever's returning the JSON is accurately giving you the data that it knows about, and that the data is broken upstream. You should follow the data step by step (assuming you own all the links in the chain) until you can see where you're first encountering this brokenness.