"En dash" being garbled during http response handling or text manipulation

"En dash" being garbled during http response handling or text manipulation - java

I'm writing code to work with text from Wikipedia and am having issues with en dashes being garbled. I haven't worked with en dashes or other non-standard characters before (non-standard to me being character that don't appear on my keyboard ;), so I'm not sure where to point the finger at what I'm doing wrong. Here's what is happening, along with code snippets.....
I send a request to Wikipedia (I'm using the Apache HttpComponents client API for communicating with Wikipedia) for the contents of an article and save it in a String:
DefaultHttpClient client = new DefaultHttpClient();
HttpGet queryRequest = new HttpGet(query); // query is the URL for retrieving the article contents.
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
At this point if I were to send "responseBody" to System.out, en dashes are displayed in my Eclipse console as '?'. This might just be an Eclipse console display issue so I'll move on.
I manipulate the text, ignoring the en dashes, and then send the text back to Wikipedia.
List<NameValuePair> postParams = new ArrayList<NameValuePair>();
postParams.add(new BasicNameValuePair("text", content); // content is a String with the article text
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(postParams, "UTF-8");
HttpPost queryRequest = new HttpPost(url); // url is the basic URL for the Wikipedia api
queryRequest.setEntity(entity);
queryRequest.addHeader("Content-Type", "application/x-www-form-urlencoded");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
When the text, now uploaded to Wikipedia, is displayed in a web browser what was en dashes before are now displayed as '?' in a box (unknown character?). Therefore, somewhere I am inadvertently changing or miscoding the en dashes, but I'm not sure exactly where.
Can someone point me in the right direction?

Now for the real answer. The problem with the non-English characters getting mangled had nothing to do with the Apache HTTPComponents or with an Java string handling/manipulation. The problem was with the Eclipse IDE running on Windows.
Eclipse in the run configuration defaults to use the system's default encoding method, Cp1252 for Windows. Since Cp1252 doesn't support all of the UTF-8 characters, thus problems arise. I found the solution here. In Eclipse you go into the Run Configurations. For the project you are attempting to run, go to the 'Common' tab. There is a section for encoding. Change it from "Default" to "Other" and set the encoding to UTF-8.
All is now well.

I still have yet to figure out why the endash is getting mangled. I do have a (possibly kludgy) fix in the mean time.
String unknownUTF = String.copyValueOf(Character.toChars(65533));
content = content.replace(unknownUTF, "\u2013");
I'm basically replacing all instances of the 'unknown' UTF-8 character with the endash character. This works assuming that the original content doesn't contain any other UTF-8 characters that are getting converted into the 'unknown' character.

Related

Code is not translating german characters from Google Books API correctly

I have produced a little app that searches and displays for me data which I retrieve from Google Books in a neat but simple fashion. Everything works so far, but there is an issue directly at the source: Though Google provides me correctly with German text search results, it for some reason displays all special German characters (Ä, Ö, Ü and ß probably) as the "�" dummy or sometimes just "?".
I was able to confirm that the JSONObject built from the InputStream already contains those mistakes. It seems like the original inputstream from Google is not being read correctly. Weird is that I have "UTF-8" encoding (which should contain german characters) added to my InputStreamReader, but to no avail apparently.
Here is the http-request procedure I am using:
public class HttpRequest {
public static String request(String urlString) throws IOException {
URL url = new URL(urlString);
URLConnection connection = url.openConnection();
connection.setConnectTimeout(5000);
connection.setReadTimeout(10000);
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
StringBuilder builder = new StringBuilder();
String inputLine;
while((inputLine = in.readLine()) != null)
builder.append(inputLine);
in.close();
return builder.toString();
}
}
What else could be going wrong? I checked the StringBuilder already, but the mistakes are already in the inputLine(s) that get read out of the BufferedReader.
Also, I was unable to find any language or encoding specific settings in the official google books api guide, so I guess they should come with universal encoding, but then the "UTF-8" flag should detect them, or not?

Easiest is to check the raw data in another way, such as a browser. Looking at a Google Books api url response in the browser is quite simple, just use the url and the response comes back as json. Optionally install a json viewer plugin, but not needed for this.
For example use this url:
https://www.googleapis.com/books/v1/volumes?q=Latein+key=NO
Checking the http header (in the browser developer tools for example) you can see that the header list the content as having the expected encoding:
content-type: application/json; charset=UTF-8
Look at the specific content for some German results and the text there and we can see that it is correct German special characters for some books, but not for all. Depending on the book in question.
Conclusion: UTF-8 is indeed correct and the source/raw data has missing/wrong data for some texts for the German characters.

Is it necessarily so that you can POST a byte stream to any API that will accept a file, or does it depend on the API?

I have come to the understanding that knowing this is indicative of a lack of knowledge of how REST-like APIs work, and if someone can provide me a reference where I can learn the background behind this question, I would appreciate it. In the meantime, though, I would also appreciate help answering this question!
I have a java application that posts files from the local filesystem to an API. My goal is to instead of having millions of files sitting on the volume with all of their file handles, I want to leave the files in a .tar.gz file, and then in memory pull them out of archive and POST them without writing them to disk. I know that I can write them to disk, POST them, and then delete them, but I view that option as a last resort.
So here's code that works to POST a file that exists in the file system, not in an archive
public CloseableHttpResponse submit (File file) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost post = new HttpPost(API_LOCATION + API_BASE);
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody("files", file, ContentType.APPLICATION_OCTET_STREAM, null);
HttpEntity multipartEntity = builder.build();
post.setEntity(multipartEntity);
CloseableHttpResponse response = client.execute(post);
System.out.println("response: " + IOUtils.toString(response.getEntity().getContent(),"UTF-8"));
client.close();
return response;
}
I get back a JSON response from my particular API that looks like this
response: {"data":[<bunch of json>]}
I've put the same file into a .tar.gz archive and have used apache commons compress to unzip the file and pull out each file as a TarArchiveEntry, and I've tested that it works properly by writing the text file to disk and opening it manually outside of java - I am definitely getting the entry into memory correctly. I tried changing the entity attached to the POST to a ByteArrayEntity and converting the archive entry to a byte stream, but the API insists it will only accept a multipart entity. So looking at the API for MultipartEntityBuilder.addBinaryBody it appears I'm left with two options: I can either post a byte array or an InputStream. I've tried both and I can't get either to work - I'll post my example code for the byte array approach, but I can't figure out how to convert the tar archive to an InputStream - at least not without converting it to a byte array first, which seems sorta silly at that point.
public CloseableHttpResponse submit (byte[] xmlBytes) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost post = new HttpPost(API_LOCATION + API_BASE);
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody("files", xmlBytes, ContentType.APPLICATION_OCTET_STREAM, null);
HttpEntity multipartEntity = builder.build();
post.setEntity(multipartEntity);
CloseableHttpResponse response = client.execute(post);
System.out.println("response: " + IOUtils.toString(response.getEntity().getContent(),"UTF-8"));
System.out.println(response.getStatusLine().getStatusCode());
client.close();
return response;
}
I believe the code is identical with the exception of the data type of the input parameter. Here is my empty response, which comes with a status code 207:
response: {"data":[]}
So here is my real question: Can any API that accept files also accept a file in the form of a byte stream or byte array? Can the API tell the difference, and what is really happening when I POST a file? Does the API have to be specifically configured to accept this file in the form of a byte stream or a byte array? A link to a reference along with a short explanation would be highly appreciated - I really need to learn this stuff and understand it well.
Is there some easy to correct mistake that I'm making? Am I using the wrong Content-Type or something? I'm not even sure what the meaning of the third argument to MultipartEntityBuilder.build is (the one I've left null).
Any help is appreciated, thank you very much!

It appears that an API that accepts a file doesn't care if it comes from a file object or a byte array. Per JB Nizet:
You're passing null as the file name. When passing a File as argument, the actual name of the File is used if you passed null as file name. That doesn't happen obviously if you pass a bute array. So specify a non-null file name as last argument. That can only be found out by reading the javadoc and the source code of MultipartEntityBuilder. It's open source: use that as an advantage.
In this specific case, adding a random string as the last argument of the build method fixes the problem and the API accepts the byte array as a file.

send arabic SMS on mobile in java

in my application there is both arabic and english language suport but i am facing a problem when the mobile receive arabic SMS it is displaied as ??? ???? (question marks) knowing that the monbile i am using for testing supports arabic and all the arabic in the application is working fine the problem is only when an arabic SMS is received by my mobile.
String ff = new String(smsContent.getBytes("UTF-8"), "UTF-8");
StringWriter stringBuffer = new StringWriter();
PrintWriter pOut = new PrintWriter(stringBuffer);
pOut.print("<?xml version=\"1.0\" encoding=\"utf-8\"?>");
pOut.print("<!DOCTYPE MESSAGE SYSTEM \"http://127.0.0.1/psms/dtd/messagev12.dtd\" >");
pOut.print("<MESSAGE VER=\"1.2\"><USER USERNAME=\""+userName+"\" PASSWORD=\""+password+"\"/>");
pOut.print("<SMS UDH=\"0\" CODING=\"1\" TEXT=\""+ff+"\" PROPERTY=\"0\" ID=\"2\">");
pOut.print("<ADDRESS FROM=\""+fromNo+"\" TO=\""+toNO+"\" SEQ=\"1\" TAG=\"\" />");
pOut.print("</SMS>");
pOut.print("</MESSAGE>");
pOut.flush();
pOut.close();
URL url = new URL("url");
HttpURLConnection connection = (HttpURLConnection)url.openConnection();
connection.setDoOutput(true);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(connection.getOutputStream()));
out.write("data="+message+"&action=send");
out.flush();
SMS in english working file in my application.

First, new String(smsContent.getBytes("UTF-8"), "UTF-8") is a redundant roundtrip, equivalent to smsContent. First you encode the string as bytes via UTF-8, and then immediately decode it back from the bytes again.
Second, your method of puzzling together XML is completely broken. You can't just concatenate strings and hope to end up with well-formed XML. Just for example think about what happens if someone tries to send a "? Use an XML library.
Third, you're implicitly using the platform default encoding for your OutputStreamWriter instead of explicitly specifying one, which means your code only works on those machines which randomly happen to have the correct encoding as default. I'm guessing yours does not.
Fourth, your method of puzzling together POST parameters is broken. You haven't specified what the variable message is. I'm guessing it's the complete XML document, but then you're trying to send it as a POST parameter to some kind of HTTP service, in which case it needs to be escaped/url-encoded. Just for example, what happens if someone tries to send the message &data=<whatever>&? Please clarify.
See also Using java.net.URLConnection to fire and handle HTTP requests
Fifth, since you're sending to some HTTP service, there's probably some documentation for that service what encoding to send or how to specify it, possibly with a HTTP header (Probably "Content-type: application/x-www-form-urlencoded; charset=UTF-8"?). Point us to the documentation if you can't figure it out yourself.
Edit: Found the documentation: http://www.google.se/search?q=valuefirst+pace
It pretty clearly states that you need to url encode the XML document, so that's probably what you're missing, in which case the encoding for the OutputStreamWriter won't matter as long as it's ASCII-compatible.
However, the documentation does not specify which character encoding to use for url-encoding, which is pretty weak. UTF-8 is the most likely though.

From what I've read on some internet pages, SMS in arabic languages (and others too) are encoded with UCS-2 and not UTF-8. Changing the encoding is worth a try.

You are using your platform's default encoding for the request data, which may very well differ from UTF-8. Try specifying UTF-8 in the OutputStreamWriter:
... new OutputStreamWriter(connection.getOutputStream(), "UTF-8") ...
Another issue is of course that your hand-made XML document will fail as soon as any of your parameters contain characters, which have to be escaped in XML, but that's a different story. Why don't you use an XML library instead?
Just an additional information: The documentation Christoffer points to also explains that the request example you are using is only suitable for text messages with characters in the standard SMS character set. For Unicode character support, you have to use a different request.

Java (Android) UTF-8 character in string

here's my problem, I am receiving a string from a soap Webservice which seems to contain UTF8 encoded %c3%89. This string is a URL i have to reach to get a picture that contains a part of the URL in it.
My problem is that the server generating the picture doesn't recognize the %c3%89 encoding and thus doesn't create the right . When replaced with it's normal representation (i.e É) the server is generating the picture correctly.
My question is: How can i replace the encoded character in the string?
Ps: I don't have access to the server side
here's my code
URL aURL = new URL(URLDecoder.decode(url));
URLConnection conn = aURL.openConnection();
conn.connect();
InputStream is = conn.getInputStream();
BufferedInputStream bis = new BufferedInputStream(is);
bm = BitmapFactory.decodeStream(bis);
Thanks a lot :)
Hush

You need to pass the character encoding as 2nd argument to URLDecoder#decode(), otherwise it will use the platform default character encoding.
System.out.println(URLDecoder.decode("%c3%89", "ISO-8859-1")); // Ã?
System.out.println(URLDecoder.decode("%c3%89", "UTF-8")); // É

I just realized that the URL was perfectly understood by the website when using earlier version of android (lets say before 2.2) I start to wonder what has changed in the urlconnection framework since that version... anyway i will try to pass through this problem by hosting the required picture on the webservice rather than returning the url.
Thank you

Google App Engine datastore encoding?

I'm using the GAE datastore for a Java application, and storing some text that will be in numerous languages. In my servlet, I'm first checking to see if there's any data in the data store, and, if not, I'm creating some, similar to the following:
ArrayList<Lang> list = new ArrayList<Lang>();
list.add(new Lang("EN", "English", 1));
list.add(new Lang("ES", "Español", 0));
//more languages here...
PersistenceManager pm = PMF.get().getPersistenceManager();
for(Lang l : list) {
pm.makePersistent(l);
}
Since this is using JDO, I guess I should include the relevent parts of the Lang class too:
#PersistenceCapable
public class Lang {
#PrimaryKey
private String code;
#Persistent
private String name;
#Persistent
private int popularity;
// getters & setters & constructors...
}
However, the non-ASCII characters are giving me grief. I've set my Eclipse project to use the UTF-8 encoding instead of the default Cp1252, so I think I'm okay from that perspective, but when I use the App Engine Data Viewer to look at my data, that Español entry becomes EspaÃ±ol, and when I click on it to view it, I get a 500 Server Error. (There are some other entries with right-to-left text that don't even show up in the Data Viewer at all, but one problem at a time...)
Is there anything special I can do in my code to set the character encoding, or specify to GAE that the data I'm storing is UTF-8? Or is the problem on the Eclipse side, and is there something I should be doing with my Java code?

Fixed same issue by setting both request and response encoding to utf-8. Request encoding results in valid string stored in datastore, without it values will be stored as "????..."
Requests: if you use Apache HTTP Client, this is done in the following way:
Get request:
NameValuePair... params;
...
String url = urlBase + URLEncodedUtils.format(Arrays.asList(params), "UTF-8");
HttpGet httpGet = new HttpGet(url);
Post request:
NameValuePair... params;
...
HttpPost httpPost = new HttpPost(url);
httpPost.setEntity(new UrlEncodedFormEntity(Arrays.asList(params), "UTF-8"));
Response: if you build your response in HttpServlet, this is done in a following way:
HttpServletResponse resp;
...
resp.setContentType("text/html; charset=utf-8");

Are you sure you have a problem with your data? I also encountered the similar issues before but it turns out it's a problem in the Python version of the Data Viewer. I can retrieve my data fine in Java.

I had I think the same problem with encoding several month ago. You can take a look to my sources, maybe it'll help:
1) http://code.google.com/p/vocrecaptor/source/browse/trunk/vocrecaptorweb/src/com/vocrecaptor/web/server/DictionaryServiceImpl.java
2) And class /com/vocrecaptor/web/server/servlet/AbstractServiceServlet.java

i notice that you already set your Eclipse project to use UTF-8 text encoding. Did you double checked the text enconding of the Java file containing the string like "Español" ?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.