Google App Engine datastore encoding?

Google App Engine datastore encoding? - java

I'm using the GAE datastore for a Java application, and storing some text that will be in numerous languages. In my servlet, I'm first checking to see if there's any data in the data store, and, if not, I'm creating some, similar to the following:
ArrayList<Lang> list = new ArrayList<Lang>();
list.add(new Lang("EN", "English", 1));
list.add(new Lang("ES", "Español", 0));
//more languages here...
PersistenceManager pm = PMF.get().getPersistenceManager();
for(Lang l : list) {
pm.makePersistent(l);
}
Since this is using JDO, I guess I should include the relevent parts of the Lang class too:
#PersistenceCapable
public class Lang {
#PrimaryKey
private String code;
#Persistent
private String name;
#Persistent
private int popularity;
// getters & setters & constructors...
}
However, the non-ASCII characters are giving me grief. I've set my Eclipse project to use the UTF-8 encoding instead of the default Cp1252, so I think I'm okay from that perspective, but when I use the App Engine Data Viewer to look at my data, that Español entry becomes EspaÃ±ol, and when I click on it to view it, I get a 500 Server Error. (There are some other entries with right-to-left text that don't even show up in the Data Viewer at all, but one problem at a time...)
Is there anything special I can do in my code to set the character encoding, or specify to GAE that the data I'm storing is UTF-8? Or is the problem on the Eclipse side, and is there something I should be doing with my Java code?

Fixed same issue by setting both request and response encoding to utf-8. Request encoding results in valid string stored in datastore, without it values will be stored as "????..."
Requests: if you use Apache HTTP Client, this is done in the following way:
Get request:
NameValuePair... params;
...
String url = urlBase + URLEncodedUtils.format(Arrays.asList(params), "UTF-8");
HttpGet httpGet = new HttpGet(url);
Post request:
NameValuePair... params;
...
HttpPost httpPost = new HttpPost(url);
httpPost.setEntity(new UrlEncodedFormEntity(Arrays.asList(params), "UTF-8"));
Response: if you build your response in HttpServlet, this is done in a following way:
HttpServletResponse resp;
...
resp.setContentType("text/html; charset=utf-8");

Are you sure you have a problem with your data? I also encountered the similar issues before but it turns out it's a problem in the Python version of the Data Viewer. I can retrieve my data fine in Java.

I had I think the same problem with encoding several month ago. You can take a look to my sources, maybe it'll help:
1) http://code.google.com/p/vocrecaptor/source/browse/trunk/vocrecaptorweb/src/com/vocrecaptor/web/server/DictionaryServiceImpl.java
2) And class /com/vocrecaptor/web/server/servlet/AbstractServiceServlet.java

i notice that you already set your Eclipse project to use UTF-8 text encoding. Did you double checked the text enconding of the Java file containing the string like "Español" ?

Related

Encoding query parameters in URL using Java with valid charset

I am trying to understand what is the difference and importance of different charsets available while encoding and decoding text.
I have a scenario, where I want to call a RestAPI. The RestAPI has a base URL, for ex: https://myrestapiurl.com. Now to perform a GET request, the URL is formed by appending the id of the entity that I want to fetch, like: https://myrestapiurl.com('id')
id : It has no limitations on valid characters!
I have encountered an id: باقی ریسورس , So before calling the RestAPI, I need to encode it. Using Java's URLEncoder, I tried the following:
String s ="باقی ریسورس";
String encodedID = URLEncoder.encode(s, StandardCharsets.UTF_8.name() )
Using the encodedID, I try to make a request using PostMan. The request fails with 404 or 400 when I use different charset. It only succeeds when I encode using ISO_8859_1 as follows:
String encodedID = URLEncoder.encode(s, StandardCharsets.ISO_8859_1.name());
String URL = "https://myrestapiurl.com('" + encodedID + "')";
This works fine, through code as well as PostMan. My question is:
How can I decide which charset to use before encoding? Or should I have fallbacks? That is if it fails with UTF_8 then try with UTF_16 etc etc...but this is very in-efficient. In case if the entity actually doesn't exist, then, these tries would be overhead
Also, when I visit https://www.w3schools.com/tags/ref_urlencode.ASP and enter the text to be encoded, it provides the valid encoded string with ISO_8859_1 , how does it manage to do so?
How can this be done in Java without using any other extra libraries like apache? We don't have choice to add extra dependencies!

Is it necessarily so that you can POST a byte stream to any API that will accept a file, or does it depend on the API?

I have come to the understanding that knowing this is indicative of a lack of knowledge of how REST-like APIs work, and if someone can provide me a reference where I can learn the background behind this question, I would appreciate it. In the meantime, though, I would also appreciate help answering this question!
I have a java application that posts files from the local filesystem to an API. My goal is to instead of having millions of files sitting on the volume with all of their file handles, I want to leave the files in a .tar.gz file, and then in memory pull them out of archive and POST them without writing them to disk. I know that I can write them to disk, POST them, and then delete them, but I view that option as a last resort.
So here's code that works to POST a file that exists in the file system, not in an archive
public CloseableHttpResponse submit (File file) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost post = new HttpPost(API_LOCATION + API_BASE);
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody("files", file, ContentType.APPLICATION_OCTET_STREAM, null);
HttpEntity multipartEntity = builder.build();
post.setEntity(multipartEntity);
CloseableHttpResponse response = client.execute(post);
System.out.println("response: " + IOUtils.toString(response.getEntity().getContent(),"UTF-8"));
client.close();
return response;
}
I get back a JSON response from my particular API that looks like this
response: {"data":[<bunch of json>]}
I've put the same file into a .tar.gz archive and have used apache commons compress to unzip the file and pull out each file as a TarArchiveEntry, and I've tested that it works properly by writing the text file to disk and opening it manually outside of java - I am definitely getting the entry into memory correctly. I tried changing the entity attached to the POST to a ByteArrayEntity and converting the archive entry to a byte stream, but the API insists it will only accept a multipart entity. So looking at the API for MultipartEntityBuilder.addBinaryBody it appears I'm left with two options: I can either post a byte array or an InputStream. I've tried both and I can't get either to work - I'll post my example code for the byte array approach, but I can't figure out how to convert the tar archive to an InputStream - at least not without converting it to a byte array first, which seems sorta silly at that point.
public CloseableHttpResponse submit (byte[] xmlBytes) throws IOException {
CloseableHttpClient client = HttpClients.createDefault();
HttpPost post = new HttpPost(API_LOCATION + API_BASE);
MultipartEntityBuilder builder = MultipartEntityBuilder.create();
builder.addBinaryBody("files", xmlBytes, ContentType.APPLICATION_OCTET_STREAM, null);
HttpEntity multipartEntity = builder.build();
post.setEntity(multipartEntity);
CloseableHttpResponse response = client.execute(post);
System.out.println("response: " + IOUtils.toString(response.getEntity().getContent(),"UTF-8"));
System.out.println(response.getStatusLine().getStatusCode());
client.close();
return response;
}
I believe the code is identical with the exception of the data type of the input parameter. Here is my empty response, which comes with a status code 207:
response: {"data":[]}
So here is my real question: Can any API that accept files also accept a file in the form of a byte stream or byte array? Can the API tell the difference, and what is really happening when I POST a file? Does the API have to be specifically configured to accept this file in the form of a byte stream or a byte array? A link to a reference along with a short explanation would be highly appreciated - I really need to learn this stuff and understand it well.
Is there some easy to correct mistake that I'm making? Am I using the wrong Content-Type or something? I'm not even sure what the meaning of the third argument to MultipartEntityBuilder.build is (the one I've left null).
Any help is appreciated, thank you very much!

It appears that an API that accepts a file doesn't care if it comes from a file object or a byte array. Per JB Nizet:
You're passing null as the file name. When passing a File as argument, the actual name of the File is used if you passed null as file name. That doesn't happen obviously if you pass a bute array. So specify a non-null file name as last argument. That can only be found out by reading the javadoc and the source code of MultipartEntityBuilder. It's open source: use that as an advantage.
In this specific case, adding a random string as the last argument of the build method fixes the problem and the API accepts the byte array as a file.

ISO-8859-1 encoded strings out of /into JSON in Java

My application has a Java servlet that reads a JSONObject out of the request and constructs some Java objects that are used elsewhere. I'm running into a problem because there are strings in the JSON that are encoded in ISO-8859-1. When I extract them into Java strings, the encoding appears to get interpreted as UTF-16. I need to be able to get the correctly encoded string back at some point to put into another JSON object.
I've tried mucking around with ByteBuffers and CharBuffers, but then I don't get any characters at all. I can't change the encoding, as I have to play nicely with other applications that use ISO-8859-1.
Any tips would be greatly appreciated.
It's a legacy application using Struts 1.3.8. I'm using net.sf.json 2.2.4 for JSONObject and JSONArray.
A snippet of the parsing code is:
final JSONObject a = (JSONObject) i;
final JSONObject attr = a.getJSONObject("attribute");
final String category = attr.getString("category");
final String value = attr.getString("value");
I then create POJOs using that information, that are retrieved by another action class to create JSON to pass to the client for display, or to pass to other applications.
So to clarify, if the JSON contains the string "Juan Guzmán", the Java String contains something like Juan Guzm?_An (I don't have the exact one in front of me). I'm not sure how to get the correct diacritical back. I believe that if I can get a Java String that contains the correct representation, that Mezzie's solution, below, will allow me to create the string with the correct encoding to put back into the JSON to serve back.

I had the same issue and I am using the same technology as you are. In our case, it was UTF 8. so just change that to UTF-16
public static String UTF8toISO( String str )
{
try
{
return new String( str.getBytes( "ISO-8859-1" ), "UTF-8" );
}
catch ( UnsupportedEncodingException e )
{
e.printStackTrace();
}
return str;
}

"En dash" being garbled during http response handling or text manipulation

I'm writing code to work with text from Wikipedia and am having issues with en dashes being garbled. I haven't worked with en dashes or other non-standard characters before (non-standard to me being character that don't appear on my keyboard ;), so I'm not sure where to point the finger at what I'm doing wrong. Here's what is happening, along with code snippets.....
I send a request to Wikipedia (I'm using the Apache HttpComponents client API for communicating with Wikipedia) for the contents of an article and save it in a String:
DefaultHttpClient client = new DefaultHttpClient();
HttpGet queryRequest = new HttpGet(query); // query is the URL for retrieving the article contents.
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
At this point if I were to send "responseBody" to System.out, en dashes are displayed in my Eclipse console as '?'. This might just be an Eclipse console display issue so I'll move on.
I manipulate the text, ignoring the en dashes, and then send the text back to Wikipedia.
List<NameValuePair> postParams = new ArrayList<NameValuePair>();
postParams.add(new BasicNameValuePair("text", content); // content is a String with the article text
UrlEncodedFormEntity entity = new UrlEncodedFormEntity(postParams, "UTF-8");
HttpPost queryRequest = new HttpPost(url); // url is the basic URL for the Wikipedia api
queryRequest.setEntity(entity);
queryRequest.addHeader("Content-Type", "application/x-www-form-urlencoded");
ResponseHandler<String> responseHandler = new BasicResponseHandler();
String responseBody = client.execute(queryRequest, responseHandler);
When the text, now uploaded to Wikipedia, is displayed in a web browser what was en dashes before are now displayed as '?' in a box (unknown character?). Therefore, somewhere I am inadvertently changing or miscoding the en dashes, but I'm not sure exactly where.
Can someone point me in the right direction?

Now for the real answer. The problem with the non-English characters getting mangled had nothing to do with the Apache HTTPComponents or with an Java string handling/manipulation. The problem was with the Eclipse IDE running on Windows.
Eclipse in the run configuration defaults to use the system's default encoding method, Cp1252 for Windows. Since Cp1252 doesn't support all of the UTF-8 characters, thus problems arise. I found the solution here. In Eclipse you go into the Run Configurations. For the project you are attempting to run, go to the 'Common' tab. There is a section for encoding. Change it from "Default" to "Other" and set the encoding to UTF-8.
All is now well.

I still have yet to figure out why the endash is getting mangled. I do have a (possibly kludgy) fix in the mean time.
String unknownUTF = String.copyValueOf(Character.toChars(65533));
content = content.replace(unknownUTF, "\u2013");
I'm basically replacing all instances of the 'unknown' UTF-8 character with the endash character. This works assuming that the original content doesn't contain any other UTF-8 characters that are getting converted into the 'unknown' character.

Concise example of file upload via Java lib Apache Commons

[edit]
I've removed my convoluted and badly malformed question so that it doesn't detract from the very neat and correct answer beneath. Given the (surprising) difficulty of finding an on-line example for doing this incredibly common task, I hope Yoni gets a few more up-ticks for his response.
So... the question in a nutshell...
How do I use Apache.Commons to upload a file to some destination. I'm using it in Android and uploading to a PHP script, but obviously it can work from any Java program and to any HTTP based listener.

From the api of MultipartRequestEntity:
File f = new File("/path/fileToUpload.txt");
PostMethod filePost = new PostMethod("http://host/some_path");
Part[] parts = {
new StringPart("param_name", "value"),
new FilePart(f.getName(), f)
};
filePost.setRequestEntity(
new MultipartRequestEntity(parts, filePost.getParams())
);
HttpClient client = new HttpClient();
int status = client.executeMethod(filePost);
I don't think you need the content-disposition part, that is used for the other direction (when the browser downloads a file and needs to know what to do with it).
getParams.setParameter is optional. You can also set it directly on the HttpClient instance.
AFAIK, the order of setting request headers is irrelevant, as long as they are all set before you set the request body.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.