Saving Chinese to mongodb 2.4.8 cause unreadable string - java

Before I used Mongodb 2.0.6, everything is fine.
recently I started to use Mongodb 2.4.8 with Java Play framework, and I found that when I tried to save Chinese to mongodb, mongodb actually stored as some unreadable string, such as &\#21457;&\#29983;, what is show on web is the same string, does anything know why?
what should I do? how to convert it to readable Chinese?

I think,your string gets converted to unreadable string in between.As I tested this on console and works fine for me.
$ mongo test
MongoDB shell version: 2.4.8
connecting to: test
> var doc = { "message" :"你好" }
> db.ChineseWord.save(doc)
> db.ChineseWord.find().pretty()
{ "_id" : ObjectId("529da2018170273efa43e181"), "message" : "你好" }

From what you have posted I suspect that this may be an artefact of the Play Framework, as both these characters can be stored directly in MongoDB.
> db.test1.insert({x:"𡑗 and 𩦃"})
> db.test1.find();
{ "_id" : ObjectId("52a12237e7c9d6190f6feb95"), "x" : "𡑗 and 𩦃" }
Assuming that the characters you posted as &#21457 and &#29983 above are really meant to be 𡑗 and 𩦃 then I would suspect that the Play Framework is converting them into a representation of their extended unicode values. In this case those two characters would be from the "CJK Unified Ideographs Extension B" section.
You can view the whole set of characters here: http://codepoints.net/cjk_unified_ideographs_extension_b
This looks to be a similar issue as here in the play-framework google group.

I just wrote a quick test and this works just fine.
package com.mongodb;
import com.mongodb.util.TestCase;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest extends TestCase {
String chinese = "你好";
#Test
public void saveChinese() {
DBCollection collection = getDatabase().getCollection("chinese");
collection.insert(new BasicDBObject().append("message", chinese));
DBObject object = collection.findOne();
Assert.assertEquals(chinese, object.get("message"));
}
}
That text saves and loads without error. It would help to see what code you're using to test.

While I have no experience with play framework specifically, the general approach to resolve your issue is to try logging/dumping such string right before it's passed to your mongodb driver, if:
the string is still encoded as utf-8, not entity (&#...), you need to check if your mongodb driver for 2.4 is updated with some new options that convert utf-8 into entities.
if the string is already converted to entities, well you at least ruled out mongodb driver and should track down the conversion within play framework instead.
As others have mentioned, mongodb itself does not care if your input are entities or not, as long as they are utf-8 encoded. it's more likely play framework or the mongodb driver is to blame.
PS: I assume unreable means they were converted to entities (&#...), not encoded incorrectly.

Related

AWS: how to fix S3 event replacing space with '+' sign in object key names in json

I have a lamba function to copy objects from bucket 'A' to bucket 'B', and everything was working fine, until and object with name 'New Text Document.txt' was created in bucket 'A', the json that gets built in S3 event, key as "key": "New+Text+Document.txt".
the spaces got replaced with '+'. I know it is a known issue by seraching on web.
But I am not sure how to fix this and the incoming json itself has a '+' and '+' can be actually in the name of the file. like 'New+Text Document.txt'.
So I cannot blindly have logic to space '+' by ' ' in my lambda function.
Due to this issue, when code tries to find the file in bucket it fails to find it.
Please suggest.
I came across this looking for a solution for a lambda written in python instead of java; "urllib.parse.unquote_plus" worked for me, it properly handled a file with both spaces and + signs:
from urllib.parse import unquote_plus
import boto3
bucket = 'testBucket1234'
# uploaded file with name 'foo + bar.txt' for test, s3 Put event passes following encoded object_key
object_key = 'foo %2B bar.txt'
print(object_key)
object_key = unquote_plus(object_key)
print(object_key)
client = boto3.client('s3')
client.get_object(Bucket=bucket, Key=object_key)
NodeJS, Javascript or Typescript
Since we are sharing for other runtimes here is how to do it in NodeJS:
const srcKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, " "));
I would say this is an official solution since it comes from the AWS docs here
What I have done to fix this is
java.net.URLDecoder.decode(b.getS3().getObject().getKey(), "UTF-8")
{
"Records": [
{
"s3": {
"object": {
"key": "New+Text+Document.txt"
}
}
}
]
}
So now the JSon value, "New+Text+Document.txt" gets converted to New Text Document.txt, correctly.
This has fixed my issue, please suggest if this is very correct solution.
Will there be any corner case that can break my implementation.
I think in Java you should use:
getS3().getObject().getUrlDecodedKey()
method that returns decoded key, instead of
getS3().getObject().getKey()
in ASP.Net has UrlDecode. The sample is below.
HttpUtility.UrlDecode(s3entity.Object.Key, Encoding.UTF8)
I was facing same issue with special character's, as aws S3 event replacing the special character's as in UrlEncoding. So to resolve the same I have used aws decode API "SdkHttpUtils.urlDecode(String key)" to decode the object key. Hence worked as expected.
You can check below link to get more details about SdkHttpUtils API.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/util/SdkHttpUtils.html#urlDecode-java.lang.String-
Agree with Scott. for me create object event was appending %3 for semicolon : i have to replace it twice to get correct s3 url
Python code:
def lambda_handler(event, context):
logger.info('Event: %s' % json.dumps(event))
source_bucket = event['Records'][0]['s3']['bucket']['name']
key_old = event['Records'][0]['s3']['object']['key']
key_new = key_old.replace('%3',':')
key = key_new.replace(':A',':')
logger.info('key value')
logger.info(key)

non-basic characters in java, how to handle the encoding correctly

when I am trying to call method with parameter using my Polish language f.e.
node.call("ąćęasdasdęczć")
I get these characters as input characters.
Ä?Ä?Ä?asdasdÄ?czÄ
I don't know where to set correct encoding in maven pom.xml? or in my IDE? I tried to change UTF-8 to ISO_8859-2 in my IDE setting, but it didn't work. I was searching similiar questions, but I didn't find the answer.
#Edit 1
Sample code:
public void findAndSendKeys(String vToSet , By vLocator){
WebElement element;
element = webDriverWait.until(ExpectedConditions.presenceOfElementLocated(vLocator));
element.sendKeys(vToSet);
}
By nameLoc = By.id("First_Name");
findAndSendKeys("ąćęasdasdęczć" , nameLoc );
Then in input field I got Ä?Ä?Ä?asdasdÄ?czÄ. Converting string to Basic Latin in my IDE helps, but It's not the solution that I needed.
I have also problems with fields in classes f.e. I have class in which I have to convert String to basic Latin
public class Contacts{
private static final By LOC_ADDRESS_BTN = By.xpath("//button[contains(#aria-label,'Wybór adresu')]");
// it doesn't work, I have to use basic latin and replace "ó" with "\u00f3" in my IDE
}
#Edit 2 - Changed encoding, but problem still exists
1:

UTF-8 for URL, Java

So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".
Here is my current code:
String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
System.out.println(url);
I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.
Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.
Solution: It wasn't Java, but IntelliJ.
Summary from comment
The test code works fine.
import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;
public class MainApp {
public static void main(String[] args) throws UnsupportedEncodingException {
String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
System.out.println(url);
}
}
It emits like below
http://www.teanglann.ie/en/gram/f%EF%BF%BDg
Which would goto correct page.
Correct steps are
Ensure that source code encoding is correct. (IntelliJ probably
cannot guess it all correct)
Run the program with appropriate encoding (utf-8 in this case)
(See
What is the default encoding of the JVM?
for a relevant discussion)
Edit from Wyzard's comment
Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow
..
String url = "http://www.teanglann.ie/en/gram/fág";
System.out.println(new URI(url).toASCIIString());
This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax

Error when using Esapi validation

I hope someone could help me with some issue.
I'm using OWASP ESAPI 2.1.0 with JavaEE, to help me to validate some entries in a web application. At some point I needed to validate a Windows file path, so I added a new property entry in the 'validation.properties' like this one:
Validator.PathFile=^([a-zA-Z]:)?(\\\\[\\w. -]+)+$
When I try to validate, for example, a string like "C:\TEMP\file.txt" via ESAPI, I get a ValidationException:
ESAPI.validator().getValidInput("PathFile", "C:\\TEMP\\file.txt", "PathFile", 100, false);
Alternatively, I also tried the java.util.regex.Pattern class to test the same regular expression with the same string example and it works OK:
Pattern.matches("^([a-zA-Z]:)?(\\\\[\\w. -]+)+$", "C:\\TEMP\\file.txt")
I must say that I added other regex in 'validation.properties' and worked OK. Why this one is so hard? Could anyone help me out with this one?
This is happening because the call to validator().getValidInput("PathFile", "C:\\TEMP\\file.txt", "PathFile", 100, false); wraps a call to ESAPI.encoder().canonicalize() that is transforming the input to the char sequence (Not literal String!) C:TEMP'0x0C'ile.txt before it passes to the regex engine.
Except for the second "\" getting converted to the char 0x0c this is normally desired behavior. That could be a bug in ESAPI.
What you want, is to make a call to ESAPI.validator().getValidDirectoryPath()

HTML entity decoding in Java: apostrophe

I have to decode, using Java, HTML strings which contain the following entities: "&#39" and "&apos".
I'm using Apache Commons Lang, but it doesn't decode those two entities, so, I'm currently doing as follows, but I'm looking for the fastest way to do what I want.
import org.apache.commons.lang.StringEscapeUtils;
public class StringUtil {
public static String decodeHTMLString(String s) {
return StringEscapeUtils.unescapeHtml((s.replace("'", "`").replace("'", "'")));
}
}
I searched for older questions, but none seems to answer my question.
Well, i would imagine that part of the problem is that one of your entities is double encoded: "'". That will not be turned into an apostrophe by any decoder.
As for "'", apparently that one is not +technically+ part of the html entity set.

Categories

Resources