Emojis in Tweets showing as "?" in MongoDB - java

Currently, I am collecting tweets based on emotions and doing the analysis. I have tweets with emojis but while collecting, it simply returns with a question mark.
For example:
Original tweet (in Twitter):
lipton ice tea💛
After collection (in MongoDB):
lipton ice tea?
I am using Twitter 4j Java package with MongoDB.

MongoDB uses UTF-8 by default so, unless you configured it not to, it is perfectly capable of storing the emojis.
This one time I spent a whole week banging my head against the wall because MongoDB wouldn't store Latin special characters. Turns out MongoDB worked just fine and it was Log4j the one that wasn't configured to print logs using UTF-8, so all I saw in the logs was ???? instead of Ă±ĂĄĂ§Ăœ.
If you connect to your MongoDB instance using Mongo Shell (<mongo installation dir>/bin/mongo.exe in Windows), as I did, and query your data, you should be able to see the emojis. Here's a quick reference for the Mongo Shell.
Your problem lies in your JSON visor, or in the encoding of the strings you're sending to MongoDB.
In Java, you might want to set the file.encoding system property to UTF-8, to make sure your program uses the right enconding when reading from files, input streams etc.

If you're using Robomongo, this is a robomongo problem.
Robomongo displays a ? instead of emojis in table mode.

Related

UTF-8 not working in Servlets with MySQL via JDBC

I have a Java Servlet running on a Tomcat Server with a MySQL database connection using JDBC.
If I have following piece of code works in, the hard-coded-HTML code works, but everything that comes from the database is displayed incorrectly.
response.setContentType("text/html;charset=UTF-8")
If I remove the line the text from the database gets displayed correct, but not the basic HTML.
In the database, and Eclipse everything is set to UTF-8.
On first sight it looks as if you were converting the text from the database again, once too much.
So the first check is the database. For instance the length of "löl" should be 3. Whether the data stored is correctly, read correctly. As #StanislavL mentioned, not only the database needs the proper encoding, in MySQL also the java driver that communicates needs to be told the encoding with ?useUnicode=yes&characterEncoding=UTF-8.
Maybe write or debug a small piece of code reading the database.
If stored correctly the culprit might be String.getBytes() or new String(bytes).
In the browser inspect the encoding or save the pages.
With a programmer's editor like NotePadd++ or JEdit inspect the HTML. These tools allow reloading with a different encoding, to see what the encodings are.
It should be that the first page is in UTF-8 and the second in Windows-1252 or something else.
Ensure that the HTML source text is correct: you might use "\u00FC" for ĂŒ in a JSP.

Special characted incorrectly stored in POSTGRES DB

Project is based on
Postgres database version 9.3.5,
Java 7, org.hibernate hibernate-core 3.6.10.Final
Problem :
I have two separate system running the same web application. One on of the systems everything is persisted correctly on the other Strings sent to Postgres database contain unicode characters and text like 'nnés' is persisted as 'nns' or 'nnés-2' . The only difference I noticed between those two systems Is one displaying UNICODE and the otherUTF8 as client encoding when doing SHOW client_encoding; in the console. The one running unicode works correctly the other does not.
My question is
Is it possible that client encoding got stuck/hardcoded somehow and it is not being selected based on real client encoding which would mean the strings sent in unicode arent converted to UTF8 but just persisted.
What can be the reasons for such a behavior.
try request.setCharacterEncoding("UTF-8");

iOS - Java - Issue with Post that having unicode

Our iOS app is working with the Webservices of JAVA and saving data to MYSQL DB.
Now the Problem is, When ever data having UNICODES from IOS is Sent to Webservices, it is not saving the Data in DB. In place of unicodes it stores "?".
We checked DB and made sure that Collection of DB is set to UTF-8 default.And if We manually stores those Character via any IDE then it is displaying there.
We Checked that whether unicodes are reaching to WebServices or not.
Then we just made an Application on JAVA to just fire a query to insert a Row with Data having UNICODE in it. But then also JAVA App can not insert that hardCoded String into DB.

Maximiser CRM Data Migration, UnEncoding Email Text from Binary Image Column

Hi Firstly I realise this is not a direct programming question as it is more data related so if it needs to go elsewhere fair enough.
I'm trying to extract Email Message text from a Maximizer CRM system for a Migration.
This information appears to reside in the AMGR_Letter_Tbl, however I’m having a few issue decoding it.
The Column in Maximizer CRM’s documentation at database level is described as a “Binary Image”, this appears accurate and for some entries (Documents) in the table casting via MSSQL obtains a readable response (See the bottom 2 rows in my query results).
However for Email Messages there appears to be at least one extra level of encoding or encryption applied. (See the my UnEncoding attempts below).
I’m hoping some one will either have encountered this issue before, know from experience with Maximizer CRM what’s needed Or will recognise the next step needed from my un-encoding attempts.
If you do know please describe what un-encoding, casting, other procedures and there required order or application.
I will be fitting this into a bigger Talend migration when I know what decoding is needed so any code examples in Talend OS or Java would be appreciated.
Cheers Andy
Just to let you know I solved this issue and the contents of the table are Microsoft OLE encoded OLE File format info
I now have an extraction method that recovers Documents, Emails and Emails attachments stored this way.
The scripts are using a number of system tools and third party Java library's controlled via Talend.
However I can't give too much away as to create these has taken a significant amount of time and effort.
If you want more info please get in contact directly.
Cheers

The unicode character from a rails app appears as ??? in the mysql database ,

The unicode character from a rails app appears as ??? in the mysql database (ie when I view through putty or linux console), but my rails app reads it properly and shows as intended.I have another java application which reads from the rails database and stores the values in its own database. and try to show in from its database. But in the webpage, it appears like ??? instead of the unicode characters.
How is that the rails application is able to show it properly and not the java application. Do I need to specify any encoding within the java application?
You really need to find out whether it's the Java app that's wrong, the Rails app that's wrong, or both. Use PuTTY or a Linux console isn't a great way of checking this, as they may well not support the relevant Unicode characters. Ideally, you should find a GUI which you can connect to the database, and use that to check the values. Alternatively, find some MySQL functions which will return the Unicode code points directly.
It's quite possible that the Rails app is doing the wrong thing, but in a reversible way (possibly not always reversible - you may just be lucky at the moment). I've seen this before, where a developer has consistently used the same incorrect code page when both encoding and decoding text, and managed to get the right results out without actually storing the correct data. Obviously that screws up any other system trying to get at the same data.
You may want to check the connection parameters: http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html
I guess your Java application may use wrong encoding when reading from rails' database, wrong encoding of its own database or in connection with it.

Categories

Resources