Foreign Language Characters decode in Java **& iacute;** - java

I want to display characters of foreign languages in jasper reports. The reports passes the text to java code for RTF formatting. Unfortunately the mysql database returns decoded string like below with spaces removed
& iacute;
what I want to display is
í
Any suggestions how to do it with java?
text: bebida fría
from database: bebida fr& iacute;a

That are HTML entities. You can use StringEscapeUtils.unescapeHtml4 from apache commons library.
Still remains to see how your RTF handles Unicode.

If I understand your question, then you could use the unicode literal,
System.out.println("bebida fr\u00EDa");
Output is (the requested)
bebida fría

Check database table encoding. Also you can try to encode your string with proper encoding.
ByteBuffer encode = Charset.forName("UTF-8").encode(myString);
String encodedStr = new String(encode.array());

Related

Character coding between mysql and java

I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?
Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}
There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Reversed Hebrew or numbers after using iText to parse a PDF document

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in ​gibberish.
I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.
Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.
Example:
PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס
if I reverse this then I get סה"כ ניכויי התחייבות 55.78​ ​
The number should be 87.55 and not 55.78
The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.
Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL
I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.
I can only analyze the data given, so in this case only the linked government paper from which
is extracted as
ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 􀂛
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø
And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!
Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents
In more detail
The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:
28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange
I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.
You mention that in some case you could retrieve the Hebrew text by applying
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))
So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.
That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.
In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.
First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8. Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "שלום את";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את
Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea. The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadocSo what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.
Using ICU did the job:
Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);

Store base64 encoded string in HBase

I have a very specific requirement of storing PDF data in Hbase columns. The source of Data is Mongo DB, from where the base64 encoded data is read and I will need to bulk upload it to Hbase table.
I realized that in base64 encoded string there are a lot of "\n" character which splits the entire string into parts. Not sure if it is because of this, but when I store the string as it is, using a put :
put.add(Bytes.toBytes(ColFamilyName), Bytes.toBytes(columnName), Bytes.toBytes(data.replaceAll("\n","").toString()));
It is storing only the first line from the entire encoded string. Eg :
If the actual content was something like this :
"JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
" +
"MSkKL1Byb2R1Y2VyIChBcGFjaGUgRk9QIFZlcnNpb24gMS4xKQovQ3JlYXRpb25EYXRlIChEOjIw\n" +
"MTUwODIyMTIxMjM1KzAzJzAwJykKPj4KZW5kb2JqCjUgMCBvYmoKPDwKICAvTiAzCiAgL0xlbmd0\n" +
It is storing only the first line which is :
JVBERi0xLjQKJaqrrK0KNCAwIG9iago8PAovQ3JlYXRvciAoQXBhY2hlIEZPUCBWZXJzaW9uIDEu
in the column. Even after trying to remove the "\n" manually it is the same output.
Could someone please guide me in the right direction here ?
Currently, I am also working on Base64 encoding. As per my understanding, you should try using
org.apache.hadoop.hbase.util.Base64.encodeBytes(byte[] source, int option)
method where DONT_BREAK_LINES can be used as an option.
Please let me know if this works fine.
Managed to solve it. The issue was when reading the Base64 encoded data from MongoDB Source. Read the data from Mongo DB document DBObject as:
jsonObj.get("receiptContent").toString().replaceAll("\n","")
And stored it as such in Hbase. Even from the Hue HBase UI Browser I can see the PDF content now.

Trouble of viewing an XML file encoded in utf-8 with non-ascii characters

I have an xml file which gets its data from an oracle table of CLOB type. I wrote into the Clob value using Unicode Stream
Writer value = clob.setCharacterStream(0L);
value.write(StrValue));
When I write non-ascii characters like chinese and then access the clob attribute using PL/SQL developer, I see the characters showing up as they are. However, when I put the value in an xml file encoded in UTF-8 and try to open the xml file through IE, i get the error message
"an invalid character was found in text content. Error processing
resource ...".
The other interesting thing is that when I write into CLOB using ascii stream, like:
OutputStream value = clob.getAsciiOutputStream();
value.write(strValue.getBytes("UTF-8"));
then, the characters appear correctly in the XML on browser , but are messy in DB when accessed using PL/SQL developer.
Is there any problem in converting unicode characters to UTF-8. Any suggestion please?

How to decode UTF-8 encoded String using java?

Actually i'm having String in UTF-8 encoded form in the mail. I want it to decode it. I use Java mimeutility.decode text. But it doesn't decode properly.
Example String
=?UTF-8?B?0J/RgNC40LLQtdGC?==?UTF-8?B?0JfQtNGA0LDQstGB0YLQstGD0LnRgtC1?=
When i used
MimeUtility.decodeText("=?UTF-8?B?0J/RgNC40LLQtdGC?==?UTF-8?B?0JfQtNGA0LDQstGB0YLQstGD0LnRgtC1?=")
it yields
Привет=?UTF-8?B?0JfQtNGA0LDQstGB0YLQstGD0LnRgtC1?=
Please help me. Thanks in advance
It is mime-encoded -- the "B" encoding, to be specific (rfc2047 section 4.1).
I think you can decode it using javamail javax.mail.internet.InternetHeaders or MimeUtility
class.

Categories

Resources