Character coding between mysql and java - java

I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?

Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Related

Convert euro sign from a PHP sent JSON string to a Java string (utf-8)

i am building an Android application that prints to a thermal printer and i have problems with printing € symbol.
Specifically, my Android application:
makes a GET request to a php file
the php file build a JSON object
one of my JSON values contains a "\u20AC", that is, euro sign in unicode.
php file stringifies the JSON + send it back to Android
which sends the data UTF-8 encoded (plain text)
My PHP sending-back code is, basically, something like that:
header("Content-Type: text/html; charset=UTF-8\n");
...
$currency_symbol = '\u20AC';
...
$blah = array("id"=>$order['id'], ... , "currency_symbol"=> $currency_symbol);
echo json_encode( $blah );
exit;
Before all that, i was able to print the € sign to a printer by:
Changing to correct code page of thermal printer
Calling the following code:
new String("\u20ac").getBytes( Charset.forName("Windows-1252") ) );
then sending euro bytes directly to printer.
With JSON solution, i can not render the euro sign anymore as every try, even the previous working one, it always renders this to printer (but not the sign):
\u20AC
PS. I have no problem with the other UTF-8 strings as i am able to print them by that:
String.format("- " + json_obj.getString("address") + "\n").getBytes( charset )
where json_obj is the encoded JSON that came from PHP and charset in what code page the printer is set to (as Charset).
I solved it by using the following code:
String currency_symbol_hex = order_obj.getString("currency_symbol");
String currency_symbol_str = Character.toString((char) Integer.parseInt(currency_symbol_hex,16));
BT_write( String.format("%s", currency_symbol_str).getBytes( Charset.forName("Windows-1252") ) );
where order_obj.getString("currency_symbol") is a JSON value sent by the PHP and contains only 20AC (rather \u20AC) and BT_write basically writes the bytes to a connected Bluetooth socket.
The currency_symbol is stored in JSON like that:
return array(
...
"euro"=>array("symbol"=>"€","symbol_unicode"=>"20AC")
-------^
...
);
and i just return it through the currency_symbol key in my JSON.
This is obviously not the best solution as different currencies may require different character set than Windows-1252 and also requires special case on PHP side (sending 2 unicode codes eg. \u0631.\u0639. (the Omani rial currency symbol) wont work unless you use an array and parse each one etc.) but at least its a start. Thanx!

Reversed Hebrew or numbers after using iText to parse a PDF document

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in ​gibberish.
I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.
Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.
Example:
PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס
if I reverse this then I get סה"כ ניכויי התחייבות 55.78​ ​
The number should be 87.55 and not 55.78
The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.
Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL
I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.
I can only analyze the data given, so in this case only the linked government paper from which
is extracted as
ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 􀂛
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø
And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!
Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents
In more detail
The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:
28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange
I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.
You mention that in some case you could retrieve the Hebrew text by applying
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))
So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.
That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.
In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.
First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8. Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "שלום את";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את
Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea. The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadocSo what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.
Using ICU did the job:
Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);

Foreign Language Characters decode in Java **& iacute;**

I want to display characters of foreign languages in jasper reports. The reports passes the text to java code for RTF formatting. Unfortunately the mysql database returns decoded string like below with spaces removed
& iacute;
what I want to display is
í
Any suggestions how to do it with java?
text: bebida fría
from database: bebida fr& iacute;a
That are HTML entities. You can use StringEscapeUtils.unescapeHtml4 from apache commons library.
Still remains to see how your RTF handles Unicode.
If I understand your question, then you could use the unicode literal,
System.out.println("bebida fr\u00EDa");
Output is (the requested)
bebida fría
Check database table encoding. Also you can try to encode your string with proper encoding.
ByteBuffer encode = Charset.forName("UTF-8").encode(myString);
String encodedStr = new String(encode.array());

Eliminating character e280a8 from a Java String

My application takes a Java string and puts it in a JSON response, it works on IE but for some reason on Chrome and Firefox I don't see the data on the page, I don't get any console errors and I do get the Response Object with the ability to examine it on Firebug and Chrome debugging tools.
I am working with Java 6, and the String in question is created from a CLOB column from an Oracle DB:
4:42 PM<
This is the hex code of the above String as it is on Oracle:
34,3a,34,32,20,50,4d,e2,80,a8,3c
As you can see, between the "M" (4d) and the "<" (3c) we have the values e2,80,a8, which according to UTF-8 is a line separator (e280a8), I've tested my application by adding only the substring until the "M" and it works on all browsers, but the moment I include one more character it breaks. So it is safe to say that the character is causing the issue.
The Java console outputs the string as:
4:42 PM
<
And its byte values as:
52,58,52,50,32,80,77,-30,-128,-88,60
Since I know that there should not be a line break or anything else between the "M" and "<", I think the solution would be to scrub that character, but desc = desc.replaceAll("
", ""); doesn't seem to work.
Any suggestions?
The bytes are in UTF-8, and it is the Unicode line separator "\u2028". You are right.
desc = desc.replace("\u2028", "");

Input Special characters to PPTX using docx4j

I got a special character from ASCII value and created a presentation by inputting that character using docx4j library. If I want to print "£" mark it print with "£". Is there a special way to input special characters to the PPT.
I used following code.
String iChar = new Character((char)163).toString();
t.setTextContent(iChar);
Please unzip your pptx, and have a look at the content of the slide. It should contain something like:
<a:t>£</a:t>
You can create a p containing that with:
// Create object for p
CTTextParagraph textparagraph = dmlObjectFactory.createCTTextParagraph();
textbody.getP().add( textparagraph);
// Create object for r
CTRegularTextRun regulartextrun = dmlObjectFactory.createCTRegularTextRun();
textparagraph.getEGTextRun().add( regulartextrun);
regulartextrun.setT( "£");
or by unmarshalling a string. In either case, you can just provide the £ char directly.
You can generate suitable code using the docx4j webapp at http://webapp.docx4java.org/

Categories

Resources