Input Special characters to PPTX using docx4j

Input Special characters to PPTX using docx4j - java

I got a special character from ASCII value and created a presentation by inputting that character using docx4j library. If I want to print "£" mark it print with "Â£". Is there a special way to input special characters to the PPT.
I used following code.
String iChar = new Character((char)163).toString();
t.setTextContent(iChar);

Please unzip your pptx, and have a look at the content of the slide. It should contain something like:
<a:t>£</a:t>
You can create a p containing that with:
// Create object for p
CTTextParagraph textparagraph = dmlObjectFactory.createCTTextParagraph();
textbody.getP().add( textparagraph);
// Create object for r
CTRegularTextRun regulartextrun = dmlObjectFactory.createCTRegularTextRun();
textparagraph.getEGTextRun().add( regulartextrun);
regulartextrun.setT( "£");
or by unmarshalling a string. In either case, you can just provide the £ char directly.
You can generate suitable code using the docx4j webapp at http://webapp.docx4java.org/

Related

Character coding between mysql and java

I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?

Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Reversed Hebrew or numbers after using iText to parse a PDF document

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in gibberish.
I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.
Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.
Example:
PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס
if I reverse this then I get סה"כ ניכויי התחייבות 55.78 
The number should be 87.55 and not 55.78
The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.
Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL

I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.
I can only analyze the data given, so in this case only the linked government paper from which
is extracted as
ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 􀂛
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø
And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!
Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents
In more detail
The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:
28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange
I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.
You mention that in some case you could retrieve the Hebrew text by applying
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))
So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.
That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.
In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.

First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8. Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "שלום את";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את
Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea. The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadocSo what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.

Using ICU did the job:
Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);

How to validate a file name in java

I am working with a coverity issue which i need to validate a file name
using regEx in java . In my application support .pdf , .txt , csv etc . My
file name getting as xxx.txt from user . i want to validate my file name
with proper extension format and not included any special character other
than dot ( eg .txt) .
filePath = properties.getProperty("DOCUMENT.LIBRARY.LOCATION");
String fileName = (String) request.getParameter("read");
Only If the file path is completed itsproper validation, the below code should be work .
filePath += "/" + fileName;

This is a terrible answer as it only verifies the filename ends with the desired extension, but doesn't verify the rest of the filename as requested in the original question. Something more like this would be MUCH better:
fileName.matches("[-_. A-Za-z0-9]+\\.(pdf|txt|csv)");
This ensures the filename contains only ONE OR MORE -, _, PERIOD, SPACE, or alphanumeric characters, followed by exactly one of .pdf, .txt or .csv at the end of the filename. Your system might allow other characters in filenames and you could add them to this list if desired. An alternate, less secure approach is to prevent 'bad' characters something like:
fileName.matches("[^/\]+\\.(pdf|txt|csv)");
Which simply prevents / or \ characters from being in the file name before the required ending extension. But this doesn't prevent potentially other dangerous characters, like NULL bytes, for example.

Have a look at String.endsWith() method
if (fileName.endsWith(".pdf")) {
// do something
}
Or use the method String.matches()
fileName.matches("\\.(pdf|txt|csv)$")

Display Hindi language in console using Java

StringBuffer contents=new StringBuffer();
BufferedReader input = new BufferedReader(new FileReader("/home/xyz/abc.txt"));
String line = null; //not declared within while loop
while (( line = input.readLine()) != null){
contents.append(line);
}
System.out.println(contents.toString());
File abc.txt contains
\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092
I want to dispaly in Hindi language in console using Java.
if i simply print like this
String str="\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 \u091c\u0928\u0924\u093e \u091c\u094b \u091a\u093e\u0939\u0924\u0940 \u0939\u0948 \u092";
System.out.println(str);
then it works fine but when i try to read from a file it doesn't work.
help me out.

Use Apache Commons Lang.
import org.apache.commons.lang3.StringEscapeUtils;
// open the file as ASCII, read it into a string, then
String escapedStr; // = "\u0905\u092d\u0940 \u0938\u092e\u092f \u0939\u0948 ..."
// (to include such a string in a Java program you would have to double each \)
String hindiStr = StringEscapeUtils.unescapeJava( escapedStr );
System.out.println(hindiStr);
(Make sure your console is set up to display Hindi (correct fonts, etc) and the console's encoding matches your Java encoding. The Java code above is just the bare bones.)

You should store the contents in the file as UTF-8 encoded Hindi characters. For instance, in your case it would be अभी समय है जनता जो चाहती है. That is, instead of saving unicode escapes, directly save the raw Hindi characters. You can then simply read like normal.
You just have to make sure that the editor you use saves it using UTF-8 encoding. See Spanish language chars are not displayed properly?
Otherwise, you'll have to make the file a .properties file and read using java.util.Properties as it offers unicode unescaping support inherently.
Also read Reading unicode character in java

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!

The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.

I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?

If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Input Special characters to PPTX using docx4j - java

Related

Character coding between mysql and java

Reversed Hebrew or numbers after using iText to parse a PDF document

How to validate a file name in java

Display Hindi language in console using Java

Convert HTML symbols and HTML names to HTML number using Java

Categories

Resources