How do we build Normalized text file from DeNormalized one? - java

Thanks for your replies/time.
We need to build a Normalized text file from DeNormalized text file. We explored couple of options such as unix shell , and Loading into data base etc. I am looking pick up better ideas for resolutions from this community.
The input text file is various length with comma delimited records. The content may look like this:
**XXXXXXXXXX , YYYYYYYYYY, TTTTTTTTTTT, UUUUUUUUUU, RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222, 333333333333, 44444444, 5555555, 666666
EEEEEEEE,WWWWWW,QQQQQQQ,PPPPPPPP**
We like to normalize as follows:
**XXXXXXXXXX , YYYYYYYYYY
TTTTTTTTTTT, UUUUUUUUUU
RRRRRRRRR,JJJJJJJJJ
111111111111, 22222222222
333333333333, 44444444
5555555, 666666
EEEEEEEE,WWWWWW
QQQQQQQ,PPPPPPPP**
Are there any simple approach to get the above?
Thanks in helping.

Related

How to get "sout" shorthand to work in Sublime Text 3?

I am trying to get the "sout" shorthand to work in Sublime Text 3 for Java. In vscode and other editors typing "sout + [tab]" will fill in "System.out.println". When I try this in Sublime Text it instead prints "southPane".
This is something that can be done via a snippet or a completion;
both can do this but which you use depends largely on the complexity of the text you want to insert and how many you have.
The main difference is that a snippet is a XML based format where each file contains a single completion whereas a sublime-completions file is JSON formatted file that can contain many completions at once. Additionally, all snippets are automatically added to the command palette and made available only in files to which they apply.
Thus the XML based snippet is good for larger stretches of code (e.g. blocks) or for any text that needs to contain characters that would be a pain to encode as JSON, whereas the JSON based completions are favored for shorter sequences of text, since you can pack more of them into a file.
To demonstrate a snippet, use Tools > Developer > New Snippet to generate a stub, then replace the stub with this content and save it as a file in the default offered location (your User package) as a sublime-snippet file; the name doesn't matter, but the extension does:
<snippet>
<content><![CDATA[
System.out.println($0);
]]></content>
<tabTrigger>sout</tabTrigger>
<scope>source.java</scope>
</snippet>
This says that in a Java file the abbreviation soutTab will expand out to the text System.out.println(); with the cursor left inside the parenthesis.
Alternately, create a file with the following content and save it in your User package as a sublime-completions file (name doesn't matter, only extension, and you can use Preferences > Browse Packages to find the User package`:
{
"scope": "source.java",
"completions": [
{ "trigger": "sout", "contents": "System.out.println($0);" },
]
}
This does the same as the above example, but the file is smaller, and you can include multiple items in it, say for example by also adding:
{ "trigger": "serr", "contents": "System.err.println($0);" },

Reversed Hebrew or numbers after using iText to parse a PDF document

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in ​gibberish.
I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.
Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.
Example:
PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס
if I reverse this then I get סה"כ ניכויי התחייבות 55.78​ ​
The number should be 87.55 and not 55.78
The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.
Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL
I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.
I can only analyze the data given, so in this case only the linked government paper from which
is extracted as
ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 􀂛
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø
And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!
Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents
In more detail
The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:
28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange
I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.
You mention that in some case you could retrieve the Hebrew text by applying
new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))
So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.
That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.
In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.
First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8. Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "שלום את";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את
Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea. The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadocSo what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.
Using ICU did the job:
Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);

PCL Format generation from pdf in java

I want to export jasperreport report in pcl format , but i didn't find a way to do it , so i generated in pdf .
I want to create a class that convert this pdf to PCL5 format. So please can you give me a sarting point and suggetions .
Thank you in advance,
I don't there are any products left that will export any level of PCL. Print to a PCL driver, but not directly export it.
You can use any PCL5 driver with Acrobat to print to FILE: And, there are many products that will batch print PDF's.
However, a lot depends on what you expect to do with the PCL5? And, why?
Bob Pooley
bp#pagetech.com

Subtitile editor[.srt to .ssa]

I have been working on a subtitling system on java.
the normal .srt file can be saved and the subtitles are seen fine.
i want the subtitles to have different properties like diff font/color/size all these properties are not encoded in a normal .srt, the file has to be saved as .ssa(substation alpha) with extra fields like [v4+ style] and events..
i want to know that are there any libraries which i can use to export directly to .ssa or do i have to write a method which includes the [v4+ style]
Thank you.
jubler is an open source library that seems to support substation alpha format.

Splitting word file into multiple smaller word files using OLE Automation from java

I have been using OLE automation from java to access methods for word.
I managed to do the following using the OLE automation:
Open word document template file.
Mail merge the word document template with a csv datasource file.
Save mail merged file to a new word document file.
What i need to do now is to be able to open the mail merged file and then using OLE programmatically split it into multiple files. Meaning if the original mail merged file has 6000 pages and my max pages per file property is set to 3000 pages i need to create two new word document files and place the 1st 3000 pages in the one and the last 3000 pages into the other one.
On my first attempts i took the amount of rows in the csv file and multiplied it by the number of pages in the template to get the total amount of pages after it will be merged. Then i used the merging to create the multiple files. The problem however is that i cannot exactly calculate how many pages the merged document will be because in some case all say 9 pages of the template will not be used because of the data and the mergefields used. So in some cases one row will only create 3 pages (using the 9 page template) and others might create 9 pages (using the 9 page template) during mail merge time.
So the only solution is to merge all rows into one document and then split it into multiple documents therafter to ensure that the exact amount of pages like the 3000 pages property is indeed in each file until there are no more pages left from the original merged file.
I have tried a few things already by using the msdn site to get methods and their properties etc but have been unable to this.
On my last attempts now i have been trying to use GoTo to get to a specific page number and the remove the page. I was going to try do this one by one for each page until i get to where i want the file to start from and then save it as a new file but have been unable to do so as well.
Please can anyone suggest something that could help me out?
Thanks and Regards
Sean
An example to open a word file using the OLE AUTOMATION from jave is included below:
Code sample
OleAutomation documentsAutomation = this.getChildAutomation(this.wordAutomation, "Documents");
int [ ] id = documentsAutomation.getIDsOfNames(new String[]{"Open"});
Variant[] arguments = new Variant[1];
arguments[0] = new Variant(fileName); // where filename is the absolute path to the docx file
Variant invokeResult = documentsAutomation.invoke(id[0], arguments);
private OleAutomation getChildAutomation(OleAutomation automation, String childName) {
int[] id = automation.getIDsOfNames(new String[]{childName});
Variant pVarResult = automation.getProperty(id[0]);
return(pVarResult.getAutomation());
}
Code sample
Sounds like you've pegged it already. Another approach you could take which would avoid building then deleting would be to look at the parts of your template that can make the biggest difference to the number of your template (that is where the data can be multi-line). If you then take these fields and look at the font, line-spacing and line-width type of properties you'll be able to calculate the room your data will take in the template and limit your data at that point. Java FontMetrics can help you with that.

Categories

Resources