replace string with unicode text in pdf file using PDFbox? - java

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.
Sample unicode text
Bɐɑɒ
issue (Address column)
https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing
I am using PDFBox.
Please help me with this.....
check the code I am using.....
enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
for (PDPage page : document.getPages()) {
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
//PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
resources.add(font);
//resources.add(font2);
//resources.add(PDType1Font.TIMES_ROMAN);
page.setResources(resources);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that
// operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}

Your code utterly breaks the PDF, cf. the Adobe Preflight output:
The cause is obvious, your code
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);
drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.
You must not drop existing resources as they are used in your document.
Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.
I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task
to read the strings from PDF file and replace it with the Unicode text
cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.
What you can implement instead is:
First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.
Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

Related

While converting a PDF to regular text using PDFBox, how can I surround superscript text with braces?

Using PDFBox I want to convert a very large PDF file into regular text. I would like to mark any supertext with braces. Being relatively new to PDFBox, how can I surround superscript text with braces?
Example:
PDF: This is text with the X being superscript.
Output: This is text with the (X) being superscript.
Hope you can help. I have seen this post, but that one does not give an easy approach.
My code so far is:
try (PDDocument document = PDDocument.load( new File("files/my-input.pdf"));
FileWriter fileWriter = new FileWriter( "files/my-output.txt")) {
PDFTextStripper tStripper = new PDFTextStripper();
int numberOfPages = document.getNumberOfPages();
for( int i = 1; i <= numberOfPages; i++) {
tStripper.setStartPage(i);
tStripper.setEndPage(i);
tStripper.writeText( document, fileWriter);
fileWriter.flush();
}
}
Subclassing the PDFTextStripper class and simply overruling the writeString() does not work because it will interfere with the original method. The string.getHeight() shows the heigth of the character - so could be used.

Handle many unicode caracters with PDFBox

I am writing a Java function which takes a String as a parameter and produce a PDF as an output with PDFBox.
Everything is working fine as long as I use latin characters.
However, I don't know in advance what will be the input, and it might be some English as well as Chinese or Japanese characters.
In the case of non latin characters, here is the error I get:
Exception in thread "main" java.lang.IllegalArgumentException: U+3053 ('kohiragana') is not available in this font Helvetica encoding: WinAnsiEncoding
at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:426)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:324)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:509)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:471)
at com.mylib.pdf.PDFBuilder.generatePdfFromString(PDFBuilder.java:122)
at com.mylib.pdf.PDFBuilder.main(PDFBuilder.java:111)
If I understand correctly, I have to use a specific font for Japanese, another one for Chinese and so on, because the one that I am using (Helvetiva) doesn't handle all required unicode characters.
I could also use a font which handle all these unicode characters, such as Arial Unicode. However this font is under a specific license so I cannot use it and I haven't found another one.
I found some projects that want to overcome this issue, like the Google NOTO project.
However, this project provides multiple font files. So I would have to choose, at runtime, the correct file to load depending on the input I have.
So I am facing 2 options, one of which I don't know how to implement properly:
Keep searching for a font that handle almost every unicode character (where is this grail I am desperately seeking?!)
Try to detect which language is used and select a font depending on it.
Despite the fact that I don't know (yet) how to do that, I don't find it to be a clean implementation, as the mapping between the input and the font file will be hardcoded, meaning I will have to hardcode all the possible mappings.
Is there another solution?
Am I completely off tracks?
Thanks in advance for your help and guidance!
Here is the code I use to generate the PDF:
public static void main(String args[]) throws IOException {
String latinText = "This is latin text";
String japaneseText = "これは日本語です";
// This works good
generatePdfFromString(latinText);
// This generate an error
generatePdfFromString(japaneseText);
}
private static OutputStream generatePdfFromString(String content) throws IOException {
PDPage page = new PDPage();
try (PDDocument doc = new PDDocument();
PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
doc.addPage(page);
contentStream.setFont(PDType1Font.HELVETICA, 12);
// Or load a specific font from a file
// contentStream.setFont(PDType0Font.load(this.doc, new File("/fontPath.ttf")), 12);
contentStream.beginText();
contentStream.showText(content);
contentStream.endText();
contentStream.close();
OutputStream os = new ByteArrayOutputStream();
doc.save(os);
return os;
}
}
A better solution than waiting for a font or guessing a text's language is to have a multitude of fonts and selecting the correct font on a glyph-by-glyph base.
You already found the Google Noto Fonts which are a good base collection of fonts for this task.
Unfortunately, though, Google publishes the Noto CJK fonts only as OpenType fonts (.otf), not as TrueType fonts (.ttf), a policy that isn't likely to change, cf. the Noto fonts issue 249 and others. On the other hand PDFBox does not support OpenType fonts and isn't actively working on OpenType support either, cf. PDFBOX-2482.
Thus, one has to convert the OpenType font somehow to TrueType. I simply took the file shared by djmilch in his blog post FREE FONT NOTO SANS CJK IN TTF.
Font selection per character
So you essentially need a method which checks your text character by character and dissects it into chunks which can be drawn using the same font.
Unfortunately I don't see a better method to ask a PDFBox PDFont whether it knows a glyph for a given character than to actually try to encode the character and consider a IllegalArgumentException a "no".
I, therefore, implemented that functionality using the following helper class TextWithFont and method fontify:
class TextWithFont {
final String text;
final PDFont font;
TextWithFont(String text, PDFont font) {
this.text = text;
this.font = font;
}
public void show(PDPageContentStream canvas, float fontSize) throws IOException {
canvas.setFont(font, fontSize);
canvas.showText(text);
}
}
(AddTextWithDynamicFonts inner class)
List<TextWithFont> fontify(List<PDFont> fonts, String text) throws IOException {
List<TextWithFont> result = new ArrayList<>();
if (text.length() > 0) {
PDFont currentFont = null;
int start = 0;
for (int i = 0; i < text.length(); ) {
int codePoint = text.codePointAt(i);
int codeChars = Character.charCount(codePoint);
String codePointString = text.substring(i, i + codeChars);
boolean canEncode = false;
for (PDFont font : fonts) {
try {
font.encode(codePointString);
canEncode = true;
if (font != currentFont) {
if (currentFont != null) {
result.add(new TextWithFont(text.substring(start, i), currentFont));
}
currentFont = font;
start = i;
}
break;
} catch (Exception ioe) {
// font cannot encode codepoint
}
}
if (!canEncode) {
throw new IOException("Cannot encode '" + codePointString + "'.");
}
i += codeChars;
}
result.add(new TextWithFont(text.substring(start, text.length()), currentFont));
}
return result;
}
(AddTextWithDynamicFonts method)
Example use
Using the method and the class above like this
String latinText = "This is latin text";
String japaneseText = "これは日本語です";
String mixedText = "Tこhれiはs日 本i語sで すlatin text";
generatePdfFromStringImproved(latinText).writeTo(new FileOutputStream("Cccompany-Latin-Improved.pdf"));
generatePdfFromStringImproved(japaneseText).writeTo(new FileOutputStream("Cccompany-Japanese-Improved.pdf"));
generatePdfFromStringImproved(mixedText).writeTo(new FileOutputStream("Cccompany-Mixed-Improved.pdf"));
(AddTextWithDynamicFonts test testAddLikeCccompanyImproved)
ByteArrayOutputStream generatePdfFromStringImproved(String content) throws IOException {
try ( PDDocument doc = new PDDocument();
InputStream notoSansRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSans-Regular.ttf");
InputStream notoSansCjkRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf") ) {
PDType0Font notoSansRegular = PDType0Font.load(doc, notoSansRegularResource);
PDType0Font notoSansCjkRegular = PDType0Font.load(doc, notoSansCjkRegularResource);
List<PDFont> fonts = Arrays.asList(notoSansRegular, notoSansCjkRegular);
List<TextWithFont> fontifiedContent = fontify(fonts, content);
PDPage page = new PDPage();
doc.addPage(page);
try ( PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
contentStream.beginText();
for (TextWithFont textWithFont : fontifiedContent) {
textWithFont.show(contentStream, 12);
}
contentStream.endText();
}
ByteArrayOutputStream os = new ByteArrayOutputStream();
doc.save(os);
return os;
}
}
(AddTextWithDynamicFonts helper method)
I get
for latinText = "This is latin text"
for japaneseText = "これは日本語です"
and for mixedText = "Tこhれiはs日 本i語sで すlatin text"
Some asides
I retrieved the fonts as Java resources but you can use any kind of InputStream for them.
The font selection mechanism above can quite easily be combined with the line breaking mechanism shown in this answer and the justification extension thereof in this answer
Below is another implementation of splitting a plain text into the chunks of TextWithFont objects. Algorithm does character-by-character encoding and always tries to encode with a main font and only in the case of a failure will proceed with the next fonts in the list of fallback fonts.
Main classwith properties:
public class SplitByFontsProcessor {
/** Text to be processed */
private String text;
/** List of fonts to be used for processing */
private List<PDFont> fonts;
/** Main font to be used for processing */
private PDFont mainFont;
/** List of fallback fonts to be used for processing. It does not contain the main font. */
private List<PDFont> fallbackFonts;
........
}
Methods within the same class:
private List<TextWithFont> splitUsingFallbackFonts() throws IOException {
final List<TextWithFont> fontifiedText = new ArrayList<>();
final StringBuilder strBuilder = new StringBuilder();
boolean isHandledByMainFont = false;
// Iterator over Unicode codepoints in Java string
final PrimitiveIterator.OfInt iterator = text.codePoints().iterator();
while (iterator.hasNext()) {
int codePoint = iterator.nextInt();
final String stringCodePoint = new String(Character.toChars(codePoint));
// try to encode Unicode codepoint
try {
// Multi-byte encoding with 1 to 4 bytes.
mainFont.encode(stringCodePoint); // fails here if can not be handled by the font
strBuilder.append(stringCodePoint); // append if succeeded to encode
isHandledByMainFont = true;
} catch(IllegalArgumentException ex) {
// IllegalArgumentException is thrown if character can not be handled by a given Font
// Adding successfully handled characters so far
if (StringUtils.isNotEmpty(strBuilder.toString())) {
fontifiedText.add(new TextWithFont(strBuilder.toString(), mainFont));
strBuilder.setLength(0);// clear StringBuilder
}
handleByFallbackFonts(fontifiedText, stringCodePoint);
isHandledByMainFont = false;
} // end main font try-catch
}
// If this is the last successful run that was handled by main font, then add result
if (isHandledByMainFont) {
fontifiedText.add(new TextWithFont(strBuilder.toString(), mainFont));
}
return mergeAdjacents(fontifiedText);
}
Method handleByFallbackFonts():
private void handleByFallbackFonts(List<TextWithFont> fontifiedText, String stringCodePoint)
throws IOException {
final StringBuilder strBuilder = new StringBuilder();
boolean isHandledByFallbackFont = false;
// Retry with fallback fonts
final Iterator<PDFont> fallbackFontsIterator = fallbackFonts.iterator();
while(fallbackFontsIterator.hasNext()) {
try {
final PDFont fallbackFont = fallbackFontsIterator.next();
fallbackFont.encode(stringCodePoint); // fails here if can not be handled by the font
isHandledByFallbackFont = true;
strBuilder.append(stringCodePoint);
fontifiedText.add(new TextWithFont(strBuilder.toString(), fallbackFont));
break; // if successfully handled - break the loop
} catch(IllegalArgumentException exception) {
// do nothing, proceed to the next font
}
} // end while
// If character was not handled and this is the last font - throw an exception
if (!isHandledByFallbackFont) {
final String fontNames = fonts.stream()
.map(PDFont::getName)
.collect(Collectors.joining(", "));
int codePoint = stringCodePoint.codePointAt(0);
throw new TextProcessingException(
String.format("Unicode code point [%s] can not be handled by configured fonts: [%s]",
codePoint, fontNames));
}
}
Method splitUsingFallbackFonts() returns a list of TextWithFont objects in which adjacent objects with the same font will not be necessarily belong to the same object. This happens because an algorithm will always first retry to render a character by the main font, and in case it fails, it will create a new object with the font capable of rendering the character. So we need to call a utility method, mergeAdjacents(), which will merge them together.
private static List<TextWithFont> mergeAdjacents(final List<TextWithFont> fontifiedText) {
final Deque<TextWithFont> result = new LinkedList<>();
for (TextWithFont elem : fontifiedText) {
final TextWithFont resElem = result.peekLast();
if (resElem == null || !resElem.getFont().equals(elem.getFont())) {
result.addLast(elem);
} else {
result.addLast(merge(result.pollLast(), elem));
}
}
return new ArrayList<>(result);
}

Remove special characters from text/PDF with Apache Tika

I am parsing PDF file to extract text with Apache Tika.
//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();
//Metadata
Metadata metadata = new Metadata();
//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));
//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();
try
{
//parsing the document using PDF parser from Tika.
PDFParser pdfparser = new PDFParser();
//Do the parsing by calling the parse function of pdfparser
pdfparser.parse(inputstream, handler, metadata,pcontext);
}catch(Exception e)
{
System.out.println("Exception caught:");
}
String extractedText = handler.toString();
Above code works and text from the PDF is extcted.
There are some special characters in the PDF file (like #/&/£ or trademark sign, etc). How can I remove those special charaters during or after the extraction process?
PDF uses unicode code points you may well have strings that contain surrogate pairs, combining forms (eg for diacritics) etc, and may wish to preserve these as their closest ASCII equivalent, eg normalise é to e. If so, you can do something like this:
import java.text.Normalizer;
String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);
If you are simply after ASCII text then once normalised you could filter the string you get from Tika using a regular expression as per this answer:
extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");
However, since regular expressions can be slow (particularly on large strings) you may want to avoid the regex and do a simple substitution (as per this answer):
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String normalized = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = normalized.length(); i < n; ++i) {
char c = normalized.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}

JSoup extract only specific parts from Wikipedia

I have managed to extract the information in the "tables" on the right side of a Wikipedia article. However I also want to get paragraphs from the main text of the articles.
The code I'm using atm is only working about 60% of the time(Nullpointers or no text at all). In the example below I'm only interested in the tho first paragraphs, however that is irrelevant for my question.
In the picture below I show what parts I want the text from. I want to be able to iterate through all ... parts in the < divid="mw-content-text"....class="mw-content-ltr"> block.
StringBuilder sb = new StringBuilder();
String url = baseUrl + location;
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select(".mw-content-ltr p");
Element firstParagraph = paragraphs.first();
Element elementTwo = firstParagraph.nextElementSibling();
if (elementTwo == null) {
for (int i = 0; i < 2; i++) {
sb.append(paragraphs.get(i).text());
}
} else {
sb.append(elementTwo.text());
}
return sb.toString();

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.
I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.
The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Categories

Resources