Remove special characters from text/PDF with Apache Tika

Remove special characters from text/PDF with Apache Tika - java

I am parsing PDF file to extract text with Apache Tika.
//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();
//Metadata
Metadata metadata = new Metadata();
//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));
//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();
try
{
//parsing the document using PDF parser from Tika.
PDFParser pdfparser = new PDFParser();
//Do the parsing by calling the parse function of pdfparser
pdfparser.parse(inputstream, handler, metadata,pcontext);
}catch(Exception e)
{
System.out.println("Exception caught:");
}
String extractedText = handler.toString();
Above code works and text from the PDF is extcted.
There are some special characters in the PDF file (like #/&/£ or trademark sign, etc). How can I remove those special charaters during or after the extraction process?

PDF uses unicode code points you may well have strings that contain surrogate pairs, combining forms (eg for diacritics) etc, and may wish to preserve these as their closest ASCII equivalent, eg normalise é to e. If so, you can do something like this:
import java.text.Normalizer;
String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);
If you are simply after ASCII text then once normalised you could filter the string you get from Tika using a regular expression as per this answer:
extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");
However, since regular expressions can be slow (particularly on large strings) you may want to avoid the regex and do a simple substitution (as per this answer):
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String normalized = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = normalized.length(); i < n; ++i) {
char c = normalized.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}

Related

replace string with unicode text in pdf file using PDFbox?

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.
Sample unicode text
Bɐɑɒ
issue (Address column)
https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing
I am using PDFBox.
Please help me with this.....
check the code I am using.....
enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
for (PDPage page : document.getPages()) {
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
//PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
resources.add(font);
//resources.add(font2);
//resources.add(PDType1Font.TIMES_ROMAN);
page.setResources(resources);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that
// operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}

Your code utterly breaks the PDF, cf. the Adobe Preflight output:
The cause is obvious, your code
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);
drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.
You must not drop existing resources as they are used in your document.
Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.
I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task
to read the strings from PDF file and replace it with the Unicode text
cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.
What you can implement instead is:
First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.
Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

Java Pdf content to String

I'm wondering if is there a way to obtain the content of a pdf file (raw bytes) as a String using Apache PdfBox 2.0.8. What I'm doing is to save the PDDocument object to a ByteArrayOutputStream and then create a new String getting ByteArrayOutputStream's byte array. But if I save the String to a file, the result is a blank pdf. The reason for this is because pdf's stream section bytes are different from a pdf created directly from PdDocument object to a file. After knowing this, I tried to get the ByteArrayOutputStream's character encoding using juniversalchardet, but no luck. So, is there a way to acomplish this?
This is what I have tried so far:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PDDocument doc = new PDDocument();
... //Add page, font, pdPageContentStream and text only to doc object with some latin chars (áéíóú)
doc.save(baos);
So, if I create a file using baos object, the pdf file looks as expected, but if I do this:
String str = new String(baos.toByteArray());
And then create a file using str bytes, the pdf file only shows a blank page.
Hope I was clear enough this time :)

Using this, just append everything to a String.
StringBuilder sb = new StringBuilder();
try (PDDocument document = PDDocument.load(new File("your\\path\\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
sb.append(line);
}
}
}
return sb.toString();

Getting image from fb2 file in Java

I'm working now for e-book reader written in Java. Primary file type is fb2 which is XML-based type.
Images inside these books stored inside <binary> tags as a long text line (at least it looks like text in text editors).
How can I transform this text in actual pictures in Java? For working with XML I'm using JDOM2 library.
What I've tried does not produce valid pictures (jpeg files):
private void saveCover(Object book) {
// Necessary cast to process with book
Document doc = (Document) book;
// Document root and namespace
Element root = doc.getRootElement();
Namespace ns = root.getNamespace();
Element binaryEl = root.getChild("binary", ns);
String binaryText = binaryEl.getText();
File cover = new File(tempFolderPath + "cover.jpeg");
try (
FileOutputStream fileOut = new FileOutputStream(cover);
BufferedOutputStream bufferOut = new BufferedOutputStream(
fileOut);) {
bufferOut.write(binaryText.getBytes());
} catch (IOException e) {
e.printStackTrace();
}
}

The image content is specified as being base64 encoded (see: http://wiki.mobileread.com/wiki/FB2#Binary ).
As a consequence, you have to take the text from the binary element and decode it in to binary data (in Java 8 use: java.util.base64 and this method: http://docs.oracle.com/javase/8/docs/api/java/util/Base64.html#getDecoder-- )
If you take the binaryText value from your code, and feed it in to the decoder's decode() method you should get the right byte[] value for the image.

Spliting paragraphs that endswith "." and new line after dot in Java

I am trying to read text from PDF file and split each paragraph and put it into ArrayList and print elements of ArrayList but I have no outputs
String path = "E:\\test.pdf";
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(path);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String page = pdfStripper.getText(pdDoc);
String[] paragraph = page.split("\n");
ArrayList<String> ramy = new ArrayList<>();
String p = "";
for (String x : paragraph) {
if ((x.endsWith("\\.")) || (x.endsWith("\\." + "\\s+"))) {
p += x;
ramy.add(p);
p = "";
} else {
p += x;
}
}
for (String x : ramy) {
System.out.print(x + "\n\n");
}
Note : I am using NetBeans 8.0.2, windows 8.1 and pdfbox library to read from pdf file.

The most crippling bug you have is that you are calling endsWith() with "\\.", which is two characters; a literal backslash and a literal dot (not an escaped dot) and again with "\\.\\s+" (again all literal characters). It's clear you (incorrectly) believed that the method accepts regex, which it doesn't.
Assuming your logic is sound, change your test to use a regex-based test:
if (x.matches(".*\\.\\s*"))
This test combines the intention of your code into one test.
Note that you don't need to end the regex with $, because matches() must match the whole string to return true, so ^ and $ are implied at the start/end of the pattern.

Replace fonts in a PDF using iText (Java)

I'd like to convert all the fonts, embedded or otherwise, of a PDF to another font using iText. I understand that line-height, kerning and a bunch of other things would be bungled up, but this I truly don't mind how ugly the output is.
I have seen how to embed fonts into existing pdfs here, but I don't know how to set ALL EXISTING text in the document to that font.
I understand that this isn't as straightforward as I make it out to be. Perhaps it would be easier just to take all the raw text from the document, and create a new document using the new font (again, layout/readability is a non-issue to me)

The example EmbedFontPostFacto.java from chapter 16 of iText in Action — 2nd Edition shows how to embed an originally not embedded font. The central method is this:
public void manipulatePdf(String src, String dest) throws IOException, DocumentException {
// the font file
RandomAccessFile raf = new RandomAccessFile(FONT, "r");
byte fontfile[] = new byte[(int)raf.length()];
raf.readFully(fontfile);
raf.close();
// create a new stream for the font file
PdfStream stream = new PdfStream(fontfile);
stream.flateCompress();
stream.put(PdfName.LENGTH1, new PdfNumber(fontfile.length));
// create a reader object
PdfReader reader = new PdfReader(RESULT1);
int n = reader.getXrefSize();
PdfObject object;
PdfDictionary font;
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(RESULT2));
PdfName fontname = new PdfName(FONTNAME);
for (int i = 0; i < n; i++) {
object = reader.getPdfObject(i);
if (object == null || !object.isDictionary())
continue;
font = (PdfDictionary)object;
if (PdfName.FONTDESCRIPTOR.equals(font.get(PdfName.TYPE))
&& fontname.equals(font.get(PdfName.FONTNAME))) {
PdfIndirectObject objref = stamper.getWriter().addToBody(stream);
font.put(PdfName.FONTFILE2, objref.getIndirectReference());
}
}
stamper.close();
reader.close();
}
This (without the fontname.equals(font.get(PdfName.FONTNAME)) test) may be a starting point for the simple cases of your task.
You'll have to do quite a lot of tests concerning encoding and add some individual translations for a more generic solution. You may want to study section 9 Text of the PDF specification ISO 32000-1 for this.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.