How to extract parameter from pdf file using java code & pdfbox

How to extract parameter from pdf file using java code & pdfbox - java

I am doing a java program which is to extract parameter from pdf files. I would like to extract the pdf to get the parameter like
obj
endobj
stream
endstream
xref
trailer
startxref
/Page
/Encrypt
/ObjStm
/JS
/JavaScript
/AA
/OpenAction
/JBIG2Decode
/RichMedia
/Launch
/XFA
parameter:
so I wish to get the output shown in the picture below:

Going by comment above So you want to extract the text from the PDF, and then count the occurences?, you can do as follows:
Read the PDF file in:
String[] words = null;
try (PDDocument document = PDDocument.load(new File("C:\\path\\to\\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
words = pdfFileInText.split("\\s+");
}
}
And then print the occurrences of words:
Arrays.stream(words)
.collect(Collectors.groupingBy(s -> s))
.forEach((k, v) -> System.out.println(k + " " + v.size()));
You may need to tweak this slightly to your own needs.

Related

While converting a PDF to regular text using PDFBox, how can I surround superscript text with braces?

Using PDFBox I want to convert a very large PDF file into regular text. I would like to mark any supertext with braces. Being relatively new to PDFBox, how can I surround superscript text with braces?
Example:
PDF: This is text with the X being superscript.
Output: This is text with the (X) being superscript.
Hope you can help. I have seen this post, but that one does not give an easy approach.
My code so far is:
try (PDDocument document = PDDocument.load( new File("files/my-input.pdf"));
FileWriter fileWriter = new FileWriter( "files/my-output.txt")) {
PDFTextStripper tStripper = new PDFTextStripper();
int numberOfPages = document.getNumberOfPages();
for( int i = 1; i <= numberOfPages; i++) {
tStripper.setStartPage(i);
tStripper.setEndPage(i);
tStripper.writeText( document, fileWriter);
fileWriter.flush();
}
}
Subclassing the PDFTextStripper class and simply overruling the writeString() does not work because it will interfere with the original method. The string.getHeight() shows the heigth of the character - so could be used.

Java Pdf content to String

I'm wondering if is there a way to obtain the content of a pdf file (raw bytes) as a String using Apache PdfBox 2.0.8. What I'm doing is to save the PDDocument object to a ByteArrayOutputStream and then create a new String getting ByteArrayOutputStream's byte array. But if I save the String to a file, the result is a blank pdf. The reason for this is because pdf's stream section bytes are different from a pdf created directly from PdDocument object to a file. After knowing this, I tried to get the ByteArrayOutputStream's character encoding using juniversalchardet, but no luck. So, is there a way to acomplish this?
This is what I have tried so far:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PDDocument doc = new PDDocument();
... //Add page, font, pdPageContentStream and text only to doc object with some latin chars (áéíóú)
doc.save(baos);
So, if I create a file using baos object, the pdf file looks as expected, but if I do this:
String str = new String(baos.toByteArray());
And then create a file using str bytes, the pdf file only shows a blank page.
Hope I was clear enough this time :)

Using this, just append everything to a String.
StringBuilder sb = new StringBuilder();
try (PDDocument document = PDDocument.load(new File("your\\path\\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
sb.append(line);
}
}
}
return sb.toString();

Text associated to PDF paragraph in document content object wit PDFBox

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
When I print the contents I get the following Keys:
/P : Reference to Parent
/A : Format of the paragraph
/K : Position of the paragraph in the section
/C : Name of the paragraph (!= text)
/Pg : Reference to the page
Example output:
COSName{K}
COSInt{2}
COSName{Pg}
COSObject{12, 0}
COSName{C}
COSName{Normal}
COSName{A}
COSObject{434, 0}
COSName{S}
COSName{Normal}
COSName{P}
COSObject{421, 0}
Now none of these points to the actual text inside the paragraph.
I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

I found a way to do this through the parsing of the Content Stream from a page.
Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

How to Convert splitted pdf to Excel file using java pdfbox

Im new to this PDFBOX. I hve one pdf file, which contain 60 pages. Im using Apache PDFBox-app-1.8.10. jar splitting up the PDF files.
public class SplitDemo {
public static void main(String[] args) throws IOException {
JButton open = new JButton();
JFileChooser fc = new JFileChooser();
fc.setCurrentDirectory(new java.io.File("C:/Users"));
fc.setDialogTitle("Select PDF");
if(fc.showOpenDialog(open)== JFileChooser.APPROVE_OPTION)
{
}
String a = null;
a = fc.getSelectedFile().getAbsolutePath();
PDDocument document = new PDDocument();
document = PDDocument.load(a);
// Create a Splitter object
Splitter splitter = new Splitter();
// We need this as split method returns a list
List<PDDocument> listOfSplitPages;
// We are receiving the split pages as a list of PDFs
listOfSplitPages = splitter.split(document);
// We need an iterator to iterate through them
Iterator<PDDocument> iterator = listOfSplitPages.listIterator();
// I am using variable i to denote page numbers.
int i = 1;
while(iterator.hasNext()){
PDDocument pd = iterator.next();
try{
// Saving each page with its assumed page no.
pd.save("G://PDFCopy/Page " + i++ + ".pdf");
} catch (COSVisitorException anException){
// Something went wrong with a PDF object
System.out.println("Something went wrong with page " + (i-1) + "\n Here is the error message" + anException);
}
}
}
}
**In PDFCopy Folder i hve list of pdf files. How can I convert all pdf to excel format and need to save it in the target folder. i am fully confused in this conversion. **

Combining XFA with PDFBox

I would like to fill a PDF form with the PDFBox java library.
The PDF form is created with Adobe Live Designer, so it uses the XFA format.
I try to find resources about filling XFA PDF forms with PDFBox, but i haven't any luck so far. I saw that a PDAcroForm.setXFA method is available in the API, but i don't see how to use it.
Do you know if it is possible to fill a PDF Form with PDFBox ?
If yes, is there anywhere a code sample or a tutorial to achieve this ?
If no, what are the best alternatives to achieve this ?

This it the best I was able to manage in the time I was allocated on the problem. I get the pdf saved (in Life Cycle) as optimized (I'm not the one doing the pdf). This is the PDF openning part, XML duplication and then saving:
PDDocument document = PDDocument.load(fileInputStream);
fileInputStream.close();
document.setAllSecurityToBeRemoved(true);
Map<String, String> values = new HashMap<String, String>();
values.put("variable_name", "value");
setFields(document, values); // see code below
PDAcroForm form = document.getDocumentCatalog().getAcroForm();
Document documentXML = form.getXFA().getDocument();
NodeList dataElements = documentXML.getElementsByTagName("xfa:data");
if (dataElements != null) {
for (int i = 0; i < dataElements.getLength(); i++) {
setXFAFields(dataElements.item(i), values);
}
}
COSStream cosout = new COSStream(new RandomAccessBuffer());
TransformerFactory.newInstance().newTransformer()
.transform(new DOMSource(documentXML), new StreamResult(cosout.createUnfilteredStream()));
form.setXFA(new PDXFA(cosout));
FileOutputStream fios = new FileOutputStream(new File(docOut + ".pdf"));
document.save(fios);
document.close();
try {
fios.flush();
} finally {
fios.close();
}
then the methods who set values for fields. I set both the XFA and the AcroForm:
public void setXFAFields(Node pNode, Map<String, String> values) throws IOException {
if (values.containsKey(pNode.getNodeName())) {
pNode.setTextContent(values.get(pNode.getNodeName()));
} else {
NodeList childNodes = pNode.getChildNodes();
if (childNodes != null) {
for (int i = 0; i < childNodes.getLength(); i++) {
setXFAFields(childNodes.item(i), values);
}
}
}
}
public void setFields(PDDocument pdfDocument, Map<String, String> values) throws IOException {
#SuppressWarnings("unchecked")
List<PDField> fields = pdfDocument.getDocumentCatalog().getAcroForm().getFields();
for (PDField pdField : fields) {
setFields(pdField, values);
}
}
private void setFields(PDField field, Map<String, String> values) throws IOException {
List<COSObjectable> kids = field.getKids();
if (kids != null) {
for (COSObjectable pdfObj : kids) {
if (pdfObj instanceof PDField) {
setFields((PDField) pdfObj, values);
}
}
} else {
// remove the [0] from the name to match values in our map
String partialName = field.getPartialName().replaceAll("\\[\\d\\]", "");
if (!(field instanceof PDSignatureField) && values.containsKey(partialName)) {
field.setValue(values.get(partialName));
}
}
}
This work, but not for all "kind" of PDF life Cycle produce, some got a warning message about "extended fonction" not enabled anymore but still work. The optimize version is the only one I found who don't prompt message when openned after being filled.
I fill the XFA and the Acroform otherwise it don't work in all viewer.

The question specifically identifies the PDFBox library in the subject; you do not need iText, the XFA manipulation can be done using the PDXFA object available in PDFBox 1.8.
Many thanks to Maruan Sahyoun for his great work on PDFBox + XFA.
This code only works when you remove all security on the PDDocument.
It also assumes the COS object in PDXFA is a COSStream.
The simplistic example below reads the xml stream and writes it back into the PDF.
PDDocument doc = PDDocument.load("filename");
doc.setAllSecurityToBeRemoved(true);
PDDocumentCatalog docCatalog = doc.getDocumentCatalog();
PDAcroForm form = docCatalog.getAcroForm();
PDXFA xfa = form.getXFA();
COSBase cos = xfa.getCOSObject();
COSStream coss = (COSStream) cos;
InputStream cosin = coss.getUnfilteredStream();
Document document = documentBuilder.parse(cosin);
COSStream cosout = new COSStream(new RandomAccessBuffer());
OutputStream out = cosout.createUnfilteredStream();
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource source = new DOMSource(xmlDoc);
StreamResult result = new StreamResult(out);
transformer.transform(source, result);
PDXFA xfaout = new PDXFA(cosout);
form.setXFA(xfaout);

I'm not familiar with pdfbox but you can do this with iText (http://itextpdf.com/) once you get access to the XFA (XML) DOM.

Try this and it will merge all pdf with no xfa and with XFA(this is when using PDBox only)
PDAcroForm form = document.getDocumentCatalog().getAcroForm();
if(form != null) {
document.setAllSecurityToBeRemoved(true);
form.flatten();
if(form.hasXFA()) {
form.setXFA(null);
}
}
merge.appendDocument(anyPDFDoc, document);

AcroForm is for PDF with static fields. If PDF have xfa forms you can use itext (Java) or itextsharp ( .net) to Populate your Data . Only problem with XFA Forms are they cannot be Flatten with itext only way to Flatten I found is using bullzip or similar pdf creator to open that xfa pdf created with itext and pass it through bullzip which will spit out flatten pdf version. Hope this will give u some ideas.
Below code just a rough idea how xfa is filled.
XfaForm xfa = pdfFormFields.Xfa;
dynamic bytes = Encoding.UTF8.GetBytes("<?xml version=\"1.0\" encoding=\"UTF-8\"?> <form1> <staticform>" + "\r\n<barcode>" + barcode + "</barcode></staticform> <flowForm><Extra>" + Extra + "</Extra></flowForm> </form1>");
MemoryStream ms = new MemoryStream(bytes);
pdfStamper.AcroFields.Xfa.FillXfaForm(ms);
you can now use the xfa pdf you created and print through bullzip
const string Printer_Name = "Bullzip PDF Printer";
PdfSettings pdfSettings = new PdfSettings();
pdfSettings.PrinterName = Printer_Name;
pdfSettings.SetValue("Output", flatten_pdf);
pdfSettings.SetValue("ShowPDF", "no");
pdfSettings.SetValue("ShowSettings", "never");
pdfSettings.SetValue("ShowSaveAS", "never");
pdfSettings.SetValue("ShowProgress", "no");
pdfSettings.SetValue("ShowProgressFinished", "no");
pdfSettings.SetValue("ConfirmOverwrite", "no");
pdfSettings.WriteSettings(PdfSettingsFileType.RunOnce);
PdfUtil.PrintFile(xfa_pdffile, Printer_Name);
output file will be flatten pdf..

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract parameter from pdf file using java code & pdfbox - java

Related

While converting a PDF to regular text using PDFBox, how can I surround superscript text with braces?

Java Pdf content to String

Text associated to PDF paragraph in document content object wit PDFBox

How to Convert splitted pdf to Excel file using java pdfbox

Combining XFA with PDFBox

Categories

Resources