Text associated to PDF paragraph in document content object wit PDFBox

Text associated to PDF paragraph in document content object wit PDFBox - java

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
When I print the contents I get the following Keys:
/P : Reference to Parent
/A : Format of the paragraph
/K : Position of the paragraph in the section
/C : Name of the paragraph (!= text)
/Pg : Reference to the page
Example output:
COSName{K}
COSInt{2}
COSName{Pg}
COSObject{12, 0}
COSName{C}
COSName{Normal}
COSName{A}
COSObject{434, 0}
COSName{S}
COSName{Normal}
COSName{P}
COSObject{421, 0}
Now none of these points to the actual text inside the paragraph.
I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

I found a way to do this through the parsing of the Content Stream from a page.
Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

Related

How to extract parameter from pdf file using java code & pdfbox

I am doing a java program which is to extract parameter from pdf files. I would like to extract the pdf to get the parameter like
obj
endobj
stream
endstream
xref
trailer
startxref
/Page
/Encrypt
/ObjStm
/JS
/JavaScript
/AA
/OpenAction
/JBIG2Decode
/RichMedia
/Launch
/XFA
parameter:
so I wish to get the output shown in the picture below:

Going by comment above So you want to extract the text from the PDF, and then count the occurences?, you can do as follows:
Read the PDF file in:
String[] words = null;
try (PDDocument document = PDDocument.load(new File("C:\\path\\to\\file.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
words = pdfFileInText.split("\\s+");
}
}
And then print the occurrences of words:
Arrays.stream(words)
.collect(Collectors.groupingBy(s -> s))
.forEach((k, v) -> System.out.println(k + " " + v.size()));
You may need to tweak this slightly to your own needs.

How to extract elements from a String with jsoup?

I want to write a small piece of code that will exctract the "Kategorie" out of a href with jsoup.
Herrscher des Mittelalters
In this case I am searching for Herrscher des Mittelalters.
My code reads the first line of a .txt file with the BufferedReader.
BufferedReader r = new BufferedReader(new InputStreamReader(new FileInputStream(new File(FilePath)), Charset.forName("UTF-8")));
Document doc = Jsoup.parse(r.readLine());
Element elem = doc;
I know there are commands to get the href-link but I don't know commands to search for elements in the href-link.
Any suggestions?
Additional information: My .txt file contains full Wikipedia HTML pages.

This should get you all titles from links. You can split the titles further as you need:
Document d = Jsoup.parse("Herrscher des Mittelalters");
Elements links = d.select("a");
Set<String> categories = new HashSet<>();
for (Element script : links) {
String title = script.attr("title");
if (title.length() > 0) {
categories.add(title);
}
}
System.out.println(categories);

You can use getElementsContainingText() method (org.jsoup.nodes.Document) to search for elements with with any text.
Elements elements = doc.getElementsContainingText("Herrscher des Mittelalters");
for(int i=0; i<elements.size();i++) {
Element element = elements.get(i);
System.out.println(element.text());
}

Spliting paragraphs that endswith "." and new line after dot in Java

I am trying to read text from PDF file and split each paragraph and put it into ArrayList and print elements of ArrayList but I have no outputs
String path = "E:\\test.pdf";
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(path);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String page = pdfStripper.getText(pdDoc);
String[] paragraph = page.split("\n");
ArrayList<String> ramy = new ArrayList<>();
String p = "";
for (String x : paragraph) {
if ((x.endsWith("\\.")) || (x.endsWith("\\." + "\\s+"))) {
p += x;
ramy.add(p);
p = "";
} else {
p += x;
}
}
for (String x : ramy) {
System.out.print(x + "\n\n");
}
Note : I am using NetBeans 8.0.2, windows 8.1 and pdfbox library to read from pdf file.

The most crippling bug you have is that you are calling endsWith() with "\\.", which is two characters; a literal backslash and a literal dot (not an escaped dot) and again with "\\.\\s+" (again all literal characters). It's clear you (incorrectly) believed that the method accepts regex, which it doesn't.
Assuming your logic is sound, change your test to use a regex-based test:
if (x.matches(".*\\.\\s*"))
This test combines the intention of your code into one test.
Note that you don't need to end the regex with $, because matches() must match the whole string to return true, so ^ and $ are implied at the start/end of the pattern.

docx4j adding style to paragraph destroys document

I'm trying to add a paragraph containing a headline (with style) and some plain unformatted text. The following code destroys the document.
Edit: After executing the following code and trying to open the document in word I get an error message "Unspecified Error" Location: Part /word/document.xml Line 1 Column 0
ObjectFactory factory = new ObjectFactory();
P complete = factory.createP();
org.docx4j.wml.P headline=factory.createP();
R hrun = factory.createR();
Text htxt = new Text();
hrun.getContent().add(htxt);
htxt.setValue(View_Beta.this.falseAlarmChoice.getSelectedItem().toString());
headline.getContent().add(hrun);
org.docx4j.wml.PPr pPr = factory.createPPr();
headline.setPPr(pPr);
org.docx4j.wml.PPrBase.PStyle pStyle = factory.createPPrBasePStyle();
pPr.setPStyle(pStyle);
pStyle.setVal("Title");
complete.getContent().add(headline);
P ptext = factory.createP();
R rtext = factory.createR();
Text ttext = new Text();
rtext.getContent().add(ttext);
ptext.getContent().add(rtext);
ttext.setValue(falseAlarmChoice.getSelectedItem()
+ falseAlarmDsc.getText());
complete.getContent().add(ptext);
//add to document context
View_Beta.this.c.insertAtPos(complete,
paragraphlst.getSelectedIndex());

In your code,
complete.getContent().add(headline)
adds a paragraph inside a paragraph, which is invalid per the Open XML spec.

How do you find/replace a placeholder in a .docx file with Apache POI?

I have a file, "template.docx" that I would like to have placeholders (ie. [serial number]) that can be replaced with a string or maybe a table. I am using Apache POI and no i cannot use docx4j.
Is there a way to have the program iterate over all occurrences of "[serial number]" and replace them with a string? Many of these tags will be inside a large table so is there some equivalent command with the Apache POI to just pressing ctrl+f in word and using replace all?
Any suggestions would be appreciated, thanks

XWPFDocument (docx) has different kind of sub-elements like XWPFParagraphs, XWPFTables, XWPFNumbering etc.
Once you create XWPFDocument object via:
document = new XWPFDocument(inputStream);
You can iterate through all of Paragraphs:
document.getParagraphsIterator();
When you iterator through Paragraphs, For each Paragraph you will get multiple XWPFRuns which are multiple text blocks with same styling, some times same styling text blocks will be split into multiple XWPFRuns in which case you should look into this question to avoid splitting of your Runs, doing so will help identify your placeHolders without merging multiple Runs within same Paragraph. At this point you should expect that your placeHolder will not be split in multiple runs if that's the case then you can go ahead and Iterate over 'XWPFRun's for each paragraph and look for text matching your placeHolder, something like this will help:
XWPFParagraph para = (XWPFParagraph) xwpfParagraphElement;
for (XWPFRun run : para.getRuns()) {
if (run.getText(0) != null) {
String text = run.getText(0);
Matcher expressionMatcher = expression.matcher(text);
if (expressionMatcher.find() && expressionMatcher.groupCount() > 0) {
System.out.println("Expression Found...");
}
}
}
Where expressionMatcher is Matcher based on a RegularExpression for particular PlaceHolder. Try having regex that matches something optional before your PlaceHolder and after as well e.g \([]*)(PlaceHolderGroup)([]*)^, trust me it works best.
Once you find the right XWPFRun extract text of your interest in it and create a replacement text which should be easy enough, then you should replace new text with previous text in this particular run by:
run.setText(text, 0);
If you were to replace this whole XWPFRun with a completely a new XWPFRun or perhaps insert a new Paragraph/Table after the Paragraph owning this run, you would probably run into a few problems, like A. ConcurrentModificationException which means you cannot modify this List(of XWPFRuns) you are iterating and B. finding the position of new Element to insert. To resolve these issues you should have a List<XWPFParagraph> of XWPFParagarphs that can hold paras after which new Element is to be inserted. Once you have your List of replacement you can iterator over it and for each replacement Paragraph you simply get a cursor and insert new element at that cursor:
for (XWPFParagraph para: paras) {
XmlCursor cursor = (XmlCursor) para.getCTP().newCursor();
XWPFTable newTable = para.getBody().insertNewTbl(cursor);
//Generate your XWPF table based on what's inside para with your own logic
}
To create an XWPFTable, read this.
Hope this helps someone.

// Text nodes begin with w:t in the word document
final String XPATH_TO_SELECT_TEXT_NODES = "//w:t";
try {
// Open the input file
String fileName="test.docx";
String[] splited=fileName.split(".");
File dir=new File("D:\\temp\\test.docx");
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new FileInputStream(dir));
// Build a list of "text" elements
List<?> texts = wordMLPackage.getMainDocumentPart().getJAXBNodesViaXPath(XPATH_TO_SELECT_TEXT_NODES, true);
HashMap<String, String> mappings = new HashMap<String, String>();
mappings.put("1", "one");
mappings.put("2", "two");
// Loop through all "text" elements
Text text = null;
for (Object obj : texts) {
text = (Text) ((JAXBElement<?>) obj).getValue();
String textToReplace = text.getValue();
if (mappings.keySet().contains(textToReplace)) {
text.setValue(mappings.get(textToReplace));
}
}
wordMLPackage.save(new java.io.File("D:/temp/forPrint.docx"));//your path
} catch (Exception e) {
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Text associated to PDF paragraph in document content object wit PDFBox - java

Related

How to extract parameter from pdf file using java code & pdfbox

How to extract elements from a String with jsoup?

Spliting paragraphs that endswith "." and new line after dot in Java

docx4j adding style to paragraph destroys document

How do you find/replace a placeholder in a .docx file with Apache POI?

Categories

Resources