How to extract text from a pdf file using Aspose PDF in Java?
I'm looking for this functionality from Aspose API (no code samples?)
edit-
Req:
Let's say a pdf has this text at random locations along with some other data.
First Name: John
Last Name: Doe
City: New York
Phone: (999)-999-9999
Note: I can easily get these values if they are fields of the pdf file. These are in some random locations, not separate fields.
Where the values John, Doe, New York, (999)-999-9999 changes for each document.
I should be able to search for First Name, Last Name, City, Phone so it would return it's preceding value too.
Any suggestions?
#intruder, You can use Regular expressions to retrieve the required text strings. Aspose.PDF for Java API accepts regular expressions, please try the code as follows:
Java
Document pdfDocument = new Document("source.pdf");
// like 1999-2000
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("\\d{4}-\\d{4}");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textFragmentAbsorber.setTextSearchOptions(textSearchOptions);
pdfDocument.getPages().accept(textFragmentAbsorber);
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection)
System.out.println("Text :- " + textFragment.getText());
I work with Aspose as Developer evangelist.
Related
I am trying to automate docx report generation process. For this I am using java and docx4j. I have a template document containing only single page.I would like to copy that page modify it and save it in another docx document.The output report is of multiple similar pages with modification from the template. How do I go about it.
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
Leaving it up to you to modify the template, here is how you could add one document to the end of another document. Suppose base.docx contains "This is the base document." and template.docx contains "The time is:", then after executing this code:
WordprocessingMLPackage doc = Docx4J.load(new File("base.docx"));
WordprocessingMLPackage template = Docx4J.load(new File("template.docx"));
MainDocumentPart main = doc.getMainDocumentPart();
Br pageBreak = Context.getWmlObjectFactory().createBr();
pageBreak.setType(STBrType.PAGE);
main.addObject(pageBreak);
for (Object obj : template.getMainDocumentPart().getContent()) {
main.addObject(obj);
}
main.addParagraphOfText(LocalDateTime.now().toString());
doc.save(new File("result.docx"));
Then result.docx will contain something like:
This is the base document.
^L
The time is:
2018-04-16T17:37:13.541984200
(Where ^L represents a page break.)
To be more precise my original template is containing only header and some styling component.
This kind of information can be stored in a Word stylesheet (.dotx file).
PS : java and docx4j are my first choice but I am open to solutions apart from java and docx4j.
A good tool would be pxDoc: you can specify a dedicated stylesheet in your document generator, or use "variable styles"and specify the stylesheet only when you launch the document generation
I have a LINQ Reporting Engine word file. Which has A field <<[ABC]>> . And its getting value from MySQL database.
The Field is a field that displays comments.Now the field type in SQL is Long-text so it can store large number of words. The problem is when a report is generated the field <<[ABC]>> has the text that is cut off in other words it only printing up-to few character around 380 . My question here is , there any specific limit that LINQ filed can display. And what can do to make all the text print with out any limit?
You can populate <<[ABC]>> with long text. There is no limit for number of characters. You can simply test it using following code example. Create a text file e.g. "text.txt" with some text and execute the following code.
DocumentBuilder builder = new DocumentBuilder();
builder.Write("<<[ABC]>>");
ReportingEngine engine = new ReportingEngine();
engine.BuildReport(builder.Document, File.ReadAllText(MyDir + "text.txt"), "ABC");
builder.Document.Save(MyDir + "18.4.docx");
I am working as Developer Evangelist at Aspose.
I am developing a module where i am supposed to print documents from the server. Following are the requirements :
the module should be able to print a pdf from a url, with & without saving
the module should be able to accept page numbers as parameters and only print/save those page numbers.
the module should be able to accept the printer name as a parameter and use only that printer
Is there any library available for this? How should i go about implementing this?
The answer was Apache PDFBox . I was able to load the PDF into a PDDocument object like this :
PDDocument pdf = PDDocument.load(new URL(download_pdf_from).openStream());
Splitting the document was as easy as :
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(pdf);
Now, to get a reference to any particular page:
splittedDocuments.get(pageNo);
Saving the entire document or even a given page number :
pdf.save(path); //saving the entire document to device
splittedDocuments.get(pageNo).save(path); //saving a particular page number to device
For the printing part, this answer helped me.
I have some huge technical pdf ebooks and I would like to split them in a way that helps me find and read exactly the parts I want from each book. I am talking about indexed pdf files, with contents (parts and chapters). I have come up with the following splitting scheme, based on the pdf's contents:
1. Read book's contents.
2. Create a root folder for the entire book
3. Create one subfolder for each part of the book
4. Split the book in one pdf file per chapter and place the pdfs (chapters) in the corresponding subfolder (part).
How can this be done using a Java or Python pdf library?
You can use PyPDF2 to read and split your PDF files.
Here is how you can export PDF pages:
import PyPDF2
def export_pdf_pages(input_pdf_path, page_first, page_last, output_pdf_path):
with open(input_pdf_path, "rb") as input_stream:
input_pdf = PyPDF2.PdfFileReader(input_stream)
output = PyPDF2.PdfFileWriter()
for index in xrange(page_first - 1, page_last):
try:
page = input_pdf.getPage(index)
except IndexError:
fmt = 'Missing page {page_num} in "{input_pdf_path}"'
msg = fmt.format(page_num=index + 1, input_pdf_path=input_pdf_path)
raise IndexError(msg)
output.addPage(page)
with open(output_pdf_path, "wb") as output_stream:
output.write(output_stream)
I have been using OLE automation from java to access methods for word.
I managed to do the following using the OLE automation:
Open word document template file.
Mail merge the word document template with a csv datasource file.
Save mail merged file to a new word document file.
What i need to do now is to be able to open the mail merged file and then using OLE programmatically split it into multiple files. Meaning if the original mail merged file has 6000 pages and my max pages per file property is set to 3000 pages i need to create two new word document files and place the 1st 3000 pages in the one and the last 3000 pages into the other one.
On my first attempts i took the amount of rows in the csv file and multiplied it by the number of pages in the template to get the total amount of pages after it will be merged. Then i used the merging to create the multiple files. The problem however is that i cannot exactly calculate how many pages the merged document will be because in some case all say 9 pages of the template will not be used because of the data and the mergefields used. So in some cases one row will only create 3 pages (using the 9 page template) and others might create 9 pages (using the 9 page template) during mail merge time.
So the only solution is to merge all rows into one document and then split it into multiple documents therafter to ensure that the exact amount of pages like the 3000 pages property is indeed in each file until there are no more pages left from the original merged file.
I have tried a few things already by using the msdn site to get methods and their properties etc but have been unable to this.
On my last attempts now i have been trying to use GoTo to get to a specific page number and the remove the page. I was going to try do this one by one for each page until i get to where i want the file to start from and then save it as a new file but have been unable to do so as well.
Please can anyone suggest something that could help me out?
Thanks and Regards
Sean
An example to open a word file using the OLE AUTOMATION from jave is included below:
Code sample
OleAutomation documentsAutomation = this.getChildAutomation(this.wordAutomation, "Documents");
int [ ] id = documentsAutomation.getIDsOfNames(new String[]{"Open"});
Variant[] arguments = new Variant[1];
arguments[0] = new Variant(fileName); // where filename is the absolute path to the docx file
Variant invokeResult = documentsAutomation.invoke(id[0], arguments);
private OleAutomation getChildAutomation(OleAutomation automation, String childName) {
int[] id = automation.getIDsOfNames(new String[]{childName});
Variant pVarResult = automation.getProperty(id[0]);
return(pVarResult.getAutomation());
}
Code sample
Sounds like you've pegged it already. Another approach you could take which would avoid building then deleting would be to look at the parts of your template that can make the biggest difference to the number of your template (that is where the data can be multi-line). If you then take these fields and look at the font, line-spacing and line-width type of properties you'll be able to calculate the room your data will take in the template and limit your data at that point. Java FontMetrics can help you with that.