extracting text from using pdfclown function 'textextractor' - java

i am getting an error while using textextractor of pdfclown library. The code i used is
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
// Extract the page text!
Map textStrings = textExtractor.extract(page);
a part of the error i got is
exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>
I also found out that this happens when my pdf contains some bullets for example
item 1
item 2
item 3
Plz help me out to extract the text from such pdfs.

(The following comment turned out to be the solution:)
Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).
After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.
The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:
Your highlighter.java has to be used with pdfclown.jar in the classpath.

Related

PDDocument.load(file) isnt a method (PDFBox)

I wanted to make a simple program to get text content from a pdf file through Java. Here is the code:
PDFTextStripper ts = new PDFTextStripper();
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = PDDocument.load(file);
String allText = ts.getText(doc1);
String gradeText = allText.substring(allText.indexOf("GRADE 10B"), allText.indexOf("GRADE 10C"));
System.out.println("Meeting ID for English: "
+ gradeText.substring(gradeText.indexOf("English") + 7, gradeText.indexOf("English") + 20));
This is just part of the code, but this is the part with the problem.
The error is: The method load(File) is undefined for the type PDDocument
I have learnt using PDFBox from JavaTPoint. I have followed the correct instructions for installing the PDFBox libraries and adding them to the Build Path.
My PDFBox version is 3.0.0
I have also searched the source files and their methods, and I am unable to find the load method there.
Thank you in advance.
As per the 3.0 migration guide the PDDocument.load method has been replaced with the Loader method:
For loading a PDF PDDocument.load has been replaced with the Loader
methods. The same is true for loading a FDF document.
When saving a PDF this will now be done in compressed mode per
default. To override that use PDDocument.save with
CompressParameters.NO_COMPRESSION.
PDFBox now loads a PDF Document incrementally reducing the initial
memory footprint. This will also reduce the memory needed to consume a
PDF if only certain parts of the PDF are accessed. Note that, due to
the nature of PDF, uses such as iterating over all pages, accessing
annotations, signing a PDF etc. might still load all parts of the PDF
overtime leading to a similar memory consumption as with PDFBox 2.0.
The input file must not be used as output for saving operations. It
will corrupt the file and throw an exception as parts of the file are
read the first time when saving it.
So you can either swap to an earlier 2.x version of PDFBox, or you need to use the new Loader method. I believe this should work:
File file = new File("C:\\Meeting IDs.pdf");
PDDocument doc1 = Loader.loadPDF(file);

How to get only the name of my PDF file

I'm developing a project for college which consist reading a CSV file and converting that to a PDF file. That part is fine, I have already done that.
In the end I need to show the name of the PDF file without the full path of where it was created. In other words, I just want the to show the name.
I search a lot to see if there is a simple method that show the name like Java has to show only the name of the File like
file.getName();
Whenever you use iText to create a PDF file, your code sets the target which usually is an OutputStream. If you use a FileOutputStream there, you know the file it writes to.
Thus, all you have to do to to show the name of the PDF File is to inspect your own code and check which target it sets.
Use getBaseName in Apache Commons IO.
getBaseName
public static String getBaseName(String filename)
Gets the base name, minus the full path and extension, from a full
filename.
This method will handle a file in either Unix or Windows format. The
text after the last forward or backslash and before the last dot is
returned.
a/b/c.txt --> c
a.txt --> a
a/b/c --> c
a/b/c/ --> ""
The output will be the same irrespective of the machine that the code
is running on.
Parameters:
filename - the filename to query, null returns null
Returns:
the name of the file without the path, or an empty string if none exists. Null bytes inside string will be removed
Source: https://commons.apache.org/proper/commons-io/javadocs/api-2.5/org/apache/commons/io/FilenameUtils.html#getBaseName(java.lang.String)
If you also need the extension, use getExtension. Which would probably always be .pdf, but you know, it's perfectly valid to have a PDF file without the .pdf filename extension. No sane person would do that but it is better to be prepared for insane users.

Using the created document trough FPDF with PHP/JAVA

I created a PDF document with PHP using FPDF. The next thing I want to do is silently printing the document without downloading the PDF file to the computer.
I've made the following code:
$pdfprintable = $pdf->Output(''.'.pdf','S');
$printcmd = "java -classpath jPDFPrint.jar;pdfprintcli.jar cli.PDFPrintCLI $pdfprintable";
exec($printcmd);
And it returns the following error message:
Warning: exec(): NULL byte detected. Possible attack in C:\Users\Jordy\Desktop\XAMPP\htdocs\php\stickers\pdf.php on line 392
If I echo the $pdfprintable in PHP it shows a lot of weird characters.
Are you sure the java command is supposed to be used with an hexadecimal string represenation of the PDF ?
use option
$pdfprintable = $pdf->Output('USEAFULLPATHTOFILE.pdf','F');
With the above the PDF is generated and then you can try to print it with the java application if that one works.
Also if you are loading the PDF correctly in FPDF you should be able to use the option D in ->Output
$pdfprintable = $pdf->Output('USEAFULLPATHTOFILE.pdf','D');
Use this to verify the that the PDF is loaded and also managed correctly by FPDF.
Also notice your example code is very limited.
If you need more troubleshooting pls show the Java and the full PHP source relevant to printing operation, loading or creation of the PDF in FPDF

Splitting word file into multiple smaller word files using OLE Automation from java

I have been using OLE automation from java to access methods for word.
I managed to do the following using the OLE automation:
Open word document template file.
Mail merge the word document template with a csv datasource file.
Save mail merged file to a new word document file.
What i need to do now is to be able to open the mail merged file and then using OLE programmatically split it into multiple files. Meaning if the original mail merged file has 6000 pages and my max pages per file property is set to 3000 pages i need to create two new word document files and place the 1st 3000 pages in the one and the last 3000 pages into the other one.
On my first attempts i took the amount of rows in the csv file and multiplied it by the number of pages in the template to get the total amount of pages after it will be merged. Then i used the merging to create the multiple files. The problem however is that i cannot exactly calculate how many pages the merged document will be because in some case all say 9 pages of the template will not be used because of the data and the mergefields used. So in some cases one row will only create 3 pages (using the 9 page template) and others might create 9 pages (using the 9 page template) during mail merge time.
So the only solution is to merge all rows into one document and then split it into multiple documents therafter to ensure that the exact amount of pages like the 3000 pages property is indeed in each file until there are no more pages left from the original merged file.
I have tried a few things already by using the msdn site to get methods and their properties etc but have been unable to this.
On my last attempts now i have been trying to use GoTo to get to a specific page number and the remove the page. I was going to try do this one by one for each page until i get to where i want the file to start from and then save it as a new file but have been unable to do so as well.
Please can anyone suggest something that could help me out?
Thanks and Regards
Sean
An example to open a word file using the OLE AUTOMATION from jave is included below:
Code sample
OleAutomation documentsAutomation = this.getChildAutomation(this.wordAutomation, "Documents");
int [ ] id = documentsAutomation.getIDsOfNames(new String[]{"Open"});
Variant[] arguments = new Variant[1];
arguments[0] = new Variant(fileName); // where filename is the absolute path to the docx file
Variant invokeResult = documentsAutomation.invoke(id[0], arguments);
private OleAutomation getChildAutomation(OleAutomation automation, String childName) {
int[] id = automation.getIDsOfNames(new String[]{childName});
Variant pVarResult = automation.getProperty(id[0]);
return(pVarResult.getAutomation());
}
Code sample
Sounds like you've pegged it already. Another approach you could take which would avoid building then deleting would be to look at the parts of your template that can make the biggest difference to the number of your template (that is where the data can be multi-line). If you then take these fields and look at the font, line-spacing and line-width type of properties you'll be able to calculate the room your data will take in the template and limit your data at that point. Java FontMetrics can help you with that.

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)
I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.
Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Categories

Resources