PDFBox : Extraction of data from table

PDFBox : Extraction of data from table - java

How to extract data from a table in a pdf using pdfbox?
In this process, Index of Text and contents can be found using PDContentStream and PageStripper classes.Index of lines in the table have to be found, Can anyone help with which class to extend and which method to implement?
I have tried the following for extracting the start index of texts:
public class Tables {
public static void main(String args[]) throws IOException{
BufferedWriter wr;
File input = new File("test.pdf");
File output = new File("SampleText.txt");
PDDocument pd=new PDDocument();
pd=PDDocument.load(input);
// PDFTextStripper pds=new PDFTextStripper();
// String text=pds.getText(pd);
PDFTextStripper stripper = new PDFTextStripper()
{
#Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}
#Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
if (startOfLine)
{
TextPosition firstProsition = textPositions.get(0);
writeString(String.format("[%s]", firstProsition.getYDirAdj()));
startOfLine = false;
}
super.writeString(text, textPositions);
}
boolean startOfLine = true;
};
wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
stripper.writeText(pd, wr);
if (pd != null) {
pd.close();
}
// I use close() to flush the stream.
wr.close();
}
}

Recently I did a similar project where I had to extract data from tables.
You have two options here:-
1) You can use tabula (It is an open source tool for extracting tables from pdf). http://tabula.technology/
https://github.com/tabulapdf/tabula
You can use tabula command line tool in your code and extract the data from a specific region.
2) You need to devise your own algorithm for extracting the tabular data.
If you are going to go for the second option then you would need to extract coordinates of the text also. You can override writestring method of pdfTextStripper class (you can google about this). Then you need to think on how to use those information to get the details you need. (Co-ordinates can be very helpful).
If you have the pdf in a standard format then I suggest you to use tabula as there is not much to be do.

Related

Read two lines of a file at once in a flink streaming process

I want to process files with a flink stream in which two lines belong together. In the first line there is a header and in the second line a corresponding text.
The files are located on my local file system. I am using the readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) method with a custom FileInputFormat.
My streaming job class looks like this:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Read> inputStream = env.readFile(new ReadInputFormatTest("path/to/monitored/folder"), "path/to/monitored/folder", FileProcessingMode.PROCESS_CONTINUOUSLY, 100);
inputStream.print();
env.execute("Flink Streaming Java API Skeleton");
and my ReadInputFormatTest like this:
public class ReadInputFormatTest extends FileInputFormat<Read> {
private transient FileSystem fileSystem;
private transient BufferedReader reader;
private final String inputPath;
private String headerLine;
private String readLine;
public ReadInputFormatTest(String inputPath) {
this.inputPath = inputPath;
}
#Override
public void open(FileInputSplit inputSplit) throws IOException {
FileSystem fileSystem = getFileSystem();
this.reader = new BufferedReader(new InputStreamReader(fileSystem.open(inputSplit.getPath())));
this.headerLine = reader.readLine();
this.readLine = reader.readLine();
}
private FileSystem getFileSystem() {
if (fileSystem == null) {
try {
fileSystem = FileSystem.get(new URI(inputPath));
} catch (URISyntaxException | IOException e) {
throw new RuntimeException(e);
}
}
return fileSystem;
}
#Override
public boolean reachedEnd() throws IOException {
return headerLine == null;
}
#Override
public Read nextRecord(Read r) throws IOException {
r.setHeader(headerLine);
r.setSequence(readLine);
headerLine = reader.readLine();
readLine = reader.readLine();
return r;
}
}
As expected, the headers and the text are stored together in one object. However, the file is read eight times. So the problem is the parallelization. Where and how can I specify that a file is processed only once, but several files in parallel?
Or do I have to change my custom FileInputFormat even further?

I would modify your source to emit the available filenames (instead of the actual file contents) and then add a new processor to read a name from the input stream and then emit pairs of lines. In other words, split the current source into a source followed by a processor. The processor can be made to run at any degree of parallelism and the source would be a single instance.

DailyFileRolling without log4j & logback

I am looking for a simple daily file rolling utility that will create file with the current date in <filename>-yyyy-mm-dd.txt
and will continue to write on it and when the date changes,
it will create a new file & continue to write on it.
On googling most of the result I got was using logback or log4j.
Is there any way to get daily file roller available in java without logback or log4j?

Assuming you're talking about java.util.logging - you will have to implement your own java.util.logging.Handler.
I haven't tried this myself but take a look at Tomcats JULI package, they've implemented a Handler that should be able to do rolling-file logging. The class is org.apache.juli.FileHandler and can be found on maven central (version 9.0.0M4).

The simpliest I've done is extending a Writer and using a date format. It should be optimized to open file only when needed , but it works.
public class StackOverflowFileRolling extends Writer {
String baseFilePath;
String datePattern;
public StackOverflowFileRolling (String baseFilePath, String datePattern) {
super();
this.baseFilePath = baseFilePath;
this.datePattern = datePattern;
}
#Override
public void write (char[] cbuf, int off, int len) throws IOException {
String name = baseFilePath + new SimpleDateFormat(datePattern).format(new Date()) + ".log";
File file = new File(name);
file.getParentFile().mkdirs();
try (FileWriter fileWriter = new FileWriter(file, true)) {
fileWriter.write(cbuf, off, len);
}
}
#Override
public void flush () throws IOException {
// TODO Auto-generated method stub
}
#Override
public void close () throws IOException {
// TODO Auto-generated method stub
}
}
Then use it by wrapping it in a printWriter
try (
PrintWriter logger = new PrintWriter(
new StackOverflowFileRolling("/var/log/stackoverflowfilerolling", "yyyy-MM-dd"))
) {
//this will append 2 lines in the log, since the FileWriter is in append mode
logger.println("test");
logger.println("test2");
}
EDIT some improvements to create directories and use a PrintWriter

Invalidate Stream without Closing

This is a followup to anonymous file streams reusing descriptors
As per my previous question, I can't depend on code like this (happens to work in JDK8, for now):
RandomAccessFile r = new RandomAccessFile(...);
FileInputStream f_1 = new FileInputStream(r.getFD());
// some io, not shown
f_1 = null;
f_2 = new FileInputStream(r.getFD());
// some io, not shown
f_2 = null;
f_3 = new FileInputStream(r.getFD());
// some io, not shown
f_3 = null;
However, to prevent accidental errors and as a form of self-documentation, I would like to invalidate each file stream after I'm done using it - without closing the underlying file descriptor.
Each FileInputStream is meant to be independent, with positioning controlled by the RandomAccessFile. I share the same FileDescriptor to prevent any race conditions arising from opening the same path multiple times. When I'm done with one FileInputStream, I want to invalidate it so as to make it impossible to accidentally read from it while using the second FileInputStream (which would cause the second FileInputStream to skip data).
How can I do this?
notes:
the libraries I use require compatibiity with java.io.*
if you suggest a library (I prefer builtin java semantics if at all possible), it must be commonly available (packaged) for linux (the main target) and usable on windows (experimental target)
but, windows support isn't a absolutely required
Edit: in response to a comment, here is my workflow:
RandomAccessFile r = new RandomAccessFile(String path, "r");
int header_read;
int header_remaining = 4; // header length, initially
byte[] ba = new byte[header_remaining];
ByteBuffer bb = new ByteBuffer.allocate(header_remaining);
while ((header_read = r.read(ba, 0, header_remaining) > 0) {
header_remaining -= header_read;
bb.put(ba, 0, header_read);
}
byte[] header = bb.array();
// process header, not shown
// the RandomAccessFile above reads only a small amount, so buffering isn't required
r.seek(0);
FileInputStream f_1 = new FileInputStream(r.getFD());
Library1Result result1 = library1.Main.entry_point(f_1)
// process result1, not shown
// Library1 reads the InputStream in large chunks, so buffering isn't required
// invalidate f_1 (this question)
r.seek(0)
int read;
while ((read = r.read(byte[4096] buffer)) > 0 && library1.continue()) {
library2.process(buffer, read);
}
// the RandomAccessFile above is read in large chunks, so buffering isn't required
// in a previous edit the RandomAccessFile was used to create a FileInputStream. Obviously that's not required, so ignore
r.seek(0)
Reader r_1 = new BufferedReader(new InputStreamReader(new FileInputStream(r.getFD())));
Library3Result result3 = library3.Main.entry_point(r_2)
// process result3, not shown
// I'm not sure how Library3 uses the reader, so I'm providing buffering
// invalidate r_1 (this question) - bonus: frees the buffer
r.seek(0);
FileInputStream f_2 = new FileInputStream(r.getFD());
Library1Result result1 = library1.Main.entry_point(f_2)
// process result1 (reassigned), not shown
// Yes, I actually have to call 'library1.Main.entry_point' *again* - same comments apply as from before
// invalidate f_2 (this question)
//
// I've been told to be careful when opening multiple streams from the same
// descriptor if one is buffered. This is very vague. I assume because I only
// ever use any stream once and exclusively, this code is safe.
//

A pure Java solution might be to create a forwarding decorator that checks on each method call whether the stream is validated or not. For InputStream this decorator may look like this:
public final class CheckedInputStream extends InputStream {
final InputStream delegate;
boolean validated;
public CheckedInputStream(InputStream stream) throws FileNotFoundException {
delegate = stream;
validated = true;
}
public void invalidate() {
validated = false;
}
void checkValidated() {
if (!validated) {
throw new IllegalStateException("Stream is invalidated.");
}
}
#Override
public int read() throws IOException {
checkValidated();
return delegate.read();
}
#Override
public int read(byte b[]) throws IOException {
checkValidated();
return read(b, 0, b.length);
}
#Override
public int read(byte b[], int off, int len) throws IOException {
checkValidated();
return delegate.read(b, off, len);
}
#Override
public long skip(long n) throws IOException {
checkValidated();
return delegate.skip(n);
}
#Override
public int available() throws IOException {
checkValidated();
return delegate.available();
}
#Override
public void close() throws IOException {
checkValidated();
delegate.close();
}
#Override
public synchronized void mark(int readlimit) {
checkValidated();
delegate.mark(readlimit);
}
#Override
public synchronized void reset() throws IOException {
checkValidated();
delegate.reset();
}
#Override
public boolean markSupported() {
checkValidated();
return delegate.markSupported();
}
}
You can use it like:
CheckedInputStream f_1 = new CheckedInputStream(new FileInputStream(r.getFD()));
// some io, not shown
f_1.invalidate();
f_1.read(); // throws IllegalStateException

Under unix you could generally avoid such problems by dup'ing a file descriptor.
Since java does not not offer such a feature one option would be a native library which exposes that. jnr-posix does that for example. On the other hand jnr depends on a lot more jdk implementation properties than your original question.

How to add HTML headers and footers to a page?

How to add header to pdf from an html source using itext?
Currently, we have extended PdfPageEventHelper and overriden these methods. Works fine but it throws a RuntimeWorkerException when I get to 2+ pages.
#Override
void onStartPage(PdfWriter writer, Document document) {
InputStream is = new ByteArrayInputStream(header?.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
}
#Override
void onEndPage(PdfWriter writer, Document document) {
InputStream is = new ByteArrayInputStream(footer?.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
}

It is forbidden to add content in the onStartPage() event in general. It is forbidden to add content to the document object in the onEndPage(). You should add your header and your footer in the onEndPage() method using PdfWriter, NOT document. Also: you are wasting plenty of CPU by parsing the HTML over and over again.
Please take a look at the HtmlHeaderFooter example.
It has two snippets of HTML, one for the header, one for the footer.
public static final String HEADER =
"<table width=\"100%\" border=\"0\"><tr><td>Header</td><td align=\"right\">Some title</td></tr></table>";
public static final String FOOTER =
"<table width=\"100%\" border=\"0\"><tr><td>Footer</td><td align=\"right\">Some title</td></tr></table>";
Note that there are better ways to describe the header and footer than by using HTML, but maybe it's one of your requirements, so I won't ask you why you don't use any of the methods that is explained in the official documentation. By the way: all the information you need to solve your problem can also be found in that free ebook so you may want to download it...
We will read these HTML snippets only once in our page event and then we'll render the elements over and over again on every page:
public class HeaderFooter extends PdfPageEventHelper {
protected ElementList header;
protected ElementList footer;
public HeaderFooter() throws IOException {
header = XMLWorkerHelper.parseToElementList(HEADER, null);
footer = XMLWorkerHelper.parseToElementList(FOOTER, null);
}
#Override
public void onEndPage(PdfWriter writer, Document document) {
try {
ColumnText ct = new ColumnText(writer.getDirectContent());
ct.setSimpleColumn(new Rectangle(36, 832, 559, 810));
for (Element e : header) {
ct.addElement(e);
}
ct.go();
ct.setSimpleColumn(new Rectangle(36, 10, 559, 32));
for (Element e : footer) {
ct.addElement(e);
}
ct.go();
} catch (DocumentException de) {
throw new ExceptionConverter(de);
}
}
}
Do you see the mechanism we use to add the Element objects obtained from XML Worker? We create a ColumnText object that will write to the direct content of the writer (using the document is forbidden). We define a Rectangle and we using go() to render the elements.
The results is shown in html_header_footer.pdf.

Bruno's anwser is correct but it didn't worked for me completely as XMLWorkerHelper.parsetoElementsList was not able to parse some system fonts on the other hand XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
} was able to parse system fonts correctly so i have to go down the route of elements handler which worked a treat here's the code in C#
/// <summary>
/// returns pdf in bytes.
/// </summary>
/// <param name="contentsHtml">contents.</param>
/// <param name="headerHtml">header contents.</param>
/// <param name="footerHtml">footer contents.</param>
/// <returns></returns>
public Byte[] GetPDF(string contentsHtml, string headerHtml, string footerHtml)
{
// Create a byte array that will eventually hold our final PDF
Byte[] bytes;
// Boilerplate iTextSharp setup here
// Create a stream that we can write to, in this case a MemoryStream
using (var ms = new MemoryStream())
{
// Create an iTextSharp Document which is an abstraction of a PDF but **NOT** a PDF
using (var document = new Document(PageSize.A4, 40, 40, 120, 120))
{
// Create a writer that's bound to our PDF abstraction and our stream
using (var writer = PdfWriter.GetInstance(document, ms))
{
// Open the document for writing
document.Open();
var headerElements = new HtmlElementHandler();
var footerElements = new HtmlElementHandler();
XMLWorkerHelper.GetInstance().ParseXHtml(headerElements, new StringReader(headerHtml));
XMLWorkerHelper.GetInstance().ParseXHtml(footerElements, new StringReader(footerHtml));
writer.PageEvent = new HeaderFooter(headerElements.GetElements(), footerElements.GetElements());
// Read your html by database or file here and store it into finalHtml e.g. a string
// XMLWorker also reads from a TextReader and not directly from a string
using (var srHtml = new StringReader(contentsHtml))
{
// Parse the HTML
iTextSharp.tool.xml.XMLWorkerHelper.GetInstance().ParseXHtml(writer, document, srHtml);
}
document.Close();
}
}
// After all of the PDF "stuff" above is done and closed but **before** we
// close the MemoryStream, grab all of the active bytes from the stream
bytes = ms.ToArray();
}
return bytes;
}
}
page events and elements handler code is here
public partial class HeaderFooter : PdfPageEventHelper
{
private ElementList HeaderElements { get; set; }
private ElementList FooterElements { get; set; }
public HeaderFooter(ElementList headerElements, ElementList footerElements)
{
HeaderElements = headerElements;
FooterElements = footerElements;
}
public override void OnEndPage(PdfWriter writer, Document document)
{
base.OnEndPage(writer, document);
try
{
ColumnText headerText = new ColumnText(writer.DirectContent);
foreach (IElement e in HeaderElements)
{
headerText.AddElement(e);
}
headerText.SetSimpleColumn(document.Left, document.Top, document.Right, document.GetTop(-100), 10, Element.ALIGN_MIDDLE);
headerText.Go();
ColumnText footerText = new ColumnText(writer.DirectContent);
foreach (IElement e in FooterElements)
{
footerText.AddElement(e);
}
footerText.SetSimpleColumn(document.Left, document.GetBottom(-100), document.Right, document.GetBottom(-40), 10, Element.ALIGN_MIDDLE);
footerText.Go();
}
catch (DocumentException de)
{
throw new Exception(de.Message);
}
}
}
public class HtmlElementHandler : IElementHandler
{
public ElementList Elements { get; set; }
public HtmlElementHandler()
{
Elements = new ElementList();
}
public ElementList GetElements()
{
return Elements;
}
public void Add(IWritable w)
{
if (w is WritableElement)
{
foreach (IElement e in ((WritableElement)w).Elements())
{
Elements.Add(e);
}
}
}
}

iText - Fail on second attempt to generate PDF

I have a Java desktop application that is using iText to generate PDFs from a resultset. The first time you generate a PDF, it works fine. The problem comes when you try to generate a second one. It throws a DocumentException saying that the document is closed. I have tried to find other examples of people having this problem, and I come up with very little, which leads me to believe that I have made a very simple mistake and I cannot find it.
The code below is a snippet of the event handler that calls the report class:
RptPotReport report = new RptPotReport();
try {
report.rptPot();
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
And here is the code for the report class itself. The error occurs on the second run through this code:
public class RptPotReport {
public static void main(String[] args) throws IOException, DocumentException, SQLException {
new RptPotReport().rptPot();
}
String fileOutput = "Potting Report.pdf";
public void rptPot() throws DocumentException, IOException {
File f = new File("Potting Report.pdf");
if (f.exists()) {
f.delete();
}
Document document = new Document();
document = pdfSizes.getPdfLetter();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(fileOutput));
document.open();
Phrase title = new Phrase();
title.add(new Chunk("Potting Report"));
document.add(title); // ******* DocumentException here: "The document has been closed. You can't add any Elements."
document.close();
try {
File pdfFile = new File(fileOutput);
if (pdfFile.exists()) {
if (Desktop.isDesktopSupported()) {
Desktop.getDesktop().open(pdfFile);
} else {
System.out.println("Awt Desktop is not supported!");
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
EDIT: At someone's suggestion, I tried calling the RptPotReport from a second thread, but that did not change anything. Looking into it further, the Document class of iText creates a new thread when it's instantiated. So I'm right back where I started, still stuck.

What does this line do exactly in your application:
document = pdfSizes.getPdfLetter();
Without the code and with your explanation it seems like the line sets the reference of the document variable to the one that you receive from pdfSizes.getPdfLetter(), which is reused between run, thus you no longer have the reference of the new Document() statement.
I tend to think the pdfSizes.getPdfLetter() method is bugged.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox : Extraction of data from table - java

Related

Read two lines of a file at once in a flink streaming process

DailyFileRolling without log4j & logback

Invalidate Stream without Closing

How to add HTML headers and footers to a page?

iText - Fail on second attempt to generate PDF

Categories

Resources