Parse PDF table and display it as CSV(Java) - java

I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate solution. We want to use any open source software for this.

This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
use iText (specifically IEventListener) to get information on all rendering events for every page
select those rendering events that make sense for your application. PathRenderInfo and TextRenderInfo.
Events in a pdf do not need to appear in order according to the spec. Solve this problem by implementing a comparator over IEventData. This comparator should sort according to reading order. This implies you might have to implement some basic language detection, since not every language reads left-to-right.
Once sorted, you can now start clustering items together according to any of the various heuristics you find in literature. For instance, two characters can be grouped into a snippet of text if they follow each other in the sorted list of events (meaning they appear next to each other in reading order), if the y-position does not differ too much (subscript and superscript might screw with this), and if the x-position does not differ too much (kerning).
Continue clustering characters until you have formed words
Assuming you have formed words, use similar algorithm to form words into lines. Use PathRenderInfo to withhold merging words if they intersect with a line.
Assuming you have managed to create lines, now look for tables. One possible approach is apply a horizontal and vertical projection. And look for those sub-areas in the page that (when projected) show a grid-like structure.
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.
http://itextpdf.com/itext7/pdf2Data

Related

How to combine two pre-trained Word2Vec models?

I successfully followed deeplearning4j.org tutorial on Word2Vec, so I am able to load already trained model or train a new one based on some raw text (more specifically, I am using GoogleNews-vectors-negative300 and Emoji2Vec pre-trained model).
However, I would like to combine these two above models for the following reason: Having a sentence (for example, a comment from Instagram or Twitter, which consists of emoji), I want to identify the emoji in the sentence and then map it to the word it is related to. In order to do that, I was planning to iterate over all the words in the sentence and calculate the closeness (how near the emoji and the word are located in the vector space).
I found the code how to uptrain the already existing model. However, it is mentioned that new words are not added in this case and only weights for the existing words will be updated based on a new text corpus.
I would appreciate any help or ideas on the problem I have. Thanks in advance!
Combining two models trained from different corpuses is not a simple, supported operation in the word2vec libraries with which I'm most familiar.
In particular, even if the same word appears in both corpuses, and even in similar contexts, the randomization that's used by this algorithm during initialization and training, and extra randomization injected by multithreaded training, mean that word may appear in wildly different places. It's only the relative distances/orientation with respect to other words that should be roughly similar – not the specific coordinates/rotations.
So to merge two models requires translating one's coordinates to the other. That in itself will typically involve learning-a-projection from one space to the other, then moving unique words from a source space to the surviving space. I don't know if DL4J has a built-in routine for this; the Python gensim library has a TranslationMatrix example class in recent versions which can do this, as motivated by the use of word-vectors for language-to-language translations.

Detecting an object (words) in an image

I want to implement object detection in license plate (the city name) . I have an image:
and I want to detect if the image contains the word "بابل":
I have tried using a template matching method using OpenCV and also using MATLAB but the result is poor when tested with other images.
I have also read this page, but I was not able to get a good understanding of what to do from that.
Can anyone help me or give me a step by step way to solve that?
I have a project to recognize the license plate and we can recognize and detect the numbers but I need to detect and recognize the words (it is the same words with more cars )
Your question is very broad, but I will do my best to explain optical character recognition (OCR) in a programmatic context and give you a general project workflow followed by successful OCR algorithms.
The problem you face is easier than most, because instead of having to recognize/differentiate between different characters, you only have to recognize a single image (assuming this is the only city you want to recognize). You are, however, subject to many of the limitations of any image recognition algorithm (quality, lighting, image variation).
Things you need to do:
1) Image isolation
You'll have to isolate your image from a noisy background:
I think that the best isolation technique would be to first isolate the license plate, and then isolate the specific characters you're looking for. Important things to keep in mind during this step:
Does the license plate always appear in the same place on the car?
Are cars always in the same position when the image is taken?
Is the word you are looking for always in the same spot on the license plate?
The difficulty/implementation of the task depends greatly on the answers to these three questions.
2) Image capture/preprocessing
This is a very important step for your particular implementation. Although possible, it is highly unlikely that your image will look like this:
as your camera would have to be directly in front of the license plate. More likely, your image may look like one of these:
depending on the perspective where the image is taken from. Ideally, all of your images will be taken from the same vantage point and you'll simply be able to apply a single transform so that they all look similar (or not apply one at all). If you have photos taken from different vantage points, you need to manipulate them or else you will be comparing two different images. Also, especially if you are taking images from only one vantage point and decide not to do a transform, make sure that the text your algorithm is looking for is transformed to be from the same point-of-view. If you don't, you'll have an not-so-great success rate that's difficult to debug/figure out.
3) Image optimization
You'll probably want to (a) convert your images to black-and-white and (b) reduce the noise of your images. These two processes are called binarization and despeckling, respectively. There are many implementations of these algorithms available in many different languages, most accessible by a Google search. You can batch process your images using any language /free tool if you want, or find an implementation that works with whatever language you decide to work in.
4) Pattern recognition
If you only want to search for the name of this one city (only one word ever), you'll most likely want to implement a matrix matching strategy. Many people also refer to matrix matching as pattern recognition so you may have heard it in this context before. Here is an excellent paper detailing an algorithmic implementation that should help you immensely should you choose to use matrix matching. The other algorithm available is feature extraction, which attempts to identify words based on patterns within letters (i.e. loops, curves, lines). You might use this if the font style of the word on the license plate ever changes, but if the same font will always be used, I think matrix matching will have the best results.
5) Algorithm training
Depending on the approach you take (if you use a learning algorithm), you may need to train your algorithm with data that is tagged. What this means is that you have a series of images that you've identified as True (contains city name) or False (does not). Here's a psuedocode example of how this works:
train = [(img1, True), (img2, True), (img3, False), (img4, False)]
img_recognizer = algorithm(train)
Then, you apply your trained algorithm to identify untagged images.
test_untagged = [img5, img6, img7]
for image in test_untagged:
img_recognizer(image)
Your training sets should be much larger than four data points; in general, the bigger the better. Just make sure, as I said before, that all the images are of an identical transformation.
Here is a very, very high-level code flow that may be helpful in implementing your algorithm:
img_in = capture_image()
cropped_img = isolate(img_in)
scaled_img = normalize_scale(cropped_img)
img_desp = despeckle(scaled_img)
img_final = binarize(img_desp)
#train
match() = train_match(training_set)
boolCity = match(img_final)
The processes above have been implemented many times and are thoroughly documented in many languages. Below are some implementations in the languages tagged in your question.
Pure Java
cvBlob in OpenCV (check out this tutorial and this blog post too)
tesseract-ocr in C++
Matlab OCR
Good luck!
If you ask "I want to detect if the image contains the word "بابل" - this is classic problem which is solved using http://code.opencv.org/projects/opencv/wiki/FaceDetection like classifier.
But I assume you still want more. Years ago I tried to solve simiar problems and I provide example image to show how good/bad it was:
To detected licence plate I used very basic rectangle detection which is included in every OpenCV samples folder. And then used perspective transform to fix layout and size. It was important to implement multiple checks to see if rectangle looks good enough to be licence plate. For example if rectangle is 500px tall and 2px wide, then probably this is not what I want and was rejected.
Use https://code.google.com/p/cvblob/ to extract arabic text and other components on detected plate. I just had similar need yesterday on other project. I had to extract Japanese kanji symbols from page:
CvBlob does a lot of work for you.
Next step use technique explained http://blog.damiles.com/2008/11/basic-ocr-in-opencv/ to match city name. Just teach algorithm with example images of different city names and soon it will tell 99% of them just out of box. I have used similar approaches on different projects and quite sure they work

Understand if two different pdf are the same research paper

I'm thinking to write a simple research paper manager.
The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend,
I'm thinking on how to generate the paper_id of a paper: IMHO should be produced
by the text of the pdf, to garantee that two different collections have the same ids only for the same papers.
At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text.
the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant.
With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for
'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature
(short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square
and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider
two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.
In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository.
Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.

how to approach phrase queries and term grouping

I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.
You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.

Which is easier? iText or Mail Merge

I'm working on a project for a school fundraiser and I'm supposed to be able to output results onto a PDF or Word Doc that I could easily automate to print out a sheet with the same page content but different results. I'm hoping I would be able to make the page look interesting as well, with bright colors and images.
I've been looking around and these two things caught my eye, which would you suggest I use? iText or Mail Merge with Office? (if you reccommend one over the other, can you also add resources for me to use?)
Thank you!
Mail Merge, no question about it. Of course ultimately iText gives you the power to rewrite the whole page based on where you are sending it (like making a report), but that is not what you are looking for. If by "different results" you mean things like the donor's name and amount of donation, then go for the Mail Merge.
If you are saying you have all kinds different bar charts and content differences per person, then I might think differently, but unless you are a super-amazing high school programmer, you aren't figuring out iText in time to get that done. It is a relatively big deal, programming wise, to put together a PDF from scratch using iText, compared to putting something together in Microsoft Word.
I agree with Jishai about keeping it simple if time is likely to be short (as it might be with funder-raiser schedule). Possibly the JODReports or Docmosis systems might be very handy since they have command line calls you can use to have documents generated base on a mail-merge requirement.
Hope that helps.

Categories

Resources