Understand if two different pdf are the same research paper

Understand if two different pdf are the same research paper - java

I'm thinking to write a simple research paper manager.
The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend,
I'm thinking on how to generate the paper_id of a paper: IMHO should be produced
by the text of the pdf, to garantee that two different collections have the same ids only for the same papers.
At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text.
the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant.
With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for
'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature
(short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square
and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider
two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.

In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository.
Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.

Related

Parse PDF table and display it as CSV(Java)

I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate solution. We want to use any open source software for this.

This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
use iText (specifically IEventListener) to get information on all rendering events for every page
select those rendering events that make sense for your application. PathRenderInfo and TextRenderInfo.
Events in a pdf do not need to appear in order according to the spec. Solve this problem by implementing a comparator over IEventData. This comparator should sort according to reading order. This implies you might have to implement some basic language detection, since not every language reads left-to-right.
Once sorted, you can now start clustering items together according to any of the various heuristics you find in literature. For instance, two characters can be grouped into a snippet of text if they follow each other in the sorted list of events (meaning they appear next to each other in reading order), if the y-position does not differ too much (subscript and superscript might screw with this), and if the x-position does not differ too much (kerning).
Continue clustering characters until you have formed words
Assuming you have formed words, use similar algorithm to form words into lines. Use PathRenderInfo to withhold merging words if they intersect with a line.
Assuming you have managed to create lines, now look for tables. One possible approach is apply a horizontal and vertical projection. And look for those sub-areas in the page that (when projected) show a grid-like structure.
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.
http://itextpdf.com/itext7/pdf2Data

BTrees and Disk Persistance

For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.
Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.
How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.

Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:
record count about 190 million, fixed
keys are 64-byte hashes, like SHA-256
values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
page size 4 KiByte
Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.
The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:
page header
uint32_t next_page
uint16_t key_count
key/offset vector
uint16_t value_offset;
byte key[64];
... unallocated space ...
last arrived filename
...
2nd arrived filename
1st arrived filename
Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.
The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.
Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.
For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:
# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms
average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)
That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.
Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.
Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.
As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.

The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.
If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.
If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.
The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:
Supercharging Sequential Searches by Walter Williams
A lot of tricks can also be gleaned from papers like
Efficient index compression in DB2 LUW
For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):
Goetz Graefe: Modern B-Tree Techniques (ISBN 1601984820)
Jim Gray: Transaction Processing. Concepts and Techniques (ISBN 1558601902)
An alternative to the latter might be
Philip E. Bernstein: Principles of Transaction Processing (ISBN 1558606238)
It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).
These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.
Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).
I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.
There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').
It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.
Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.
I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...

Fastest Hash algorithm in Java for Strings

To make it simple, my question is: how to hash a String (about 200 characters) as quickly as possible. Security is not important, but collisions ARE a big deal.
Note: After a quick investigation, it seems like MurmurHash3 might be the best choice. I am open to any comment saying otherwise tho'
First, I know that there are plenty of other similar question, but I couldn't find a convincing answer yet.
I have a list of objects, each containing a list of about 3k paragraphs which is saved to a database. Every X hours, those paragraph are regenerated and I need to find if any paragraphs has changed, and if so, push only those new paragraphs.
The quickest way I found to find the differences (knowing that most of the time the content will be identical) is to create a MerkleTree, save it to the DB, and iterate over the MerkleTree to find the differences, instead of comparing the paragraphs themselves.
This imply, in my case, that I will be creating ten thousands of hashes per second to compare with what is in the DB. Therefore, I need a very efficient way to create those hashes. I don't care about the security, I only need to ensure that the number of collision remains very very low.
What would be the best algorithm available in Java for that?
In my case, the main object is composed of Sections, which is composed of Languages, which is composed of Paragraph. The comparison strategy is:
1) If the object hash is identical, stop, otherwise go to 2)
2) Loop on all Section, keep only the Section with a different hash
3) Loop on all Languages of those Sections, keep only the language with a different hash
4) Loop on all the Paragraph of all those Languages, if the hash is different, then push the new content.

This amazing answer on Programmers Stack Exchange tells you all you need to know.
The short version is, use FNV-1a, aka the Fowler–Noll–Vo hash function, it has excellent performance, high randomness and low collisions.
Any further explanation I might shed on this question would be just be a copy and paste from that Programmers.SE answer, which incidentally is the second highest voted answer on the entire site.
Some other thoughts:
Ultimately, you have a pretty niche use case. Most people aren't dealing with 1 billion entry datasets regularly. As such, you may have to do your own benchmarking.
That said, having a high randomness suggests that the algorithm is likely to scale well for English hashes.
You haven't really talked about other issues; are you able to keep the entire data set in memory? What are your footprint requirements?
See also: Fastest Hash Algorithm for Text Data

How to search for multiple strings in a text file

i am working in text files. I want to implement a search algorithm in Java. I have a text files i need to search.
If I want to find one word I can do it by just putting all the text into the hashmap and store each word's occurrence. But is there any algorithm if i want to search for two strings (or may be more)? Should i hash the strings in pair of two ?

It depends a lot on the size of the text file. There are usually several cases you should consider:
Lot's of queries on very short documents (web pages, texts of essay length etc). Text distribution like normal language. A simple O(n^2) algorithm is fine. For a query of length n just take a window of length n and slide it over. Compare and move the window until you find a match. This algorithm does not care about words, so you just see the whole search as a big string (including spaces). This is probably what most browsers does. KMP or Boyer Moore is not worth the effort, since the O(n^2) case is very rare.
Lot's of queries on one large document. Preprocess your document and store it preprocessed. Common storage options are suffix trees and inverted lists. If you have multiple documents you can build one document from when by concatenating them and storing the end of documents seperately. This is the way to go for document databases where the collection is almost constant.
If you have several documents where you have a high redundancy and your collections changes often, use KMP or Boyer Moore. For example if you want to find certain sequences in DNA data and you often get new sequences to find as well new DNA from experiments, the O(n^2) part of the naive algorithm would kill your time.
There are probably lot's of more possibilities that need different algorithms and data structures, so you should figure out which one is the best in your case.

Some more detail is required before suggesting an approach:
Are you searching for whole words only or any substring?
Are you going to search for many different words in the same unchanged file?
Do you know the words you want to search for all at once?
There are many efficient (linear) search algorithms for strings. If possible I'd suggest using one that's already been written for you.
http://en.wikipedia.org/wiki/String_searching_algorithm
One simple idea is to use a sliding window hash with the window the same size as the search string. Then in a single pass you can quickly check to see where the window hash matches the hash of your search string. Where it matches you double check to see if you've got a real match.

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.