Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.
The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.
Allow me to predict your response.
*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"
Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!
If you know something about PDF, I can predict your next reply.
"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"
Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.
"Then what are these page labels about?" you ask.
Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.
This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have hundreds of images of handwritten notes. They were written from different people but they are in sequence so you know that for example person1 wrote img1.jpg -> img100.jpg. The style of handwriting varies a lot from person to person but there are parts of the notes which are always fixed, I imagine that could help an algorithm (it helps me!).
I tried tesseract and it failed pretty bad at recognizing the text. I'm thinking since each person has like 100 images is there an algorithm I can train by feeding it a small number of examples, like 5 or less and it can learn from that? Or would it not be enough data? From searching around it seems looks like I need to implement a CNN (e.g. this paper).
My knowledge of ai is limited though, is this something that I could still do using a library and some studying? If so, what should I do going forward?
This is called OCR and there has been a progress. Actually, here is an example of how simple it is to parse an image file to text using tesseract:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
print(ocr_core('sample.png'))
BUT
I am not very sure that it can recognize different types of handwriting. You can give it a try yourself to find out. If you want to try the python example you need to import tesseract but first things first to install tesseract on your OS and add it to your PATH.
There are many OCRs out there and some perform better than others. However, this is a field that has improved a lot recently with the Deep Neural Networks. I would consider using a Cloud provider such as Azure, Google Cloud or Amazon. Your upload the image and they return the metadata.
For instance:
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
If you don't want to use cloud services for any reason, I would consider using TensorFlow... but some knowledge is required:
Tensorflow model for OCR
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm trying to add some data to a PDF with iText 7 in a Java application.
I don't succeed in opening the pdf in append mode. I looked for some solutions online but all concerned iText5 (and use classes that doesn't exist any more.)
What can I do?
It depends on what you want specifically:
merge two documents:
https://developers.itextpdf.com/content/itext-7-examples/itext-7-merging-pdf-documents
add content at the end of a document:
Similar to before, you could create a new document (to a byte output stream), and merge the two together
add content to an existing page:
Hard to do, since that typically requires re-layout of the document, which no PDF-engine can currently do.
fill in forms in the document:
https://developers.itextpdf.com/content/itext-7-examples/itext-7-form-examples
add an attachment to the document:
https://developers.itextpdf.com/examples/miscellaneous/clone-embedded-files
extra (3):
Adding content to a PDF, in the middle of existing content is extremely hard.
To understand why, here is some information on how PDF documents are built internally:
PDF documents contain instructions for a viewer to render, rather than plain text
instructions and their arguments are grouped in 'objects'
objects can be compressed to reduce file size
a PDF document keeps an internal index of all of these objects, this is called the XREF table
the index inside a PDF document uses byte-offsets to tell a renderer where (in the file) an object can be found
Suppose you want to change (or add) something.
You'd mess up all the byte-offsets in the XREF. No viewer would be able to find any object again.
Then there is the fact that the PDF does not contain layout information. If you added something new, and existing content would need to move, you need layout information (what objects make a sentence? which sentences make a paragraph?). Only by having layout information can you sensibly re-layout the document.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I would like to be able to copy the content of an Excel cell, let's say B3 in tab (or sheet) "my sheet" into a word document.
I know exactly where to put the content of B3 into the word document, but I don't know how I can do it using Java.
I just finished a project that creates an excel workbook from scratch using java. First time doing it, but I would recommend Apache POI. To share some sources I found along the way here is an overview of core classes the library offers with some useful method descriptions.
And if you find this is something you may want to use here are a bunch of examples that I found quite useful.
My answer is as non-technical as they come, but hopefully this is helpful in some way.
Edit: Just to give an example, you could do something like:
Sheet sheet = new Sheet();
sheet.getRow(rowNumber).getCell(cellNumber).setCellValue(someValue);
And you can get fancy and iterate through every row, cell, column, etc. I found it to be pretty flexible. Even offers styling options
Edit2: Just realized you don't need to set the excel cell value. Oh well, there it is anyways. Still a useful library to use one way or the other though.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.
I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.
By this approach I'm able to know that particular page is differing.
But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.
Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.
Please suggest me someway to achieve this.
PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.
PDF to image using Java
Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)
https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw
A good library for converting PDF to TIFF?
Convert jpeg/png to an array of pixels in java
int pixels array to bmp in java
Finding pixel position
Get Pixel Color around an image
For extraction of text using PDFBox: Extracting text from PDF file using pdfbox
There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.
http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html
http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html
Check out this Java package: https://java.net/projects/pdf-renderer
You can convert the pdf to an image and then traverse the image as a 2D array and compare differences like that.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to write a code either in Java or PHP (Codeigniter) to extract information such as email and phone number of a user uploading hbis resume or cv to the site. Basically I want to build a cv parser.
Need help for this.
thanks
EDIT
The cv format will be in doc.
Since there is no standard CV format, parsing will be next to impossible.
Instead, consider collecting contact information in an HTML form when they upload.
I'd suggest you to build it using a set of regular expressions.
If you just want to extract phone number and email the parser is very simple. It will work almost 100% for emails and (I believe) 98% for phone numbers.
If you wish to extract other information it will be more complicated because there is no standards for CVs; information may be formatted using different ways. Anyway, good luck!
you should use python and write your own scraper, its easy and it can be done really quickly in your case with modules like beautiful soup, urllib2 ...
what its this all about
beautiful soup documentation
Ditto AlexR. If ALL you want to find is email address and phone number, you could scan for strings of characters in the appropriate format. A couple of simple regular expressions could do that fairly reliably. Even that wouldn't be 100%. If someone included, "Learned Java#Technocorp. US citizen." etc, you might easily be fooled into thinking that's an email address of "java#technocorp.us". Okay, that's a strained example, but it's the sort of thing that shoots down natural language parsing.
If you want more than that, there is no easy answer. You could search for keywords, like to find where he went to school you could look for the words "college" or "university". But even then, someone might put "Graduate of Foobar College" or "College: Foobar" or "BA from Foobar" or many many other possible formats.
As #Corbin said, there is no standard CV format. It will be quite difficult to parse with 100% accuracy.
Though, you can try Apache Tika - A Content Analysis Toolkit to parse resume doc/docx format. Apache also support many document format including pdf, txt, xml, odf etc.
Btw, extracting email and phone number from resume can be achieved with few lines of code with the help of regex after getting whole contents from cv using Apache Tika.
Let me know if you get stuck.
Hope this helps!
Note- (I am working on resume summarizer).