Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I need to compare and get all the visual differences in the two PDF files. I know there are some questions related to this on stack overflow but they are not fulfilling my need.
I'm currently using PDFBox to generate images for pages in PDF and comparing the bytes of the images.
By this approach I'm able to know that particular page is differing.
But I need to find to know some more fine details such as font size of some text, for say - "The text" is differing in the page number, say 6 in the PDFs.
Not only for text but I need to take care of all the visual differences such as images, text in the charts etc.
Please suggest me someway to achieve this.
PS: I tried using Apache Tika but I'm getting the sense that it could be used to get structured text in XHTML and metadata. But I'm seeing the fine details such as font size, font eight is not appearing in structured text. Please correct me if I'm getting it wrong.
PDF to image using Java
Convert PDF to thumbnail image in Java (there's an example of pdf-renderer use here)
https://www.google.com.br/search?q=PixelGraber&ie=utf-8&oe=utf-8&rls=org.mozilla:pt-BR:official&client=firefox-a&gws_rd=cr&ei=K1PhUqD2Jei0sQTQs4DoAw
A good library for converting PDF to TIFF?
Convert jpeg/png to an array of pixels in java
int pixels array to bmp in java
Finding pixel position
Get Pixel Color around an image
For extraction of text using PDFBox: Extracting text from PDF file using pdfbox
There are classes in PDFBox for detecting font position, type, size and maybe (didn't search deeper) other settings. (Links below) You could, then, extract text from both PDFs, compare them to check if texts are equal, then - if they are equal - compare their format. If there's something different, mark for display into another text, image or PDF.
http://pdfbox.apache.org/docs/1.8.3/javadocs/org/apache/pdfbox/util/TextPosition.html
http://pdfbox.apache.org/docs/1.8.2/javadocs/org/apache/pdfbox/pdmodel/graphics/PDFontSetting.html
Check out this Java package: https://java.net/projects/pdf-renderer
You can convert the pdf to an image and then traverse the image as a 2D array and compare differences like that.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have hundreds of images of handwritten notes. They were written from different people but they are in sequence so you know that for example person1 wrote img1.jpg -> img100.jpg. The style of handwriting varies a lot from person to person but there are parts of the notes which are always fixed, I imagine that could help an algorithm (it helps me!).
I tried tesseract and it failed pretty bad at recognizing the text. I'm thinking since each person has like 100 images is there an algorithm I can train by feeding it a small number of examples, like 5 or less and it can learn from that? Or would it not be enough data? From searching around it seems looks like I need to implement a CNN (e.g. this paper).
My knowledge of ai is limited though, is this something that I could still do using a library and some studying? If so, what should I do going forward?
This is called OCR and there has been a progress. Actually, here is an example of how simple it is to parse an image file to text using tesseract:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
print(ocr_core('sample.png'))
BUT
I am not very sure that it can recognize different types of handwriting. You can give it a try yourself to find out. If you want to try the python example you need to import tesseract but first things first to install tesseract on your OS and add it to your PATH.
There are many OCRs out there and some perform better than others. However, this is a field that has improved a lot recently with the Deep Neural Networks. I would consider using a Cloud provider such as Azure, Google Cloud or Amazon. Your upload the image and they return the metadata.
For instance:
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
If you don't want to use cloud services for any reason, I would consider using TensorFlow... but some knowledge is required:
Tensorflow model for OCR
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm trying to add some data to a PDF with iText 7 in a Java application.
I don't succeed in opening the pdf in append mode. I looked for some solutions online but all concerned iText5 (and use classes that doesn't exist any more.)
What can I do?
It depends on what you want specifically:
merge two documents:
https://developers.itextpdf.com/content/itext-7-examples/itext-7-merging-pdf-documents
add content at the end of a document:
Similar to before, you could create a new document (to a byte output stream), and merge the two together
add content to an existing page:
Hard to do, since that typically requires re-layout of the document, which no PDF-engine can currently do.
fill in forms in the document:
https://developers.itextpdf.com/content/itext-7-examples/itext-7-form-examples
add an attachment to the document:
https://developers.itextpdf.com/examples/miscellaneous/clone-embedded-files
extra (3):
Adding content to a PDF, in the middle of existing content is extremely hard.
To understand why, here is some information on how PDF documents are built internally:
PDF documents contain instructions for a viewer to render, rather than plain text
instructions and their arguments are grouped in 'objects'
objects can be compressed to reduce file size
a PDF document keeps an internal index of all of these objects, this is called the XREF table
the index inside a PDF document uses byte-offsets to tell a renderer where (in the file) an object can be found
Suppose you want to change (or add) something.
You'd mess up all the byte-offsets in the XREF. No viewer would be able to find any object again.
Then there is the fact that the PDF does not contain layout information. If you added something new, and existing content would need to move, you need layout information (what objects make a sentence? which sentences make a paragraph?). Only by having layout information can you sensibly re-layout the document.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have an image in a byte[] variable and my goal is to convert it to JPG format and then create a BufferedImage variable from it.
Speed is very important here. Just to create a BufferedImage from the byte[] on a 500kb image takes 0.5 seconds.
One approach is (but very slow) is:
create a BufferedImage from the image byte[]
use ImageIO.write to convert the image to jpg and write it to disk
read the image from disk and create a BufferedImage from it
Is there any faster way to do this, please?
edit:
The byte array contains the content of a valid PNG, JPG or GIF image i have read the HDD.
There's a library called OpenCV. It's a C/C++ library but it also has support for Java through a JNI wrapper. I have to mention that OpenCV is specialized in computer vision and general image processing and not so heavily in compressing images and having rocket science methods to write images to disk.
A JPG image of 500 kB is probably much larger in raw data when you read it into memory.
Honestly, if I were you, I'd stick with the stuff that's already there in the Java API (such as ImageIO).
If you plan on processing loads of images, you could create a list of the files you wish to convert to a different format, design your application be multi-threaded and separate the tasks over multiple threads. That's really the way to win back time.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a web app that I am changing to use xPressions from EMC2. There is a point where xPressions returns a pdf document inside a java servlet. Before we added xPressions, we would combine several of these pdfs into one large pdf and send it back to the user/screen. But xPressions can only process one pdf at a time. It is returning the pdf as a byte[] array. So I am trying to find a way to take the byte[] arrays and combine them into one large pdf to send back to the user/screen. Before we had xPressions we were using an old version of Big Faceless (bfo.com) to combine the individual pdfs into one pdf in the servlet. I have not been able to get the byte[] array to a valid pdf using the old bfo.com software. I have searched on Google and here on stack overflow for another technique. I have found answers that are close but most are using Linux or c#. Also, these pdfs are created inside the java servlet and are not existing on a hard drive where I could read them in and convert them. I have to take the byte[] array and work with that. So, does anyone have any ideas for me ? Thanks in advance !
You can use PDFBox for merging your pdf files. PDFMergerUtility class has a method addSource which takes in an inputstream, you can convert the byte array to inputstream and add that as a source.
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(...);
merger.addSource(...);
merger.setDestinationFileName(...);
merger.mergeDocuments();
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.
The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.
Allow me to predict your response.
*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"
Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!
If you know something about PDF, I can predict your next reply.
"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"
Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.
"Then what are these page labels about?" you ask.
Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.
This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).