I'm using the google drive api to store and retrieve pdf files. I would like to query these files using the search parameters.
But before I start implementing this. I would like to know how google handles the indexing of large pdf files. (600+pages 25Mb+) I would like to know for text based pdf's.(they don't need ocr)
I've tried some searches on the drive website and it doesn't always work.
I would like to know if the are any limitations and what they are.
According to this page for PDFs with OCR:
The maximum size for images (.jpg, .gif, .png) and PDF files (.pdf) is 2 MB. For PDF files, we only look at the first 10 pages when searching for text to extract.
And this page for PDFs with text:
You can search for text in PDF and image files by:
Typing a query in the search box in Google Drive on the web.
Opening the Google Drive viewer and using the search box in the upper right.
In theory you should be able to search the first 100 pages of any text documents or text-based PDFs that you've uploaded. You'll also be able to search for text found on the first ten pages of any image PDFs on your Drive.
Related
I'm using pdfbox (1.8) to handle pdf on Windows (7 and above). I need to take an input pdf and convert to a pdf made by the same page but used as image (no text selectable etc etc). With small file i have no problem but when i have to convert bigger file i have no clue due to massive memory use.
I will post some code if it helps but the approach i'm using is simple: create a document by all the page saved as image taken from the source pdf.
I'm searching for some more memory and time efficent way to do this (i have to handle pdf with 1 or 2 k of pages).
I would like to implement lazy loading PDFs. Going through the forums we need to make make the PDF documents as 'linearized' and It will load first few pages quickly because it store page references in the start of the file. Will this resolve below problem as well.
There is separate charge for data transfer in AWS. Users want to see only first pages, But our system has to download entire PDF document that may be huge file. So we have to pay more money. If linearized PDF solves how to implement in Java technoglogy while download
The way a linearized PDF is processed by a PDF reader is implementation dependent, but I dare to say that in all cases, it is very likely that the reader will just continue to download the rest of the file in the background after showing the first page. The goal of linearization is not to prevent the rest of the file from being downloaded.
An alternative could be for you to split the file in sections and provide bookmarks that point to external PDF documents that will allow navigating through all the sections.
I know that there is already PDFbox and iText but they don't have the ability for visual content extraction as well as need to work offline with the pdf. withal, I want a way to do some text and visual content extraction online. do not want to download the pdf file and then do stuff. what kind of API or library is there for Java language?
EDIT for those who find it not clear, I explain some more:
Just imagine when using any HTML parser you can parse a page online, make the DOM or SAX tree and going through their elements and then extracting photos and text based on the content of the nodes in those trees. at least, for photos, you can get their corresponding HTML tags and for text, the same plus you can get actual text. now, I want to know if there is anything similar for doing with PDFs? going through text and images without downloading
Gnostice PDFOne (for Java) has a getPageElements() method that can parse a PDF page for text and image elements. Text in a PDF is not in a DOM like a HTML or XML document. Text just appears in various x-y coordinates and magically looks well-formatted. However, PDFOne has some PDF text extraction methods that reconstruct those text elements to user-friendly sentences. DISCLOSURE: I work for the company that makes this library.
PDFImageStream can do that. There is a free version with only one restriction: it can only be used in single-threaded applications.
Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it.
Please try to understand what is my actual requirement.
The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index.
the html files not only having text only but also they are having links , images etc.
thanks in advance hoping help from you
regards
In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:
Stored the HTML document as is on disk (you can store in the DB as well).
Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.
HTH.
How to load word document in the jasper report in java.
You have the following options:
Put the word document somewhere on a shared network drive and add a link to it in the report.
You can convert the word document to PDF and then link to that in your report. That way, almost anyone on the planet (not only Windows users) can read it.
You can convert the word document to PDF and then use a tool to convert that to PNG or JPEG images (one page = one image) and then include the images in your report. This will make the report very huge.
You can hope that Microsoft will implement a MS Word reader in Java and tell your boss in the meantime that it is not possible without some drawbacks.