Input the file and count the characters - java

I am writing a code that calculates how many words in a file
And my problem is: when I input the file, there will have more word than the original file...For example, in the file, the content is abcdabcd, but when I run the code, the console shows Total no. of letters: 194
I am using netbeans IDE and mac, when I click the blank space instead of directly open the file, I found there are many words in front of abcdabcd, I guess perhaps this is the reason... But I don't know how to fix this problem on my code
[![enter image description here][1]][1]
Can anyone help me solve this problem?
Thanks!

The input file is encoded with rtf (Rich Text Format), despite it being saved with a .txt extension. This encoding is commonly used to do many things plain text cannot do, such as bold and italic etc. However, to accomplish this, the file is filled with all sorts of other things to accompany the normal text, in your case "abcdef". Of course, java reads all of this as if it were a plain text file, and ends up counting all of the rtf formatting as well.
I assume you're using TextEdit, so look at this tutorial to see how to use only plain text, so all of the extra formatting is not included in the final file.
Hope this helped :)

Related

Best way to extract text from PDF in java

I want to make a program that is able to read PDF files and parse it's contents.
Thus I need to extract the text using some kind of library. I found 3 ways to do so.
OCR libraries (like Tesseract)
ScanPdf libraries (like iText)
Converters from PDF to text.
I fail to understand the big differences between them since all of them will produce in the end a text file from the PDF. So which is the best way to go about this?
PDF is a complex format. If you open a PDF and you're staring at a bunch of text, that doesn't really tell you much. It could be that you're staring at an image file someone decided to wrap into a PDF file. This is 99%+ certain what you have if someone scanned a document and told their scanner to 'scan to PDF', and 100% certain what you got if you have a PNG or JPG and 'save as PDF', or try to 'print to PDF' such a thing.
There is no text in the PDF then. There are pixels.
To turn pixels into text, that's where OCR libraries come in. That's what they do. That is all they do. It's an AI bonanza and error prone. No guarantees.
However, PDF is more complex than that, it isn't like PNG/JPG: It's more like HTML. You can put actual text in there.
This has different issues, though. You can place text blobs (i.e. a 'rectangle', with coordinates, and then the text that is supposed to go inside). Again a lot like HTML: You can do something like:
<p class="foo">
World!
</p>
<p class="bar">
Hello,
</p>
and then create CSS so that the foo is rendered after the bar block (can be as simple as .foo, .bar { display: block; } .foo {float: right}).
Turning that HTML into "World! Hello," is not all that tricky. Realizing that during a render, you end up seeing "Hello, World!", and thus writing code that returns "Hello, World!", that's way more complicated.
The same problem applies to PDF. For simple PDFs, extracting the raw text inside is not too difficult, but be aware that for even mildly complex PDFs, the text can arrive in a jumbled mess.
iText is trying to give you enough power, at least, to provide the latter: To give you a full hierarchical breakdown. It returns 'here is a text box, here is its positioning, and here is the text inside. and now here is another text box, etc'. It does not return a big string.
In other words: The answer depends a lot on what PDFs you have / what PDFs you expect to be able to read, and how complex they are. If they are scans, you need an OCR library. If they are simple, a basic pdf2text converter will do fine. If you want to attempt to take into account fancily positioned PDFs with forms inside and 'popups' that can be opened and closed, oof. Probably all these tools are insufficient and you're signing up to many personweeks worth of effort.
There definitely IS text embedded PDFs, it is NOT just pixels.
It depends on if the PDF is a "true" PDF (ie you can highlight the text and copy and paste it elsewhere) or if the PDF is a scanned image.
With scanned images, you'll have to use an OCR API. All of the major cloud providers have OCR APIs (ie Amazon Textract, Google Document AI, Microsoft Form Recognizer, etc). If it's a true PDF, then I've found the pdf.js library (https://mozilla.github.io/pdf.js/) quite helpful in doing a direct text extraction.
Just know that doing this only gets you the text that is literally on the page, and there's quite a bit of work still to do to get key/value data fields programmatically across many documents.
This is something that my startup is working on (www.sensible.so/) too if you're interested in something more powerful!

MS Word line breaks in Java

I'm building a grammatical text interpreter with a JavaFX GUI where you can just paste your text into a TextArea. This creates something of a problem though, since line breaks from MS Word are somehow formatted in a special way (both paragraphs (ENTER) and line breaks (SHIFT+ENTER).
If I copy the text to a plain .txt file and open it in a browser (tried Chrome, Safari and Firefox), I can copy it again from there without any problem. They somehow fix the issue.
Pasting the text into a new e-mail in Apple's mail and then copying it from there, fixes it, even pasting it into a JEditorPane (java swing), fixes it.
Notepad, TextEdit, Notes and Pages does NOT solve it.
But I need the ability to copy the text directly from a variety of sources and asking people to copy it to somewhere else is just not an option.
I searched for a solution before this post, and the general solution I could find was to replace "\r" with "\r\n", but this doesn't seem to help, as the line break is already lost, as soon as it has been pasted into the TextArea.
A different solution would of course be to simply use a JEditorPane instead of the TextArea, but when I create an AnchorPane and put a JEditorPane into it as a swingComponent, I get an area the size of the initial text, and editing that text on runtime doesn't change the size of that area (which is NOT the area of the contentPane).
Now, I know that I present different solution ideas here, and that I probably should just ask for how to get one specific thing to work, but I'm really getting a headache over this, so weather it's a way to get the contentPane to work properly with the layout, finding a way to make the TextArea keep these linebreaks or even something different is fine. I've just spent too much time trying to solve it now :(

Itext get special letters from pdf

I am trying to extract accented words from pdf e book . The best results are produced when using itext library , but I fail to get accents from words .
example :
побеђивање -should come out as- побеђи́ва̄ње (accents are missing)
The letters are Cyrillic Serbian .
I tried many of the ocr solutions but they all give bad results . Is there a way for me to extract all of this pdf data the way they are in the pdf using itext. I know that this has a lot to do with the way pdf works and that this is a hard thing to get , but again I realy need this , the alternative is to retype all of the data.
The pdf file pdf example file
The sample document actually contains one big image, a scanned page, and invisible text information on top of the scanned printed letters. Most likely this text information is the result of some OCR process.
Unfortunately already this text information is missing the accents in question. E.g. the text for the first entry
is added as
(\340\361\362\340\353\367\355)Tj 0 Tc (\236)Tj
...
As you can see, the same letter \340 is used at position 1 and 4 while according to the scanned page one of the matching printed letters has an accent and one not.
This happens throughout the whole page.
Thus, any attempt at regular text extraction will fail to return the accents in question. The only chance you have is to use OCR.
You say you
tried many of the ocr solutions but they all give bad results
Probably you applied the OCR applications to the PDF or a rendered version of it. I would suggest you instead extract the scanned images; this way you get all the quality there is. iText can help you with image extraction.

iText PDF Text Extraction with fonts and styles

I am using iText to extract text from PDF to a String but I have encountered a problem
with some PDF. When I tried to extract text, the reader extract only blanks/destroyed text
on SOME pdfs.
Example of destroyed text:
"th isbe long to t he t est fo r extr act ion tex t"
What is the cause of this problem?
I am thinking of removing the fonts and change the font to a suitable one to be read by
the reader. I have tried researching about this, but what I found does not help me.
This is caused by the way text is stored in the PDF file. It just puts letters with information for rendering and location. The text extraction algorithm is smart in that it finds letters that seem to be close together and, if so, it puts them together. If they aren't that close, it puts in some space.
I can't tell you what to do about it, though.

how can get highligted word from pdf file?

I develope new program but i need to allow user to highlighting word in pdf file then i want to process the file to get list of highlighted words with place
how can do that by java
thank in advance
PDF files are PostScript, which is very difficult to process. I doubt there's an easy way.
Take a look at http://java-source.net/open-source/pdf-libraries , but be aware you might have some difficulty.
Also, read http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf for the specs of the highlight format. Depending on what "place" information you need, that might be enough.
How are you displaying the PDF? If you are displaying the image, you just need the word co-ordinates. Something like PdfBox or JPedal or maybe IText can do this.

Categories

Resources