Extracting Mathematical text from pdf using itext

Extracting Mathematical text from pdf using itext - java

I have a pdf textbook which has math equations like this:
However, if i attempt a simple text extraction i get something along the lines of:
V(r) = - 3 - -
2R R2
This is not an image, it is text but I don't know how to preserve the way it looks and get the actual characters into a text file.

The problem you are running into is a frequently encountered one. PDF essentially doesn't care about structure. It has no notion of a column, paragraph, a line of text or even a word, let alone a mathematical formula with lots of special formatting.
PDF - essentially - is only interested in placing things on a page at a specific location. And that's exactly what it does with your formulas as well, it will use the characters and graphics you need for your formulas and put them somewhere on the page. Without any additional knowledge that you could use afterwards to figure out that these characters and graphics even belong to a formula; let alone reconstruct it while doing text extraction.
Two additional points:
1) If you share an example of such a PDF document, we could have a look if there is some useful information in it that could be used to extract this formula in a more competent way; but the chance is close to zero.
2) You would also have to define what a "useful way" from your point of view is. Formulas don't translate well to plain text files, so you probably need something like MathML to store them in.

Related

Extract paragraph sample from file

I have an unknown file type uploaded. It can be doc, pdf, xls, etc.
My ultimate goal is to:
Determine if there are paragraphs of text in the file (as opposed to, say, a bunch of picture captions or text from a chart or table)
If (1) is true and there are paragraphs of text, extract a few sample paragraphs from the file.
I know that I can use a program like Apache Tika to extract the file to a String.
However, I would like to also get the format of the extracted text and determine where there are paragraphs of full, written text (as opposed to captions, etc.).
So I also would like a way to analyze the extracted text. Specifically, I would like a library that can identify full, written paragraphs, as opposed to text that was simply taken from things like photo captions, charts, etc.
While Tika is a rather large library, I would be willing to add it if it can perform the tasks that I need.
However, I can not find anything in Tika that would allow me to analyze the structure of the text in such a way.
Is there something I missed?
Other than Tika, I am aware of some API’s for analyzing text, specifically Comprehend or Textract, but I still couldn't find something that can ensure the extraction of full, written paragraphs as I require.
I am looking for any suggestion using the libraries I listed above or others. Again, I'd like to avoid things like photo captions and such and only get text that was part of full, written paragraphs.
Is there any library that can help me with this or will I have to code the logic myself (for detecting paragraphs as well as detecting the difference between full paragraphs and text that was extracted from charts and captions)?

Is it possible to remove tags (or sequences) and relate or remember them as indexes?

I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?

You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.

Editing PDF text using Java and Itext

Is there a way I can edit a PDF document text? like find and replace specific text ?
I have a PDF document which contains placeholders for text that I need to identify and be replaced or just delete that text.
I am able to edit the pdf with a specific coordinates (x, y) but unable to identify and replace. All the libraries that I saw created PDF from scratch and small editing functionality.
Is there anyway I can edit above explained using itext?
please advise...thank you!
**Example : A pdf document contains following paragaph. In this paragraph, I need to identify DATE: and FROM: as a text and replace it with something else.
The oldest classical Greek and Latin writing had little or no spaces between words or other ones, and could be written in boustrophedon (alternating directions). Over time, text direction (left to right) became standardized, and word dividers and terminal punctuation became common.
**DATE:
FROM:
The first way to divide sentences into groups was the original paragraphos, similar to an underscore at the beginning of the new group
-----------------------------------------------------------**

Allow me to copy the intro of chapter 6 of my book:
When I wrote the first book about iText, the publisher didn’t like the
subtitle “Creating and Manipulating PDF.” He didn’t like the word
manipulating because of some of its pejorative meanings. If you consult the dictionary on Yahoo! education, you’ll find the
following definitions:
To influence or manage shrewdly or deviously
To tamper with or falsify for personal gain
Obviously, that’s not what the book is about. The publisher suggested
“Creating and Editing PDF” as a better subtitle. I explained that
PDF isn’t a document format well suited for editing. PDF is an end
product. It’s a display format. It’s not a word processing
format.
In a word processing format, the content is distributed over different
pages when you open the document in an application, not earlier. This
has some disadvantages: if you open the same document in different
applications, you can end up with a different page count. The same
text snippet can be on page X when looked at in Microsoft Word, and
on page Y when viewed in Open Office. That’s exactly the kind of
problem you want to avoid by choosing PDF.
In a PDF document, every character or glyph on a PDF page has its
fixed position, regardless of the application that’s used to view the
document. This is an advantage, but it also comes with a disadvantage.
Suppose you want to replace the word “edit” with the word “manipulate”
in a sentence, you’d have to reflow the text. You’d have to reposition
all the characters that follow that word. Maybe you’d even have to
move a portion of the text to the next page. That’s not trivial, if
not impossible.
If you want to “edit” a PDF, it’s advised that you change the original
source of the document and remake the PDF. If the original document
was written using Microsoft Word, change the Word document, and make
the PDF from the new version of the Word document. Don’t expect any
tool to be able to edit a PDF file the same way you’d edit a Word
document.
This being said, the verb “to manipulate” also means
To move, arrange, operate, or control by the hands or by mechanical means, especially in a skillful manner
That’s exactly what you’re going to do in this chapter. Using iText,
you’re going to manipulate the pages of a PDF file in a skillful
manner. You’re going to treat a PDF document as if it were made of
digital paper.
In your question, you say: "All the libraries that I saw created PDF from scratch and small editing functionality."
Well, that's only normal. It's inherent to the document format you've chosen. Your design that involves "placeholders for text that you need to identify and replace or just delete" is seriously flawed. It suffers from a wrong choice of document format. You should have chosen a format that is suited for editing. PDF isn't such a format.

Replacing placeholders using iText in Java

I have a PDF that contains placeholders like <%DATE_OF_BIRTH%>, i want to be able to read in the PDF and change the PDF placeholder values to text using iText.
So read in PDF, use maybe a replaceString() method and change the placeholders then generate the new PDF.
Is this possible?
Thanks.

The use of placeholders in PDF is very, very limited. Theoretically it can be done and there are some instances where it would be feasible to do what you say, but because PDF doesn't know about structure very much, it's hard:
simply extracting words is difficult so recognising your placeholders in the PDF would already be difficult in many cases.
Replacing text in PDF is a nightmare because PDF files generally don't have a concept of words, lines and paragraphs. Hence no nice reflow of text for example.
Like I said, it could theoretically work under special conditions, but it's not a very good solution.
What would be a better approach depends on your use case:
1) For some forms it may be acceptable to have the complete form as a background image or PDF file and then generate your text as an overlay to that background (filling in the blanks so to speak) As pointed out by Bruno and mlk in comments, in this case you can also look into using form fields which can be dynamically filled.
2) For other forms it may be better to have your template in a structured format such as XML or HTML, do the text replacement in that format and then convert it into PDF.

How can I output data with special characters visible?

I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?

I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.

Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.