I am using XSLFPowerPointExtractor to extract text from a pptx file. However all the text in the pptx file is returned to me in a single string. Is there anyway i can get the text on each slide separately? I am completely new to this concept, so please give detailed answers..
I looked up the API documentation and it seems that it's either all or nothing. The API documentation has a method called getText() which returns the entire text for all the slides which is exactly the behavior you are observing.
A bit more googling showed me that the way to do it is to use another API namely XMLSlideShow. That gives you a slide-by-slide access to the presentation.
From there, you can access the different shapes including the text areas from which you can read the text. As a matter of fact, this is explained in this other SO question which I believe will help you resolve your issue: How to get pptx slide notes text using apache poi?
Related
so I am working on a JAVA/html/php project.
I am reading the OOXML out of a docx and saving all that into a String, cause in that String I can search for the "relevant" tags with the help of regex. After that I save that tags in a MYSQL database. And after that I want to display the data out of the database on a html/php webside.
For example:
I found the OOXML tags, that represent "Hello sqrt(2)" while sqrt(2) is "squareroot of 2" (just in symbols) and safed that as a String into my database.
<w:t>Hello</w:t><w:t xml:space="preserve"> </w:t><m:oMath><m:rad><m:radPr><m:degHide m:val="1"/><m:ctrlPr><w:rPr><w:rFonts w:ascii="Cambria Math" w:eastAsiaTheme="minorEastAsia" w:hAnsi="Cambria Math"/><w:i/></w:rPr></m:ctrlPr></m:radPr><m:deg/><m:e><m:r><w:rPr><w:rFonts w:ascii="Cambria Math" w:eastAsiaTheme="minorEastAsia" w:hAnsi="Cambria Math"/></w:rPr><m:t>2</m:t></m:r></m:e></m:rad></m:oMath>
My problem now is, that in my browser the "squareroot of 2 symbol" is not displayed, when I am printing that String out of my database, cause browsers like Chrome do not support OMML (the <m:oMath> tags).
So my question now is, if there is a possibility to display that?
I found some similar questions and they suggest to transform the OMML math tags into MathML, which is supported by html. (well, I figured out, Chrome is not supporting MathML right now, but there is a script, that helps out called MathJax...)
So a solution could be transforming the whole omml tags into MathML. But how can I do that? There is a solution here: Reading equations from Word (*.docx) to HTML together with their text context using apache poi
But that seems very complicated to me, cause its working with the word file. I allready have the xml as a String.
So if you have an idea, how I could solve that problem or if you have new ideas, I would be very very thankfull to you. Perhaps I am thinking to complicated and there is an easy way of showing OMML math formulas in html, which I don't know.
Thanks a lot!
If someone will have the same issue in the future:
In the end, I did it like recommended in
Reading equations from Word (*.docx) to HTML together with their text context using apache poi
I transfered the whole docx document to html and MathML. And after that I looked for my relevant tags with regex again.
I am trying to make some existing PDF's into templets.
Because these documents hold real data I am replaceing this data such as names and addrsss and making them into dummy place holders.
Examples
[[Name]]
[[Address1]]
When I alter the text via the iText version 5 library replace via a program I can use the template.
To speed things up I tried to use Adobe DC.
When using this method the template stops working.
Any ideas?
From what I understand of your question;
you have (or want to have) a template document
fill in the template with data from a program
turn this back into a pdf
You can easily achieve some of your goals with iText.
I suggest you look into http://developers.itextpdf.com/examples/form-examples/clone-filling-out-forms
I am using the last answer which is available in link:
Replacing a text in Apache POI XWPF not working.
Thanks to Josh.
It is working perfect for almost all scenarios, but sometimes it is not applying the color to the replaced text properly.
am I missing something?
Runs are funny things. I know that the solution in this Stack Overflow question works great to replace sections of paragraphs or parts of runs that have different formatting (bold, embossed, etc) scattered throughout a given paragraph. For my particular use-case, the replace function was able to replace strings mid-run and handle any particular formatting that we were encountering. I didn't personally look at the color, but it appears to have functionality to do so: newRun.setColor(run.getColor());
Note that I originally was using Apache POI 3.11 and the code was giving me a lot of errors like "The method isEmbossed() is undefined for the type XWPFRun". Upgrading to 3.15 solved this.
i've been looking for quite long time for answer, but i haven't found anything.
My problem is in parsing pdf, i have page made with some kind of tables.
I've already written some code via which i can extract iformation from specified rectangle, but i am declaring those values in code and it is not dynamic as it should. I want to find information about cells and with this information i will be able to get those string which i will need. In PDFbox api i haven't found anything what could be useful.
I would be graceful for any tips.
Good morning, fellas. I have been assigned a task wherein I am supposed to extract text from a PDF file (a bank invoice), as per the given specification of fields and sections. This specification is given in a YAML file. The fields are expressed as a set of two coordinates - top left and right bottom of the rectangle in which the text resides, and the name of the field. I am using SnakeYAML to load this info into objects. I have been successful upto this point. For the next part, where I have to extract text from PDFs using this data, well... I am kind of stuck here. For one, I am yet unable to decide on what PDF parsing library to use. Can you please suggest me a PDF parsing library suited to my task, and how should I go about accomplishing the above mentioned task? Thanks!
PDF Box is able to extract text from a given area. Have a look at PDFTextStripperByArea!