I am working on a project, where there are pdfs with content is English and Spanish language,I am interested only in English part of it and save it to Database.I am using Apache PDF box for extracting the text out of it.How can I avoid Spanish content and get text having only English part of it.I tried out some library like Apache Tika and https://code.google.com/p/language-detection/ but they are not giving correct result in some cases.Can anyone please provide some reliable solution or any other way to achieve the requirement.
Thanks in Advance.
Related
In my application, there are notes being fed by user inside browser. These notes can be formatted for font, size, color etc. These notes are saved in database using html tags string.
Now I want to export these formatted text into PPTX. Is there any solution for it? Currently, I have tried Apache POI which allows for formatted text but does not allow input of html string.
I am looking for open source library, so using Aspose is a difficulty. Somehow, I need to render these HTML text and then copy as it is to PPTX.
Any solution or way will be helpful.
EDIT: I am thinking for custom parsing the string html text; using JAXB to convert the tags into objects and then using some java logic to integrate POI with it. Any wayout/ help on achieving this will be appreciated.
Aspose.Slides offers you to import HTML text inside presentation and also exporting presentation to HTML. I suggest you please visit the following documentation link to serve the purpose in this regard. You are right that Aspose.Slide
I work as developer evangelist at Aspose.
I know this may sound silly to some of you experienced guys out there but it’s really important for me and my group at school, we need to create a software that allows the user to create a new RTF document from scratch (like an editor where you can center, change font size, style, save, insert picture), it also needs to be able read a docx document with images and format included and save it as a RTF document.
What we have done so far is being able to open the .docx document, extract the text without format and put it into an RTF document out. In other words using docx4j library we have been able to transform a .docx document text to .rtf, no pictures included, no formatting, just plain text surrounded by [ ].
We have made some progress today but we can’t figure out the next steps, considering the delivery date is in 72 hours, I thought it’d be a good idea to ask for help from more experienced people than us.
Please leave your answers or request info about the project, we’ll be glad to learn from you guys
To convert a .docx to .rtf use a library like https://code.google.com/p/jodconverter/. It will do all the heavy lifting for you.
Anyway, now about your editor itself. If I had to do it as fast as I could, I would use JavaFX to make my interface. There is a control called "Rich Text Editor" (http://docs.oracle.com/javafx/2/ui_controls/editor.htm) which you can just put into your application.
The trick here is that you can actually extract the HTML of the editor using getHtmlText(), and then you can the HTML to RTF using... yes, a library. I suspect that jodconverter can do this too, but if not, you can look at this question: Convert HTML to RTF in java?.
This should give you a better idea of how to do your project. There are Java libraries to handle conversion between HTML and RTF, so you can use an HTML editor (provided by JavaFX). And of course, a .docx can be converted to HTML too. Let libraries do all the dirty work :).
I have a project, i have to get title,author informations from inside of the PDF file(not from metaData). So i try to read text from PDF by given coordinates and try to get fonts of texts.
Is there any way to do that, can anyone give advise ? Or is there another solutions to do my project?
Thanks for every help and thought you're sharing with me.
There are multiple PDF libraries for Java which allow you to extract text, my favourite being iText, as examples for text parsing have a look at ExtractPageContentArea and other examples from chapter 15 of iText in Action, 2nd edition.
Currently there is no example making use of the font information, but the information is available to the RenderListeners.
I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.
I have an Arabic PDF, and I want to parse it into text document using Java. I have tried many times, and the English words parse successfully but the Arabic words don't.
Can anyone recommend a solution that will convert the Arabic words properly as well?
There are several libraries that come to mind. Apache Tika, iText or pdfbox will all more or less solve your problem. Although, I must put in a word for Tika, as it supports language detection, and can also handle other document types too.
I think you can use iText for pdf manipulation using Java. It supports Arabic too.