pdf parse to text in java - java

I have an Arabic PDF, and I want to parse it into text document using Java. I have tried many times, and the English words parse successfully but the Arabic words don't.
Can anyone recommend a solution that will convert the Arabic words properly as well?

There are several libraries that come to mind. Apache Tika, iText or pdfbox will all more or less solve your problem. Although, I must put in a word for Tika, as it supports language detection, and can also handle other document types too.

I think you can use iText for pdf manipulation using Java. It supports Arabic too.

Related

Java Detect language of Content from large String

I am working on a project, where there are pdfs with content is English and Spanish language,I am interested only in English part of it and save it to Database.I am using Apache PDF box for extracting the text out of it.How can I avoid Spanish content and get text having only English part of it.I tried out some library like Apache Tika and https://code.google.com/p/language-detection/ but they are not giving correct result in some cases.Can anyone please provide some reliable solution or any other way to achieve the requirement.
Thanks in Advance.

Issue with russian-letter links in Aspose pdf-viewer

I am faced with an encoding problem while using embedded aspose pdf-previewer and doc-to-pdf converter in my java project.
When I try to convert a .doc file with clickable links that contain Russian symbols to a pdf file using the com.aspose.words.Document.saveToPdf(...) method I get a good pdf file. But when I try to open this file in standard aspose pdf-previewer and follow these links with Russian symbols I see a "wrong url" error.
Links by itself looks okay (Russian letters look correct) but in mouseover tooltip I note a wrong encoded symbols instead of Russian ones.
How can I deal with this problem?
Should I convert doc file with some specific options or maybe should to configure the pdf-previewer in another way?
Document.saveToPdf() method is no longer available in the latest library, you can just use Document.save("filename.ext") method to save to pdf or any other supported format.
Try the latest version, chances are that this bug might already be fixed. As, I tried to convert a Word document to Pdf with Russian letters in link, the encoding seems to work fine.
I work as a Developer Evangelist for Aspose.

convert docx to doc with java

I have a legacy software which produces a xml and then with help of docx4j a docx document . I must also create a microsoft doc document from the xml file with java.
How can I do that. I'd really appreciate for any help.
Thanks
Look into poi. It's pretty much the defacto standard for modifying Microsoft documents with Java.
docx4j has POI as a dependency, and POI has reasonable support for the legacy binary doc format (hwpf). So you could use that to convert to doc without introducing additional dependencies. Basically, iterate through your content, and emit each paragraph/table/image in doc format. That would be the reverse of convert/in/Doc.java.
However, the devil is in the detail, and it would be a lot of work if your documents contain a variety of features. This assertion stands whether you were doing docx4j to binary doc (hwpf), or POI's own xwpf to hwpf, since POI doesn't have a common interface across the two of them.
So instead of using POI for this, I'd use JODConverter to drive LibreOffice (or OpenOffice, their docx features are a bit different) to convert docx to legacy binary .doc.
The JODConverter approach is definitely the path of least resistance, and will generally give good results. The downside with it is that if you find something which isn't supported properly, you'll have to wait for the LO/OO guys to fix it, which wouldn't be the case if you did decide to build binary doc output for docx4j using POI. If you did build this, we'd happily accept it as a contribution :-)

Putting data in doc template using poi

I need to insert data in some doc template and return it's changed value. I decided to use POI, but if there are other ways to solve my problems I may change the library. I can change the string using Range.replaceText(), but by this way I loose my text formatting, and the text itself turns into plain document with no styles and tables. Are there any ways to replace some characters saving the formatting? I tried RTFTemplate, but it could slightly help me, because it depends on Spring, but I use vaadin in my project.
Thanks in advance
Several years ago I was solving similar problem. The easiest way was to use RTF files as templates, and avoid using any parsing library, because MS Office RTF is not so standatd as you could expect, and any library that tries to "understand" this format tends to loose part of formatting.
So I just opened rtf files as plain text, and searched for my keywords within it. There was a problem when this keywords were splitted into several parts, divided by some non meaningful parts.
I will search my delphi sources and will try to port it to java later this week.

Can Java POI write image to word document?

Anyone know if it is possible?
And got any sample code for this?
Or any other java API that can do this?
The Office 2007 format is based on XML and so can probably be written to using XML tools. However there is this library which claims to be able to write DocX format word documents.
The only other alternative is to use a Java-COM Bridge and use COM to manipulate word. This is probably not a good idea though - I would suggest finding a simpler way.
For example, Word can easily read RTF documents and you can generate .rtf documents from within Java. You don't have to use the Microsoft Word format!
As others have said POI isn't going to allow you to do anything really fancy - plus it doesn't support Office 2007+ formats. Treating MS Word as a component that provides this type of functionality via COM is most likely the best approach here (unless you are running on a non-Windows OS or just can't guarantee that Word will be installed on the machine).
If you do go the COM route, I recommend that you look into the JACOB project. You do need to be somewhat familiar with COM (which has a very steep learning curve), but the library works quite well and is easier than trying to do it in native code with a JNI wrapper.
If you are using docx, you could try docx4j.
See the AddImage sample
Surely:
Take a look at this: http://code.google.com/p/java2word
Word 2004+ is XML based. The above framework gets the image, convert to Base64 representation and adds it to the XML.
When you open your Word Document, there will be your image.
Simple like this:
IDocument myDoc = new Document2004();
myDoc.getBody().addEle("path/myImage.png"));
Java2Word is one API to generate Word Docs using obviously Java code. J2W takes care of all implementation and XML generation behind the scenes.
As far as can be gathered from the project website: no.
POI's HWPF can extract an MS Word document's text and perform simple modifications (basically deleting and inserting text).
AFAIK it can't do much more than that.
Also keep in mind that HWPF works only with the older MS Word (97) format, not the latest ones.
Not sure if Java out of the box can do it directly. But i've read about a component that can pretty much do anything in terms of automating word document generation without having Word. Aspose Words
JasperReports uses this API alternatively to POI, because it supports images:
JExcelAPI
I didn't try it yet and don't know how good/bad it is.

Categories

Resources