How is shift-enter represented in a Word doc? - java

I'm using Java and Apache POI to read a Word documen template and generate a new document from it. The original document has newline breaks entered with "shift-enter"; I thought it would allow a line break while continuing the paragraph. But as I sequence through runs, I seem to get an empty string at that point. There are 'flags' on the run; do they indicate the line break somehow? I want to leave it in the resuling document; I think what's happening is that I detect it as an empty string and leave it out. How can I detect its presence so I can leave it in the resulting document after I've processed the template?
As a side note, are those flags documented anywhere?

I suspect you are talking about XWPF of apache poi which is the apache poi part to handle Office Open XML file format *.docx.
All Office Open XML file formats are ZIP archives containing XML files and other files in a special directory structure. So one can simply unzip a *.docx file and have a look into it.
For an explicit line break (Shift+Enter) you will find following XML in /word/document.xml in that ZIP archive:
...
<w:r ...>
<w:br/>
</w:r>
...
So it is a run element (w:r) containing one or more break elements (w:br).
The run element (w:r) is the low level source for a XWPFRun in apache poi. It is represented by a org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR which can be got via XWPFRun.getCTR.
So if you got a XWPFRun run, you can get the explicit line breaks as so:
...
for (int i = 0; i < run.getCTR().getBrList().size(); i++) {
System.out.println("<BR />");
}
...
Is this documented anywhere?
There is ECMA-376 for Office Open XML.
The org.openxmlformats.schemas.wordprocessingml.x2006.main.* classes are auto-generated from this specifications. Unfortunately there is not a API documentation public available. So one needs downloading the sources from ooxml-schemas (up to apache poi 4) or poi-ooxml-full (from apache poi 5 on) and then doing javadoc from them.

Related

Apache POI replaceText() side effect, changes line space

I am using POI 3.15 in Java to replace some text in my .doc template.
private HWPFDocument replaceText(HWPFDocument doc, String findText, String replaceText) {
Range r = doc.getRange();
for (int i = 0; i < r.numSections(); ++i) {
Section s = r.getSection(i);
for (int j = 0; j < s.numParagraphs(); j++) {
Paragraph p = s.getParagraph(j);
for (int k = 0; k < p.numCharacterRuns(); k++) {
CharacterRun run = p.getCharacterRun(k);
String text = run.text();
if (text.contains(findText)) {
run.replaceText(findText, replaceText);
}
}
}
}
return doc;
}
After I save the document. All content inside are correct. But the style of the document is not. The space between lines is changed. The original gap between lines is missing. All line are closely packed together.
Why? How do I keep the style of my template?
The HWPF library may not support all features, which exist in your doc file and this may result in changed formats. It may also result in unreadable files.
Some years ago I created a customized HWPF library, which could properly modify and write a wide variety of doc files for one of my clients and I gained a lot of experience about the doc file format and the HWPF library.
The problem is, that one has to properly support all features in HWPF, which may be present in the doc file. For instance, if clipart is included in the file, there will be separate tables, which maintain position and properties of the cliparts. If the content (text) is changed without adjusting the addresses in the other internal tables, formats etc. can be shifted, ignored or lost. (or in worst case, the document is unreadable)
I am not sure about the status of HWPF these days, but I expect, that it does not fully support the main relevant doc file features.
If you want to use HWPF for modifying / writing doc files, you may succeed with files, which have a reduced "feature set". For instance no tables, no cliparts, no text boxes - things like that. If you need to support almost any document, which a user may provide, I'd recommend to find a different solution.
One option could be to use rtf files, which are named .doc. Or use the XWPF library, which works for .docx files.

issues using apache tika Parser object to parse .doc and .docx file formats

When I try to use org.apache.tika.parser.Parser and DefaultDetector() to detect and parse the .doc and .docx file formats. But I am getting some error (not exception) thrown from Tika jars and that doesn't have any helpful stack trace for me to put here. I can confirm that it is happening for .doc and .docx only. PDF, jpeg, texts are fine. Has anyone come across this problem with .doc and .docx file formats? is there any solution that you have adopted?
My Code is the following:
unzippedBytes = loadUnzippedByteCode(attachment.getContents()); /* This is utility method written using native Java Zip library - returns byte array byte[] */
/* All the objects below were declared beforehand, but not initialised until now */
parseContextObj = new ParseContext();
dObj = new DefaultDetector();
detectedParser = new AutoDetectParser(dObj);
context.set(Parser.class, parser);
OutputStream outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
InputStream input = TikaInputStream.get(unzippedBytes, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
detectedParser.parse(input, handler, metadata, parseContextObj); // This is where it is throwing NoSuchMethodError - cannot understand why and also cannot get the stacktrace - using tika 1.10 */
input.close();
The code above was something that I also found in some other SO question and decided to use it for my work. Also, the byte[] that I have used is something that I am receiving from very old struts 1.0 FormFile interface (getFileData() that returns byte[]). I used to have the bullhorn's irex parser to parse, but decided to use Tika for numerous reasons. the byte[] works fine with irex, but has issues whenever I am trying to parse .docx and .doc contents.
The following is the stack trace which I masked certain parts of due to privacy reasons:
2016-01-15 16:21:06,947 [http-apr-80-exec-3] [ERROR] XXXXX.XXXX.XXXXService - java.lang.NoSuchMethodError: org.apache.poi.util.POILogger.log(I[L
java/lang/Object;)V
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.parseRelationshipsPart(PackageRelationshipCollection.java:313)
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:163)
at org.apache.poi.openxml4j.opc.PackageRelationshipCollection.<init>(PackageRelationshipCollection.java:131)
at org.apache.poi.openxml4j.opc.PackagePart.loadRelationships(PackagePart.java:561)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:109)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:80)
at org.apache.poi.openxml4j.opc.PackagePart.<init>(PackagePart.java:125)
at org.apache.poi.openxml4j.opc.ZipPackagePart.<init>(ZipPackagePart.java:78)
at org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:245)
at org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:684)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:227)
at org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:208)
at org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:145)
at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:88)
at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
I realised that my path has POI jar version 2.5.1 and according to maven central repo I am a dinosaur (seems like) is that possibly why? I am also getting error after putting all these for versions 3.13 and 2.60 for poi artifacts and xmlbeans respectively (suggested by #venkyreddy in that answer).
UPDATE
I tried building a new project separately from my original work, and used tika-app-1.10.jar ONLY in my classpath. I also investigated the tika-app-1.10.jar and found out that all the POI dependencies are actually there inluding xmlbeans and 'xml-schema'. After keeping only tika-app-1.10.jar in my classpath, I am getting the following Error (not Exception):
java.lang.NoClassDefFoundError: org/apache/poi/POIXMLTypeLoader
at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.parse(Unknown Source)
at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:158)
at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:167)
at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:119)
at org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:59)
at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:204)
at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at xxx.xxx.xxx.xxx.xxxxxAttachmentWithTika(xxxService.java:792)
I browsed the package and couldn't find any POIXMLTypeLoader class. is this a known issue? Could someone please respond to me?
Make sure there are no outdated POI jars and use the version of POI which matches the version of Tika that you are trying to use.
The class POIXMLTypeLoader was added to POI after POI 3.13 was released, so it seems you somehow mix newer versions. Only release POI 3.14-beta1 knows about this class! Make sure you do not include that version somehow.

extracting text from using pdfclown function 'textextractor'

i am getting an error while using textextractor of pdfclown library. The code i used is
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
// Extract the page text!
Map textStrings = textExtractor.extract(page);
a part of the error i got is
exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>
I also found out that this happens when my pdf contains some bullets for example
item 1
item 2
item 3
Plz help me out to extract the text from such pdfs.
(The following comment turned out to be the solution:)
Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).
After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.
The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:
Your highlighter.java has to be used with pdfclown.jar in the classpath.

Add image using WordML

I am trying to add an image into a document using WordML. I have used the xml as a basis from the jpg example from here http://www.codeproject.com/KB/office/WordML.aspx. I have managed to write Java which creates this exact xml(wordML) in the document, however when I try and open the generated file in MS Word 2007 it says the file in invalid or corrupt.
The xml for the document that won't open is here:
http://pastebin.com/RNEkbvYG (Raw xml)
Sorry for the long paste, this is the shortest example I could create, there's load of gumph at the top and bottom, but you can clearly see the data image in the middle.
http://pastebin.com/download.php?i=RNEkbvYG (download, rename from txt to xml and open with word)
I would greatly appreciate if anybody could look at the xml at the link above and see if they can see why it won't open in word.
<w:pict>
<w:binData w:name="wordml://02000001.jpg">/9j/4AA..Xof/9k=</w:binData>
<v:shape id="_x0000_i1025" style="width:100%;height:auto" type="#_x0000_t75">
<v:imagedata o:title="network" src="wordml://02000001.jpg"/>
</v:shape>
</w:pict>
is 2003 WordML. There is no w:binData element in the 2007 docx format / ECMA standard.
You might try docx4j instead :-)
See http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/samples/AddImage.java

How to convert .doc or .docx files to .txt

I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.
If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.
P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)
Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();
You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.
Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.

Categories

Resources