I am trying to add an image into a document using WordML. I have used the xml as a basis from the jpg example from here http://www.codeproject.com/KB/office/WordML.aspx. I have managed to write Java which creates this exact xml(wordML) in the document, however when I try and open the generated file in MS Word 2007 it says the file in invalid or corrupt.
The xml for the document that won't open is here:
http://pastebin.com/RNEkbvYG (Raw xml)
Sorry for the long paste, this is the shortest example I could create, there's load of gumph at the top and bottom, but you can clearly see the data image in the middle.
http://pastebin.com/download.php?i=RNEkbvYG (download, rename from txt to xml and open with word)
I would greatly appreciate if anybody could look at the xml at the link above and see if they can see why it won't open in word.
<w:pict>
<w:binData w:name="wordml://02000001.jpg">/9j/4AA..Xof/9k=</w:binData>
<v:shape id="_x0000_i1025" style="width:100%;height:auto" type="#_x0000_t75">
<v:imagedata o:title="network" src="wordml://02000001.jpg"/>
</v:shape>
</w:pict>
is 2003 WordML. There is no w:binData element in the 2007 docx format / ECMA standard.
You might try docx4j instead :-)
See http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/samples/AddImage.java
Related
I'm using Java and Apache POI to read a Word documen template and generate a new document from it. The original document has newline breaks entered with "shift-enter"; I thought it would allow a line break while continuing the paragraph. But as I sequence through runs, I seem to get an empty string at that point. There are 'flags' on the run; do they indicate the line break somehow? I want to leave it in the resuling document; I think what's happening is that I detect it as an empty string and leave it out. How can I detect its presence so I can leave it in the resulting document after I've processed the template?
As a side note, are those flags documented anywhere?
I suspect you are talking about XWPF of apache poi which is the apache poi part to handle Office Open XML file format *.docx.
All Office Open XML file formats are ZIP archives containing XML files and other files in a special directory structure. So one can simply unzip a *.docx file and have a look into it.
For an explicit line break (Shift+Enter) you will find following XML in /word/document.xml in that ZIP archive:
...
<w:r ...>
<w:br/>
</w:r>
...
So it is a run element (w:r) containing one or more break elements (w:br).
The run element (w:r) is the low level source for a XWPFRun in apache poi. It is represented by a org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR which can be got via XWPFRun.getCTR.
So if you got a XWPFRun run, you can get the explicit line breaks as so:
...
for (int i = 0; i < run.getCTR().getBrList().size(); i++) {
System.out.println("<BR />");
}
...
Is this documented anywhere?
There is ECMA-376 for Office Open XML.
The org.openxmlformats.schemas.wordprocessingml.x2006.main.* classes are auto-generated from this specifications. Unfortunately there is not a API documentation public available. So one needs downloading the sources from ooxml-schemas (up to apache poi 4) or poi-ooxml-full (from apache poi 5 on) and then doing javadoc from them.
I am using PDFBox in Java to attempt to extract text from the pdf file. This is how I load the file:
PDDocument document = PDDocument.load(new File(path1));
As you can see, it opens the file and loads the stuff inside it. This may cause issue when say I tried to load a file which has 10 million words or text which is huge and it throws an OutOfMemoryException:Java heap space.
I actually tested this and it does throw an error. And the culprit was the line above.
Is there a way to open the file but not loading it's content in PDFBox?
I appreciate any suggestion.
Use :
PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());
This will setup buffering memory usage to only use temporary files with no restricted size.
I'm trying to extract some infos from a set of PDFs. This works so far, but one PDF gives me grievances.
I'm using PDFBox 1.8.8, with Java 7.
PDDocument document = PDDocument.load(pdfFile);
PDFTextStripper stripper = new PDFTextStripper();
System.out.println("File: "+pdfFile.getAbsolutePath()+" readable: "+pdfFile.canRead()+" size: "+pdfFile.length());
System.out.println(stripper.getText(document));
It just prints
File: /foo/bar/mypdf.pdf readable: true size: 1267743
Then it terminates. Usually I use the writeText method and funnel the text through a stream, but above code was used for simplification. I've tried converting the PDF with pdftotext - it works just like the others.
I get no exception, no nothing. Any ideas?
EDIT:
Additional Info: Created with Acrobat Distiller 9.0.0 (Windows), Format PDF-1.6; The other PDFs are Version 1.4 and 1.5
Doesn't seem to contain exotic characters. I can mark/copy text in Evince PDF-viewer
EDIT2:
Dang it. File property dialog (Nautilus) said "Security: No", but pdfinfo gives me:
Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:AES)
Anyway to circumvent that? After all, pdftotext could get the text out.
The document was "encrypted" (write protected), but with no user password set. This Stackoverflow answer shows how you can remove the encryption and simply read the file: remove encryption from pdf with pdfbox, like qpdf
i am using Docx4j to generate pdf documents based on Microsoft Word templates.
In a microsoft word template, i have some Mail Merge fields, which should be replaced.
I am able to replace Mail Merge field but in generated PDF are displayed in a wrong way.
In output PDF i have always text like MERGEFIELD ContractNo * MERGEFORMAT.
In word, you can swith between field views by ALT+F9, but how can i achieve to show in generate PDF different view of mail merge fields?
Instead of MERGEFIELD ContractNo * MERGEFORMAT i want to show only ContractNo.
Should "just work" with a current nightly build (as opposed to 2.8.1).
Use Content Controls instead of MERGEFIELDs. I've posted an example on github complete with a sample template and a sample XML data file: https://github.com/sylnsr/docx4j-ws ...
MergeFields are deprecated and not (IMHO) recommended for continued use.
I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)
I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.
Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);