how to Show or Read docx file

how to Show or Read docx file - java

I am new to rendering files in android, and I want to render or display a docx file in my application.
I had already extract text from docx file, but now I want to extract images from the docx file as well.
I've found several ways to display images in pure Java, but are there any good examples for Android?
I tried this code to fetch Images but not working...
public void extractImages(Document xmlDoc)
{
NodeList binDataList = xmlDoc.getElementsByTagName("w:drawings");
String fileName = "";
Node currentNode;
for(int i = 0; i < binDataList.getLength(); i++)
{
currentNode = binDataList.item(i);
if(currentNode.getNodeType() == Node.ELEMENT_NODE && ((Element)currentNode).hasAttribute("w:name"))
{
File newImageFile = new File(picDirectory, ((Element)currentNode).getAttribute("w:name").replaceFirst("wordml://", ""));
if(newImageFile.exists())
{
}
else
{
if(writeImage(newImageFile, currentNode))
{
//Print some success message
}
}
}
}

Have a look at AndroidDocxToHtml, which I made to demonstrate using docx4j on Android.
A couple of caveats.
First, that project does not include all docx4j dependencies, only the ones required for docx to HTML conversion. So if you want to do other things, you may need others of the dependencies.
Second, docx4j requires JAXB - see this blog post re JAXB on Android - and JAXB context init on app startup takes a while depending on the device. There are ways to work around this, but at extra effort.
If all you want to do is extract the images, and you don't care how they relate to the text, you could just look for image parts. You might use OpenXML4J for that, and avoid JAXB.

The easiest way to create an image in Android is to use the BitmapFactory factory methods.
The BitmapFactory class has methods for creating a Bitmap from a byte array, a file or an InputStream.
Once you have a Bitmap object you can display it by setting it on an ImageView in your layout using the setImageBitmap method.

You can just unzip the file (rename to .zip and open it) then you can investigate the folder structure, where the images are located etc.

Related

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

I am trying to fetch and read bar codes from my PDF using getXObjectNames() of PdResources.
My code is very similar to this link: https://issues.apache.org/jira/browse/PDFBOX-2124
If you see the above JIRA item, you will see a PDF file attached to it.
When I run the code on that PDF file I get the desired output (i.e. the bar code type is printed.)
However when I run it on my PDF, it does not recognize the bar code in it. (I have checked that the bar code is in fact an image and not text.)
Also it may sound weird, but it did work on my PDF once and I haven't made any changes since then, but it definitely does not work now. (I cannot share the PDF for some reason.)
Has anyone faced a similar issue?
Also this is my first question on Stack Overflow. Please tell me if I am wrong anywhere.
Here is a link to that pdf:
https://drive.google.com/file/d/1PzVApIePg4U9XL399BpAd2oeY6Q2tLEB/view?usp=drivesdk

In General
As you don't show your code but only describe it as very similar to that in PDFBOX-2124, and as you say you cannot share the PDF for some reason, I only have that code to analyze. Thus, I cannot tell what really is the issue but merely enumerate some possible problems
First of all, that code only inspects the immediate resources of the given page for bitmap images:
PDResources pdResources = pdPage.getResources();
Map<String, PDXObject> xobjects = (Map<String, PDXObject>) pdResources.getXObjects();
if (xobjects != null)
{
for (String key : xobjects.keySet())
{
PDXObject xobject = xobjects.get(key);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
extractBarcodeArrayByAreas(image, this.maximumBlankPixelDelimiterCount);
}
}
}
}
(PDFBOX-2124 PdPageBarcodeScanner method scsan)
Bitmap images can also be stored elsewhere, e.g.
in the separate resources of form xobjects, patterns, or Type 3 fonts used on the page; to find them one has to inspect other page resources, too, even recursively as the image might be a resource of a pattern used in a form xobject used on the page;
in the separate resources of annotations of the page; thus, one has to recurse into annotation resources, too;
inlined in some content stream; thus, one also has to search the content streams of the page itself, of page resources (recursively), and page annotations and their resources (recursively).
Furthermore, the bitmap might be given in some format (in particular with some colorspace) which PDFBox does not know how to export as BufferedImage.
Also the bar code may be constructed using some mask applied to a purely black bitmap in which case your code probably only tries to scan that purely black image.
Furthermore, you say
I have checked that the bar code is in fact an image and not text.
If you only checked that the bar code is not text, it may not only be a bitmap image but it can also be drawn by vector graphics instructions. Thus, you also have to check all content streams for vector graphics instructions drawing a bar code.
Also there may be combinations, e.g. a soft mask of vector graphics may be active when drawing a purely black inlined bitmap image etc.
And I'm sure I've missed a number of options here.
As next step you may want to analyze the PDF you cannot share to find out how exactly that barcode is drawn.
Alternatively, you render the page as bitmap image and search that large bitmap for bar codes using zxing.
Sample PDF.pdf
You provided a link to a sample PDF. So I tried to extract the bar code using code very similar to that from PDFBOX-2124. Apparently the code there was for some PDFBox 2.0.0-SNAPSHOT, so it had to be corrected a bit. In particular the method getXObjectNames() you mention in the question title finally is used:
PDResources pdResources = pdPage.getResources();
int index = 0;
for (COSName name : pdResources.getXObjectNames()) {
PDXObject xobject = pdResources.getXObject(name);
if (xobject instanceof PDImageXObject)
{
PDImageXObject imageObject = (PDImageXObject) xobject;
String suffix = imageObject.getSuffix();
if (suffix != null)
{
BufferedImage image = imageObject.getImage();
File file = new File(RESULT_FOLDER, String.format("Sample PDF-1-%s.%s", index, imageObject.getSuffix()));
ImageIO.write(image, imageObject.getSuffix(), file);
index++;
System.out.println(file);
}
}
}
(ExtractImages test testExtractSamplePDFJayshreeAtak)
The output: One bitmap image is exported as "Sample PDF-1-0.tiff" which looks like this:
Thus, I cannot reproduce your issue
PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet
Obviously getXObjectNames() does return the name of the bitmap image xobject resource and PDFBox exports it just fine.
Please check with your code whether as claimed the image is not extracted or whether some later step simply cannot deal with it.
If in your case indeed the image is not extracted,
update your PDFBox version (I used the current development head but the newest released version should return the same),
update your Java,
check whether you have extra JAI jars that might cause trouble.
If in your case the image is extracted but not analyzed as expected by later code,
debug more thoroughly to find out where the analysis fails,
create a new question here focusing on the QR code image analysis,
and provide enough code and the tiff file to allow people to actually reproduce the issue.

Is it possible to embed images in exported html

I'm trying to use the JasperHtmlExporterBuilder to generate an HTML version of a report that has images. The two options that I seem to have are:
Use JasperHtmlExporterBuilder and .setImagesURI("image?image="); This method relies on the code living in some kind of web container (like tomcat) and generates IMG tags to grab images from the server.
Use setOutputImagesToDir option of JasperHtmlExporterBuilder and force the images to be outputted separately to a local directory on disk.
I was wondering whether there might be a 3rd option where the images are base64 encoded and put directly into the HTML that's generated.
This would be ideal for me as I'd really like to return one complete result that's entirely self-contained.
One way I can "hack" it would be to use option #2 from above, then iterate over the images that get outputted, read them in, convert to base64 and manually replace the src part of the generated HTML.
Update: Below is my actual implementation based on the "hack" I describe above. Would be nice to do this better - but the code below is doing what I need (thought not very memory friendly).
public String toHtmlString() throws IOException, DRException {
File tempFile = Files.createTempFile("tempInvoiceHTML", "").toFile();
Path tempDir = Files.createTempDirectory("");
FileOutputStream fileOutputStream = new FileOutputStream(tempFile);
JasperHtmlExporterBuilder htmlExporter = export.htmlExporter(fileOutputStream).setImagesURI("");
htmlExporter.setOutputImagesToDir(true);
htmlExporter.setImagesDirName(tempDir.toUri().getPath());
htmlExporter.setUsingImagesToAlign(false);
reportBuilder.toHtml(htmlExporter);
String html = new String(Files.readAllBytes(Paths.get(tempFile.toURI())));
for (Path path : Files.list(Paths.get(tempDir.toUri().getPath())).collect(Collectors.toList())) {
String fileName = path.getFileName().toString();
byte[] encode = Base64.encode(FileUtils.readFileToByteArray(path.toFile()));
html = html.replaceAll(fileName, "data:image/png;base64,"+ new String(encode));
}
return html;
}
Is there a better way to do this?
Thanks!

Enterprise Architect scripting with java - create and modify linked document

my question: How can I create a new linked document and insert (or connect) it into an element (in my case a Note-Element of an activity diagram).
The Element-Class supports the three Methods:
GetLinkedDocument ()
LoadLinkedDocument (string Filename)
SaveLinkedDocument (string Filename)
I missing a function like
CreateLinkedDocument (string Filename)
My goal: I create an activity diagram programmatically and some notes are to big to display it pretty in the activity diagram. So my goal is to put this text into an linked document instead of directly in the activity diagram.
Regards
EDIT
Thank you very much to Uffe for the solution of my problem. Here is my solution code:
public void addLinkedDocumentToElement(Element element, String noteText) {
String filePath = "C:\\rtfNote.rtf";
PrintWriter writer;
//create new file on the disk
writer = new PrintWriter(filePath, "UTF-8");
//convert string to ea-rtf format
String rtfText = repository.GetFormatFromField("RTF", noteText);
//write content to file
writer.write(rtfText);
writer.close();
//create linked document to element by loading the before created rtf file
element.LoadLinkedDocument(filePath);
element.Update();
}
EDIT EDIT
It is also possible to work with a temporary file:
File f = File.createTempFile("rtfdoc", ".rtf");
FileOutputStream fos = new FileOutputStream(f);
String rtfText = repository.GetFormatFromField("RTF", noteText);
fos.write(rtfText.getBytes());
fos.flush();
fos.close();
element.LoadLinkedDocument(f.getAbsolutePath());
element.Update();

First up, let's separate the linked document, which is stored in the EA project and displayed in EA's built-in RTF viewer, from an RTF file, which is stored on disk.
Element.LoadLinkedDocument() is the only way to create a linked document. It reads an RTF file and stores its contents as the element's linked document. An element can only have one linked document, and I think it is overwritten if the method is called again but I'm not absolutely sure (you could get an error instead, but the EA API tends not to work that way).
In order to specify the contents of your linked document, you must create the file and then load it. The only other way would be to go hacking around in EA's internal and undocumented database, which people sometimes do but which I strongly advise against.
In .NET you can create RTF documents using Microsoft's Word API, but to my knowledge there is no corresponding API for Java. A quick search turns up jRTF, an open-source RTF library for Java. I haven't tested it but it looks as if it'll do the trick.
You can also use EA's API to create RTF data. You would then create your intended content in EA's internal display format and use Repository.GetFormatFromField() to convert it to RTF, which you would then save in the file.
If you need to, you can use Repository.GetFieldFromFormat() to convert plain-text or HTML-formatted text to EA's internal format.

Extracting media objects present in LibreOffice Impress using LibreOffice APIs

I am trying to get details of the media contents (video, audio ) present in a LibreOffice Impress document through LibreOffice API in java. The details which I want to extract is the type of media content present in the document. And also ways to export them. I have gone through the java examples given on the Website but could not find anything relevant to type of video or audio present in file and extraction of video files. I have gone through the example given for exporting Images from Impress Documents using GraphicExportFilter, but it is not able to export video or audio files present in the document. I also tried to extract the type of media content by using XShape (code below), but it only gives the name of the media content and not its type(audio/video/or media extension).
For exporting I am also aware of the method of converting documents to pptx and then renaming and extracting all types of media files. But I suppose that would consume more time to extract (correct me if I am wrong) in practical application, so I was trying to do the same by LibreOffice API.
XComponent xDrawDoc = Helper.loadDocument( xOfficeContext,fileName, "_blank", 0, pPropValues );
XDrawPage xPage = PageHelper.getDrawPageByIndex( xDrawDoc,nPageIndex );
XIndexAccess xIndexAccess = UnoRuntime.queryInterface(XIndexAccess.class,xPage);
long shapeNumber = xIndexAccess.getCount();
for(int j=0;j < shapeNumber;j++)
{
XShape xShape =UnoRuntime.queryInterface(XShape.class, xPage.getByIndex(j));
XNamed xShapeNamed =UnoRuntime.queryInterface(XNamed.class, xShape);
System.out.println(j+":"+xShapeNamed.getName());
}
(This code gives me the names of the media contents present in Impress but not its type or extension)
Thanks in Advance..

Extracting font-color for text in a pdf in Java using PDFBox [duplicate]

I have just started working with PDFBox, extracting text and so on. One thing I am interested in is the colour of the text itself that I am extracting. However I cannot seem to find any way of getting that information.
Is it possible at all to use PDFBox to get the colour information of a document and if so, how would I go about doing so?
Many thanks.

All color informations should be stored in the class PDGraphicsState and the used color (stroking/nonstroking etc.) depends on the used text rendering mode (via pdfbox mailing list).
Here is a small sample I tried:
After creating a pdf with just one line ("Sample" written in RGB=[146,208,80]), the following program will output:
DeviceRGB
146.115
208.08
80.07
Here's the code:
PDDocument doc = null;
try {
doc = PDDocument.load("C:/Path/To/Pdf/Sample.pdf");
PDFStreamEngine engine = new PDFStreamEngine(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PageDrawer.properties"));
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
System.out.println(graphicState.getStrokingColor().getColorSpace().getName());
float colorSpaceValues[] = graphicState.getStrokingColor().getColorSpaceValue();
for (float c : colorSpaceValues) {
System.out.println(c * 255);
}
}
finally {
if (doc != null) {
doc.close();
}
Take a look at PageDrawer.properties to see how PDF operators are mapped to Java classes.
As I understand it, as PDFStreamEngine processes a page stream, it sets various variable states depending on what operators it is processing at the moment. So when it hits green text, it will change the PDGraphicsState because it will encounter appropriate operators. So for CS it calls org.apache.pdfbox.util.operator.SetStrokingColorSpace as defined by mapping CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace in the .properties file. RG is mapped to org.apache.pdfbox.util.operator.SetStrokingRGBColor and so on.
In this case, the PDGraphicsState hasn't changed because the document has just text and the text it has is in just one style. For something more advanced, you would need to extend PDFStreamEngine (just like PageDrawer, PDFTextStripper and other classes do) to do something when color changes. You could also write your own mappings in your own .properties file.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to Show or Read docx file - java

You can just unzip the file (rename to .zip and open it) then you can investigate the folder structure, where the images are located etc.

Related

PDF Box getXObjectNames() does not recognize bar code on my PDF, however it does recognize it on a PDF file I got off the internet

Is it possible to embed images in exported html

Enterprise Architect scripting with java - create and modify linked document

Extracting media objects present in LibreOffice Impress using LibreOffice APIs

Extracting font-color for text in a pdf in Java using PDFBox [duplicate]

Categories

Resources