Extracting media objects present in LibreOffice Impress using LibreOffice APIs

Extracting media objects present in LibreOffice Impress using LibreOffice APIs - java

I am trying to get details of the media contents (video, audio ) present in a LibreOffice Impress document through LibreOffice API in java. The details which I want to extract is the type of media content present in the document. And also ways to export them. I have gone through the java examples given on the Website but could not find anything relevant to type of video or audio present in file and extraction of video files. I have gone through the example given for exporting Images from Impress Documents using GraphicExportFilter, but it is not able to export video or audio files present in the document. I also tried to extract the type of media content by using XShape (code below), but it only gives the name of the media content and not its type(audio/video/or media extension).
For exporting I am also aware of the method of converting documents to pptx and then renaming and extracting all types of media files. But I suppose that would consume more time to extract (correct me if I am wrong) in practical application, so I was trying to do the same by LibreOffice API.
XComponent xDrawDoc = Helper.loadDocument( xOfficeContext,fileName, "_blank", 0, pPropValues );
XDrawPage xPage = PageHelper.getDrawPageByIndex( xDrawDoc,nPageIndex );
XIndexAccess xIndexAccess = UnoRuntime.queryInterface(XIndexAccess.class,xPage);
long shapeNumber = xIndexAccess.getCount();
for(int j=0;j < shapeNumber;j++)
{
XShape xShape =UnoRuntime.queryInterface(XShape.class, xPage.getByIndex(j));
XNamed xShapeNamed =UnoRuntime.queryInterface(XNamed.class, xShape);
System.out.println(j+":"+xShapeNamed.getName());
}
(This code gives me the names of the media contents present in Impress but not its type or extension)
Thanks in Advance..

Related

How to detect Adobe Illustrator files in Tika version 2.6?

I want to use Tika 2.6 to detect files with the MIME type 'application/illustrator'. When I use the following code snippet, I always get the MIME type 'application/pdf':
public MediaType detectMimeTypeFromContent(#NonNull File file) throws IOException {
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
Metadata metadata = new Metadata();
TikaInputStream tikaStream = TikaInputStream.get(file, metadata);
MediaType mediaType = detector.detect(tikaStream, metadata);
tikaStream.close();
return mediaType;
}
I use this dependencies:
implementation 'org.apache.tika:tika-core:2.6.0'
implementation 'org.apache.tika:tika-parsers:2.6.0'
How can I detect Adobe Illustrator files correctly?

Adobe documentation shows they internally use 3 application/type settings
List of document mime-types that are considered to be PDF or Illustrator documents.
PDF
Postscript
Illustrator
also go one to say
Adobe Illustrator’s file format is a variant of PDF. The main differences, in the context of Experience Manager Assets, is the following:
Adobe Illustrator documents consist of a single page with multiple layers. Each layer is extracted as a PNG subasset under the main Illustrator asset.
PDF documents consist of one or more pages. Each page is extracted as a single page PDF subasset under the main multi-page PDF document.
So Adobe applications have an inhouse typeset to distinguish application/illustrator, however, that is not a registered mimetype (AI is a subset of PDF as above )
Other applications may struggle with hybrids that are wrappers of one around the other so as one example
Linguist reports a content-type of application/postscript for *.ai whilst other report application/PDF which may be due to
"Early versions [over 24 years ago] of the AI file format [Illustrator versions 3 through to 8 saved artwork as specialised EPS files,] are true EPS files with a restricted, compact syntax, with additional semantics represented by Illustrator-specific DSC comments that conform to DSC's Open Structuring Conventions."
Confused ? Dont be, simply like the current mime type register, accept AI files that are PDF like are application/pdf.
I often refer to text/pdf as the legacy format for ansi/pdf but those are not listed either
If a file starts with 40 bit signature %PDF- then irrespective of version or content
The RFC https://www.rfc-editor.org/rfc/rfc8118.html
PDF Versions
The PDF format has gone through several revisions, primarily for the
addition of features. PDF features have generally been added in a
way that older viewers "fail gracefully", because they can just
ignore features they do not recognize. Even so, the older the PDF
version produced, the more legacy viewers will support that version,
but the fewer features will be enabled. The "application/pdf" media
type is used for all versions.* See [ISOPDF2] Annex I, "PDF Versions
and Compatibility".

Vaadin file preview

Can anyone support me with the following issue:
I am working with Vaadin and need to develop a code tt will allow the user to preview the file in such formats as pdf, image, video and audio files. All files are stored in database and may be or various possible types. For pdf it is enough to add the following code:
StreamResource resource = file.downloadFileFromDatabase();
Embedded pdf = new Embedded("", resource);
pdf.setMimeType("application/pdf");
pdf.setType(Embedded.TYPE_BROWSER);
pdf.setSizeFull();
pdf.setHeight("310px");
verticalLayout.setSizeFull();
verticalLayout.addComponent(pdf);
verticalLayout.setExpandRatio(pdf, 1.0f);
But this code doesn't work with video and audio files. Do I need to add a sort of if-else statement to arrange preview of file in accordance with its format? Thank u in advance.

Playing videos from vaadin is possible, there exists the Video class for this.
But since it renders a html5 <video> tag, it requires url as video source. In addition to this, you need a browser supporting the video tag, and the specific video encodings your videos have.
This can help for the player elements to control playback:
https://vaadin.com/directory#!addon/mediaelementjs-player
Here some more info for debugging your potential problem:
https://vaadin.com/forum#!/thread/2445276

How to update the content of a file in Google Drive?

I am trying to update the content of a Google Doc file with the content of another Google Doc file. The reason I don't use the copy method of the API is because that creates another file with another ID. My goal is to keep the current ID of the file. This is a code snippet which unfortunately does nothing:
com.google.api.services.drive.Drive.Files.Get getDraft = service.files().get(draftID);
File draft = driveManager.getFileBackoffExponential(getDraft);
com.google.api.services.drive.Drive.Files.Update updatePublished = service.files().update(publishedID, draft);
driveManager.updateFileBackoffExponential(updatePublished);
The two backoffExponential functions just launch the execute method on the object.
Googling around I found out that the update method offers another constructor:
public Update update(java.lang.String fileId, com.google.api.services.drive.model.File content, com.google.api.client.http.AbstractInputStreamContent mediaContent)
Thing is, I have no idea how to retrieve the mediaContent of a Google file such as a Google Doc.
The last resort could be a Google Apps Script but I'd rather avoid that since it's awfully slow and unreliable.
Thank you.
EDIT: I am using Drive API v3.

Try the Google Drive REST update.
Updates a file's metadata and/or content with patch semantics.
This method supports an /upload URI and accepts uploaded media with
the following characteristics:
Maximum file size: 5120GB Accepted Media MIME types: /*
To download a Google File in the format that's usable, you need to specify the mime-type. Since you're using Spreadsheets, you can try application/vnd.openxmlformats-officedocument.spreadsheetml.sheet. Link to Download files for more info.

how to Show or Read docx file

I am new to rendering files in android, and I want to render or display a docx file in my application.
I had already extract text from docx file, but now I want to extract images from the docx file as well.
I've found several ways to display images in pure Java, but are there any good examples for Android?
I tried this code to fetch Images but not working...
public void extractImages(Document xmlDoc)
{
NodeList binDataList = xmlDoc.getElementsByTagName("w:drawings");
String fileName = "";
Node currentNode;
for(int i = 0; i < binDataList.getLength(); i++)
{
currentNode = binDataList.item(i);
if(currentNode.getNodeType() == Node.ELEMENT_NODE && ((Element)currentNode).hasAttribute("w:name"))
{
File newImageFile = new File(picDirectory, ((Element)currentNode).getAttribute("w:name").replaceFirst("wordml://", ""));
if(newImageFile.exists())
{
}
else
{
if(writeImage(newImageFile, currentNode))
{
//Print some success message
}
}
}
}

Have a look at AndroidDocxToHtml, which I made to demonstrate using docx4j on Android.
A couple of caveats.
First, that project does not include all docx4j dependencies, only the ones required for docx to HTML conversion. So if you want to do other things, you may need others of the dependencies.
Second, docx4j requires JAXB - see this blog post re JAXB on Android - and JAXB context init on app startup takes a while depending on the device. There are ways to work around this, but at extra effort.
If all you want to do is extract the images, and you don't care how they relate to the text, you could just look for image parts. You might use OpenXML4J for that, and avoid JAXB.

The easiest way to create an image in Android is to use the BitmapFactory factory methods.
The BitmapFactory class has methods for creating a Bitmap from a byte array, a file or an InputStream.
Once you have a Bitmap object you can display it by setting it on an ImageView in your layout using the setImageBitmap method.

You can just unzip the file (rename to .zip and open it) then you can investigate the folder structure, where the images are located etc.

Running a JavaScript command from MATLAB to fetch a PDF file

I'm currently writing some MATLAB code to interact with my company's internal reports database. So far I can access the HTML abstract page using code which looks like this:
import com.mathworks.mde.desk.*;
wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.setCurrentLocation(ReportURL(8:end));
pause(1);
s={};
while isempty(s)
s=char(wb.getHtmlText);
pause(.1);
end
desk=MLDesktop.getInstance;
desk.removeClient(wb);
I can extract out various bits of information from the HTML text which ends up in the variable s, however the PDF of the report is accessed via what I believe is a JavaScript command (onClick="gotoFulltext('','[Report Number]')").
Any ideas as to how I execute this JavaScript command and get the contents of the PDF file into a MATLAB variable?
(MATLAB sits on top of Java, so I believe a Java solution would work...)

I think you should take a look at the JavaScript that is being called and see what the final request to the webserver looks like.
You can do this quite easily in Firefox using the FireBug plugin.
https://addons.mozilla.org/en-US/firefox/addon/1843
Once you have found the real server request then you can just request this URL or post to this URL instead of trying to run the JavaScript.

Once you have gotten the correct URL (a la the answer from pjp), your next problem is to "get the contents of the PDF file into a MATLAB variable". Whether or not this is possible may depend on what you mean by "contents"...
If you want to get the raw data in the PDF file, I don't think there is a way currently to do this in MATLAB. The URLREAD function was the first thing I thought of to read content from a URL into a string, but it has this note in the documentation:
s = urlread('url') reads the content
at a URL into the string s. If the
server returns binary data, s will
be unreadable.
Indeed, if you try to read a PDF as in the following example, s contains some text intermingled with mostly garbage:
s = urlread('http://samplepdf.com/sample.pdf');
If you want to get the text from the PDF file, you have some options. First, you can use URLWRITE to save the contents of the URL to a file:
urlwrite('http://samplepdf.com/sample.pdf','temp.pdf');
Then you should be able to use one of two submissions on The MathWorks File Exchange to extract the text from the PDF:
Extract text from a PDF document by Dimitri Shvorob
PDF Reader by Tom Gaudette
If you simply want to view the PDF, you can just open it in Adobe Acrobat with the OPEN function:
open('temp.pdf');

wb=com.mathworks.mde.webbrowser.WebBrowser.createBrowser;
wb.executeScript('javascript:alert(''Some code from a link'')');
desk=com.mathworks.mde.desk.MLDesktop.getInstance;
desk.removeClient(wb);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.