Can you recommend a library that lets me add XMP data to a TIFF file? Preferably a library that can be used with Java.
There is JempBox which is open source and allows the manipulation of XMP streams, but it doesn't look like it will read/write the XMP data in a TIFF file.
There is also Chilkat which is not open source, but does appear to do what you want.
It's been a while, but it may still be useful to someone: Apache Commons has a library called Sanselan suitable for this task. It's a bit dated and the documentation is sparse, but it does the job well nevertheless:
File file = new File("path/to/your/file");
// Get XMP xml data from a file
String xml = Sanselan.getXmpXml(file);
// Process the XML data
xml = processXml(xml);
// Write XMP xml data from a file
Map params = new HashMap();
params.put(Sanselan.PARAM_KEY_XMP_XML, xml);
BufferedImage image = Sanselan.getBufferedImage(file);
Sanselan.writeImage(image, file, Sanselan.guessFormat(file), params);
You may have to be careful with multipage TIFFs though, because Sanselan.getBufferedImage will probably only get the first (so only the first gets written back).
Related
I am using java and i am trying to extract some metadata with apache tika, but i cannot extarct the expected value for the 'subject' metadata. The file is a jpg image. Here is my code:
First i am parsing the file like this:
inputStream = new FileInputStream(fileToExtract);
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
and then i am trying to print these:
metadata.get(Metadata.AUTHOR) --> "MyAuthor"
metadata.get(TikaCoreProperties.CREATOR) --> "MyCreator"
metadata.get(TikaCoreProperties.TITLE) --> "MyTitle"
metadata.get(Metadata.SUBJECT) --> **null**
metadata.get(TikaCoreProperties.KEYWORDS) --> **null**
So, i get correctly all the values and i get null value for the subject. The metadata were added manually by me (right click->properties, windows OS).
Am i doing something wrong?
PS: Note that "TikaCoreProperties.KEYWORDS" is another way to retrieve the subject according to apache tika documentation.
Apache Tika tries to return consistent metadata across all file formats. It shouldn't matter if one format calls it Author, another Creator, another Created By and another Creator[0], Tika maps those all onto a consistent key. Typically, those keys are based on well known external standards, such as Dublin Core
If you want to see the mappings that Tika applies to Microsoft Office documents, you'll need to look in SummaryExtractor. If you want to know what all the metadata keys and values are that Tika can extract from a given file, either use the tika-app cli tool with --metadata, or call names() on the Metadata object to get the list of metadata keys Tika found.
I'm looking to modify certain tags (like comments, keywords, etc) of a .DOC file. I've been able to do this for DOCX using docx4j but I haven't been able to find anything that lets me change the tags for a .DOC format.
Is there a way to programmatically change the content of certain tags in a .DOC file?
Apache POI will quite happily let you read and edit the metadata of supported documents. For the older OLE2 formats (.doc, .xls etc), you'll want to use HPSF, likely via POIDocument. For the OOXML formats (.docx, .xlsx etc) use POIXMLDocument and POIXMLProperties
To modify the OLE2 properties, you can either follow the detailed instructions and code in the HPSF documentation, or on newer version of POI you can short cut quite a bit of that with HPSFPropertiesOnlyDocument, eg
NPOIFSFileSystem fs = new NPOIFSFileSystem(new File("test.doc"));
HPSFPropertiesOnlyDocument doc = new HPSFPropertiesOnlyDocument(fs);
SummaryInformation si = doc.getSummaryInformation();
if (si == null) doc.createInformationProperties();
si.setAuthor("StackOverflow");
si.setTitle("Properties Demo!");
FileOutputStream out = new FileOutputStream("changed.doc");
doc.write(out);
out.close();
I'm trying to extract metadata from a PNG image format. I'm using this library? http://code.google.com/p/metadata-extractor/
Even though it claims that PNG format is supported I get an error File format is not supported when I try it with a PNG image. From the source (in method readMetadata also it looks like that it doesn't support PNG format: http://code.google.com/p/metadata-extractor/source/browse/Source/com/drew/imaging/ImageMetadataReader.java?r=1aae00f3fe64388cd14401b2593b580677980884
I've also given this piece of code a try as well but it also doesn't extract the metadata on the PNG.
BTW, I'm adding metadata on PNG with imagemagick like this:
mogrify -comment "Test" Demo/myimage.png
Has anyone used this library for PNG format or are there other ways to extract metadata from PNG image?
You can try PNGJ (I'm the developer)
See eg here an example to dump all chunks.
If you want to read a particular text chunk (recall that in PNG each textual metadata has a key and a value), you could write
pngr.getMetadata().getTxtForKey("mykey")
A useful little Windows program to peek inside PNG chunk structure is TweakPNG
Update: If you want to check all textual chunks (bear in mind that there are three types with some differences, but...)
PngReader pngr = FileHelper.createPngReader(new File(file));
pngr.readSkippingAllRows();
for (PngChunk c : pngr.getChunksList().getChunks()) {
if (!ChunkHelper.isText(c)) continue;
PngChunkTextVar ct = (PngChunkTextVar) c;
String key = ct.getKey();
String val = ct.getVal();
// ...
}
Bear also in mind that textual chunks with repeated keys are allowed.
I have many JPEG images which contain corrupted XMP XML blocks. I can easily fix these blocks but I'm unsure how to write the 'fixed' data back to the image files.
I'm currently using JAVA but am open to anything that will make this task easy.
This is the goal for another question around XMP XML asked earlier.
In JAVA you can use the Apache Sanselan library:
String newXmpXmlString = "<the><new/><xmp/><xml/></the>";
File file = new File('path/to/file');
new JpegXmpRewriter().updateXmpXml(new ByteSourceFile(file), new BufferedOutputStream(new FileOutputStream(file)), newXmpXmlString);
For a more detailed example of the solution outlined above there is an open source project on Google Code that houses a small jPeg XMP XML Trimmer.
I'm wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there's an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.
If you're interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.
P.S.: If, on the other hand, you're simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don't want to use an existing library but do all the hard work yourself, you'll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you're interested in. In your case, you'd need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)
Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf ...)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();
You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.
Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.