Optimize metadata writing in DOC, XLS files - java

I'm doing a program that modifies only the metadata (standard and custom) in files Doc, xls, ppt and Vsd, the program works correctly but I wonder if there is a way to do this without loading the entire file into memory:
POIFSFileSystem POIFS = new POIFSFileSystem (new FileInputStream ("file.xls"))
The NPOIFSFileSystem method is faster and consumes less memory but is read only.
I'm using Apache POI 3.9

You could map the desired part to memory and then work on it using java.nio.FileChannel.
In addition to the familiar read, write, and close operations of byte channels, this class defines the following file-specific operations:
Bytes may be read or written at an absolute position in a file in a way that does not affect the channel's current position.
A region of a file may be mapped directly into memory; for large files this is often much more efficient than invoking the usual read or write methods.

At the time of your question, there sadly wasn't a very low memory way to do it. The good news is that as of 2014-04-28 it is possible! (This code should be in 3.11 when that's released, but for now it's too new)
Now that NPOIFS supports writing, including in-place write, what you'll want to do is something like:
// Open the file, and grab the entries for the summary streams
NPOIFSFileSystem poifs = new NPOIFSFileSystem(file, false);
DocumentNode sinfDoc =
(DocumentNode)root.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentNode dinfDoc =
(DocumentNode)root.getEntry(DocumentSummaryInformation.DEFAULT_STREAM_NAME);
// Open and parse the metadata
SummaryInformation sinf = (SummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(sinfDoc));
DocumentSummaryInformation dinf = (DocumentSummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(dinfDoc));
// Make some metadata changes
sinf.setAuthor("Changed Author");
sinf.setTitle("Le titre \u00e9tait chang\u00e9");
dinf.setManager("Changed Manager");
// Update the metadata streams in the file
sinf.write(new NDocumentOutputStream(sinfDoc));
dinf.write(new NDocumentOutputStream(dinfDoc));
// Write out our changes
fs.writeFilesystem();
fs.close();
You ought to be able to do all of that in under 20% of the memory of the size of your file, quite possibly less than that for larger files!
(If you want to see more on this, look at the ModifyDocumentSummaryInformation example and the HPSF TestWrite unit test)

Related

Large file to be read in chunks

I need a process a large file and insert into Db and don't want to spend lot of ram doing the same. I know we can read line in streaming mode by using apache commons API or buffered reader....bt I wish to insert in DB in batch mode like 1000 insertions at 1 go and not 1 by 1. ....is reading the file line by line ,adding to a list ,counting size ,inserting and refreshing the list of lines the only option to achieve this ?
According to your description, Spring-Batch fit very well.
Basically, it use chunk concept to read/process/write the content. By the way, it can be concurrent for performance.
#Bean
protected Step loadFeedDataToDbStep() {
return stepBuilder.get("load new fincon feed").<com.xxx.Group, FinconFeed>chunk(250)
.reader(itemReader(OVERRIDDEN_BY_EXPRESSION))
.processor(itemProcessor(OVERRIDDEN_BY_EXPRESSION, OVERRIDDEN_BY_EXPRESSION_DATE, OVERRIDDEN_BY_EXPRESSION))
.writer(itemWriter())
.listener(archiveListener())
.build();
}
You can refer to here for more

Keeping Encoding when Reading Image File

I'm am currently reading through a file which contains meta-data and a tiff image like so:
private String readFile( String file ) throws IOException {
File file = new File(filename);
int size = (int) file.length();
byte[] bytes = new byte[size];
BufferedInputStream buf = new BufferedInputStream(new FileInputStream(file));
buf.read(bytes, 0, bytes.length);
buf.close();
...
}
I parse the meta-data + image content, then I try to output the tiff like this, where img is a String:
writer = new BufferedWriter( new FileWriter( "img.tiff"));
writer.write(img);
writer.close();
Why is the encoding being lost of the tiff image file?
Why are you trying to rewrite the file?
If the answer is "I'm trying to alter some metadata within the file." I strongly suggest that you use a set of tools that are specifically geared towards working with TIFF metadata, especially if you intend to manipulate/alter than metadata as there are several special case data elements in TIFF files that really don't like being moved around blithely.
My day-to-day job involves understanding the TIFF spec, so I always get a little antsy when I see people mucking around with the internals of TIFFs without first consulting the spec or being concerned with some of the bizarre special cases that exist in the wild that now need to be handled because of someone else who didn't fully grok the spec and created a commercial product that generated thousands of these beasts (I'm looking at you Microsoft for making "old style JPEG compression" TIFFs, but I've also seen a Java product that defined a type of image that used floating point numbers for the component values without bothering to (1) normalize them as the spec would have you do or (2) have a standard for defining what the expected min and max of the component values would be).
In my code base (and this is a commercial product), you can do your work like this:
TiffFile myTiff = new TiffFile();
myTiff.read(someImageInputStream);
for (TiffDirectory dir : myTiff.getImages())
{
// a TiffDirectory contains a collection of TiffTag objects, from which the
// metadata for each image in the document can be read/edited
// TiffTag definitions can be found [here][2].
}
myTiff.save(someImageOutputStream); // writes the whole TIFF back
and in general, we've found that it's really advanced customers who want to do this. For the most part, we find that customers are more concerned with higher-level operations like combining TIFF files into a single document or extracting out pages, for which we have a different API which is much lighter weight and doesn't require you to know the TIFF specification (as it should be).
Try specifying the encoding in your writer.
http://docs.oracle.com/javase/7/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter%28java.io.OutputStream,%20java.nio.charset.CharsetEncoder%29
Wrap your stream:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
For images you should look into the ImageIO package.
http://docs.oracle.com/javase/7/docs/api/javax/imageio/ImageIO.html#getImageWriter%28javax.imageio.ImageReader%29

Failing for Larger Input Files Only: FileServiceFactory getBlobKey throws IllegalArgumentException

I have a Google App Engine App that converts CSV to XML files. It works fine for small XML inputs, but refuses to finalize the file for larger inputed XML. The XML is read from, and the resulting csv files are written to, many times before finalization, over a long-running (multi-day duration) task. My problem is different than this FileServiceFactory getBlobKey throws IllegalArgumentException , since my code works fine both in production and development with small input files. So it's not that I'm neglecting to write to the file before closing/finalizing. However, when I attempt to read from a larger XML file. The input XML file is ~150MB, and the resulting set of 5 CSV files is each much smaller (perhaps 10MB each). I persisted the file urls for the new csv files, and even tried to close them with some static code, but I just reproduce the same error, which is
java.lang.IllegalArgumentException: creation_handle: String properties must be 500 characters or less. Instead, use com.google.appengine.api.datastore.Text, which can store strings of any length.
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedSingleValue(DataTypeUtils.java:242)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:207)
at com.google.appengine.api.datastore.DataTypeUtils.checkSupportedValue(DataTypeUtils.java:173)
at com.google.appengine.api.datastore.Query$FilterPredicate.<init>(Query.java:900)
at com.google.appengine.api.datastore.Query$FilterOperator.of(Query.java:75)
at com.google.appengine.api.datastore.Query.addFilter(Query.java:351)
at com.google.appengine.api.files.FileServiceImpl.getBlobKey(FileServiceImpl.java:329)
But I know that it's not a String/Text data type issue, since I am already using similar length file service urls for the previous successful attempts with smaller files. It also wasn't an issue for the other stackoverflow post I linked above. I also tried putting one last meaningless write before finalizing, just in case it would help as it did for the other post, but it made no difference. So there's really no way for me to debug this... Here is my file closing code that is not working. It's pretty similar to the Google how-to example at http://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore .
log.info("closing out file 1");
try {
//locked set to true
FileWriteChannel fwc1 = fileService.openWriteChannel(csvFile1, true);
fwc1.closeFinally();
} catch (IOException ioe) {ioe.printStackTrace();}
// You can't get the blob key until the file is finalized
BlobKey blobKeyCSV1 = fileService.getBlobKey(csvFile1);
log.info("csv blob storage key is:" + blobKeyCSV1.getKeyString());
csvUrls[i-1] = blobKeyCSV1.getKeyString();
break;
At this point, I just want to finalize my new blob files for which I have the urls, but cannot. How can I get around this issue, and also, what may be the cause? Again, my code works for small files (~60 kB), but the input file of ~150MB fails). Thank you for any advice on what is causing this or how to get around it! Also, how long will my unfinalized files stick around for, before being deleted?
This issue was a bug in the Java MapReduce and Files API, which was recently fixed by Google. Read announcement here: groups.google.com/forum/#!topic/google-appengine/NmjYYLuSizo

FileOutputStream is really slow

I am downloading databases from the network, which are between 100 Kbytes and 500 Kbytes large. Here is my code (removed useless code):
URLConnection uConnection = downloadUrl.openConnection();
InputStream iS = uConnection.getInputStream();
BufferedInputStream bIS = new BufferedInputStream(iS);
byte[] buffer = new byte[1024];
FileOutputStream fOS = new FileOutputStream(db);
int bufferLength = 0;
while ((bufferLength = bIS.read(buffer)) > 0) {
fOS.write(buffer, 0, bufferLength);
}
fOS.close();
My problem is, that it takes a long time for him to finish the while-statement. Have I messed up the code somewhere? It shouldn't take that long for such small files, shouldn't it? I'm talking about 1 minute, for three files not larger than 1 MB altogether... Thanks in advance!
"Slow" is really rather ambiguous. That being said, considering what you're trying to do you shouldn't be using a BufferedInputStream and your buffer is way too small.
The buffered wrappers are for optimizing small reads/writes. Since all you're doing is trying to read a ton of data as fast as you can, you should just read directly from the InputStream, and use a large buffer (Say, 64k since the underlying native code is probably going to chunk at that size anyway).
byte[] buffer = new byte[65536];
...
while ((bufferLength = iS.read(buffer, 0, buffer.length) > 0) {
...
I've found the real solution in Jdk 1.7, which is made by reliable, fast, simple and almost definitively will spawn a pity veil on older java.io solutions.Despite the web is still plenty full of examples of copying files in java using In/out Streams I'll warmely suggest everyone to use a simple method : java.nio.Files.copy(Path origin, Path destination) with optional parameters for replacing destination,migrate metadata file attributes and even try a transactional move of files (if permitted by the underlying O.S.). That's a really good Job, waited for so long! You can easily convert code from copy(File file1, File file2) by appending a ".toPath()" to the File instance (e.g. file1.toPath(), file2.toPath(). Note also that the boolean method isSameFile(file1.toPath(), file2.toPath()), is already used inside the above copy method but easily usable in every case you want. For every case you can't upgrade to 1.7 using community libraries from Apache (commons-io) or Google (guava commons) is still suggested.

Library for writing XMP to a multipage TIFF

Can you recommend a library that lets me add XMP data to a TIFF file? Preferably a library that can be used with Java.
There is JempBox which is open source and allows the manipulation of XMP streams, but it doesn't look like it will read/write the XMP data in a TIFF file.
There is also Chilkat which is not open source, but does appear to do what you want.
It's been a while, but it may still be useful to someone: Apache Commons has a library called Sanselan suitable for this task. It's a bit dated and the documentation is sparse, but it does the job well nevertheless:
File file = new File("path/to/your/file");
// Get XMP xml data from a file
String xml = Sanselan.getXmpXml(file);
// Process the XML data
xml = processXml(xml);
// Write XMP xml data from a file
Map params = new HashMap();
params.put(Sanselan.PARAM_KEY_XMP_XML, xml);
BufferedImage image = Sanselan.getBufferedImage(file);
Sanselan.writeImage(image, file, Sanselan.guessFormat(file), params);
You may have to be careful with multipage TIFFs though, because Sanselan.getBufferedImage will probably only get the first (so only the first gets written back).

Categories

Resources