Keeping Encoding when Reading Image File - java

I'm am currently reading through a file which contains meta-data and a tiff image like so:
private String readFile( String file ) throws IOException {
File file = new File(filename);
int size = (int) file.length();
byte[] bytes = new byte[size];
BufferedInputStream buf = new BufferedInputStream(new FileInputStream(file));
buf.read(bytes, 0, bytes.length);
buf.close();
...
}
I parse the meta-data + image content, then I try to output the tiff like this, where img is a String:
writer = new BufferedWriter( new FileWriter( "img.tiff"));
writer.write(img);
writer.close();
Why is the encoding being lost of the tiff image file?

Why are you trying to rewrite the file?
If the answer is "I'm trying to alter some metadata within the file." I strongly suggest that you use a set of tools that are specifically geared towards working with TIFF metadata, especially if you intend to manipulate/alter than metadata as there are several special case data elements in TIFF files that really don't like being moved around blithely.
My day-to-day job involves understanding the TIFF spec, so I always get a little antsy when I see people mucking around with the internals of TIFFs without first consulting the spec or being concerned with some of the bizarre special cases that exist in the wild that now need to be handled because of someone else who didn't fully grok the spec and created a commercial product that generated thousands of these beasts (I'm looking at you Microsoft for making "old style JPEG compression" TIFFs, but I've also seen a Java product that defined a type of image that used floating point numbers for the component values without bothering to (1) normalize them as the spec would have you do or (2) have a standard for defining what the expected min and max of the component values would be).
In my code base (and this is a commercial product), you can do your work like this:
TiffFile myTiff = new TiffFile();
myTiff.read(someImageInputStream);
for (TiffDirectory dir : myTiff.getImages())
{
// a TiffDirectory contains a collection of TiffTag objects, from which the
// metadata for each image in the document can be read/edited
// TiffTag definitions can be found [here][2].
}
myTiff.save(someImageOutputStream); // writes the whole TIFF back
and in general, we've found that it's really advanced customers who want to do this. For the most part, we find that customers are more concerned with higher-level operations like combining TIFF files into a single document or extracting out pages, for which we have a different API which is much lighter weight and doesn't require you to know the TIFF specification (as it should be).

Try specifying the encoding in your writer.
http://docs.oracle.com/javase/7/docs/api/java/io/OutputStreamWriter.html#OutputStreamWriter%28java.io.OutputStream,%20java.nio.charset.CharsetEncoder%29
Wrap your stream:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
For images you should look into the ImageIO package.
http://docs.oracle.com/javase/7/docs/api/javax/imageio/ImageIO.html#getImageWriter%28javax.imageio.ImageReader%29

Related

Optimize metadata writing in DOC, XLS files

I'm doing a program that modifies only the metadata (standard and custom) in files Doc, xls, ppt and Vsd, the program works correctly but I wonder if there is a way to do this without loading the entire file into memory:
POIFSFileSystem POIFS = new POIFSFileSystem (new FileInputStream ("file.xls"))
The NPOIFSFileSystem method is faster and consumes less memory but is read only.
I'm using Apache POI 3.9
You could map the desired part to memory and then work on it using java.nio.FileChannel.
In addition to the familiar read, write, and close operations of byte channels, this class defines the following file-specific operations:
Bytes may be read or written at an absolute position in a file in a way that does not affect the channel's current position.
A region of a file may be mapped directly into memory; for large files this is often much more efficient than invoking the usual read or write methods.
At the time of your question, there sadly wasn't a very low memory way to do it. The good news is that as of 2014-04-28 it is possible! (This code should be in 3.11 when that's released, but for now it's too new)
Now that NPOIFS supports writing, including in-place write, what you'll want to do is something like:
// Open the file, and grab the entries for the summary streams
NPOIFSFileSystem poifs = new NPOIFSFileSystem(file, false);
DocumentNode sinfDoc =
(DocumentNode)root.getEntry(SummaryInformation.DEFAULT_STREAM_NAME);
DocumentNode dinfDoc =
(DocumentNode)root.getEntry(DocumentSummaryInformation.DEFAULT_STREAM_NAME);
// Open and parse the metadata
SummaryInformation sinf = (SummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(sinfDoc));
DocumentSummaryInformation dinf = (DocumentSummaryInformation)PropertySetFactory.create(
new NDocumentInputStream(dinfDoc));
// Make some metadata changes
sinf.setAuthor("Changed Author");
sinf.setTitle("Le titre \u00e9tait chang\u00e9");
dinf.setManager("Changed Manager");
// Update the metadata streams in the file
sinf.write(new NDocumentOutputStream(sinfDoc));
dinf.write(new NDocumentOutputStream(dinfDoc));
// Write out our changes
fs.writeFilesystem();
fs.close();
You ought to be able to do all of that in under 20% of the memory of the size of your file, quite possibly less than that for larger files!
(If you want to see more on this, look at the ModifyDocumentSummaryInformation example and the HPSF TestWrite unit test)

What is the difference between using IOUtils and ImageIO for writing an image file

I have a tiff image stored as Base64 encoded String in a file. My aim is to create a tiff file out of it. This is what I am doing:
String base64encodedTiff = IOUtils.toString(new FileInputStream("C:/tiff-attachment.txt"));
byte[] imgBytes = DatatypeConverter.parseBase64Binary(base64encodedTiff);
BufferedImage bufImg = ImageIO.read(new ByteArrayInputStream(imgBytes));
ImageIO.write(bufImg, "tiff", new File("c:/new-darksouls-imageIO-tiff.tiff"));
ImageIO.write() is throwing IllegalArgumentException because bufImg is null. I don't understand what am I doing wrong here.
On the contrary if I use IOUtils to write, it works fine:
IOUtils.write(imgBytes, new FileOutputStream("c:/new-darksouls-io-tiff.tiff"));
Please help me understand
Why ImageIO is throwing exception
What is the right API and way for what I am trying to achieve.
ImageIO would be useful if, for example, you wanted to convert a PNG to a JPEG. Since you don't need to manipulate the image or convert to another format, don't bother with ImageIO. Just use IOUtils.write() to save the TIFF data verbatim.
ImageIO.read() is returning a null image because it can't read the TIFF file, probably because TIFF isn't one of the standard ImageIO plugin formats. The standard supported image formats are listed here:
http://docs.oracle.com/javase/6/docs/api/javax/imageio/package-summary.html
An additional note -- the code you posted buffers the entire image in memory. If you're concerned about using memory efficiently, consider using some kind of Base64 decoding input stream to perform the decoding on the fly. That might look like this:
try (FileOutputStream out = new FileOutputStream("c:/new-darksouls-io-tiff.tiff");
FileInputStream in = new FileInputStream("C:/tiff-attachment.txt");
Base64InputStream decodedIn = new Base64InputStream(in)) {
IOUtils.copy(decodedIn, out);
}

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Library for writing XMP to a multipage TIFF

Can you recommend a library that lets me add XMP data to a TIFF file? Preferably a library that can be used with Java.
There is JempBox which is open source and allows the manipulation of XMP streams, but it doesn't look like it will read/write the XMP data in a TIFF file.
There is also Chilkat which is not open source, but does appear to do what you want.
It's been a while, but it may still be useful to someone: Apache Commons has a library called Sanselan suitable for this task. It's a bit dated and the documentation is sparse, but it does the job well nevertheless:
File file = new File("path/to/your/file");
// Get XMP xml data from a file
String xml = Sanselan.getXmpXml(file);
// Process the XML data
xml = processXml(xml);
// Write XMP xml data from a file
Map params = new HashMap();
params.put(Sanselan.PARAM_KEY_XMP_XML, xml);
BufferedImage image = Sanselan.getBufferedImage(file);
Sanselan.writeImage(image, file, Sanselan.guessFormat(file), params);
You may have to be careful with multipage TIFFs though, because Sanselan.getBufferedImage will probably only get the first (so only the first gets written back).

CSV file validation with Java

I'm reading a file line by line, like this:
FileReader myFile = new FileReader(File file);
BufferedReader InputFile = new BufferedReader(myFile);
// Read the first line
String currentRecord = InputFile.readLine();
while(currentRecord != null) {
currentRecord = InputFile.readLine();
}
But if other types of files are uploaded, it will still read their contents. For instance, if the uploaded file is an image, it will output junk characters when reading the file. So my question is: how can I check the file is CSV for sure before reading it?
Checking extension of the file is kind of lame since someone can upload a file that is not CSV but has a .csv extension. Thanks in advance.
Determining the MIME type of a file is not something easy to do, especially if ASCII sections can be mixed with binary ones.
Actually, when you look at how a java mail system does determine the MIME type of an email, it does involve reading all bytes in it, and applying some "rules".
Check out MimeUtility.java
If the primary type of this datasource is "text" and if all the bytes in its input stream are US-ASCII, then the encoding is "7bit".
If more than half of the bytes are non-US-ASCII, then the encoding is "base64".
If less than half of the bytes are non-US-ASCII, then the encoding is "quoted-printable".
If the primary type of this datasource is not "text", then if all the bytes of its input stream are US-ASCII, the encoding is "7bit".
If there is even one non-US-ASCII character, the encoding is "base64".
#return "7bit", "quoted-printable" or "base64"
As mentioned by mmyers in a deleted comment, JavaMimeType is supposed to do the same thing, but:
it is dead since 2006
it does involve reading the all content!
:
File file = new File("/home/bibi/monfichieratester");
InputStream inputStream = new FileInputStream(file);
ByteArrayOutputStream byteArrayStream = new ByteArrayOutputStream();
int readByte;
while ((readByte = inputStream.read()) != -1) {
byteArrayStream.write(readByte);
}
String mimetype = "";
byte[] bytes = byteArrayStream.toByteArray();
MagicMatch m = Magic.getMagicMatch(bytes);
mimetype = m.getMimeType();
So... since you are reading the all content of the file anyway, you could take advantage of that to determine the type based on that content and your own rules.
Java Mime Magic may be of use. It'll analyse mime-types from files and inputstreams. I can't vouch for it's functionality, however.
This link may provide further info. It provides several different means of determining how to do what you want (or at least something similar).
I would perhaps be tempted to write something specific to your problem domain. e.g. determining the number of comma-separated values per line and rejecting if it's not within certain limits. Then split on the commas and parse each entry according to requirements (e.g. are they doubles/floats/valid Strings - and if strings, what encoding). I think you may have to do this anyway, given that someone may upload a file that starts like a CSV but is corrupted half-way through.

Categories

Resources