processing zipped xml files in hadoop using mapreduce - java

I have a file structure like this.
a.zip contains a1.zip,a2.zip,a3.zip and then each of these zipped files have one xml file per zip.
I need to process these xml files. currently I am extracting zipped files from a.zip, storing them in hdfs and running a MR job to process a1.zip, a2.zip ..... using custom input format and record reader.
Can anyone help me with a better solution where I dont have to unzip a.zip and still process the files in parallel.

Why don't you write a normal java pre-processor class which you can call from the main program. The steps would be:
1) pre-processor class would programmatically extracts the a.zip file into a temp location.
2) programmatically add the child zip classes to hdfs.
3) fire the XML processing in the way you are doing now.
4) if you wish, you can extend the pre-processor class to directly place XML, such that you could keep xml processing program simpler.
Let me know if something is not clear here.

Related

Java Servelet 3.0 File Upload to input stream - without intermediate folders or files being created

I dont know how to do this, or whether is possible or wise, so any form of answer that points me to a library, example or reasoning will be helpful.
I need to upload and process some Java XML files (actually, XSLT files - XML Excel files).
I dont want to store the file on the server and then invoke processing on it. Instead, I want to stream the file in, and process it as a stream.
I also want to be able to process multipart file uploads, but still process that as an input stream.
I am expressly trying to avoid creating a file on disk for this.

How to decode and read a chm file in java?

I see many suggestions to use the jchm jar to do this. However I don't see any code which shows how to read the chm file using the jchm jar.
I used hh.exe -decompile [destination_folder] chmfile.chm to decompile the .chm into its underlying .htm files. I am now writing java code to parse all these .htm to create a tree structure to store the complex structure of hyperlinks in the .chm file.

How can I add a file to another file which is such as named store.dat in Java?

All data must be stored in one single persistent file name secure_store.dat.
The following command should add new files to the Secure Store realm:
put [path_on_OS] [file_name]
How can I do this ?
How can I add a file that in my PC to secure.store ? Thank you.
If you don't mind that secure_store.dat will be a zipped file then you can use standard Java handling for zipped files...
Edit:
When you add multiple files together into one single file you must store them in such a way to preserve their boundaries, if you fail to do that the two or more files will become garbled mess.
java.util.zip functionality provides all features that you seem to need, it will create a zipped archive file with separate entries for each file that you add. It provides functionality to add/extract/remove files from the archive too.

EXIFTool JSON to EXIF batch processing

I have >400 JPG files and a JSON file for each which contains the image tags, description and title. I've found this command
exiftool -json=picture.json picture.jpg
But I don't want to run this for each and every file.
How can I run this command for the folder containing the JPGs and JSONs or is there another way I can batch process these?
Each JSON file has the same name as it's JPG counterpart so it's easy to identify which files match up to each other.
Assuming your JPGs and JSONs have the same filename, but different extesion(e.g. picture001.jpg has an associated picture001.json,etc.), a batch for loop might work.
Assuming you've already cd-ed into the folder and the files aren't nested in folders, something like this should work
( for jpg in *.jpg; do exiftool -json=${jpg/\.jpg/.json} $jpg; done )
Note that this isn't tested. I recommend making a copy of your folder and testing there beforehand to make sure you don't irreversibly damage them.
I've also noticed you're using the java tag. I had to work with EXIF data in Java a while back (on Android then) and I used the JHeader library. If you want to roll your own little java command line tool, you should be able to use Java's IO classes to traverse your directory and files and the JHeader library to modify the EXIF data.

Two applications need to export and import a single file which needs to include data and images, best file type?

I'm making two Java applications one to collect data, another to use it. The one collecting will be importing a file from the other which will include data and images and will be decrypted.
I'm unsure what filetype to use. So far all of the data is in XML and works great but I need the images and was hoping not to have to rely on giving all the images in a folder with a path reference.
Ideas?
well, I think that the best way is to create your own format (.myformat or .data). This file will be in fact a Zip file that contains your XML file and images.
There is no perfect example writen in java as far as I know. However, here are some examples :
Not in java
The best example is, as #Bolo said, the odt format. Indeed, OpenOffice writes the doc in an xml file, and the images too. All that is wrapped in an odt file.
The .exe file is an other example. The C files and the resources are put in a single file. try to open it with 7-zip, you'll see.
The Skyrim plugins are .esp file that contain the dds, the scripts, the niffs (textures)...
In java
The minecraft texture packs are a zip file that contains a .mcmeta file (the infos) and the textures (.png)
Jar files are like exe.
If both programs are in java you could also go with serialization, which is basically saving an object as a file (suffix will be .ser I think) and then being able to retrieve it. You should google it, even if it won't help right now it is quite good to know about it.
I'd suggest using JSON. Gson is a decent library.
You can embed images as byte arrays.
Save the serialized string in a file with a preferred extension, read it from the second application, de-serialize, and reconstruct images.
You can convert binary image data to text with Base64 encoding and this way you can embed your images in XML. [1]: http://en.wikipedia.org/wiki/Base64

Categories

Resources