Merging multiple LZO compressed files on HDFS

Merging multiple LZO compressed files on HDFS - java

Let's say I have this structure on HDFS:
/dir1
/dir2
/Name1_2015/
file1.lzo
file2.lzo
file3.lzo
/Name2_2015
file1.lzo
file2.lzo
Name1_2015.lzo
I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo
For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo
Each files are LZO compressed.
How can I do it ?
Thanks

If you don't care much about parallelism here's a bash one-liner:
for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop | hdfs dfs -put - /dir1/$d.lzo ; done
You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

I would do this with Hive, as follows:
Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table
(
all_content string
)
PARTITIONED BY (name string)
LOCATION "/dir1/dir2"
ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:
SET mapreduce.job.reduces=1 # this guarantees it'll make one file
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
insert into table receiving
select all_content from sending_table

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

Related

Facing issue while chunking then merging the jar files

I have one jar file for example apache-cassandra-3.11.6.jar.
Firstly i split/chunked into mutiple jars like below :
apache-cassandra1.jar
apache-cassandra2.jar
apache-cassandra3.jar
apache-cassandra4.jar
apache-cassandra5.jar
apache-cassandra6.jar
Then i reassemble them again into new Jar file i.e apache-cassandra_Merged.jar.
Now the problem comes.
When i compare the original jar file i.e apache-cassandra-3.11.6.jar with new Jar file i.e apache-cassandra_Merged.jar. then it is not matching.
The newly created jar file which is apache-cassandra_Merged.jar, it's size also reduced.
Please find below my code for your reference :
/// Chunking/spliting into mutiple jars
Path path = Paths.get("/Original_Jar/apache-cassandra-3.11.6.jar");
byte [] data = Files.readAllBytes(path); // Will read all bytes at once
Now divide total bytes into equal part and then write in each small jars one by one.
int count = 0;
for(byte[] rangeData : Arrays.copyOfRange(data, rangeSTART, rangeEND)){
FileOutputStream fileOutputStream1 = new FileOutputStream("/Cassandra_Image/Chunked_Jar/apache-cassandra"+count+".jar");
fileOutputStream1.write(rangeData);
}
//Merging back to one jar
For merging i used the same way. Created array of byte for each small/chunked jars and written into FileOutputStream("/Merged_Jar/apache-cassandra_Merged.jar") one by one.
Please let me know if i should use some other method/algorithm to split jar and reassemble it again which will make sure the originality of data after chunking and merging as well.
Note : Actually i want to transfer the jars to any server/directory where i should transfer a jar with limited size so for big size jars i need to split into small jars and send them one by one and then again reassemble them in target directory/place and it should be as original jar.
Thanks in advance.

This may not be the answer, but I provide as an information for you. Java also provides pack format where you can compress the jar files and then you can uncompress using unpack.
The tool is called pack200.
How to compress
<java_location>...\jre\lib>pack200 -J-Xmx256m small.jar.gz big.jar
How to uncompress
<java_location>...\jre\lib>unpack200 small.jar.gz big.jar
You can refer the following links.
https://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/pack200.html
https://docs.oracle.com/javase/7/docs/technotes/tools/share/unpack200.html

I am able to solve the issue with shell scripting.
Written below code in my shell script file and run through my java code.
split -b 1000000 src.jar target.jar
cat src.jaraa src.jarab src.jarac src.jarad src.jarae > merged.jar
And compare with any algorithm like sha256 checksum will work fine and it shows equal. and size also equal.

Reading a file from tar.gz archive in Spark

I have a bunch of tar.gz files which I would like to process with Spark without decompressing them.
A single archive is about ~700MB and contains 10 different files but I'm interested only in one of them (which is ~7GB after decompression).
I know that context.textFile supports tar.gz but I'm not sure is it the right tool when an archive contains more then one file. What happens is that Spark will return content of all files (line by line) in the archive including file names with some binary data.
Is there any way to select which file from tar.gz I would like to map?

AFAIK, I'd suggest sc.binaryFiles method... please see below doc. where file name and file content are present, you can map and pickup the file you want and process that.
public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
int minPartitions)
Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
Also, check this

Hadoop Map Reduce let addInputPath work with spesific file name

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?

From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).

If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

Multiple directories as Input format in hadoop map reduce

I am trying to run a graph verifier app in distributed system using hadoop.
I have the input in the following format:
Directory1
---file1.dot
---file2.dot
…..
---filen.dot
Directory2
---file1.dot
---file2.dot
…..
---filen.dot
Directory670
---file1.dot
---file2.dot
…..
---filen.dot
.dot files are files storing the graphs.
Is it enough for me to add the input directories path using FileInputFormat.addInputPath()?
I want hadoop to process the contents of each directory in same node because the files present in each directory contains data that depends on the presence of other files of the same directory.
Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
The files in each directory is dependent on each other for data(to be precise...
each directory contains a file(main.dot which has acyclic graph whose vertices are the names of the rest of the files,
so my verifier will traverse each vertex of graph present in main.dot, search for the file of the same name in the same directory and if found processes the data in that file.
similarly all the files will be processed and the combined output after processing each file in the directory is displayed,
same procedure goes for rest of the directories.)
Cutting long story short
As in famous word count application(if the input is a single book), hadoop will split the input and distribute the task to each node in the cluster where the mapper process each line and count the relevant word.
How can i split the task here(do i need to split by the way?)
How can i leverage hadoop power for this scenario, some sample code template will help for sure:)

The soln given by Alexey Shestakov will work. But it is not leveraging MapReduce's distributed processing framework. Probably only one map process will read the file ( file containing paths of all input files) and then process the input data.
How can we allocate all the files in a directory to a mapper, so that there will be number of mappers equal to number of directories?
One soln could be using "org.apache.hadoop.mapred.lib.MultipleInputs" class.
use MultipleInputs.addInputPath() to add the directories and map class for each directory path. Now each mapper can get one directory and process all files within it.

You can create a file with list of all directories to process:
/path/to/directory1
/path/to/directory2
/path/to/directory3
Each mapper would process one directory, for example:
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
for (FileStatus status : fs.listStatus(new Path(value.toString()))) {
// process file
}
}

Will the hadoop framework take care of distributing the directories equally to various nodes of the cluster(e.g. directory 1 to node1 , directory 2 to node2....so on) and process in parallel?
No, it won't. Files are not distributed to each node in the sense that the files are copied to the node to be processed. Instead, to put it simply, each node is given a set of file paths to process with no guarantee on location or data locality. The datanode then pulls that file from HDFS and processes it.
There's no reason why you can't just open other files you may need directly from HDFS.

How to join the Pig output files?

The pig script output a few part files (part-m-00000, part-m-00001, etc) with .pig_header and .pig_schema and I am trying to join them as one output csv.
I tried to use the hadoop merge
hadoop fs -getmerge ./output output.csv
but the files are merged with the .pig_schema file as well so it becomes something like
header1,header2,header3
{"fields":[{"name": "header1", "type":...}]}
value1,value2,value3
How do I join them correctly without the .pig_schema included?
Thanks!

Use a fileglob: hadoop fs -getmerge ./output/part* output.csv

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.