How to join the Pig output files?

How to join the Pig output files? - java

The pig script output a few part files (part-m-00000, part-m-00001, etc) with .pig_header and .pig_schema and I am trying to join them as one output csv.
I tried to use the hadoop merge
hadoop fs -getmerge ./output output.csv
but the files are merged with the .pig_schema file as well so it becomes something like
header1,header2,header3
{"fields":[{"name": "header1", "type":...}]}
value1,value2,value3
How do I join them correctly without the .pig_schema included?
Thanks!

Use a fileglob: hadoop fs -getmerge ./output/part* output.csv

Related

Reading a file from tar.gz archive in Spark

I have a bunch of tar.gz files which I would like to process with Spark without decompressing them.
A single archive is about ~700MB and contains 10 different files but I'm interested only in one of them (which is ~7GB after decompression).
I know that context.textFile supports tar.gz but I'm not sure is it the right tool when an archive contains more then one file. What happens is that Spark will return content of all files (line by line) in the archive including file names with some binary data.
Is there any way to select which file from tar.gz I would like to map?

AFAIK, I'd suggest sc.binaryFiles method... please see below doc. where file name and file content are present, you can map and pickup the file you want and process that.
public RDD<scala.Tuple2<String,PortableDataStream>> binaryFiles(String path,
int minPartitions)
Get an RDD for a Hadoop-readable dataset as PortableDataStream for each file (useful for binary data)
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.binaryFiles("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
Also, check this

Run .sh File in Java Code on Linux Machine

My Task:
1 - Read an Excel File, parse it and save the data in txt file.
2 - Read that txt file and pass it to a batch.sh file. This batch.sh file do one thing, it picks the data from the above mentioned txt file and save it in a database's table.
When I run the batch.sh file from the terminal(and give it the txt file), it works fine. It inserts the records in the database just as i want.
Problem is, when I want to do the same with the java code, the batch.sh file does not work. And also, no exception is thrown.
Some Explanation: I am using Java 7, Oracle, SQL-Loader, Linux.
Additional Information: If I rename the batch.sh file to batch.bat file and run it in Windows environment, it works perfectly fine. the batch.sh file also works fine if it is executed from the terminal. The only problem is, its not working from java code.
I am listing the java code snippet, batch.sh file and the control.txt file for the following task.
Java Code:
try{
String target = new String("/path/to/batchFile/batch.sh");
Runtime rt = Runtime.getRuntime();
Process proc = rt.exec(target);
proc.waitFor();
}catch(Exception e){
LOG.error("Exception Occured with Message : "+e.getMessage());
}
Batch.sh:
sqlldr username/password#sid control='/path/to/batchFile/control.txt' log='/path/to/batchFile/Results.log' bad='/path/to/batchFile/BadFile.bad' ERRORS=5000
control.txt:
options ( skip=1 )
load data
infile '/path/to/batchFile/TxtFileData.txt'
truncate into table TABLE_NAME
fields terminated by ","
(
column_name "replace(:column_name,'-','')"
)
P.S: I have read many post regarding the same issue, and tried every solution, but non is working. The current example of java code is taken from another StackOverFlow thread.
Any help will be highly appreciated.

You'd want runtime.getRuntime.exec(new String[]{"/bin/bash", "-c", "/path/to/script.txt"})
See here: How to execute command with parameters?

Merging multiple LZO compressed files on HDFS

Let's say I have this structure on HDFS:
/dir1
/dir2
/Name1_2015/
file1.lzo
file2.lzo
file3.lzo
/Name2_2015
file1.lzo
file2.lzo
Name1_2015.lzo
I would like to merge each file of each directory in 'dir2' and append the result to the file in /dir1/DirName.lzo
For example, for /dir1/dir2/Name1_2015, I want to merge file1.lzo, file2.lzo, file3.lzo and append it to /dir1/Name1_2015.lzo
Each files are LZO compressed.
How can I do it ?
Thanks

If you don't care much about parallelism here's a bash one-liner:
for d in `hdfs dfs -ls /dir2 | grep -oP '(?<=/)[^/]+$'` ; do hdfs dfs -cat /dir2/$d/*.lzo | lzop -d | lzop | hdfs dfs -put - /dir1/$d.lzo ; done
You can extract all files in parallel using map-reduce. But how do you create one archive from multiple files in parallel? As far as I know, it is not possible to write to a single HDFS file from multiple processes concurrently. So as it's not possible we come up with a single node solution anyway.

I would do this with Hive, as follows:
Rename the subdirectories to name=1_2015 and name=2_2015
CREATE EXTERNAL TABLE sending_table
(
all_content string
)
PARTITIONED BY (name string)
LOCATION "/dir1/dir2"
ROW FORMAT DELIMITED FIELDS TERMINATED BY {a column delimiter that you know doesn't show up in any of the lines}
Make a second table that looks like the first, named "receiving", but with no partitions, and in a different directory.
Run this:
SET mapreduce.job.reduces=1 # this guarantees it'll make one file
SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true
insert into table receiving
select all_content from sending_table

You can try to archive all the individual LZO files into HAR (Hadoop Archive). I think its overhead to merge all the files into single LZO.

Hadoop Map Reduce let addInputPath work with spesific file name

Hey this is more a java question but it is related to Hadoop .
I have this line on code in my Map Reduce java Job :
JobConf conf= new JobConf(WordCount.class);
conf.setJobName("Word Count");
.............
.............
.............
FileInputFormat.addInputPath(conf, new Path(args[0]));
instead of "giving" a directory with many files how do i set specific file name ?

From the book "Hadoop: The Definitive Guide":
An input path is specified by calling the static addInputPath() method
on FileInputFormat, and it can be a single file, a directory (in which
case the input forms all the files in that directory), or a file
pattern. As the name suggests, addInputPath() can be called more than
once to use input from multiple paths.
So to answer your question, you should be able to just pass a path to your specific single file, and it will be used as an only input (as long as you do not do more calls of addInputPath() with some other paths).

If you only want to do map-reduce stuff on one file, a quick and easy work around is to move that file only into a folder by itself and then provide that folder's path to your addInputPath.
If you're trying to read a whole file per map task then might I suggest taking a look at this post:
Reading file as single record in hadoop
What exactly are you trying to do?
I would have posted this as a comment, but I don't have sufficient priviledges apparently...

Hadoop : Provide directory as input to MapReduce job

I'm using Cloudera Hadoop. I'm able to run simple mapreduce program where I provide a file as input to MapReduce program.
This file contains all the other files to be processed by mapper function.
But, I'm stuck at one point.
/folder1
- file1.txt
- file2.txt
- file3.txt
How can I specify the input path to MapReduce program as "/folder1", so that it can start processing each file inside that directory ?
Any ideas ?
EDIT :
1) Intiailly, I provided the inputFile.txt as input to mapreduce program. It was working perfectly.
>inputFile.txt
file1.txt
file2.txt
file3.txt
2) But now, instead of giving an input file, I want to provide with an input directory as arg[0] on command line.
hadoop jar ABC.jar /folder1 /output

The Problem is FileInputFormat doesn't read files recursively in the input path dir.
Solution: Use Following code
FileInputFormat.setInputDirRecursive(job, true); Before below line in your Map Reduce Code
FileInputFormat.addInputPath(job, new Path(args[0]));
You can check here for which version it was fixed.

you could use FileSystem.listStatus to get the file list from given dir, the code could be as below:
//get the FileSystem, you will need to initialize it properly
FileSystem fs= FileSystem.get(conf);
//get the FileStatus list from given dir
FileStatus[] status_list = fs.listStatus(new Path(args[0]));
if(status_list != null){
for(FileStatus status : status_list){
//add each file to the list of inputs for the map-reduce job
FileInputFormat.addInputPath(conf, status.getPath());
}
}

you can use hdfs wildcards in order to provide multiple files
so, the solution :
hadoop jar ABC.jar /folder1/* /output
or
hadoop jar ABC.jar /folder1/*.txt /output

Use MultipleInputs class.
MultipleInputs. addInputPath(Job job, Path path, Class<? extends InputFormat>
inputFormatClass, Class<? extends Mapper> mapperClass)
Have a look at working code

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to join the Pig output files? - java

Use a fileglob: hadoop fs -getmerge ./output/part* output.csv

Related

Reading a file from tar.gz archive in Spark

Run .sh File in Java Code on Linux Machine

Merging multiple LZO compressed files on HDFS

Hadoop Map Reduce let addInputPath work with spesific file name

Hadoop : Provide directory as input to MapReduce job

Categories

Resources