process small file map reduce hadoop

process small file map reduce hadoop - java

I have a 456kb file which is being read from hdfs and its given as input to mapper function. Every line contain a integer for which I am downloading some files and storing them on local system. I have hadoop set up on two-node cluster and the split size is changed from the program to open 8-mappers :
Configuration configuration = new Configuration();
configuration.setLong("mapred.max.split.size", 60000L);
configuration.setLong("mapred.min.split.size", 60000L);
8 mappers are created but same data is downloaded on both the servers, I think its happening because block size is still set to default 256mb and input file is processed twice. So my question is can we process a small size file with map reduce?

If your download of files take time, you might have suffered from what's called speculative execution of Hadoop, which is by default enabled. It's just a guess though, since, you said you are getting same files downloaded more than once.
With speculative execution turn on the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform.
You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.

Related

Cloud Dataflow - Records dropped while streaming job but as not batch job

I have a dataflow pipeline that takes many files from a gcs bucket, extracts the records and applies some transformations, and finally outputs them into parquet files. It is continuously watching the bucket for files making this a streaming pipeline, though for now we have a termination condition to stop the pipeline after it has been 1 minute since the last new file. We are testing with a fixed set of files in the bucket
I initially ran this pipeline in batch mode(no continuous file watching) and from querying the parquet files in bigquery there was about 36 million records. However when I enabled continuous file watching and reran the pipeline the parquet files only contained ~760k records. I doubled checked that in both runs that the input bucket had the same set of files.
The metrics on the streaming job details page does not match up at all with what was outputted. Going by the section Elements added (Approximate) it says ~21 million records(which is wrong) were added to input collection for the final parquet writing step even though the files contained ~760k records.
The same step on the batch job had correct number(36 million) for Elements added (Approximate) and that was the same number of records in the outputted parquet files.
I haven't seen anything unusual in the logs.
Why is cloud dataflow marking the streaming job as Succeeded even though a ton of records have been dropped during writing the output?
Why is there an inconsistency with the metrics reporting for batch and streaming jobs on cloud dataflow with the same input?
For both jobs I have set 3 workers with a machine type of n1-highmem-4. I pretty much reached my quota for the project.

I suspect this might be due to the way you have configured Windows and triggers for your streaming pipeline. By default Beam/Dataflow triggers data when watermark passes the end of the window and default window configuration sets allowed lateness to zero. So any late data will be discarded by the pipeline. To change this behavior you can try setting the allowed lateness value or try setting a different trigger. See here for more information.

Spark Dataframe Write to CSV creates _temporary directory file in Standalone Cluster Mode

I am running spark job in a cluster which has 2 worker nodes! I am using the code below (spark java) for saving the computed dataframe as csv to worker nodes.
dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath);
I am trying to understand how spark writes multiple part files on each worker node.
Run1) worker1 has part files and SUCCESS ; worker2 has _temporarty/task*/part* each task has the part files run.
Run2) worker1 has part files and also _temporary directory; worker2 has multiple part files
Can anyone help me understand why is this behavior?
1)Should I consider the records in outputDir/_temporary as part of the output file along with the part files in outputDir?
2)Is _temporary dir supposed to be deleted after job run and move the part files to outputDir?
3)why can't it create part files directly under ouput dir?
coalesce(1) and repartition(1) cannot be the option since the outputDir file itself will be around 500GB
Spark 2.0.2. 2.1.3 and Java 8, no HDFS

After analysis, observed that my spark job is using fileoutputcommitter version 1 which is default.
Then I included config to use fileoutputcommitter version 2 instead of version 1 and tested in 10 node spark standalone cluster in AWS. All part-* files are generated directly under outputDirPath specified in the dataframe.write().option("header","false").mode(SaveMode.Overwrite).csv(outputDirPath)
We can set the property
By including the same as --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2' in spark-submit command
or set the property using sparkContext javaSparkContext.hadoopConifiguration().set("mapreduce.fileoutputcommitter.algorithm.version","2")
I understand the consequence in case of failures as outlined in the spark docs, but I achieved the desired result!
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version, defaultValue is 1
The
file output committer algorithm version, valid algorithm version
number: 1 or 2. Version 2 may have better performance, but version 1
may handle failures better in certain situations, as per
MAPREDUCE-4815.

TL;DR To properly write (or read for that matter) data using file system based source you'll need a shared storage.
_temporary directory is a part of basic commit mechanism used by Spark - data is first written to a temporary directory, and once all task finished, atomically moved to the final destination. You can read more about this process in Spark _temporary creation reason
For this process to be successful you need a shared file system (HDFS, NFS, and so on) or equivalent distributed storage (like S3). Since you don't have one, failure to clean temporary state is expected - Saving dataframe to local file system results in empty results.
The behavior you observed (data partially committed and partially not) can occur, when some executors are co-located with the driver and share file system with the driver, enabling full commit for the subset of data.

Multiple part files are based on your dataframe partition. The number of files or data written is dependent on the number of partitions the DataFrame has at the time you write out the data. By default, one file is written per partition of the data.
you can control it by using coalesce or repartition. you can reduce the partition or increase it.
if you make coalesce of 1, then you wont see multiple part files in it but this affects writing Data in Parallel.
[outputDirPath = /tmp/multiple.csv ]
dataframe
.coalesce(1)
.write.option("header","false")
.mode(SaveMode.Overwrite)
.csv(outputDirPath);
on your question on how to refer it..
refer as /tmp/multiple.csv for all below parts.
/tmp/multiple.csv/part-00000.csv
/tmp/multiple.csv/part-00001.csv
/tmp/multiple.csv/part-00002.csv
/tmp/multiple.csv/part-00003.csv

Hadoop MapReduce Out of Memory on Small Files

I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system).
Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch of these files together, then parses the file name to add it into the contents of the file, and spits out some results. Easy peasy.
So, we have about 3 million ~7kb files in HDFS. If we run our task against a small subset of these files (one folder, maybe 10,000 files), we get no trouble. If we run it against the full list of files, we get an out of memory error.
The error comes out on STDOUT:
#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 15690"...
I'm assuming what's happening is this - whatever JVM is running the process that defines the input splits is getting totally overwhelmed trying to handle 3 million files, it's using too much memory, and YARN is killing it. I'm willing to be corrected on this theory.
So, what I need to know how to do is to increase the memory limit for YARN for the container that's calculating the input splits, not for the mappers or reducers. Then, I need to know how to make this take effect. (I've Googled pretty extensively on this, but with all the iterations of Hadoop over the years, it's hard to find a solution that works with the most recent versions...)
This is Hadoop 2.6.0, using the MapReduce API, YARN framework, on AWS Elastic MapReduce 4.2.0.

I would spin up a new EMR cluster and throw a larger master instance at it to see if that is the issue.
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.4xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge
If the master is running out of memory when configuring the input splits you can modify the configuration
EMR Configuration

Instead of running the MapReduce on 3 million individual files, you can merge them into manageable bigger files using any of the following approaches.
1. Create Hadoop Archive ( HAR) files from the small files.
2. Create sequence file for every 10K-20K files using MapReduce program.
3. Create a sequence file from your individual small files using forqlift tool.
4. Merge your small files into bigger files using Hadoop-Crush.
Once you have the bigger files ready, you can run the MapReduce on your whole data set.

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

Combining large number of small files for mapreduce input

I am new to Hadoop & MapReduce .We are developing a network monitoring tool (in java).We collect various information of monitored devices periodically , say in every 5 sec. and write that information to HDFS through java client each information as new file(since we'r not using hdfs append facility).In HDFS our data organization would be like this:
/monitored_info
/f1.txt
/f2.txt
.......
/f1020010.txt
Thus each file typically less than 2KB in size.
I know each map task can take upto 1 file, and it will spawn as much as map task and the job will be inefficient. To get rid of this we used merging facility of FileUtil before submitting job:
FileUtil.copyMerge(fileSystem, new Path("monitored_info"), fileSystem,
new Path("mapInputfile"), false, conf, null);
Is it a good practice ? Or is there any other mechanism used for such requirements? Please help...

Check for Apache Kafka and Apache Flume. You can aggregate logs and move to your data store with them.
I'd use Flume personally. Easier to use imho.

If you want to use mapreduce there are different ways we can do that
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files
The situation is alleviated somewhat by CombineFileInputFormat, which was designed
to work well with small files. Where FileInputFormat creates a split per file,
CombineFileInputFormat packs many files into each split so that each mapper has more
to process. Crucially, CombineFileInputFormat takes node and rack locality into account
when deciding which blocks to place in the same split, so it does not compromise the
speed at which it can process the input in a typical MapReduce job.
One technique for avoiding the many small files case is to merge small files
into larger files by using a SequenceFile: the keys can act as filenames (or a constant such as NullWritable, if not needed) and the values as file contents.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.