Combining large number of small files for mapreduce input

Combining large number of small files for mapreduce input - java

I am new to Hadoop & MapReduce .We are developing a network monitoring tool (in java).We collect various information of monitored devices periodically , say in every 5 sec. and write that information to HDFS through java client each information as new file(since we'r not using hdfs append facility).In HDFS our data organization would be like this:
/monitored_info
/f1.txt
/f2.txt
.......
/f1020010.txt
Thus each file typically less than 2KB in size.
I know each map task can take upto 1 file, and it will spawn as much as map task and the job will be inefficient. To get rid of this we used merging facility of FileUtil before submitting job:
FileUtil.copyMerge(fileSystem, new Path("monitored_info"), fileSystem,
new Path("mapInputfile"), false, conf, null);
Is it a good practice ? Or is there any other mechanism used for such requirements? Please help...

Check for Apache Kafka and Apache Flume. You can aggregate logs and move to your data store with them.
I'd use Flume personally. Easier to use imho.

If you want to use mapreduce there are different ways we can do that
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files
The situation is alleviated somewhat by CombineFileInputFormat, which was designed
to work well with small files. Where FileInputFormat creates a split per file,
CombineFileInputFormat packs many files into each split so that each mapper has more
to process. Crucially, CombineFileInputFormat takes node and rack locality into account
when deciding which blocks to place in the same split, so it does not compromise the
speed at which it can process the input in a typical MapReduce job.
One technique for avoiding the many small files case is to merge small files
into larger files by using a SequenceFile: the keys can act as filenames (or a constant such as NullWritable, if not needed) and the values as file contents.

Related

Write 1 million rows of CSV into S3 by batches

I'm trying to build a very large CSV file on S3.
I want to build this file on S3
I want to append rows to this file in batches.
Number of rows could be anywhere between 10k to 1M
Size of each batch could be < 5Mb(So multi-part upload is not feasible)
What would be the right way of accomplishing something like this?

Traditionally in Big Data processing ("Data Lakes"), information related to a single table are stored in a directory rather than a single file. So, appending information to a table is as simple as adding another file to a directory. All files within the directory will need to be the same schema (such as CSV columns, or JSON data).
The directory of files can then be used with tools such as:
Spark, Hive and Presto on Hadoop
Amazon Athena
Amazon Redshift Spectrum
A benefit of this method is that the above systems can process multiple files in parallel rather than being restricted to processing a single file in a single-threaded method.
Also common is to compress the files using technologies like gzip. This lowers storage requirements and makes it faster to read data from disk. Adding additional files is easy (just add another csv.gz file) rather than having to unzip, append and re-zip a file.
Bottom line: It would be advisable to re-think your requirements for "one great big CSV file".

'One big file' isn't going to work for you - you can't append rows to an s3 file, without first downloading the entire file, adding the rows, and then uploading the new file over the old one - for small files, it will work, but as the file gets larger, the bandwidth and processing is going to go up geometrically on you, and may get very slow and possibly expensive.
Better off refactoring your design to work with lots of little files instead of one big one.

Leave a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to upload and concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage.

Hadoop MapReduce Out of Memory on Small Files

I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system).
Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch of these files together, then parses the file name to add it into the contents of the file, and spits out some results. Easy peasy.
So, we have about 3 million ~7kb files in HDFS. If we run our task against a small subset of these files (one folder, maybe 10,000 files), we get no trouble. If we run it against the full list of files, we get an out of memory error.
The error comes out on STDOUT:
#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 15690"...
I'm assuming what's happening is this - whatever JVM is running the process that defines the input splits is getting totally overwhelmed trying to handle 3 million files, it's using too much memory, and YARN is killing it. I'm willing to be corrected on this theory.
So, what I need to know how to do is to increase the memory limit for YARN for the container that's calculating the input splits, not for the mappers or reducers. Then, I need to know how to make this take effect. (I've Googled pretty extensively on this, but with all the iterations of Hadoop over the years, it's hard to find a solution that works with the most recent versions...)
This is Hadoop 2.6.0, using the MapReduce API, YARN framework, on AWS Elastic MapReduce 4.2.0.

I would spin up a new EMR cluster and throw a larger master instance at it to see if that is the issue.
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.4xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge
If the master is running out of memory when configuring the input splits you can modify the configuration
EMR Configuration

Instead of running the MapReduce on 3 million individual files, you can merge them into manageable bigger files using any of the following approaches.
1. Create Hadoop Archive ( HAR) files from the small files.
2. Create sequence file for every 10K-20K files using MapReduce program.
3. Create a sequence file from your individual small files using forqlift tool.
4. Merge your small files into bigger files using Hadoop-Crush.
Once you have the bigger files ready, you can run the MapReduce on your whole data set.

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

Processing large set of small files with Hadoop

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce

I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388

Can you concatenate files before submitting them to Hadoop?

CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

Text file split libraries in Java

My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?

I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...

What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.

How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.