Processing large set of small files with Hadoop

Processing large set of small files with Hadoop - java

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce

I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388

Can you concatenate files before submitting them to Hadoop?

CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

Related

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

Distributing indexes of a large file over several computers

I want to distribute the processing of a large file (almost 1GB) over many machines. One option would be to store the whole file in all the machines and pass indexes from the master machine. But I not able to understand how I must do this in Java. I want to do this mainly for optimization. More specifically, I want each machine to process a different part of the file and return the result. The problem is that the file reading cannot start from a specific line, so on each machine the file would have to be read from the beginning, which will waste the time as the processing is being done several times. Can there be any solution for this?

Combining large number of small files for mapreduce input

I am new to Hadoop & MapReduce .We are developing a network monitoring tool (in java).We collect various information of monitored devices periodically , say in every 5 sec. and write that information to HDFS through java client each information as new file(since we'r not using hdfs append facility).In HDFS our data organization would be like this:
/monitored_info
/f1.txt
/f2.txt
.......
/f1020010.txt
Thus each file typically less than 2KB in size.
I know each map task can take upto 1 file, and it will spawn as much as map task and the job will be inefficient. To get rid of this we used merging facility of FileUtil before submitting job:
FileUtil.copyMerge(fileSystem, new Path("monitored_info"), fileSystem,
new Path("mapInputfile"), false, conf, null);
Is it a good practice ? Or is there any other mechanism used for such requirements? Please help...

Check for Apache Kafka and Apache Flume. You can aggregate logs and move to your data store with them.
I'd use Flume personally. Easier to use imho.

If you want to use mapreduce there are different ways we can do that
Hadoop Archives, or HAR files, are a file archiving facility that packs files into HDFS
blocks more efficiently, thereby reducing namenode memory usage while still allowing
transparent access to files
The situation is alleviated somewhat by CombineFileInputFormat, which was designed
to work well with small files. Where FileInputFormat creates a split per file,
CombineFileInputFormat packs many files into each split so that each mapper has more
to process. Crucially, CombineFileInputFormat takes node and rack locality into account
when deciding which blocks to place in the same split, so it does not compromise the
speed at which it can process the input in a typical MapReduce job.
One technique for avoiding the many small files case is to merge small files
into larger files by using a SequenceFile: the keys can act as filenames (or a constant such as NullWritable, if not needed) and the values as file contents.

Search in big text log files

let's say you have an game server which creating text log files of gamers actions, and from time to time you need to lookup something in those logs files (like investigating an scam or loosing an item). Just for example you have 100 files and each file have size between 20MB and 50MB - How you would search them quickly?
What I have already tried to do is create several threads and each invidual thread will map his own file to memory (let say memory should not be problem if it not exceed 500MB of ram) perform search here, result was something around 1 second per file :
File:a26.log - read in: 0.891, lines: 625282, matches: 78848
Is there better way how to do that ? - because it seems to me kinda slow.
thanks.
(java was used for this case)

Tim Bray was investigating approaches to process Apache log files here: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Seems like there may be a lot in common with your situation.

You can use Unix commands combinations with find and grep.

For ad-hoc searching of large text files, I would use the UNIX grep, fgrep or egrep utilities. They have been around a long time, and have had the benefit of many people working on them to make them fast.
On the other hand, the ultimate bottleneck in search text files (that haven't been previously indexed) will be the speed at which the application + operating system can move data from a disc file into memory. You seem to be managing 20Mbytes or more per second, which seems reasonably fast ... too me.

I should probably mention that in first post, game server is written for Win64x - and I'm wonder if it is on same performace level like grep for Windows and for unix?

Of course there is a better way: you index the contents before searching. The way you index depends on how you want to search the logs, but in general, you might do well using Lucene (or Solr, if the log entries can easily be restructured into xml documents).
The amount of performance and resource use optimization put into tools like the above should give you orders of magnitude better performance than an ad-hoc solution.
This is all assuming you search each file many times. If this is not the case, you might as well grep the files and be done with it.

Text file split libraries in Java

My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?

I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...

What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.

How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.