Distributing indexes of a large file over several computers

Distributing indexes of a large file over several computers - java

I want to distribute the processing of a large file (almost 1GB) over many machines. One option would be to store the whole file in all the machines and pass indexes from the master machine. But I not able to understand how I must do this in Java. I want to do this mainly for optimization. More specifically, I want each machine to process a different part of the file and return the result. The problem is that the file reading cannot start from a specific line, so on each machine the file would have to be read from the beginning, which will waste the time as the processing is being done several times. Can there be any solution for this?

Related

HDFS - load mass amount of files

For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!

If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices

First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.

Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here

Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).

For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.

Writing multiple files of same data amount vs writing a single large file of same data amount

I want to write a big file to the local disk.
I split the big file into many small files and I tried to write it to the disk. But I observed that when I split the files and tried to write, there was a big increase in disk write time.
Also, I copy the files from a disk and write it another computer's disk(reducer). I observed that there was a big increase in read time as well. Can anybody explain me the reason? I am working with hadoop.
Thanks!

That's due to the underlying file system and hardware.
There's overhead for each file in addition to its contents, for example MFT for NTFS(on Windows). So for a single large file the file system could do less bookkeeping.Thus it's faster.
As arranged by your OS, single big file tends to be written on consecutive sectors of the hard drive where possible, but multiple small files may or may not be written as such. So the resulting increased seek time may account for the increased reading time for many small files.
The efficiency of your OS may also play a big part. For example whether it prefetches file contents, how it makes use of buffer, etc. For many small files it's more difficult for the OS to use the buffer(and deal with other issues) efficiently.(Under different scenarios it can behave differently.)
EDIT: As for the copy process you mentioned, generally your OS do it in the following steps:
read data from disk->writing data to buffer->read from buffer->write to (possibly another) disk
This is usually done in multiple threads. When dealing with many small files, the OS may fail to coordinate these threads efficiently(Some threads are very busy, while others must wait). For a single large file the OS doesn't have to deal with these issues.

Every file system has a smallest unit(non sharable) defined to store the data named page. Say for example, in the file system, you have a page size of 4KB. Now if you save a big file of 8 KB, it will consume 2 pages on the disk. But if you break the file in 4 files, each of size 2KB, then it will consume 4 half filled pages on the disk consuming size 16KB disk space.
Similarly, if you break the file in 8 small files, each of size 1KB, then it will consume 8 pages in the disk though partially filled and your 32KB of the disk space is consumed.
Same is true in the reading overhead. If your file as several pages, then might be scattered. This will lead into high overhead in seektime/access time.

Something like a message digest but that progressively describes a file

I have a set of geographically remote nodes with heterogeneous operating systems which need to transfer files and updates around using a Java program I am writing. At present I need to send the entire file again if the file changes. Is there a way to determine the sections of the files that are different and only send those (note that these files are not necessarily text, they could be any format). The only way I can think of is to split the file into blocks, hash the blocks and send the hashes back the the requester which then requests only the blocks it needs but for small blocks and large files this is a large overhead so is there any way to send some message describing my file such that the singular message can be analysed to provide a list of the blocks that need to be transmitted?
Most digest functions are designed such that a small change to the data results in a large change over the whole hash output, I basically need the reverse of this, that will work on all operating systems.

If I understand your question correctly, you need to keep files in sync on two systems. There is a tool called rsync that can synchronize two files (or whole directories) by only sending the changes made to the file.
You may also be interested in the Rsync algorithm.

Processing large set of small files with Hadoop

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.

Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.

From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce

I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388

Can you concatenate files before submitting them to Hadoop?

CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.

How to decompress faster in Java?

Our system is having a problem with too much files, which is used in a webapp which should be using all the time. That mean the files cannot be deleted and there are too much of them, making the system(which is a windows) slow. We would like zip up the files, and when the file is request, we unzip the particular file out.
I've try the java ZipFile class, and the performance is not good enough, because there will be many people using the webapp and they will request the files. From my observation, the unzipping action require time between 0.5 secs to 2 secs, and when there are too much user, the system cannot catch up to them.
For example, I've use a Jmeter to simulate a situation where 30 user use the system, with a random delay between 0.3 secs to 0.6 secs. Although I doubt there may not be so much requests, I cannot know for advance that how many people will use the webapps. I would like to ask you guys, is there any other method to solve this problem?
Thanks in advance!!
P.S. If any 3rd party library is need, it must be free!
P.S. Because the number of files is just too much, and it hang the machine. We would like do this : zip up 2000 file into a zip file, then the number of files will decrease and hope the system won't hang anymore, and when need, we unzip some file out.

Okay, here's some thoughts. It appears to me that your core problem is the slowness of your system and that you're trying to fix it by compressing the files and decompressing them on demand. Then you've found that the decompression is too slow and you need a faster way to do that.
Now I'm not entirely certain why you think this compression will speed things up instead of making things slower.
I would go back to the original problem and work more on solving that. Why is the number of files making your system slow? If you can figure that out, you can fix it in a way that doesn't involve things going even slower.
If it's an issue with too many files in a directory, think about splitting into multiple directories. But I have no idea whether NTFS even has that problem (FAT did). For example, if you have a directory with files for every minute of the last ten years (five million files), you can split them into day directories (three and a half thousand directories with fifteen hundred files in each).
Compression won't reduce the number of files, just the space taken by them.
If it's an issue with the number of files on the system (rather than in a directory), there are plenty of ways to split files between systems as well. Example, hive off 10% of the entire file set to ten different machines and forward incoming requests for a specific file to the relevant machine.
But, I have to say, I've seen Windows machines handle absolute bucket-loads of files so I'd be very surprised if the problem lay there. I think you're probably just going to have to track down what's actually causing your "hangs".

compressing/uncompressing the files will not make the windows faster.

If zip doesn't provides performance gain (despite has native implementation in Java), you can try to improve at the filesystem-level. Folders with too many (>10000) files doesn't work well under some Windows filesystems, so try to divide the files into several folders, tune the NTFS filesystem (cluster size, reserved space for filesystem), disable anti virus, disable indexing, buy an SSD SLC hard disk...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.