My server application stores files of different size. Database size goes up to 2-4 TB, where about 90% of all files have less than 100kb, the rest up to 2 GB.
Which would be the best way of storing these files according to following conditions:
* The database should be portable.
It is at the actual solution, as the index, metadata etc. is stored a SQLite3 database. Now it is a pest copying 10 million single small files so I'd prefer store small files inside archives, like 1000x 2gb files.
* Some files are quite big, so BLOB storing may get to a performance problem
Additionally some of the bigger(5 mb - 2gb) files may need RandomAccess reading, which won't be possible with most archive systems
* Some transactions need to read thousands of small files at once
..which works good with BLOB, but I'm not sure if something like a zip archive would perform well in this case. The server can in most cases put all of these files in one archive, but the objects may have to be read in random order.
* Files stored should not be (directly) executable, but no need of encryption
Which system would you recommend for that?
Related
I'm trying to build a very large CSV file on S3.
I want to build this file on S3
I want to append rows to this file in batches.
Number of rows could be anywhere between 10k to 1M
Size of each batch could be < 5Mb(So multi-part upload is not feasible)
What would be the right way of accomplishing something like this?
Traditionally in Big Data processing ("Data Lakes"), information related to a single table are stored in a directory rather than a single file. So, appending information to a table is as simple as adding another file to a directory. All files within the directory will need to be the same schema (such as CSV columns, or JSON data).
The directory of files can then be used with tools such as:
Spark, Hive and Presto on Hadoop
Amazon Athena
Amazon Redshift Spectrum
A benefit of this method is that the above systems can process multiple files in parallel rather than being restricted to processing a single file in a single-threaded method.
Also common is to compress the files using technologies like gzip. This lowers storage requirements and makes it faster to read data from disk. Adding additional files is easy (just add another csv.gz file) rather than having to unzip, append and re-zip a file.
Bottom line: It would be advisable to re-think your requirements for "one great big CSV file".
'One big file' isn't going to work for you - you can't append rows to an s3 file, without first downloading the entire file, adding the rows, and then uploading the new file over the old one - for small files, it will work, but as the file gets larger, the bandwidth and processing is going to go up geometrically on you, and may get very slow and possibly expensive.
Better off refactoring your design to work with lots of little files instead of one big one.
Leave a 5MB garbage object sitting on S3 and do concatenation with it where part 1 = 5MB garbage object, part 2 = your file that you want to upload and concatenate. Keep repeating this for each fragment and finally use the range copy to strip out the 5MB garbage.
For testing purposes I'm trying to load a massive amount of small files into HDFS. Actually we talk about 1 Million (1'000'000) files with a size from 1KB to 100KB. I generated those files with an R-Script on a Linux-System in one folder. Every file has a information structure that contains a header with product information and a different number of columns with numeric information.
The problem is when I try to upload those local files into HDFS with the command:
hdfs dfs -copyFromLocal /home/user/Documents/smallData /
Then i get one of the following Java-Heap-Size errors:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I use the Cloudera CDH5 distribution with a Java-Heap-Size about 5 GB. Is there another way than increasing this Java-Heap-Size even more? Maybe a better way to load this mass amount of data into HDFS?
I'm very thankfully for every helpful comment!
If you will increase the memory and store the files in HDFS. After this you will get many problems at the time of processing.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Problems with small files and MapReduce
Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.
There are a couple of features to help alleviate the bookkeeping overhead: task JVM reuse for running multiple map tasks in one JVM, thereby avoiding some JVM startup overhead (see the mapred.job.reuse.jvm.num.tasks property), and MultiFileInputSplit which can run more than one split per map.
Solution
Hadoop Archives (HAR files)
Create .HAR File
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files
hadoop archive -archiveName name -p <parent> <src>* <dest>
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record)
HBase
If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices
First of all: If this isn't a stress test on your namenode it's ill advised to do this. But I assume you know what you are doing. (expect slow progress on this)
If the objective is to just get the files on HDFS, try doing this in smaller batches or set a higher heap size on your hadoop client.
You do this like rpc1 mentioned in his answer by prefixing HADOOP_HEAPSIZE=<mem in Mb here> to your hadoop -put command.
Try to increase HEAPSIZE
HADOOP_HEAPSIZE=2048 hdfs dfs -copyFromLocal /home/user/Documents/smallData
look here
Hadoop Distributed File system is not good with many small files but with many big files. HDFS keep a record in a look up table that points to every file/block in HDFS and this Look up table usually is loaded in memory. So you should not just increase java heap size but also increase the heap size of the name node inside hadoop-env.sh, this is the default:
export HADOOP_HEAPSIZE=1000
export HADOOP_NAMENODE_INIT_HEAPSIZE="1000"
If you are going to do processing on those files, you should expect low performance on the first MapReduce job you run on them (Hadoop creates number of map tasks as the number of files/blocks and this will overload your system except when you use combineinputformat). advice you to either merge the files into big files (64MB/ 128MB) or use another data source (not HDFS).
For solve this problem, I build a single file with some format. The content of file are all the small files. The format will be like that:
<DOC>
<DOCID>1</DOCID>
<DOCNAME>Filename</DOCNAME>
<DOCCONTENT>
Content of file 1
</DOCCONTENT>
</DOC>
This structure could be more or less field, but the idea is the same. For example, I have use this stucture:
<DOC>
<DOCID>1</DOCID>
Content of file 1
</DOC>
And handle more of six million files.
If you desire process each file for a one map task, you could be delete \n char between and tags. After this, you only parse the structure and have the doc identifier and Content.
I want to write a big file to the local disk.
I split the big file into many small files and I tried to write it to the disk. But I observed that when I split the files and tried to write, there was a big increase in disk write time.
Also, I copy the files from a disk and write it another computer's disk(reducer). I observed that there was a big increase in read time as well. Can anybody explain me the reason? I am working with hadoop.
Thanks!
That's due to the underlying file system and hardware.
There's overhead for each file in addition to its contents, for example MFT for NTFS(on Windows). So for a single large file the file system could do less bookkeeping.Thus it's faster.
As arranged by your OS, single big file tends to be written on consecutive sectors of the hard drive where possible, but multiple small files may or may not be written as such. So the resulting increased seek time may account for the increased reading time for many small files.
The efficiency of your OS may also play a big part. For example whether it prefetches file contents, how it makes use of buffer, etc. For many small files it's more difficult for the OS to use the buffer(and deal with other issues) efficiently.(Under different scenarios it can behave differently.)
EDIT: As for the copy process you mentioned, generally your OS do it in the following steps:
read data from disk->writing data to buffer->read from buffer->write to (possibly another) disk
This is usually done in multiple threads. When dealing with many small files, the OS may fail to coordinate these threads efficiently(Some threads are very busy, while others must wait). For a single large file the OS doesn't have to deal with these issues.
Every file system has a smallest unit(non sharable) defined to store the data named page. Say for example, in the file system, you have a page size of 4KB. Now if you save a big file of 8 KB, it will consume 2 pages on the disk. But if you break the file in 4 files, each of size 2KB, then it will consume 4 half filled pages on the disk consuming size 16KB disk space.
Similarly, if you break the file in 8 small files, each of size 1KB, then it will consume 8 pages in the disk though partially filled and your 32KB of the disk space is consumed.
Same is true in the reading overhead. If your file as several pages, then might be scattered. This will lead into high overhead in seektime/access time.
I have a bunch of files (around 4000), each weighting 1-5K more or less,
all created using the serialization mechanism of Java.
I'd like to compress them and send them over a network as a single file.
(They total for around 200-300MB).
I'm looking for a way to increase the compression / decompression speed, without hurting the file size too much (as it should still be sent over the network and get stored in the server).
Currently using the zip package that comes with Apache Ant.
I read that zip files store meta data for each file, so I guess zip files won't be the best choice here.
So what's preferable?
Gzip / Tar?
Or not compressing at all?
Which java library would you recommend for this case?
Thanks in advance.
Not compressing at all would be fastest, but the resulting file size is the downside.
One reason why tar.gz produces smaller file sizes than zip alone is that gzip gets to work with a bigger buffer of data (the whole tar file), while in your case, zip only gets to work with the data from one file at a time (usually a lot less than the size of the tar file, if there are a lot of files).
So gzip gets to compress an entire book with chapters of pages at a time, while zip compresses each chapter of a book and then wraps the compressed chapters up in a book - i.e. compressed collection of objects is usually smaller than a collection of compressed objects.
To produce a similar result to tar.gz, you can zip up the files in the first pass using the 'store'algorithm, and then zip up the resulting zip file using the default deflate algorithm.
A lot depends on the network that you are using.
If its over the internet - you might be better off sending as (say) 50 zipped up files rather than one file. If you transfer the data in one file and the file copy fails - you will have to send it again.
Copying as separate files will allow you to transfer some in parallel and to minimise the risk of a large upload failing.
Another possibility might be switching to another Serialization mechanism. JBoss Serialization is API and functionality compatible, but produces 30% less data.
Im currently developing a system where the user will end up having large arrays( using android). However the JVM memory is at risk of running out, so in order to prevent this I was thinking of creating a temporary database and store the data in there. However, one of the concerns that comes to me is the SDcard limited by read and write. Also another problem that comes to mind is the overhead of such an operation. Can anyone clear up my concerns, as well as also suggest a possibly good alternative to handling large arrays ( in the end these arrays will be uploaded to a website by writing a csv file and uploading it).
Thanks,
Faisal
A couple of thoughts:
You could store them using a dbms like Derby, which is built into many versions of java
You could store them in a compressed output stream that writes to bytes - this would work especially well if the data is easily compressed, i.e. regularly repeating numbers, text, etc
You could upload portions of the arrays at a time, i.e. as you generate them, begin uploading pieces of the data up to the servers in chunks