Our system is having a problem with too much files, which is used in a webapp which should be using all the time. That mean the files cannot be deleted and there are too much of them, making the system(which is a windows) slow. We would like zip up the files, and when the file is request, we unzip the particular file out.
I've try the java ZipFile class, and the performance is not good enough, because there will be many people using the webapp and they will request the files. From my observation, the unzipping action require time between 0.5 secs to 2 secs, and when there are too much user, the system cannot catch up to them.
For example, I've use a Jmeter to simulate a situation where 30 user use the system, with a random delay between 0.3 secs to 0.6 secs. Although I doubt there may not be so much requests, I cannot know for advance that how many people will use the webapps. I would like to ask you guys, is there any other method to solve this problem?
Thanks in advance!!
P.S. If any 3rd party library is need, it must be free!
P.S. Because the number of files is just too much, and it hang the machine. We would like do this : zip up 2000 file into a zip file, then the number of files will decrease and hope the system won't hang anymore, and when need, we unzip some file out.
Okay, here's some thoughts. It appears to me that your core problem is the slowness of your system and that you're trying to fix it by compressing the files and decompressing them on demand. Then you've found that the decompression is too slow and you need a faster way to do that.
Now I'm not entirely certain why you think this compression will speed things up instead of making things slower.
I would go back to the original problem and work more on solving that. Why is the number of files making your system slow? If you can figure that out, you can fix it in a way that doesn't involve things going even slower.
If it's an issue with too many files in a directory, think about splitting into multiple directories. But I have no idea whether NTFS even has that problem (FAT did). For example, if you have a directory with files for every minute of the last ten years (five million files), you can split them into day directories (three and a half thousand directories with fifteen hundred files in each).
Compression won't reduce the number of files, just the space taken by them.
If it's an issue with the number of files on the system (rather than in a directory), there are plenty of ways to split files between systems as well. Example, hive off 10% of the entire file set to ten different machines and forward incoming requests for a specific file to the relevant machine.
But, I have to say, I've seen Windows machines handle absolute bucket-loads of files so I'd be very surprised if the problem lay there. I think you're probably just going to have to track down what's actually causing your "hangs".
compressing/uncompressing the files will not make the windows faster.
If zip doesn't provides performance gain (despite has native implementation in Java), you can try to improve at the filesystem-level. Folders with too many (>10000) files doesn't work well under some Windows filesystems, so try to divide the files into several folders, tune the NTFS filesystem (cluster size, reserved space for filesystem), disable anti virus, disable indexing, buy an SSD SLC hard disk...
Related
I have an issue with fragmentation on my drive. I got a programm that generates over 50000 files in different folders, each file grows over time. Each file will be about 500MB in size and I need to read the files fast.
The issue I am facing is that each file will be spread over the drive and defragmenation would take over 4 weeks.
I heard about a filesystem that will spread each file on the drive so that the gap between each file is the same sice. I searched the internet for that filesystem but i couldn't find anything.
My program is written in Java, maybe there is a way to set the beginning of a file on a specific byte position on the drive.
I would be glad if someone could help me facing this issue.
I heard about a filesystem that will spread each file on the drive so that the gap between each file will be the same sice. I searched in the internet for that filesystem but i coudn't find anything.
Most likely you did not because it does not exist...
But we have RAID systems (Rapid Array of Inexpensive Disks) which could ease your pain...
As Timothy said, you can't get to that level by using Java.
I neither heard that filesystem, it hasn't got much logic though.
Perhaps, in the case that you are storing text, you can use a NoSQL database (like MongoDB) that stores data in binary size. Probably you'll get good speeds, and the Java connector is easy to use.
Use a Linux filesystem like ext4 where disk fragmentation is very low but also make sure you have plenty of disk space left else fragmentation will happen anyway.
I also don't know of a file system that does this. However I have some info that may help-
If you used an SSD, then fragmentation would be less of a concern for reading performance reasons. SSDs store data in chunks - NAND flash pages, 16 KB for instance. These are always stored in scattered order due to the wear-levelling algorithm used. That is very unlike how hard disks work in practice. Pages on SSDs are accessed in a very parallel fashion as well. As a result, you would have much less impact of fragmentation on reading performance with an SSD. Fragmentation would still have some penalty for writes/deletions.
RAID would also allow for higher performance on reads as Timothy mentions.
I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.
Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.
From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce
I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388
Can you concatenate files before submitting them to Hadoop?
CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.
Using log4j on Unix, which Appender would perform the best to write 1000Meg :
1) Using RollingFileAppender writing 10 file of 100 Meg
or
2) Using a FileAppender and writing a single 1000Meg file
In other words, using java on unix, does the size matter?
Thank you
There no Java-side performance difference between writing to a small file or writing to a large file. There might be a small difference at the OS level when a file gets big enough that an extra level of index blocks is required (FS dependent), but it is probably not worth worrying about.
There will be a performance cost in implementing the file rolling behavior. The appender has to:
test / remember how big the file is,
close the current one,
rename it,
open a new file.
My gut feeling is that this is not likely to be significant. (However, it would be worth measuring to see if the performance impact should be a concern. Also, you should probably ask yourself if you are not doing too much logging.)
You have to compare all of the above against the advantages of file rolling:
Having a bounded size on log files means that your logging won't fill the disk, causing problems for the application and potentially others on the same machine.
Smaller log files can make it easier / quicker to do searches for events at specific times. (Running less on a 1000Mb file can be painful ...)
They'll both easily write 1000MB files. I don't see why they should perform differently.
You do need the RollingFileAppender though in order to set the total maximum size that the log file(s) can reach. Otherwise you may run out of hard disk space, assuming that you application has got traffic.
I think it's always preferable to use small files than large files because they are more manageable. Also, consider that with large files you may have problems in the case of file-system full because of the risk of having to remove the log file when the process is up and running to free up disk space.
Just wondering, if it is generally a good idea to compress jar files that will be shipped with a desktop application (no network access to jars), of if the decompression will have a bigger impact than file io.
EDIT: Thanks for the answers so far, and sorry for being a bit unclear here. I was not speaking about shipping the jars to the customer, but of the optimal format for the jar files on the disk when the app start ups. I know that jar files are zip files and can be served with different compression levels (or no compression at all), and I was directly wondering how compression would alter startup performance, not only on my dev box (has a fast SSD disk in it, but also on slower disks).
I expect that the answer depends on your application. However, it should be easy to determine experimentally if compressed JARs give faster or slower startup for your application. Just build your application JAR file with compression on and compression off, and compare the application startup times. (Try it on different machines; e.g. with slow discs, fast discs, SSD, and with different amounts of RAM. Bear in mind that some OSes cache files aggressively, and take this into account in your timing measurements.)
While you are at it, you should also investigate the impact of different compression levels (via the jar command options) and using pack200
Having said that, my gut feeling is that the difference between compressed and uncompressed for locally installed JARs will be small enough that the user will hardly notice the difference.
In almost any reasonable desktop situation, the cost of disk IO is way higher than the cost of compression. It'll almost certainly be a win to compress files.
That said, a JAR file is already compressed. Doubly compressing things is generally not worth the effort. So I'd say no, don't compress your JAR files as they are already compressed.
let's say you have an game server which creating text log files of gamers actions, and from time to time you need to lookup something in those logs files (like investigating an scam or loosing an item). Just for example you have 100 files and each file have size between 20MB and 50MB - How you would search them quickly?
What I have already tried to do is create several threads and each invidual thread will map his own file to memory (let say memory should not be problem if it not exceed 500MB of ram) perform search here, result was something around 1 second per file :
File:a26.log - read in: 0.891, lines: 625282, matches: 78848
Is there better way how to do that ? - because it seems to me kinda slow.
thanks.
(java was used for this case)
Tim Bray was investigating approaches to process Apache log files here: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Seems like there may be a lot in common with your situation.
You can use Unix commands combinations with find and grep.
For ad-hoc searching of large text files, I would use the UNIX grep, fgrep or egrep utilities. They have been around a long time, and have had the benefit of many people working on them to make them fast.
On the other hand, the ultimate bottleneck in search text files (that haven't been previously indexed) will be the speed at which the application + operating system can move data from a disc file into memory. You seem to be managing 20Mbytes or more per second, which seems reasonably fast ... too me.
I should probably mention that in first post, game server is written for Win64x - and I'm wonder if it is on same performace level like grep for Windows and for unix?
Of course there is a better way: you index the contents before searching. The way you index depends on how you want to search the logs, but in general, you might do well using Lucene (or Solr, if the log entries can easily be restructured into xml documents).
The amount of performance and resource use optimization put into tools like the above should give you orders of magnitude better performance than an ad-hoc solution.
This is all assuming you search each file many times. If this is not the case, you might as well grep the files and be done with it.