let's say you have an game server which creating text log files of gamers actions, and from time to time you need to lookup something in those logs files (like investigating an scam or loosing an item). Just for example you have 100 files and each file have size between 20MB and 50MB - How you would search them quickly?
What I have already tried to do is create several threads and each invidual thread will map his own file to memory (let say memory should not be problem if it not exceed 500MB of ram) perform search here, result was something around 1 second per file :
File:a26.log - read in: 0.891, lines: 625282, matches: 78848
Is there better way how to do that ? - because it seems to me kinda slow.
thanks.
(java was used for this case)
Tim Bray was investigating approaches to process Apache log files here: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Seems like there may be a lot in common with your situation.
You can use Unix commands combinations with find and grep.
For ad-hoc searching of large text files, I would use the UNIX grep, fgrep or egrep utilities. They have been around a long time, and have had the benefit of many people working on them to make them fast.
On the other hand, the ultimate bottleneck in search text files (that haven't been previously indexed) will be the speed at which the application + operating system can move data from a disc file into memory. You seem to be managing 20Mbytes or more per second, which seems reasonably fast ... too me.
I should probably mention that in first post, game server is written for Win64x - and I'm wonder if it is on same performace level like grep for Windows and for unix?
Of course there is a better way: you index the contents before searching. The way you index depends on how you want to search the logs, but in general, you might do well using Lucene (or Solr, if the log entries can easily be restructured into xml documents).
The amount of performance and resource use optimization put into tools like the above should give you orders of magnitude better performance than an ad-hoc solution.
This is all assuming you search each file many times. If this is not the case, you might as well grep the files and be done with it.
Related
I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).
Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.
Context: spring boot microservices
Server: xeon processor 24 core 256gb Ram
Filesystem: NFS mounted from external server
Test data: 1000 files, each one 1Gb
For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.
research will read all files
Spring batch - buffered reader and parse into object: 12,097 sec
Plain java - threadpool, buffered reader and parse into object: 10,882 sec
Linux egrep with regex and parallel ran from java and parse into object: 7,701 sec
The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.
Googling i found mariadb CONNECT engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.
For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.
Do you know something that can help me on it? Not only for the speed but a good practice to apply.
Thanks a lot folks.
To answer your question:
No. There are no best practices. And, AFAIK there are no generally applicable "good" practices.
But I do have some general advice. If you allow considerations such as bureaucracy and (to a lesser extent) security edicts to dictate your technical solutions, then you are liable to end up with substandard solutions; i.e. solutions that are slow or costly to run and keep running. (If "they" want it to be fast, then "they" shouldn't put impediments in your way.)
I don't think we can give you an easy solution to your problem, but I can say some things about your analysis.
You said about the grep solution.
"I want avoid it because security department warned me about all checks to make on input data to prevent shell injection."
The solution to that concern simple: don't use an intermediate shell. The dangerous injection attacks will be via shell trickery rather than grep. Java's ProcessBuilder doesn't use a shell unless you explicitly use one. The grep program itself can only read the files that are specified in its arguments, and write to standard output and standard error.
You said about the general architecture:
"The target is query these files without import them (the main problem is bureaucracy, nobody want responsibility if some data change)."
I don't understand the objection here. We know that the CSV files are going to change. You are getting a new 27GB CSV file every hour!
If the objection is that the format of the CSV files is going to change, well that affects your ability to effectively query them. But with a little ingenuity, you could detect the the change in format and adjust the ingestion process on the fly.
"We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range."
If you haven't done this already, you need to do some analysis to see whether your proposed solution is going to be viable. Estimate how much CSV data needs to be scanned to satisfy a typical query. Multiply that by the number of queries to be performed in (say) 24 hours. Then compare that against your NFS server's ability to satisfy bulk reads. Then redo the calculation assuming a given number of queries running in parallel.
Consider what happens if your (above) expectations are wrong. You only need a couple of "idiot" users doing unreasonable things ...
Having a 24 core server for doing the queries is one thing, but the NFS server also needs to be able to supply the data fast enough. You can improve things with NFS tuning (e.g. by tuning block sizes, the number of NFS daemons, using FS-Cache) but the the ultimate bottlenecks will be getting the data off the NFS server's disks and across the network to your server. Bear in mind that there could be other servers "hammering" the NFS server while your application is doing its thing.
I'm testing data structure performance with very large data.
As a temporary workaround (see here) I want to write memory to disk.
I want to test with very big datasets - how can I make it so that when the java VM runs out of memory it writes some of it to disk?
Since we're talking about temporary fixes here you could always increase your page file if you need a little extra space (swap file in most linux distros)
Here's a link from Microsoft:
http://windows.microsoft.com/en-us/windows-vista/change-the-size-of-virtual-memory
Linux:
http://www.cyberciti.biz/faq/linux-add-a-swap-file-howto/
Now let me say that this isn't a good long term fix, but I understand that sometimes developers just need to make it work. If this is something that will ever see a production environment you may want to look at a tool like Hadoop. It allows you to distribute your data processing over multiple JVM's--a tool built for a "big data" application like the one you're describing
Maybe you can use stream, or some buffered one. I think that will be the best choice for testing such structure. If you can read from disk using stream and that will be not make any additional objects(only that which are necessary) so you can have all jvm memory for your structure. But maybe you can describe your problem more?
I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.
Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.
From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce
I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388
Can you concatenate files before submitting them to Hadoop?
CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.
Our system is having a problem with too much files, which is used in a webapp which should be using all the time. That mean the files cannot be deleted and there are too much of them, making the system(which is a windows) slow. We would like zip up the files, and when the file is request, we unzip the particular file out.
I've try the java ZipFile class, and the performance is not good enough, because there will be many people using the webapp and they will request the files. From my observation, the unzipping action require time between 0.5 secs to 2 secs, and when there are too much user, the system cannot catch up to them.
For example, I've use a Jmeter to simulate a situation where 30 user use the system, with a random delay between 0.3 secs to 0.6 secs. Although I doubt there may not be so much requests, I cannot know for advance that how many people will use the webapps. I would like to ask you guys, is there any other method to solve this problem?
Thanks in advance!!
P.S. If any 3rd party library is need, it must be free!
P.S. Because the number of files is just too much, and it hang the machine. We would like do this : zip up 2000 file into a zip file, then the number of files will decrease and hope the system won't hang anymore, and when need, we unzip some file out.
Okay, here's some thoughts. It appears to me that your core problem is the slowness of your system and that you're trying to fix it by compressing the files and decompressing them on demand. Then you've found that the decompression is too slow and you need a faster way to do that.
Now I'm not entirely certain why you think this compression will speed things up instead of making things slower.
I would go back to the original problem and work more on solving that. Why is the number of files making your system slow? If you can figure that out, you can fix it in a way that doesn't involve things going even slower.
If it's an issue with too many files in a directory, think about splitting into multiple directories. But I have no idea whether NTFS even has that problem (FAT did). For example, if you have a directory with files for every minute of the last ten years (five million files), you can split them into day directories (three and a half thousand directories with fifteen hundred files in each).
Compression won't reduce the number of files, just the space taken by them.
If it's an issue with the number of files on the system (rather than in a directory), there are plenty of ways to split files between systems as well. Example, hive off 10% of the entire file set to ten different machines and forward incoming requests for a specific file to the relevant machine.
But, I have to say, I've seen Windows machines handle absolute bucket-loads of files so I'd be very surprised if the problem lay there. I think you're probably just going to have to track down what's actually causing your "hangs".
compressing/uncompressing the files will not make the windows faster.
If zip doesn't provides performance gain (despite has native implementation in Java), you can try to improve at the filesystem-level. Folders with too many (>10000) files doesn't work well under some Windows filesystems, so try to divide the files into several folders, tune the NTFS filesystem (cluster size, reserved space for filesystem), disable anti virus, disable indexing, buy an SSD SLC hard disk...
I am writing a servlet which will examine a directory on the server (external to the web container), and recursively search for certain files (by certain files, I mean files that are of a certain extension as well as a certain naming convention). Once these files are found, the servlet responds with a long list of all of the found files (including the full path to the files). My problem is that there are so many files and directories that my servlet goes extremely slow. I was wondering if there was a best practice or existing servlet for this type of problem? Would it be more efficient to simply compile the entire list of files and do the filtering via js/jquery on the client side?
Disk access is slow and as the number of files and directories increases, you'll rapidly reach a point where your servlet will be useless when using the conventional recursive search through the directory tree. You'll especially reach this limit quickly if you have a large number of concurrent users performing the same search at the same time.
It's instead, much better to use an external batch job to generate the list of files which can then be read into the servlet through a database call or even by just parsing a file containing all the file names separated by a newline character. Using "find" on Linux is a simple way to do this: e.g.
find <path_to_directory> -name '*.bin' > list_files.txt
This would list every file name that ends with .bin in a particular directory and output it into a file named list_files.txt. Your servlet could then read in that file and create the list of files from there.
If you really have loads of files, you might think about spawning an external process to do the searching. If you're running on a unix-like server (like linux), you might get speed gains by having the "find" command do the searching for you, and parse its output.
You can google for many examples of how to use "find".
I see two possible reasons why this process might be going slowly:
1) Disk I/O is taking too long. This'll be a real constraint that you can't do much about. Usually the operating system is pretty good at keeping structures in memory that allow it to find files in your folders much quicker. If it is too slow regardless, you might have to build an index yourself in memory. This all depends on how you're doing it.
In any case, if this is the issue (you can try measuring), then there's no way doing the filtering client side will help, as that shouldn't really take very long, no matter where you do it. Instead you're going to make the client slower by sending it more data to sort through.
2) There's something wrong with your directory traversal. You say it's "recursive". If you mean it's actually recursive, i.e. a method that calls itself whenever it encounters a new directory, then that might well be slowing you down (the overhead really adds up). There's some stuff about tree traversal on wikipedia, but basically just use a queue or stack to keep track of where you are in the traversal, instead of using your method state to do so.
Note that a file system isn't actually a tree, but I'm assuming that it is in this case. It gets a bit hairier otherwise.
I don't agree with the other posters that you can't implement it in-process. It should work pretty well up to a certain point, no need for batch jobs just yet.
i think your servlet works slow because of hard drive speed. if list of files a permanent you should load it to memory