I have decided to, as side project, try and write a sorting method for large files. Unfortunately, I only have my 4 core laptop available for the research right now. For data, I am only using characters for each record. A typical record looks like this:
AAAAM_EL,QMOIXYGB,LAD_HNTU,BYFKKHWY,AVVCIXMC,KWVGCIUB,YWD_LQNU,HDTKUFK_,W_E_LT_M,MW_HEQKE,VHEDHK_U,SAIUAVGH,DQTSMK_L,RNUBFKUX,OXEVMHNR,EMEEJHJB,BKYQWYAP,MKMWKAAT,MIAEDTDY,RANAGVOM
All the fields are randomly generated. However, I am only sorting using the complete record as the key. A file containing 1 million records equals 181Million bytes. I have noticed the following on my laptop:
using a unix shell and executing unix sort command against the file, it takes approximately 15 to 22 seconds to sort and write the file back to disk as another file.
I tried using the unix sort command with the parallel=cores option but that would not work in my widows-bash.
Using a quick sort algorithm I implemented in java: It takes 3 seconds to read the file into memory, sort it and write back out to a new file.
Using an experimental multi-threaded java application that I implemented takes as long as the "unix sort" command took.
Does anyone have some reliable approximate times for sorting a file this size? I plan on sorting much larger files once I study the multi-threaded approach that I have currently implemented. It needs a lot of improvement I am sure. However I need some good target times to try and achieve. Does anyone know of such target times. Any examples on the net , or any sorting research papers that would give me a hint as to how much time it should take.
Related
I'm working for well known company in a project that should bring integration with other system that are producing one csv per hour of 27Gb. The target is query these files without import em (the main problem is bureaucracy, nobody want resposibility if some data change).
Main filters on this files can be done by dates, the end-user must insert a range start-end dates. After that can be filter by few strings.
Context: spring boot microservices
Server: xeon processor 24 core 256gb Ram
Filesystem: NFS mounted from external server
Test data: 1000 files, each one 1Gb
For performance improvement i'm indexing files by date writing on each file name the range that contains and making a folder structure like yyyy/mm/dd. For each of following test the first step was make a raw file paths list that will be read.
research will read all files
Spring batch - buffered reader and parse into object: 12,097 sec
Plain java - threadpool, buffered reader and parse into object: 10,882 sec
Linux egrep with regex and parallel ran from java and parse into object: 7,701 sec
The dirtiest is also fastes. I want avoid it because security department warned me about all checks to make on input data to prevent shell injection.
Googling i found mariadb CONNECT engine that can point also huge csvs, so now i'm going on this way creating temporary table with files that research have interest, the bad part is i have to do one table for each query since dates can be different.
For first year We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range. This queries will be done asyncronousely.
Do you know something that can help me on it? Not only for the speed but a good practice to apply.
Thanks a lot folks.
To answer your question:
No. There are no best practices. And, AFAIK there are no generally applicable "good" practices.
But I do have some general advice. If you allow considerations such as bureaucracy and (to a lesser extent) security edicts to dictate your technical solutions, then you are liable to end up with substandard solutions; i.e. solutions that are slow or costly to run and keep running. (If "they" want it to be fast, then "they" shouldn't put impediments in your way.)
I don't think we can give you an easy solution to your problem, but I can say some things about your analysis.
You said about the grep solution.
"I want avoid it because security department warned me about all checks to make on input data to prevent shell injection."
The solution to that concern simple: don't use an intermediate shell. The dangerous injection attacks will be via shell trickery rather than grep. Java's ProcessBuilder doesn't use a shell unless you explicitly use one. The grep program itself can only read the files that are specified in its arguments, and write to standard output and standard error.
You said about the general architecture:
"The target is query these files without import them (the main problem is bureaucracy, nobody want responsibility if some data change)."
I don't understand the objection here. We know that the CSV files are going to change. You are getting a new 27GB CSV file every hour!
If the objection is that the format of the CSV files is going to change, well that affects your ability to effectively query them. But with a little ingenuity, you could detect the the change in format and adjust the ingestion process on the fly.
"We're expecting not more than 5 parallel researches in same time, with an average of 3 weeks of range."
If you haven't done this already, you need to do some analysis to see whether your proposed solution is going to be viable. Estimate how much CSV data needs to be scanned to satisfy a typical query. Multiply that by the number of queries to be performed in (say) 24 hours. Then compare that against your NFS server's ability to satisfy bulk reads. Then redo the calculation assuming a given number of queries running in parallel.
Consider what happens if your (above) expectations are wrong. You only need a couple of "idiot" users doing unreasonable things ...
Having a 24 core server for doing the queries is one thing, but the NFS server also needs to be able to supply the data fast enough. You can improve things with NFS tuning (e.g. by tuning block sizes, the number of NFS daemons, using FS-Cache) but the the ultimate bottlenecks will be getting the data off the NFS server's disks and across the network to your server. Bear in mind that there could be other servers "hammering" the NFS server while your application is doing its thing.
I'm trying to annotate 200k documents, one document at a time with Stanford CoreNLP. Each document contains 200 numbers of sentences in average, or equivalently 6k tokens.
I'm not familiar with java, so I'm using pycorenlp. I start the server with the following command as suggusted (I edited it with extra arguments later).
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
I'm using java 1.8 and python 3.6. Below are the problems I encountered and ways I've tried to solve, followed by my questions:
1.Java OutOfMemory, GC overhead limit exceeded:
I did: increase java memory and add -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics
Effect: So far it's fine. Not sure about what will happen later. I've not been able to process all text.
Problem: Not yet.
2. Connection/broken pipe issues
I did: Shutdown server every four document in my code. (I am able to process maximum 4 documents. Sometimes I'm even unable process one document)
Effect: It seem to work fine. However, it seems rather slow after the server restarts
Question: Any better or smart solutions? If I keep doing this, will I get myself into trouble in terms of server usage wise, like prohibition from using the server?
3. Low Speed
I did: Increase threads to 12, 18 when calling the server.
Effect: Work a lot better than 1 thread
Question: Is there any suggestion for speeding up? It takes almost half an hour to process even one document, due to the length of documents, although I'm calling a few annotators. (I understand it takes more time when more annotators are used, but I still need to use them.)
4. No response from the server at all. Not even with errors.
This is the most painful problem. Since I don't really have an IT background, it becomes very hard to figure out where the problems lies. Below is where the program gets stuck. No warnings, no errors. Maybe one hour later, it will continue. Or, it can stay there forever until I kill the program.
[pool-1-thread-2] INFO CoreNLP - [/127.0.0.1:56846] API call w/annotators tokenize,ssplit,pos,depparse,lemma,ner,parse,dcoref,natlog,openie
the same field as the previous one is detected with magnitude #xmath107 photometric redshift, like borg_ 0240- 1857_ 129, is peaked at #xmath111, with a broad higher- redshift wing......(further content omitted)
Any prompt response will be much appreciated, especially for the third and four issues. I've thoroughly looked into the official document and github, but I couldn't find any solutions. On the official document, it says to limit the size of a document, say to a chapter, instead of the whole novel. Hence I presume the length of a single document in the dataset is fine.
I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the program is very slow. I guess it is because cost of setting and tearing the job are far greater then the job itself. Such small files also cause depletion of namespaces for file names.
I read that in this case I should use HDFS archive (HAR), but I am not sure how to modify this program WordCount to read from this archives. Can program continue to work without modification or some modification is necessary?
Even if I pack a lot of files in archives, the question remains if this will improve performance. I read that even if I pack multiple files, this files inside one archive will not be processed by one mapper, but many, which in my case (I guess) will not improve performance.
If this question is too simple, please understand that I am newbie to the Hadoop and have very little experience with it.
Using the HDFS won't change that you are causing hadoop to handle a large quantity of small files. The best option in this case is probably to cat the files into a single (or few large) file(s).
This will reduce the number of mappers you have, which will reduce the number of things required to be processed.
To use the HDFS can improve performance if you are operating on a distributed system. If you are only doing psuedo-distributed (one machine) then the HDFS isn't going to improve performance. The limitation is the machine.
When you are operating on a large number of small files, that will require a large number of mappers and reducers. The setup/down can be comparable to the processing time of the file itself, causing a large overhead. cating the files should reduce the number of mappers hadoop runs for the job, which should improve performance.
The benefit you could see from using the HDFS to store the files would be in distributed mode, with multiple machines. The files would be stored in blocks (default 64MB) across machines and each machine would be capable of processing a block of data that resides on the machine. This reduces network bandwidth use so it doesn't become a bottleneck in processing.
Archiving the files, if hadoop is going to unarchive them will just result in hadoop still having a large number of small files.
Hope this helps your understanding.
From my still limited understanding og Hadoop, I believe the right solution would be to create SequenceFile(s) containing your HTML files as values and possibly the URL as the key. If you do a M/R job over the SequenceFile(s), each mapper will process many files (depending on the split size). Each file will be presented to the map function as a single input.
You may want to use SequenceFileAsTextInputFormat as the InputFormat to read these files.
Also see: Providing several non-textual files to a single map in Hadoop MapReduce
I bookmarked this article recently to read it later and found the same question here :) The entry is a bit old, not exactly sure how relevant it is now. The changes to Hadoop are happening at a very rapid pace.
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
The blog entry is by Tom White, who is also the author of "Hadoop: The Definitive Guide, Second Edition", a recommended read for those who are getting started with Hadoop.
http://oreilly.com/catalog/0636920010388
Can you concatenate files before submitting them to Hadoop?
CombineFileInputFormat can be used in this case which works well for large numaber of small files. This packs many of such files in a single split thus each mapper has more to process (1 split = 1 map task).
The overall processing time for mapreduce will also will also fall since there are lesser number of mappers running.
Since ther are no archive-aware InputFormat using CombineFileInputFormat will improve performance.
let's say you have an game server which creating text log files of gamers actions, and from time to time you need to lookup something in those logs files (like investigating an scam or loosing an item). Just for example you have 100 files and each file have size between 20MB and 50MB - How you would search them quickly?
What I have already tried to do is create several threads and each invidual thread will map his own file to memory (let say memory should not be problem if it not exceed 500MB of ram) perform search here, result was something around 1 second per file :
File:a26.log - read in: 0.891, lines: 625282, matches: 78848
Is there better way how to do that ? - because it seems to me kinda slow.
thanks.
(java was used for this case)
Tim Bray was investigating approaches to process Apache log files here: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Seems like there may be a lot in common with your situation.
You can use Unix commands combinations with find and grep.
For ad-hoc searching of large text files, I would use the UNIX grep, fgrep or egrep utilities. They have been around a long time, and have had the benefit of many people working on them to make them fast.
On the other hand, the ultimate bottleneck in search text files (that haven't been previously indexed) will be the speed at which the application + operating system can move data from a disc file into memory. You seem to be managing 20Mbytes or more per second, which seems reasonably fast ... too me.
I should probably mention that in first post, game server is written for Win64x - and I'm wonder if it is on same performace level like grep for Windows and for unix?
Of course there is a better way: you index the contents before searching. The way you index depends on how you want to search the logs, but in general, you might do well using Lucene (or Solr, if the log entries can easily be restructured into xml documents).
The amount of performance and resource use optimization put into tools like the above should give you orders of magnitude better performance than an ad-hoc solution.
This is all assuming you search each file many times. If this is not the case, you might as well grep the files and be done with it.
I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?
The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.
Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.
Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.