I was reading this SO article and was amazed to find that no one mentioned how long it takes to run a data dump with the appcfg.py tool.
Now, obviously, this is a function of how much data you have, and the hardware your dumping to, but was wondering if anyone had ever benchmarked this. If I have a decent amount of data (say, 50GB), how long can I expect to wait for the entire datastore to dump to a YAML file?
Minutes? Hours? Days?
The factors affecting time to download a dataset are
number of entities
size of entities
speed of you network.
how appengine backend is performing ;-)
I am on a slow link (typically only achieve 150kbps (3G wireless) and my 675MB of entities totalling 1.5million entities takes about 2 days to download to the local machine using appcfg.
So unfortunately there won't be a good hard number you can rely on - you will just have to give it a go and see how long it takes in your situation.
I can't tell you about dumping/downloading to your computer, but I have used Datastore backup, which saves datastore data to Blobstore: I have seen backup speeds somewhere in 100-150MBytes / minute.
Related
I want to create a hashmap kind of thing for fast lookup between IDs and assigned names.
The number of entries will be a few hundred thousand. Thus I don't want to keep everything in memory. Anyhow, as performance counts in the process, I don't want to make database queries for each ID.
So, what are my chances? How could I get fast lookups on large datasets?
A quick search found these:
Production-ready:
MapDB - I can personally recommend this. It's the successor to JDBM, if you found that one while googling.
Chronicle-Map
Voldemort
Probably nor production-ready, but worth looking at:
https://github.com/aloksingh/disk-backed-map
https://github.com/reines/persistenthashmap
Well there are couple of solutions in my mind !
1) Go for lucene -> store in files
2) Make views in the database -> store in database
So its upto you for which you go for !!
I had a similar requirement a few years back and was avoiding using databases thinking it would have highlook up times. Similar to you I had large set of values so could not use in memory datastructures. So I decided to sequentially parse the filesystem. It was a bit slow, but I could do nothing about it.
Then I explored more on DBs and used DB for my application, just to test. Initially it was slower compared to filesystem. But after indexing the table and Optimizing the database. It proved to be atleast 10-15 times faster than file system. I cant remember the exact performance results but it took just 150-200 ms to read data from a large dataset(around 700 mb of data-size on file system), whereas the same for filesystem was 3.5 seconds.
I used DB2 database, and this guide for performance tuning of DB2
Besides once the DB is setup, you can reuse it for mutiple applications over the network.
If you looking fast solution. Answer in-memory database
Redis, Memcached, Hazelcast, VoltDB etc.
We are working with Google BigQuery (using Java) for one of our cloud solution and facing lot of issues in development. Our observations and issues as follows -
We are using Query Jobs (Example: jobs().insert()/jobs().query() method first and then tablesdata().list() for data) for data retrieval. The Job execution taking 2-3 seconds (we had data in MBs only right now). We looked into sample codes on code.google.com and github.com and tried to implement them. However, we are not able to achieve fast execution than 2-3 seconds. What is the fast way to retrieve data from BigQuery tables? Is there a way to improvise Job execution speed? If yes, Can you provide links for sample codes?
In our screens, we need to fetch data from different tables (different queries) and display them. So, we inserted multiple query jobs and total execution time getting summed-up (Example: if we had two jobs (i.e. two queries), it takes 6-7 seconds). In Google documentation it has been mentioned that, we can run concurrent Jobs. Is there any sample code available for this?
Waiting for your valuable responses.
Query of cached results can be much faster, if you can run the query independently. The following query will run faster.
Check that the bottle-neck is not related to network\ paginating\ page rendering\ etc. you can do it by trying executing only the 2nd step.
Parallel jobs might be queued on BQ end based on their current load.
My recommendation would be to separate the query from presentation. Run the BQ queries, retrieve the "Small size" data to a fast access data store (flat file, Cache, Cloud SQL, etc.) and present it from there.
As Pentium10 says, BQ is excellent for huge datas (and returns results faster and cheaper than any other comparable solution). If you are looking for a back-end of a fast reporting visualization tool, I am afraid that BQ might not be your solution.
1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.
tl;dr summary: Are there standard solutions for limiting the length of database tables and number of file system files based on number, disk space or time?
I have a Java web service that allows users to run operations that are internally handled as jobs. In order to access results of previously run jobs or asynchronous jobs the user gets a handle in the form of a job ID. I save all this information in a few database tables of a relational database (currently Apache Derby) because it's much more convenient than inventing a new file format (and also probably much more reliable and performant). The actual job results are saved as XML files in the file system.
Job execution may be very frequently (1/s and up) so the tables/directories might get quite large after some time. What I need is a method that allows pruning the job history of the oldest entries based on
job count (a maximum of n jobs and their results should be saved)
table/directory size (the tables should take at most n GB of space on the hard drive)
when the job was run (keep only jobs that completed at most n days ago)
I'm not decided which solution to take yet so the more flexibility the better. I fear when I implement this myself the solution might be quite error prone and it would take some time to get the system robust. The software I'm developing should be able to run for a very long time without any interruption (Ok, whose doesn't...).
I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?
The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.
Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.
Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.
I am on a LAMP stack for a website I am managing. There is a need to roll up usage statistics (a variety of things related to our desktop product).
I initially tackled the problem with PHP (being that I had a bunch of classes to work with the data already). All worked well on my dev box which was using 5.3.
Long story short, 5.1 memory management seems to suck a lot worse, and I've had to do a lot of fooling to get the long-term roll up scripts to run in a fixed memory space. Our server guys are unwilling to upgrade PHP at this time. I've since moved my dev server back to 5.1 so I don't run into this problem again.
For mining of MySQL databases to roll up statistics for different periods and resolutions, potentially running a process that does this all the time in the future (as opposed to on a cron schedule), what language choice do you recommend? I was looking at Python (I know it more or less), Java (don't know it that well), or sticking it out with PHP (know it quite well).
Edit: design clarification for commenter
Resolutions: The way the rollup script works currently, is I have some classes for defining resolutions and buckets. I have year, month, week, day -- given a "bucket number" each class gives a start and end timestamp that defines the time range for that bucket -- this is based on arbitrary epoch date. The system maintains "complete" records, ie it will complete its rolled up dataset for each resolution since the last time it was run, currently.
SQL Strat: The base stats are located in many dissimilar schemas and tables. I do individual queries for each rolled up stat for the most part, then fill one record for insert. Your are suggesting nested subqueries such as:
INSERT into rolled_up_stats (someval, someval, someval, ... ) VALUES (SELECT SUM(somestat) from someschema, SELECT AVG(somestat2) from someschema2)
Those subqueries will generate temporary tables, right? My experience is that had been slow as molasses in the past. Is it a better approach?
Edit 2: Adding some inline responses to the question
Language was a bottleneck in the case of 5.1 php -- I was essentially told I made the wrong language choice (though the scripts worked fine on 5.3). You mention python, which I am checking out for this task. To be clear, what I am doing is providing a management tool for usage statistics of a desktop product (the logs are actually written by an EJB server to mysql tables). I do apache log file analysis, as well as more custom web reporting on the web side, but this project is separate. The approach I've taken so far is aggregate tables. I'm not sure what these message queue products could do for me, I'll take a look.
To go a bit further -- the data is being used to chart activity over time at the service and the customer level, to allow management to understand how the product is being used. You might select a time period (April 1 to April 10) and retrieve a graph of total minutes of usage of a certain feature at different granularities (hours, days, months etc) depending on the time period selected. Its essentially an after-the-fact analysis of usage. The need seems to be tending towards real-time, however (look at the last hour of usage)
There are a lot of different approaches to this problem, some of which are mentioned here, but what you're doing with the data post-rollup is unclear...?
If you want to utilize this data to provide digg-like 'X diggs' buttons on your site, or summary graphs or something like that which needs to be available on some kind of ongoing basis, you can actually utilize memcache for this, and have your code keep the cache key for the particular statistic up to date by incrementing it at the appropriate times.
You could also keep aggregation tables in the database, which can work well for more complex reporting. In this case, depending on how much data you have and what your needs are, you might be able to get away with having an hourly table, and then just creating views based on that base table to represent days, weeks, etc.
If you have tons and tons of data, and you need aggregate tables, you should look into offloading statistics collection (and perhaps the database queries themselves) to a queue like RabbitMQ or ActiveMQ. On the other side of the queue put a consumer daemon that just sits and runs all the time, updating things in the database (and perhaps the cache) as needed.
One thing you might also consider is your web server's logs. I've seen instances where I was able to get a somewhat large portion of the required statistics from the web server logs themselves after just minor tweaks to the log format rules in the config. You can roll the logs every , and then start processing them offline, recording the results in a reporting database.
I've done all of these things with Python (I released loghetti for dealing with Apache combined format logs, specifically), though I don't think language is a limiting factor or bottleneck here. Ruby, Perl, Java, Scala, or even awk (in some instances) would work.
I have worked on a project to do a similar thing in the past, so I have actual experience with performance. You would be hard pressed to beat the performance of "INSERT ... SELECT" (not "INSERT...VALUES (SELECT ...)". Please see http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
The advantage is that if you do that, especially if you keep the roll-up code in MySQL procedures, is that all you need from the outside is just a cron-job to poke the DB into performing the right roll-ups at the right times -- as simple as a shell-script with 'mysql <correct DB arguments etc.> "CALL RollupProcedure"'
This way, you are guaranteeing yourself zero memory allocation bugs, as well as having decent performance when the MySQL DB is on a separate machine (no moving of data across machine boundary...)
EDIT: Hourly resolution is fine -- just run an hourly cron-job...
If you are running mostly SQL commands, why not just use MySQL etc on the command line? You could create a simple table that lists aggregate data then run a command like mysql -u[user] -p[pass] < commands.sql to pass SQL in from a file.
Or, split the work into smaller chunks and run them sequentially (as PHP files if that's easiest).
If you really need it to be a continual long-running process then a programming language like python or java would be better, since you can create a loop and keep it running indefinitely. PHP is not suited for that kind of thing. It would be pretty easy to convert any PHP classes to Java.