I need to know what's the best way to store persistent data on google app engine in a situation like mine:
I only need to store a few relatively small amounts of data (a few hashmaps of strings to strings, with no string exceeding over ~300 characters). i'm estimating maybe maximum usage is 5 of these hashmaps and 3000 keys per hash map... The data needs to be around indefinitely, and will be updated once every ~30 minutes. would REALLY prefer not to have pay for service, as it would require a ton of beaurocratic procedures (corporate policy).
I've been trying to wrap my head around how data storage really works on app engine...
can i just continue to use a static HashMap<String, String> to store all the data? this is how i currently have my deployment set up, and it seems to be working just fine for the past couple of hours (no data reset). i dont think i have any backend instances or anything set up.
how long can i expect this data to last if it is in memcache or jcache? the documentation says it's volatile, but if something is used incredibly infrequently, will it just stay forever in cache?
I've been looking at the Datastore options, but i'm so confused on what/if anything will cost money.
can i store it in just like a raw .txt file somehow??
any help would be appreciated. thanks!
You can setup a little table on the DB, so that you have your pairs stored on a non-volatile place. You cannot rely on Memcache if these data should be preserved.
for point #4, you cannot use the filesystem on backend instances.
Hope this helps
Related
My question may be very broad but I really need to ask this. I am planning to use a Key Value NoSQL Database and I am completely new in NoSQL world. I was going through Wikipedia Page https://en.wikipedia.org/wiki/Key-value_database
As per Wiki KV databases are categorized in 4 below categories.
KV – eventually consistent
KV – ordered
KV – RAM
KV – solid-state drive or rotating disk
I am not able to understand exact differences between them. If any one can explain it to me that would be great.
This typology is interesting and maybe a little misleading. I can comment on some of these:
RAM - I've worked with Redis, so what I say will apply to that DB. Redis is designed to keep info in RAM, not on disc. Basically, if you turn it off the data is gone, so its more of a cache than a db. It Can be configured to preserve the data on disc, but its not designed to do so.
eventually consistent - I've worked with Cassandra. EC means that read-operations are not guaranteed to provide the most up-to-date value. They will eventually return the most up-to-date, but not immediately following the update. In Cassandra, you specify which level of consistency you need - and this effect the speed at which you will be able to read/write. less consistency == faster (its a tradeoff). Unlike redis, cassandra saves to disc.
solid-state drive or rotating disk - I've worked with couchbase. Couchbase saves to disc and is guaranteed to always be consistent (on read operations of keys, but not on views/indexes). so when you read a key from CB, you know its always the most up-to-date value. I think the name of this category is a little misleading. Other databases (like cassandra) also save to disc, so the name is a little off.
hope this helps.
I am currently working on an app that have 15000+ rows and 20+ columns array on database. If a user search in app then two things I can do
1. Is I will save all data from database when app runs and then after search for the user preference data.
2/ I will convert user data into a query and fire on database and retrieve data from database at same time.
Additionally I want to ask if I store 10^6*20 size array in app then how much space it take and how it behave to reload the app.
And if I fetch then what will be fetching time complexity in worst case data usage.
Thanks for your appreciation in before.
With only about 200 MB or less of data either way will be fast, assuming you allocate enough heap to the process. In the modern era that is not much data. Which will be faster? That all depends.
How fast is your I/O system? What database would you use? How would you code your algorithm? Are you doing any processing in parallel? What bugs have you written into your code? What amount of overall program time does access to these data use? What else is running on the system?
Only actual measurement of both approaches under very well controlled conditions that simulate realistic loads will reveal which is faster.
But at what cost? More complex systems have more risk and costs. Complicated premature "optimization" code makes code brittle and hard to understand. Workarounds take time and effort, wasted time and effort if there's nothing to work around.
So instead of asking which is fast(er), ask what makes sense. What gets the job done at all? What costs the least to do? What's more correct? What minimizes risk?
We have a java based product which keeps Calculation object in database as blob. During runtime we keep this in memory for fast performance. Now there is another process which updates this Calculation object in database at regular interval. Now, what could be the best strategy to implement so that when this object get updated in database, the cache removes the stored object and fetch it again from database.
I won't prefer any caching framework until it is must to use.
I appreciate response on this.
It is very difficult to give you good answer to your question without any knowledge of your system architecture, design constraints, your IT strategy etc.
Personally I would use Messaging pattern to solve this issue. A few advantages of that pattern are as follows:
Your system components (Calculation process, update process) can be loosely coupled
Depending on implementation of Messaging pattern you can "connect" many Calculation processes (out-scaling) and many update processes (with master-slave approach).
However, implementing Messaging pattern might be very challenging task and I would recommend taking one of the existing frameworks or products.
I hope that will help at least a bit.
I did some work similar to your scenario before, generally there are 2 ways.
One, the cache holder poll the database regularly, fetch the data it needs and keep it in the memory. The data can be stored in a HashMap or some other collections. This approach is simple and easy to implement, no extra framework or library needed. But users will have to endure dirty data from time to time. Besides, polling will cause a lot of pressure on DB if the number of pollers is huge or the query is not fast enough. However, it is generally not a bad one if your requirement for real-time is not that high and the scale of your system is relatively small.
The other approach is that the cache holder subscribes the notification of the data updater and update its data after being notified. It provides better user experience, but this will bring more complexity to your system because you have to get some MS infrastructure, such as JMS, involved. Developing and tuning is more time-consuming.
I know I am quite late resonding this but it might help somebody searching for the same issue.
Here was my problem, I was storing requestPerMinute information in a Hashmap in a Java filter which gets loaded during the start of the application. The problem if somebody updates the DB with new information ,the map doesn't know about this.
Solution: I took one variable updateTime in my Java filter which just stored when was my hashmap last got updated and with every request it checks if the current time is time more than 24 hours , if yes then it updates the hashmap from the database.So every 24 hours it just refreshes the whole hashmap.
Although my usecase was not to update at real time so it fits the use case.
Not-so-critical data can be stored in memcache. However, how can complicated data be stored there while being updated simultaneously by different user sessions.
Say a graph, tree or linked list? It is OK to miss a node, but it is bad to loose the whole graph/tree/list if a node is evicted. An example here is sending user update notification among online-users.
App engine's appstats use a predetermined 1000 bucket which is good ( I think there is no dependency among them). But I am thinking about more complicated cases...
Any tutorial, example or theory would be considered helpful...
( Memcached tag was added, but I know it is not for app engine )
If you want to store more complex data, you need to serialize it before you place it into shared memory.
When you want to update this data, you will need to deserialize back into your complex structure, update the structure then serialize again to place the structure back into shared memory.
I am curious - why memcache? There are many other shared memory storage systems out there such as MemBase, Redis and Hazelcast, with Hazelcast adding some help to hide some of the complexity of storing some more complex structures (like lists and maps). Hazelcast also adds nice features like cluster wide locks and data listeners which can come in useful (full disclosure: I decided on Hazelcast).
Of course if you want to spend real money for licensing, you always have Terracotta which can completely abstract you from this complexity.
If you're storing a graph, and don't mind nodes going missing, just store each node under its own memcache key.
If all the data going away at once is "bad", though, you probably shouldn't be using memcache in the first place. Complete wipes are rare, but can happen.
I am planning to develop some application like connecting with friends of friends of friends. It may look like as Facebook or Twitter but initially i am planning to implement that to learn more about NOSQL databases.
There are number of database tools in NOSQL. I have gone through many database types like document store, key-value store, column type, graph databases. And finally i come up with two database tools which are cassandra & Neo4J. Is it right to choose any one, if not correct me & provide me some your valuable opinions.
One more thing is the language binding which i choose is JAVA.
My question is,
Which database tool suits for my application?
Awaiting for your valuable opinions. Thanks for spending your valuable time.
Tim, you really should have posted your question separately, rather than as an answer to the OP, which it wasn't.
But to answer, first, go read Ben Black's slides at http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency.
Done? Okay, now for the specific questions:
"How would differences in [replica] data-state be reconciled on a subsequent read?"
The highest timestamp wins.
"Do all zones work off the same system clock?"
Timestamps are provided by clients (i.e., your app server). They should be synchronized with e.g. ntpd (which is good practice anyway), but high precision is not required because if ordering matters you should be avoiding conflict either by using unique column names or by using external locking.
For example: if you have a list of users following you in a Twitter clone, you should give each follower its own column and there will be no way to lose data no matter how out of sync the clocks are.
If you have an admin tool for your website and two admins upload a new favicon "simultaneously," one update is going to win and it doesn't really matter which. Here, you do want your clocks synchronized but "within a few ms" is close enough.
If you are managing user registration and you want to allow creating account "jbellis" only if it doesn't already exist, you need a lock manager no matter how closely synchronzied your clocks are.
"Would stale data get returned?"
A node (a better unit to think about than a "zone") will not have data it missed during its downtime until it is sent that data by read repair, hinted handoff, or anti-entropy repair. In the meantime, it will reply to read requests with stale data; if you use a high enough consistencylevel read requests will wait for enough other replies to make sure you always see the most recent version anyway, which may mean not being able to fulfil requests if enough other replicas are down.
Otherwise, a low consistencylevel (e.g. ONE) implicitly means "I understand that the higher availability and lower latency I get with this lower consistencylevel means I'm okay with seeing stale data temporarily after downtime."
I'm not sure I understand all of the implications of the Cassandata consistency model with respect to data-agreement across multiple availability zones.
Given multiple zones, and given that the coordinator node in Cassandra has used a consistency level that does not require all zones to report back, but only a quorum, how would differences in zone data-state be reconciled on a subsequent read?
Do all zones work off the same system clock? Or does each zone have its own clock? If they don't work off the same clock, how are they synchronized so that timestamps can be compared during the "healing" process when differences are reconciled?
Let's say that a zone that does have accurate, up-to-date data is now offline, and a zone that was offline during a previous write (so it didn't get updated and contains stale data) is now back online. Would stale data get returned? Would the coordinator have any way to know the data were stale?
If you don't need to scale in the short term I'd go with Neo4j because it is designed to store networks like the one you described. (If you eventually do need to scale, maybe you can throw Gizzard in front of it or something. Good luck!)
Have you looked on Riak database? It has the same background as Cassandra, but you don't need to care about timestamp synchronization (they involve different method for resolving data status).
My first application was build on a Cassandra database. But I am now trying Riak because it is more suitable. It is not only the difference in keys (keys - values / super column - keys - values) but goes further with the document store feature.
It has a method to create complex queries using MapReduce. Cassandra does have this option using Hadoop, but it sounds difficult.
Further more it uses a well known and defined access protocol in http/s so it's easy to manage the server when you have a lot of traffic.
The only bad point is that is slower than Cassandra. But usually you will read records more than write (and Cassandra is optimised on writes, not reads) so the end result should be ok.