I'm looking for resources to help migrate my design skills from traditional RDBMS data store over to AppEngine DataStore (ie: 'Soft Schema' style). I've seen several presentations and all touch on the the overarching themes and some specific techniques.
I'm wondering if there's a place we could pool knowledge from experience ("from the trenches") on real-world approaches to rethinking how data is structured, especially porting existing applications. We're heavily Hibernate based and have probably travelled a bit down the wrong path with our data model already, generating some gnarly queries which our DB is struggling with.
Please respond if:
You have ported a non-trivial application over to AppEngine
You've created a common type of application from scratch in AppEngine
You've done neither 1 or 2, but are considering it and want to share your own findings so far.
I'm wondering if there's a place we could pool knowledge from experience
Various Google Groups are good for that, though I don't know if any are directly applicable to Java-GAE yet -- my GAE experience so far is all-Python (I'm kind of proud to say that Guido van Rossum, inventor of Python and now working at Google on App Engine, told me I had taught him a few things about how his brainchild worked -- his recommendation mentioning that is now the one I'm proudest, on amongst all those on my linkedin profile;-). [I work at Google but my impact on App Engine was very peripheral -- I worked on "building the cloud", cluster and network management SW, and App Engine is about making that infrastructure useful for third party developers].
There are indeed many essays & presentations on how best to denormalize and shard your data for optimal GAE scaling and performance -- they're of varying quality, though. The books that are out so far are so-so; many more are coming in the next few months, hopefully better ones (I had a project to write one of those, with two very skilled friends, but we're all so busy that we ended up dropping it). In general, I'd recommend the Google I/O videos and the essays that Google blessed in its app engine site and blogs, PLUS every bit of content from appenginefan's blog -- what Guido commended me for teaching him about GAE, I in turn mostly learned from appenginefan (partly through the wonderful app engine meetup in Palo Alto, but his blog is great too;-).
I played around with Google App Engine for Java and found that it had many shortcomings:
This is not general purpose Java application hosting. In particular, you do not have access to a full JRE (e.g. cannot create threads, etc.) Given this fact, you pretty much have to build your application from the ground up with the Google App Engine JRE in mind. Porting any non-trival application would be impossible.
More pertinent to your datastore questions...
The datastore performance is abysmal. I was trying to write 5000 weather observations per hour -- nothing too massive -- but I could not do it because I kept on running into time out exception both with the datastore and the HTTP request. Using the "low-level" datastore API helped somewhat, but not enough.
I wanted to delete those weather observation after 24 hours to not fill up my quota. Again, could not do it because the delete operation took too long. This problem in turn led to my datastore quota filling up. Insanely, you cannot easily delete large swaths of data in the GAE datastore.
There are some features that I did like. Eclipse integration is snazzy. The appspot application server UI is a million times better than working with Tomcat (e.g. nice views of logs). But the minuses far outweighed those benefits for me.
In sum, I constantly found myself having to shave the yak, in order to do something that would have been pretty trivial in any normal Java / application hosting environment.
The timeouts are tight and performance was ok but not great, so I found myself using extra space to save time; for example I had a many-to-many relationship between trading cards and players, so I duplicated the information of who owns what: Card objects have a list of Players and Player objects have a list of Cards.
Normally storing all your information twice would have been silly (and prone to get out of sync) but it worked really well.
In Python they recently released a remote API so you can get an interactive shell to the datastore so you can play with your datastore without any timeouts or limits (for example, you can delete large swaths of data, or refactor your models); this is fantastically useful since otherwise as Julien mentioned it was very difficult to do any bulk operations.
The non relational database design essentially involves denormalization wherever possible.
Example: Since the BigTable doesnt provide enough aggregation features, the sum(cash) option that would be in the RDBMS world is not available. Instead it would have to be stored on the model and the model save method must be overridden to compute the denormalized field sum.
Essential basic design that comes to mind is that each template has its own model where all the required fields to be populated are present denormalized in the corresponding model; and you have an entire signals-update-bots complexity going on in the models.
Related
A few months ago I created a shop manager software for one of our customers.
One of the requirements was Adobe ColdFusion. Nevertheless, I came up with a really nice product, simple and fast, a few nice ideas and some positive feedbacks.
Now I think I am ready for letting "bigger" customers see my project.
I prefer not to write that next version in ColdFusion, I'm looking for something much more scalable and reliable.
The cloud computing, actually, is making me really curious! In particular, Google AppEngine seems to have all I need:
I know Java
I could start from scratch, without paying anything
It's Google, what's more reliable than it?
I made a few helloworld-s, looking for the best technology to use.
GWT is really nice, but my dev team loves html+css "page centric" apps, so I think it would be a too high jump. Instead of it I was considering:
Spring MVC 3.x
Objectify 4 (as a persistence manager, instead of JDO/JPA)
My questions are:
Based on your experiences, do you think that GAE is suitable for developing and hosting a "shop manager" software, which will manage tables (CRUD), make reports, and so on? My projects is really simple.
Do the two technology I mentioned are good and enough for such a project? What will I need in addition?
I've made an pretty standard shop, using: Appengine + Spring + Groovy + Objectify (and backbonejs + google closure templates for client side)
And from my experience I can say:
It's possible :)
Such system requires a lot of transactions - it's possible, but not so trivial using appengine. There is a lot of transactions, and a lot of work
Reports is better to prepare on background (taskqueue/cron/prospectivesearch), and it's hard to prepare 'on-demand' or custom reports (actually it's must be done on background anyways)
I'm happy with my current implementation, but I see that using an standard RDBMS fits much better for this type of projects.
PS And also, you can take a look at CloudFoundry, I didn't tried it yes, but seems that it's good thing too and it have PostreSQL as service
I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?
I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.
I have to develop an ERP System for a 2,000+ end users organisation.
Could you please suggest me with comparable points that among (Java or .Net)
in which technology I should invest money and time? Although I have done
some average projects in both, but this project is going to be very big in near
future in terms of scalability.
I want to know your experiences and some tips from you people, so that I can develop
and deploy this project efficiently.
I rate .Net > Java for this project only due to less development time available.
We have to use some Rapid App Development technology.
I have to deploy this on Cloud (Azure or Google App engine).
It will be better if I got answers from those people who works in both (.Net and Java).
I will appreciate answers from your experiences.
I would suggest creating a very small proof-of-concept project in both technologies, which do something real - like allow people to log in, see messages, and allow them to type in new messages, and log out again.
Even if the project is laughably small, if you do it well, you will have a finished product on each platform which have shown you by experience how things works and if you like the way you had to do them. You will be able to see if you can debug in the cloud, if you can profile when load testing, if you can do fast work inhouse which then works well when deployed to the cloud.
And you will need to figure out things. Are the online resources good? How responsive is the StackOverflow community for each platform when you ask questions?
Personally, I consider the ".NET is Windows-only" to be important. Except for that I do not believe there is any technical showstopper for either platform.
I think both approaches can be used to deliver this successfully. I would expect you to have the same amount of success/pain with either choice. When it comes to making a decision you should base it on the amount of expertise that you have to hand. That is, your own and that of your existing colleagues and the resources that you can acquire (new recruits, contractors, consultants etc.).
That said a couple of technical notes:
The Java approach tends to have more freedom, i.e. more frameworks and choice of technologies for various solutions (although GAE will bring in some restrictions).
There is less choice in the .NET space, but that is not always a bad thing. E.g. you tend not end up in tireless debates about the logging frameworks.
Java is starting to age as a language and C# is a bit nicer, however there a number of newer languages that run on the Java VM (Scala, Groovy, Ruby, Clojure).
I am a computer science undergraduate currently in my final year. As my final year project, I am thinking of creating a matlab-like numerical computing environment as SAAS that supports matrix manipulations, plotting of functions and data, image processing operations etc. The project is going to be created in Java + Scala. Scala will be used for application's DSL. Rest of the application is going to be programmed in Java.
I was thinking of implementing this system on google app engine so that we could parallelize various algorihms across a number of servers and thus obtain faster results. However I do not have any prior experience with web development (except some simple sites in PHP).
So I had the following key questions:
First of all does it make sense to have an application like matlab hosted on cloud?
How easy or difficult it would be to write such an application over google app engine, considering my limited experience with web development?
Can you please point me to some already existing projects that parallelize mathematical, graph and image processing algorithms.
I know the question is very much subjective but I still request you all not to close it as I am very much confused regarding my project and need some expert advice.
Any hep would be greatly appreciated!
Thanks!
About half a year ago I've thought about making such thing.
Thoughts ended up with nothing except some code at http://code.google.com/p/metaplasm...
In fact, the tricky thing with GAE is that computation must be sliced into thirty secods slices with no shared memory (only memcache and database). After you're accomplish that, everything else will go smooth :-)
App Engine probably isn't the right platform for this. App Engine is targeted at web applications where each request does a modest amount of computation, but you need to service a lot of them - most traditional webapps, such as social networking sites, blogs, web-based games, and so on and so forth. It isn't targeted at services that need to do intensive computation for a single user request, and while it has services to do parallel background processing, they're asynchronous, which is probably also not what you want for your use-case.
What I would recommend is looking at other cloud environments, such as Amazon's EC2, for the processing power and parallelism you need. App Engine would still do an admirable job as a frontend for such a service, though! For example, you could use an App Engine app to manage jobs, dispatch them to backends, and turn up and down VM instances as required by load.
This absolutely makes sense, and there are two existing projects that run numerical routines in the cloud.
Biocep (free, runs R & Scilab on EC2 or Eucalyptus) and Monkey Analytics (commercial, runs R, Octave or Python on EC2).
Why not try BOINC opensource distributed computing system ?
http://boinc.berkeley.edu/
It allows multiple platforms, multiple hosting environments and services all kind of numerical computation jobs depending on parallel environments.
Moreover, You don't need any web development knowledge. You need to just create a new project in BOINC and try running it in existing volunteer computing environment.
You might encounter issues with this type of service on GAE as it's quite restrictive on what you are allowed to do in the sandbox. From the GAE Docs
An App Engine application cannot:
spawn a sub-process or thread. A web
request to an application must be
handled in a single process within a
few seconds. Processes that take a
very long time to respond are
terminated to avoid overloading the
web server.
This could make it tricky to offer the types of services you describe. The scaling that GAE offers enables you to grow the number of requests you can handle but doesn't really offer you good tools for scaling the CPU resources for a single request.
Sounds like an interesting idea for a project though, good luck.
It makes little sense to me to write the rest in Java. That's precisely where I think Scala would make the most difference.
I'm hosting my Java math online demo on Google appengine. This non parallelized demo of course hits the Google Appengine quota limits for time expensive requests.
But with the help of the appengine-mapreduce library you can parallelize your mathematical algorithms and avoid these limits.
I plan to start a mid sized web project, what language + framework would you recommend?
I know Java and Python. I am looking for something simple.
Is App Engine a good option? I like the overall simplicity and free hosting, but I am worried about the datastore (how difficult is it to make it similarly fast as a standard SQL solution? + I need fulltext search + I need to filter objects by several parameters).
What about Java with Stripes? Should I use another framework in addition to Stripes (e.g. for database).
UPDATE:
Thanks for the advice, I finally decided to use Django with Eclipse/PyDev as an IDE.
Python/Django is simple and elegant, it's widely used and there is a great documentation. A small disadvantage is that perhaps I'll have to buy a VPS, but it shouldn't be very hard to port the project to App Engine, which is free to some extent.
Since you mentioned python, I would suggest looking into Django. You may need to look harder for hosting options, however...
Is App Engine a good option? I like the overall simplicity and free hosting, but I am worried about the datastore (how difficult is it to make it similarly fast as a standard SQL solution? + I need fulltext search + I need to filter objects by several parameters).
App Engine is nice. It supports Python or Java (with some limitations), and it provides free hosting for small needs (rare, at least for Java). But I wouldn't expect the exact same performances as with dedicated servers, the cloud is about scalability, not performance (you won't always get the fastest response time for a single hit; however, GAE would handle gazillions of concurrent hits without any problem while your servers would be on fire). But this scalability is not without cost; if you don't need it, the development tradeoffs may be too much trouble. And also note that it does not support full-text search out of the box (what an irony), you will have to use extra tooling.
What about Java with Stripes? Should I use another framework besides Stripes (e.g. for database).
I like Stripes very much. I love its conventions over configuration approach, it's a very elegant and simple framework (but still powerful). Definitely not a bad choice. For persistence, if you go for GAE, you will have to use JPA or JDO. If you don't, it's at your discretion (although I would go for JPA).
See also
Google AppEngine - A Second Look
As many things in life, this depends on what your goals are. If you intend to learn a web framework that is used in corporate environments, then choose a Java solution. If not, don't. Python is certainly more elegant and generally more fun in pretty much every way.
As to which framework to use, django has the most mindshare, as evidenced by the number of questions asked about it here. My understanding is that it's also pretty good. It's best suited for CMS-like web sites, though - at least that's what it's coming from and what it's optimized for. You might also have a look at one of the simpler, nimbler ones, such as the relatively new flask. All of these are enjoyable, though they may not all have all features on AppEngine.
Kay and Tipfy are excellent Python framework choices when you target specifically GAE. Kay is modelled after and similar to Django, but is better suited to GAE.
I've been kick App Engine around a little bit, and so far the DataStore is pretty quick... there is a bit of a learning curve compared to SQL, but I've had no real issues. I'm not sure about fulltext search, however filtering is simple, you would just run each filter one at a time.
class DBModel(db.Model):
field1 = db.StringProperty()
field2 = db.StringProperty()
field3 = db.IntegerProperty()
GQLObj = DBModel.all().filter('field1 =', 'Foo')
GQLObj = GQLObj.filter('field2 =', 'Bar')
As far as hosting, with GAE I'm not sure you even get a choice, I know you can register your own domain with google though.
I don't think the datastore is a problem. Many people will reject it out of hand because they want a standard relational database; if you are willing to consider a datastore in general then I doubt you will have any problems with the GAE datastore. Personally, I quite like it.
The thing that might trip you up is the operational limitations. For example, did you know that an HTTP request must complete within 10 seconds?
What if you get 50% of the way through a project and then find that a web service you are using sometimes take 15 seconds to respond? Now you are toast. You can't pay extra to get the limit raised or anything like that.
So, my point is that you must approach GAE with great care. Learn about the limitations and make sure that they will not be a problem before you start using it.
It depends on your personality. There's no right answer to this question any more than there's a right answer to "what kind of car should I drive?"
If you're artistic and believe code should be beautiful, use Rails.
If you're a real hacker type, I think you'll find a full-stack framework such as Rails or Django to be unsatisfying. These frameworks are "opinionated" software, which means you have to really embrace the author's vision to be most productive.
The wonderful thing about web development in the Python world is there's several great minimal frameworks. I've used several, including web.py, GAE's webapp, and cherrypy. These frameworks are like "here's a request, give me a string to serve up." It's raw. Don't think you'll be stuck in Python concatenating strings though, God no. There's also several excellent templating libraries for Python. I can personally recommend Cheetah but Mako also looks good.
Google App Engine + GWT and you have a pretty powerful combination for developing web applications. The datastore is quite fast, and it has so far done the job quite nicely for me.
In my project I had to do a lot of redesigning of my database model, because it was made for a traditional relational database, and some things were not (directly) possible with the datastore.
GWT has a fairly moderate learning curve, but it gets the job done very well. The gui code is really easy to get started with, but it's the asynchronous way of thinking that's the hardest part.
As for search I don't think it's supported in the framework. Filtering is possible on parameters.
There are some limitations to GAE, and you should consider them before putting all your eggs in that basket. The fact that GAE uses J2EE distribution standards makes the application very easy to move to a dedicated server, should the limitations of GAE become a problem. In fact I only think you would have to refactor the part of your code that makes the queries and stores the data (which shouldn't be much more than 100 lines).
I've built several apps on GAE (with Python) over the last year. It's hard to beat the ease with which you can get an app up and running quickly. Don't discount the value in that alone.
While you may not understand the datastore yet, it is extremely well documented and there are great resources - including this one - to help you get past any problem you might have.