We are designing the architecture of social networking website which has highly interconnected dataset. (like user can follow other users, places, interests. And recommendations based on that). The feed would come from directly following entities as well as from indirectly connected entities. (the places and interest can be connected to other places and interests in a inverted tree like hierarchy ).
Now we plan to use Neo4j for storing the complex relationships between entities with their IDs. We want to store the actual data for that entity in MySQL. We want to keep graph database content only to minimal size (but with the entire relationship (that's very important for feeds)), so that we could load entire graph in RAM at run time. (entire graph in memory for fast retrieval of content). Once we get ID's of object from Neo4j, we could run normal SQL queries on MySQL.
We are using PHP and MySQL combination. Now we have learned that Neo4j, if implemented in embedded mode, is suitable for complex algorithm and fast data retrieval. Now we need to integrate Neo4j with PHP. We plan to create RESTful Java APIs (or SOAP) for Neo4j implementation. By this way we could do it.
We would have atleast 1 million nodes and 10 millions relationship. Can Neo4j traverse 1 million nodes without perfomance glitches in 1-5 seconds with proper indexing?
Now, please guide me if this would work. Anyone who has already done this kind of things before. Your any little guidance in this regards would be highly useful to me.
thank you
P/s: i am attaching some project relationship diagrams to give you more understanding. please ask if you need more inputs from me.
https://drive.google.com/file/d/0B-XA2uVZaFFTWDdwUEViZ2ZsbkE/edit?usp=sharing
https://drive.google.com/file/d/0B-XA2uVZaFFTTGV4d1IySXlWRGs/edit?usp=sharing
I published an unmanaged extension some time ago that represents a kind of activity stream. Feel free to have a look, you would consume it from PHP just via a simple http-REST call.
https://github.com/jexp/neo4j-activity-stream
A picture of the domain model is here:
yes, 10M relationships and 1M nodes should be no problem to even hold in memory. For fastest retrieval, I would build a server extension in Java and use the embedded API or even Cypher, and expose a custom REST endpoint that your PHP environment talks to, see http://docs.neo4j.org/chunked/milestone/server-plugins.html
Related
I'm starting a team of 2 to develop a chat server (both of us are college students), we made some research and found that netty is the most suitable for this kind of concurrent based app.
we never had any experience in developing server side application in java, this is our first time to tackle this kind of project and I just need the right direction for us to build this server the right way.
Our goal is to build something like, whatsapp, kik messenger, Line or weChat.
The real question is, how to make our netty app scalable? do we need to use redis for data persistent? do we need to use mysql for saving relationship or nosql database like mongodb?
Hope someone could guide us.
You could have a look at the documentation if you don't have done yet:
SecureChat example
Netty User Guide
The scalability is a complex answear. One could think of making your application multi-servers able (horizontal scalability), but then it really depend on how your information/context/session are available/updated...
You could think of course to use some Redis for data persistency.
On database usage, it mainly depends on how your data are and if you need relationship using SQL language or if your application can do it for you (to be clear, do you want the database making for you the join parts in your SQL command, or do you want to use the application doing that?). Also it depends on the amount of data (1 millions, 1 billions, ?) and the amount of connections too.
So all is your choice...
Then you can come back with some issues you've got.
I want to implement a hotel booking system with Play framework 2.0 (Java). The app will do hotel booking with no banking transactions (credit card will be provided only as identification method to prevent fraud), user will select the desirable room and date range and the app will make the booking and update the rooms availability.
I am considering mongo over mySQL for performance reasons and also because my models will have translatable fields in a few languages which will have a lot of joins if it is in mySQL.
For Availability check I don't quite figure it out, if it is simpler in mySQL rather than mongo.
Is MongoDB suitable for that kind of web app, or I am in the wrong path?
Two simple statements :
Stick with what you know. If you know plain SQL, go for that.
Premature optimization is the root of all evil. If even before starting your application, you are already thinking of optimization, there is something terribly wrong with your way of working.
You can use mongoDB for this kind of application. I personally have some security concerns when it comes to sensitive data (i.e. credit card infirmation) as deleting the information does not automatically mean that it is unrecoverable.
From wikipedia on MongoDB:
E-commerce. Several sites are using MongoDB as the core of their ecommerce infrastructure (often in combination with an RDBMS for the final order processing and accounting).
There are some very intresting articles on pros an cons on stackoverflow Pros and cons of MongoDB?
If you want to learn a new NoSQL technology by all means go for it, but if you want to play it safe stick to the mysql solution. These applications (booking) often work well with BI solutions/data mining and this fact alone would make a noSql approach a no-go for me.
Currently I am gathering information what database servce we should use.
I am still very new to web development but we think we want to have a noSQL database.
We are using Java with Play! 2.
We only need a database for user registration.
Now I am already familiar with GAE ndb which is a key value store such as dynamoDB. MongoDB is a document db.
I am not sure what advantages each solution has.
I also know that dynamoDB runs on SSD's and mongoDB is inmemory.
An advantage of mongoDB would be that Java Play! already "supports" mongodb.
Now we don't expect too much database usage, but we would need to scale pretty fast if our app grows.
What alternatives do I have? What pros/cons do they have?
Considering:
Pricing
Scaling
Ease of use
Play! support?
(Disclosure: I'm a founder of MongoHQ, and would obviously prefer you choose us)
The biggest difference from a developer perspective is the querying capability. On DynamoDB, you need the exact key for a given document, or you need to build your keys in such a way that you can use them for range based queries. In Mongo, you can query on the structure of the document, add secondary indexes, do aggregations, etc.
The advantage of doing it with k/v only is that it forces you to build your application in a way that DynamoDB can scale. The advantage of Mongo flexible queries against your docs is that you can do much faster development, even if you discount what the Play framework includes. It's always going to be quicker to do new development with something like Mongo because you don't have to make your scaling decisions from the get go.
Implementation wise, both Mongo and DynamoDB can grow basically unbounded. Dynamo abstracts most of the decisions on storage, RAM and processor power. Mongo requires that you (or someone like us) make decisions on how much RAM to have, what kind of disks to use, how to managed bottlenecks, etc. The operations hurdles are different, but the end result is very similar. We run multiple Mongo DBs on top of very fast SSDs and it works phenomenally well.
Pricing is incredibly difficult to compare, unfortunately. DynamoDB pricing is based on a nominal per GB fee, but you pay for data access. You need to be sure you understand how your costs are going to grow as your database gets more active. I'm not sure I can predict DynamoDB pricing effectively, but I know we've had customers who've been surprised (to say the least) at how expensive Dynamo ended up being for the stuff they wanted to do.
Running Mongo is much more predictable cost-wise. You likely need 1GB of RAM for every 10GB of data, running a redundant setup doubles your price, etc. It's a much easier equation to wrap your head around and you're not in for quite as nasty of a shock if you have a huge amount of traffic one day.
By far the biggest advantage of Mongo (and MongoHQ) is this: you can leave your provider at any time. If you get irked at your Mongo provider, it's only a little painful to migrate away. If you get irked at Amazon, you're going to have to rewrite your app to work with an entirely different engine. This has huge implications on the support you should expect to receive, hosting Mongo is competitive enough that you get very good support from just about any Mongo specific company you choose (or we'd die).
I addressed scaling a little bit above, but the simplest answer is this: if you define your data model well, either option will scale out just about as far as you can imagine you'd need to go. You are likely to not do this right with Mongo at first, though, since you'll probably be developing quickly. This means that once you can't scale vertically any more (by adding RAM, disk speed, etc to a single server) you will have to be careful about how you choose to shard. The biggest difference between Mongo and Dynamo scaling is when you choose to make your "how do I scale my data?" decisions, not overall scaling ability.
So I'd choose Mongo (duh!). I think you can build a fantastic app on top of DynamoDB, though.
As you said, mongoDB is one step ahead among other options, because you can use morphia plugin to simplify DB interactions(you have JPA support as well). Play framework provides CRUD module (admin console) and secure module as well (for your overall login system), so I strongly suggest you to have a look at' em.
I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?
I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.
I'm looking for resources to help migrate my design skills from traditional RDBMS data store over to AppEngine DataStore (ie: 'Soft Schema' style). I've seen several presentations and all touch on the the overarching themes and some specific techniques.
I'm wondering if there's a place we could pool knowledge from experience ("from the trenches") on real-world approaches to rethinking how data is structured, especially porting existing applications. We're heavily Hibernate based and have probably travelled a bit down the wrong path with our data model already, generating some gnarly queries which our DB is struggling with.
Please respond if:
You have ported a non-trivial application over to AppEngine
You've created a common type of application from scratch in AppEngine
You've done neither 1 or 2, but are considering it and want to share your own findings so far.
I'm wondering if there's a place we could pool knowledge from experience
Various Google Groups are good for that, though I don't know if any are directly applicable to Java-GAE yet -- my GAE experience so far is all-Python (I'm kind of proud to say that Guido van Rossum, inventor of Python and now working at Google on App Engine, told me I had taught him a few things about how his brainchild worked -- his recommendation mentioning that is now the one I'm proudest, on amongst all those on my linkedin profile;-). [I work at Google but my impact on App Engine was very peripheral -- I worked on "building the cloud", cluster and network management SW, and App Engine is about making that infrastructure useful for third party developers].
There are indeed many essays & presentations on how best to denormalize and shard your data for optimal GAE scaling and performance -- they're of varying quality, though. The books that are out so far are so-so; many more are coming in the next few months, hopefully better ones (I had a project to write one of those, with two very skilled friends, but we're all so busy that we ended up dropping it). In general, I'd recommend the Google I/O videos and the essays that Google blessed in its app engine site and blogs, PLUS every bit of content from appenginefan's blog -- what Guido commended me for teaching him about GAE, I in turn mostly learned from appenginefan (partly through the wonderful app engine meetup in Palo Alto, but his blog is great too;-).
I played around with Google App Engine for Java and found that it had many shortcomings:
This is not general purpose Java application hosting. In particular, you do not have access to a full JRE (e.g. cannot create threads, etc.) Given this fact, you pretty much have to build your application from the ground up with the Google App Engine JRE in mind. Porting any non-trival application would be impossible.
More pertinent to your datastore questions...
The datastore performance is abysmal. I was trying to write 5000 weather observations per hour -- nothing too massive -- but I could not do it because I kept on running into time out exception both with the datastore and the HTTP request. Using the "low-level" datastore API helped somewhat, but not enough.
I wanted to delete those weather observation after 24 hours to not fill up my quota. Again, could not do it because the delete operation took too long. This problem in turn led to my datastore quota filling up. Insanely, you cannot easily delete large swaths of data in the GAE datastore.
There are some features that I did like. Eclipse integration is snazzy. The appspot application server UI is a million times better than working with Tomcat (e.g. nice views of logs). But the minuses far outweighed those benefits for me.
In sum, I constantly found myself having to shave the yak, in order to do something that would have been pretty trivial in any normal Java / application hosting environment.
The timeouts are tight and performance was ok but not great, so I found myself using extra space to save time; for example I had a many-to-many relationship between trading cards and players, so I duplicated the information of who owns what: Card objects have a list of Players and Player objects have a list of Cards.
Normally storing all your information twice would have been silly (and prone to get out of sync) but it worked really well.
In Python they recently released a remote API so you can get an interactive shell to the datastore so you can play with your datastore without any timeouts or limits (for example, you can delete large swaths of data, or refactor your models); this is fantastically useful since otherwise as Julien mentioned it was very difficult to do any bulk operations.
The non relational database design essentially involves denormalization wherever possible.
Example: Since the BigTable doesnt provide enough aggregation features, the sum(cash) option that would be in the RDBMS world is not available. Instead it would have to be stored on the model and the model save method must be overridden to compute the denormalized field sum.
Essential basic design that comes to mind is that each template has its own model where all the required fields to be populated are present denormalized in the corresponding model; and you have an entire signals-update-bots complexity going on in the models.