What are the possibilities to distribute data selectively?
I explain my question with an example.
Consider a central database that holds all the data. This database is located in a certain geographical location.
Application A needs a subset of the information present in the central database. Also, application A may be located in a geographical location different (and maybe far) from the one where the central database is located.
So, I thought about creating a new database at the same location of application A that would contain a subset of information of the central database.
Which technology/product allow me to deploy such a configuration?
Thanks
Look for database replication. SQL Server can do this for sure, others (Oracle, MySQL, ...) should have it, too.
The idea is that the other location maintains a (subset) copy. Updates are exchanged incrementally. The way to treat conflicts depends on your application.
Most major database software such as MySql and SQL server can do the job, but it
is not a good model. With the growth of the application (traffic and users),
not only will you create a load on the central database server (which might be serving
other applications),but you will also be abusing your network bandwidth to transfer data
between the far away database and the application server.
A better model is to keep your data close to the application server, and use the far away
database for backup and recovery purposes only. You can use an FC\IP SAN (or any other
storage network architecture) as your storage network model, based on your applications' needs.
One big question that you didn't address is if Application A needs read-only access to the data or if it needs to be read-write.
The immediate concept that comes to mind when reading your requirements is sharding. In MySQL, this can be accomplished with partitioning. That being said, before you jump into partitions, make sure you read up on their pros and cons. There are instances where partitioning can slow things down if your indexes are not well chosen, or your partitioning scheme is not well thought out.
If your needs are read-only, then this should be a fairly simple solution. You can use MySQL in a Master-Slave context, and use App A off a slave. If you need read-write, then this becomes much more complex.
Depending on your write needs, you can split your reads to your slave, and your writes to the master, but that significantly adds complexity to your code structure (need to deal with multiple connections to multiple dbs). The advantage of this kind of layout is that you don't need to have complex DB infrastructure.
On the flip side, you can keep your code as is, and use a Master-Master replication in MySQL. Although not officially supported by Oracle, a lot of people have had success in this. A quick Google search will find you a huge list of blogs, howtos, etc. Just keep in mind that your code has to be properly written to support this (ex: you cannot use auto-increment fields for PKs, etc).
If you have cash to spend, then you can look at some of the more commercial offerings. Oracle DB and SQL Server both support this.
You can also use Block Based data replication, such as DRDB (and Mysql DRDB) to handle the replication between your nodes, but the problem you always will encounter is what happens if your link between the two nodes fails.
The biggest issue you will encounter is how to handle conflicting updates in 2 separate DB nodes. If your data is geographically dependent, then this may not be an issue for you.
Long story short, this is not an easy (or inexpensive) problem to resolve.
It's important to address the possibility of conflicts at the design phase anytime you are talking about replicating databases.
Moving on from that, SAP's Sybase Replication Server will allow you to do just that, either with Sybase database's or 3rd party databases.
In Sybase's world this is frequently called a corporate roll-up environment. There may be multiple geographically seperated databases each with a subset of data which they have primary control over. At the HQ, there is a server that contains all the various subsets in one repository. You can choose to replicate whole tables, or replicate based on values in individual rows/columns.
This keeps the databases in a loosely consistent state. Transaction rates, Geographic separation, and the latency that can be inherent to network will impact how quickly updates move from one database to another. If a network connection is temporarily down, Sybase Replication Server will queue up transaction, and send them as soon as the link comes back up, but the reliability and stability of the replication system will be affected by the stability of the network connection.
Again, as others have stated it's not cheap, but it's relatively straight forward to implement and maintain.
Disclaimer: I have worked for Sybase, and am still part of the SAP family of companies.
Related
My employer has currently given me a project that has me scratching my head about synchronization.
I'm going to first talk about the situation I'm in:
I've been asked to create a pdf-report/quotation-tool that takes data (from csv-files; because the actual database the data is on is being used by old IBM software and they for reasons (unknown) don't want any direct access to this database (so instead of making copies of the data to other databases, they apparently found it incredibly fine to just create a folder on the server with loads and loads and loads of CSV-files.)), this piece of software is to load data into the application, query it, transform where needed, do calculations and then return with a pdf-file to the end-user.
The problem here is that getting, querying, and calculating things takes a fair amount of time, the other problem is: they want it to be a WebApp because the business team does not want to install any new software, they're mostly moving towards doing everything online (since the start of the pandemic), it being a WebApp means that every computation has to be done by the WebApp and getting the data likewise.
My question: Is each call to a servlet by a separate user treated as a separate servlet and should I only synchronize the methods on the business logic (getting and using the data); or should I write some code that puts itself in the middle of the servlet, receives a user-id (as reference), that then runs the business-logic in a synchronized-fashion, then receiving data and returning the pdf-file?
(I hope you get the gist of it...)
Everything will run on Apache Tomcat 8 if that helps. Build is Java 11lts.
Sorry, no code yet. But I've made some drawings.
With java web applications, the usual pattern is for the components to not have conversational state (meaning information specific to a specific user's request). If you need to keep state for a user on the server, you can use the http session. With a SPA or Ajax application it's often easier to keep a lot of that kind of state in the browser. The less state you keep on the server the easier things are as your application scales, you don't have to pin sessions to servers (messing up load balancing) or copy lots of session state across a cluster.
For simple (non-reactive) web apps that do blocking i/o, each request-response cycle gets its own dedicated thread from tomcat's pool. That thread delivers the http request to the servlet, handles the business logic and blocks while talking to the database, then carries the http response.
(Reactive webapps are going to be more complex to build, you will need a non-blocking database driver and you will have less choices for databases, so I would steer clear of those, at least for your first web application.)
The threadpool used by tomcat has to protect itself from concurrent access but that doesn't impact your code. Likewise there are 3rd party middletier caching libraries that have to deal with concurrency but you can avoid dealing with it directly. All of your logic is confined to one thread so it doesn't interfere with processing done by other threads unless there are shared mutable data structures. Those data structures would be the part of the application where synchronization might be one of several possible solutions.
Synchronization or other locking schemes are local to one instance of the application. If you want to stand up multiple instances of this application then you need to be aware each one would be locking separately from the others. So for some things it's better to do locking in the database, since that is shared across webapp instances.
If you can make use of a database to store your data, so that you can rely on the database for caching and indexing, then it seems likely your application should be able to avoid having doing a lot of locking.
If you want examples there are a lot of small examples for building web apps using spring at https://spring.io/guides. These are spring boot applications that are self hosted so you can put them together quickly and run them right away.
Going rogue with a database may not be the best course since databases need looking after by DBAs. My advice is put together two project plans, one for using a database, and one for using the flat files. The flat file one will have to allow for addressing issues like handling caching, indexing data, replication of data from the legacy database, and not having standard tools that generate pdfs from sql queries. The alternative plan using a database should have a lot less sorting out of infrastructure and a shorter time til you can get down to cranking out reports.
I am developing a Spring MVC JPA web application. When this application is deployed in live the same DB that my application interacts with will be used by other 2 Dotnet and VB applications at the same time. I am managing the concurrency of my JPA app through version column.
Will there be any issue when running these 3 apps at the same time for the same database? All systems are using the same tables.
I will have to build another application (most probably Spring MVC + JPA) for the same DB in the future. Will there be any issues when running both apps at the same time (in terms of persisting the same tables in both apps etc.)?
Multiple applications accessing the same database tables can be (and usually is) a concurrency nightmare. Just adding a version column to tables does not help because each application may use different concurrency management mechanisms. Common problems encountered when sharing a database across applications (assuming read-write access for all and in no particular order):
Concurrency mismatch: Imagine one application using optimistic locking throughout, another using pessimistic locking and a third using no locking at all (since all are maintained by different teams). Even if there is a central architecture group handing out good application architecture advice to everyone, developers can do whatever they like and end up causing concurrency hell.
Deadlocks: Imagine one application using SERIALIZABLE isolation level for all transactions and performing long-running operations on the database, causing queues, deadlocks and timeouts. Even if other applications do not corrupt data, they may end up seeing too many errors due to deadlocks and timeouts, reducing their usefulness.
Data validity: It is common for developers to use in-memory caches for storing fixed or slow-changing data to save repeated database roundtrips. In a JPA application backed by Hibernate, developers could use a second-level cache. However, if another application updates the database, the cache would be holding stale (and therefore inaccurate) data.
Data integrity: Different applications may use different parts of the data. If all are allowed to update the common data independently, how is referential integrity maintained? What about business rules? Will they have to be duplicated across the applications?
Team communication overhead: Each team has to keep the other teams informed about changes they need to make to the schema so that they do not step on to each others' toes. This may even reduce agility if other teams do not agree with changes required by a team or if priorities cannot be aligned.
Schema errors: What happens if someone deletes, renames, moves or archives tables or columns required by an application?
Access control: If the underlying data is sensitive and requires authentication and authorization, access control checks will have to be duplicated across the applications.
Ownership: Who owns the common tables becomes a challenge as each team may have its own stakeholders, roadmap, constraints and priorities.
Portability: If the underlying database (vendor and type) has to be changed some day, all applications will need to be changed.
Auditing and versioning: If common data needs to be audited and/or versioned (multiple versions of data rows), the code will either have to be duplicated across applications or built into the database (which may not be easy within the database, for example, knowing which application user changed a record). If done within the database, portability is affected again because the syntax may vary from one database vendor to another.
Database-specific optimizations: Some scenarios (for example, reporting) may require native queries or other optimizations that may not be possible or too hard to perform using an ORM technology like JPA. If multiple applications require access to optimized queries, such functionality will either have to be built into the database (stored procedures) or duplicated across the applications (in possibly different ways).
Archival and partitioning: Different applications may have different archival and partitioning requirements. Who makes sure that everyone's needs are met equally?
Standardization: If different teams are allowed to manage the common database objects on their own, how are things such as data dictionary, naming conventions, etc. managed? A table that is used by different teams may have differently (and mostly annoyingly) named columns, constraints, etc.
A better approach is to expose the common database tables using a service. The service can keep things consistent and controlled at all times. A common team handling changes to the service can ensure that unexpected changes are minimized and also communicated to all affected parties in a timely manner.
I am going to make a desktop application with mysql database. My database tables are frequenlty changing -- almost 60% of the tables. So I think caching may be a bad idea. Can anyone suggest me:
How can I make a fast desktop application with a remote database ?
My language is Java.
The biggest problem with most projects that have performance as their primary concern is that people tend make some exotic choices that end up complicating the project without any real benefits. Unless you have previous actual hands-on experience with the environment you will be working start simple.
Set some realistic goals about how often you have to refresh your data before you start. If your data changes very frequently, eg. every second, does it make sense to try and show the changes in real time? A query every second will make everyone involved miserable.
Use a thread to take care of the queries. You don't need more than one, since any more will only make the race conditions in the database worst.
Design your database layer to be insulated from the rest of the application. Also time your DB-related operations from the beginning in order to track the impact of your optimizations.
Start with Hibernate / ORMLite. Although I cannot talk about ORMLite, I have used (optimized) Hibernate in heavy load environments without any problems. If you have complicated objects you should give it a try, it sure beats using plain JDBC and implementing the cache mechanism yourself.
Find out when you need lazy loading and when it's slowing you down (due to the select n+1 problem).
If you have performance issues optimize. You don't have to map every single relationship. Use custom SQL in separate methods to get the objects you need when you need them. You can write a query that only returns table ids and afterwards ask Hibernate to load the corresponding objects.
Optimize your SQL. Avoid joins, use subselects, where id in etc.
Implement (database) paging if it makes sense.
If all else fails, start using plain SQL. You' ll have already written the most complex queries and you'll know where your bigger bottlenecks are.
You could use a local SQLite to save the less volatile data and talk to the database mainly to get lists of ids and the stuff that you're missing. For example if you have users and orders, you can assume that you will have many more new orders per minute/second than users per hour.
To sum up, set clear performance goals before you start, always use a separate thread for data retrieval, avoid reinventing the wheel and keep it as simple as possible.
Here goes some generic approaches to the problem.
0) HW: make sure you are not having bottlenecks in you hardware, that you can cheaply increase. (adding HW is faster and cheaper that dev hours in most cases)
1) Caching:
Perhaps you can cache (locally or in a distributed cache like memcache) the 40% of data that tends to be immutable. You could invalidate the cache when data gets modified. You should choose the right entities and granularity level for building the keys.
2) Replication:
If the first is to much overhead, you could create slaves of your mysql and read from there. Again, you have to know when you can afford to have some stale data.
3) NoSQL:
Moving in that direction, but increasing the dev effort, you could move to some distributed store (take a look at the CAP theorem before making a choice)
Hope it helps
Depends on your database structure and application. You can use an object relational mapping library like ormlite and refresh objects loaded from database at the background with threads. With ormlite you may also use LazyForeignCollection to load only required data in your application.
Minimize unnecessary database call.
If your fields on database is changing, you can shift from relational to NoSQL database like MongoDB.
You can perform multithreading on the server side for data processing and clustering of application servers. While using multithreading use it effectively, be aware of the sychronized keyword, it will degrade the performance to some extend.
Perform best practice of coding, don't use more instance variable, try to use local variable, it will make you thread safe also.
You can use Mybatis for ORM also for large queries.
You can perform caching on DAO layer, service layer and even in client side but be sure to sychronize with the database, you can use different caching soutions.
You can do database indexing for first retrival.
Do not use same service for large data querying break it down into different services which will help u to process in multithreading way.
If the application is not very hard real time system you can use messaging solution also, like asychronously processing of data.
I have large set of data(more than 1TB). This will be accessed by more than 1000 people concurrently. Storing it in one database will make the application really slow. So I was planning to store it across different databases. Does mongo DB support routing between different databases? Or should this in our application? I am developing using Java and use Spring framework to interact with mongo.
Given the reason for splitting your data into multiple databases is to improve performance, I would suggest sharding a single database rather than splitting across multiple. If location is granular enough and you would like to split load across servers you could then use tag aware sharding to pin specific locations or location ranges to a specific server. There is a good tutorial on this available here.
Before following this route I would suggest performing load tests on your application with your database on the hardware you plan to use for your system. It is worth confirming that you really do need to shard/split data and if so the # of servers you may need. If your database is going to be read rather than write intensive it could be that a non-sharded database would handle your load giving your working set fits in memory.
I need to create project in which there are two databases local and remote. Remote database needs to be synchronized daily with local database reflecting changes made in local database.
I am using JAVA. Database is ORACLE. I have JAVA/JPA code that does CRUD operations on local database.
How to synchronize changes to remote database.
I would not do this in Java, but look for native Oracle database synchronization mechanisms/tools. This will
be quicker to implement
be more robust
have faster replication events
be more 'correct'
Please look at some synchronization products. SQL Anywhere from Sybase where I work is one such product. You may be able to get a developer/evaluation copy that you can use to explore your options. I am sure Oracle has something similar too.
The basic idea is to be able to track the changes that have happened in the central database. This is typically done by keeping a timestamp for each row. During the synchronization, the remote database provides the last sync time and the server sends to it all rows that have changed since then. Note that the rows that have been deleted in the central database will need some special handling to ensure they get deleted from the remote database.
A true two-way synchronization is lot more complex. You need to also upload the changes from remote database to central and also some conflict resolution strategies have to be implemented for the cases when the same row has been changed in both the remote and central database in incompatible way in the two.
The general problem is too complex to be explained in a respone here but I hope I have been able to provide some useful pointers.
The problem is that what you are asking can range from moderately difficult (for a simple, not very robust system) to a very complex product that could keep a small team busy for a year depending on requirements.
That's why the other answers said "Find another way" (basically)
If you have to do this for a class assignment or something, it's possible but it probably won't be quick, robust or easy.
You need server software on each side, a way to translate unknown tables to data that can be transferred over the wire (along with enough meta-data to re-create it on the other side) and you'll probably want to track database changes (perhaps with a flag or timestamp) so that you don't have to send each record over every time.
It's a hard enough problem that we can't really help much. If I HAD to do that for a customer, I'd quote him at least a man year of work to get it even moderately reliable.
Good Luck
Oracle has a sophistication replication functionality to synchronise databases. Find out more..
From your comments it appears you're using the Oracle Lite: this supports replication, which is covered in the Lite documentation.
Never worked with it, but http://symmetricds.codehaus.org/ might be of use