Risk of data contamination due to in memory processing - JAVA - java

I am developing java application based on spring framework.
It
Connects to a MySQL database
Gets data from MySQLTable1 in POJOs
Manipulates (update,delete) it in memory
Inserts into a Netezza database table
The above 4 processes are done for each client (A,B,C) every hour.
I am using a spring JDBC template to get the data like this:
SELECT COL1,COL2,COL3 FROM MySQLTable1 WHERE CLIENTID='A' AND COL4='CONDITION'
and read each record into a POJO before I write it to a Netezza table.
There are going to be multiple instance of this application running every hour through a scheduler.
So Client A and Client B can be running concurrently but the SELECT will be unique,
I mean data for:
SELECT COL1,COL2,COL3 FROM MySQLTable1 WHERE CLIENTID='A' AND COL4='CONDITION'
will be different from
SELECT COL1,COL2,COL3 FROM MySQLTable1 WHERE CLIENTID='B' AND COL4='CONDITION'
But remember all of these are stored in memory as POJOs.
My questions are :
Is there a risk of data contamination?
Is there a need to implement database transaction using spring data transaction manager?
Does my application really need to use something like Spring Batch to deal with this?
I appreciate your thoughts and feedback.
I know this is a perfect scenario for using an ETL tool but that is out of scope.

Is there a risk of data contamination?
It depend on what you are doing with your data but I don't see how you can have data contamination if every instance is independant, you just have to make sure that every instances that run concurrently are not working on the same data (Client ID).
Is there a need to implement database transaction using spring data transaction manager?
You will probably need a transaction for insertion into the Netezza table. You certainly want your data to have a consistent state in the result table. If an error occur in the middle of the process, you'll probably want to rollback everything that was inserted before it failed. Regarding the transaction manager, you don't especially need the Spring transaction manager, but since you are using Spring it might be a good option.
Does my application really need to use something like Spring Batch to deal with this?
Does it really need it, probably not, but Spring Batch was made for those kind of application, so it might help you to structure your application (Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management). Everything can be made without the framework and it might be overkill to use it if you have a really small application. But at the end, if you need those features, you'll probably want to use it...

Spring Batch is ETL, so using it would be a good fit for this use case and also a good alternative to a commercial ETL tool.
Is there a risk of data contamination? Client A and B read separate data, so they can never interfere with each other by reading or writing the same data by accident. The risk would be if two clients with the same ID are created, but that is not the case.
Is there a need to implement database transaction using spring data transaction manager?
There is no mandatory need to do that, although programatic transaction management has many pitfalls and is best avoided. Spring Batch would manage transactions for you, as well as other aspects such as paging.
Does my application really need to use something like Spring Batch to deal with this? There is no mandatory need to do this, although it would help a lot, especially in the paging aspect. How will you handle queries that return thousands of rows? Without a framework this needs to be handled manually.

Related

Should I use MongoDB with entity's relations to be non-blocking end to end with my Spring 5 project?

I started a Spring WebFlux project some times ago, the goal of this project is to offer an REST API which collects its data from a database.
I currently go with a reactive approach thanks Reactor project included inside Spring 5 release and created reactive controllers. I need to persist in my database normalized datas with relations, this is why I use PostgreSQL.
At the time I am writing this lines, no reactive programming support is provided for JDBC and so JPA. But my controllers are only truly non-blocking if other components that they work with are also non-blocking. If I write Spring WebFlux controllers that still depend on blocking repositories, then my reactive controllers will be blocked waiting for them to produce data.
I would like to be non-blocking end to end, so I wonder to move on one of the NoSQL databases supported by Spring Data : Cassandra DB or MongoDB. I don't think Cassandra DB really fits to my needs, I will need to rewrite my entities and think differently my database's structure to be query oriented.
I read it is possible to keep some relations between my entities with MongoDB, especially with the last 4.0 version without refractor completely my db schema. But I wonder what is worth ?
Switch to MongoDB even if I need to keep relational datas
Keep to fetch data in a blocking fashion and then translate it into a reactive type as soon as possible
Forget Spring WebFlux and go back to Spring MVC (probably not)
Thank you for any help and advice !
I think it depends on your context, it seems that moving to a document db might not be a good fit for your data as it seems fully relational unless you are sure you can model your data as a bunch of aggreates, otherwise you might end up having other problems such as transaction consistency when checking consistency rules between your models. As a first option i would try to fetch data in another thread, perhaps wrapping the call in an rxjava observable. Although it is still a blocking call it will not block the main thread and you will be able to make better use of resources.
Those are my 2 cents.
Regards

Should I use AKKA for the periodical task

I have a terminal server monitor project. In the backend, I use the Spring MVC, MyBatis and PostgreSQL. Basically I query the session information from DB and send back to front-end and display it to users. But there is some large queries(like searching total users, total sessions, etc.), which slow down the system when user opens the website, So I want to do these queries as asynchronous tasks so the website could be opened fast rather than waiting for the query. Also, I would check terminal server state periodically from DB(every hour), and if terminal server fails or average load is too high, I would notifying admins. I do not know what should I use, maybe AKKA, or any other way to do these two jobs(1.do the large query asynchronously 2. do some periodical query)? Please help me, thanks!
You can achieve this using Spring and caching where necessary.
If the data you're displaying is not required to be "in real-time", but it can be "near real-time" you can read the data from the DB periodically and cache it. Your app then reads from the cache.
There's different approaches you can explore.
You can try to create a materialized view in PostgreSQL which will hold the statistic data you need. Depending on your requirements you have to see how to handle refresh intervals etc.
Another approach is to use application level cache - you can leverage Spring for that(Spring docs). You can populate the cache on start up and refresh it as necessary.
The task that runs every hour can be implemented again leveraging Spring (Spring docs) #Scheduled annotation.
To answer your question - don't use Akka - you have all the tools necessary to achieve the task in the Spring ecosystem.
Akka is not very relevant here, it is for event-driven programming model which deals with concurrency issues to build highly scalable multithreaded applications.
You can use Spring task scheduler for running heavy queries periodically. If you want to keep it simple, you can solve your problem by simply storing the data like total users, total sessions etc, in the global application context. And periodically update this data from database using spring scheduler. You can also store the same in a separate database table, so that this data can be easily loaded at the initialization time.
I really don't see why you need "memcached", "materialized views", "Websockets" and other heavy technologies and frameworks, for a caching a small set of data. All you need is maintain a set of global parameters in your application context, keep them updated using a scheduled task as frequently as desired.

How to handle concurrent sql updates, given database structure can change at runtime

I am developing spring mvc application
For now I am using innodb mysql but I have to develop the application to support other databases also.
Can any one please suggest me how to handle concurrent sql update on single record.
Suppose two users are trying to update same record then how to handle such scenario.
Note: My database structure is dependent on some configuration (It can change at runtime) and my spring controller is singleton in nature.
Thanks.
Update:
Just for reference I am going to implement version like https://stackoverflow.com/a/3618445/3898076).
Transactions are the way to go when it comes to concurrent sql updates, in spring you can use a transaction manager.
As for the database structure, as far as I know MySql does not support transactions for DDL commands, that is if you change the structure concurrently with updating, you're likely to run into problems.
To handle multiple users working on the same data, you need to implement a manual "lock" or "version" field on the table to keep track of last updates.

Java web application memory handling

I have a Java web application which uses Hibernate for storing data into the database and retrieving them.
The strategy I am currently using is to load everything from the database on to the application at start up, and saving/updating them to the database as the user interacts with the application.
What I have also done is to keep track of Transaction history for each user as part of the business logic. (So this transaction history is all loaded on application start up).
The problem I can see is that I shouldn't load all the transaction history for all the user, because if there are a lot of the Transaction history, and users might not necessarily need to see them, then that could be a lot of memory being used up, so it is not efficient.
I was wondering if there is something similar to what PHP script can do, which is just query the database only when user request to see the transaction history, and so it is not using the server resource. (Asides from query the database) Or what are some suggestions/comments regards to what I am facing right now.
Thank you.
Query Hibernate when you need a given piece of information and let Hibernate manage putting it back to the database. This will allow Hibernate to manage the caching.
Note, that when using Hibernate, you should let Hibernate manage the data completely. Do not add or change data yourself using raw SQL.
If you are using a modern container, you should consider migrating to JPA as it is the standard in Java EE containers, allowing you to be more flexible when you need to scale. JPA is very close to Hibernate, but is an API, not an implementation, so you have more than one to choose from.
why not query hibernate for every request come in and release after response? This is a common approach.

Can I use hibernate for data centric applications?

I wag going through a hibernate tutorial, where they say that hibernate is not suitable for data centric application. I am very much impressed by the 'object oriented structure' it gives to the program, but my application is very much data centric(it fetches and updates huge number of records. But I dont use any stored procedures). Cant I use hibernate?Are there any wrappers written over hibernate, which I can use for my application?Any help is appreciated.
I am not sure about specific meaning of phrase data centric. Aren't all database applications data centric? However, if you do process tons of data, Hibernate may not be the best choice. Hibernate is best to represent object models mapped to the database and it may have role in any application, but to do ETL (extract/transform/load) tasks you may need to write very efficient SQL by hand.
In principal you can, but it tends to be slow. Hibernate more or less creates an object for every row retrieved from the database. If you do this with large volumes of data, performance takes a serious hit. Also updates on many rows using a single update have only very basic support.
A wrapper won't help, at least with the object creation issue.
There are many advantages of using Hibernate, when one gets their object model correct as a developer there is a lot of appeal in interacting with the database via objects but in practice I have found initially Hibernate is great but becomes very frustrating when you come against issues like performance and fault finding.
When it comes to decision on the DA (Data Access) layer I ask myself this question.
Am I writing an application which has a requirement to run an different databases?
If the answer is yes then I will consider an (ORM) like Hibernate.
If its no then I will normally just use JDBC normally via Spring.
I feel that interacting with the database via JDBC is a lot more transparant and easier to find faults and performance tune.

Categories

Resources