What is more efficient of Querying the database

What is more efficient of Querying the database - java

Should we query the table with more filtering by adding multiple conditions/ where clauses to SQL query and get the specific data
Or pull all the data and do filtering in our java class.
Looking for efficient way of coding practices
Example :
A table with multiple columns Id, Name, Place.
I need to pull the list of ids with Place should be form placesList and Name should match namesList.
1)
(SELECT id
FROM Person p
WHERE p.name IN (<name_list>)
AND p.place IN (<place_list>)
order
by p.id asc)
public List<Long> getIds(#BindIn("name_list") List<String> name_list, #BindIn("place_list") List<String> place_list);
or
2)
(SELECT id
FROM Person p)
public List getIds();
apply java8 filters to the result
Note: Above example i took name place for easy understanding. In Real time, data is huge and have multiple fields and rows in the table. The list used to filter is also large.

Best approach is query with required filters on database and which will reduce amount of data fetch you from applicaion and db back and forth and also it will reduce time on I/O operations(since it involves some latency to transfer large amount of data over network).
also reduces overhead of memory need to process large amount of data on application side.
Also when you are running query and filtering on multiple fields you can add indexes(if necessary) on fields it will improve query fetch time.
Hope it answers

You always want to perform things in the database if possible. You want to avoid transferring data from the database to your application, using up memory, just to remove it there.
Databases are very efficient at doing those things, so you'll want to use them to their full extent.

Query the database directly instead of downloading data in Java application. It will reduce the latency from the database to your java application.
But be very careful when using user inputs in the filter. Make sure that you have sanitized the user input before using them in a query to the database to avoid SQL injection.
If you are worried about security more than performance then filter the data in Java App (if the data is not massive in size).
But I strongly recommend filtering the data on the database itself by ensuring necessary safeguards.

Related

Fetch large volume of data from DynomoDB?

I am developing a spring boot REST API, which has to fetch large volume of data (100-200k records) from dynamoDB table based on search conditions and return the response to the API consumer without loading the entire object list in its memory. With SQL based database, I have used JDBCTemplate queryForStreams method for similar requirement. But for no-sql database like DynamoDB, I could not find similar methods to stream the data.
One sample scenario is to fetch all passengers who booked business class ticket on Christmas weekend from xyz airline dynamoDB database.
note: Edited for clarity.

Reading GB's of data per request from DynamoDB does not seem scalable. Does the end user require all that data, what is the purpose?
DynamoDB can only return 1MB per request so for a single end user API call you would have to make many paginated requests to DynamoDB.
If you are using Scan then your solution is not at all scalable and I would possibly suggest using a different database.

This is not a good use case for REST in general. Have you considered storing the query result in an S3?
Your rest API will return a task id, that you can then use to check the progress of the query and eventually download the result.
This way you get infinite scalability and can run huge amounts of parallel dynamo scans or queries.

The fastest way to do this is going to be using a parallel Scan operation. Assuming you have sufficient read capacity on the DynamoDB table, this is going to give you very high speed results.
See "parallel scan using Java" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html for an example

DynamoDB
Think of DDB in the way that Amazon uses it, for e-commerce - small sub 100s list of paginated data, the items are usually small in size, but the items must be easy to update
In this case you would never need to store / fetch GBs of data from the tables
Your needs as 'how might we...' question
How might we store GBs of data in AWS and retrieve that data quickly?
AWS Best Practices
Before we dive in solving the 'hmw' question above we need to understand some core tenets of AWS
operational excellence
security
reliability
performance efficiency
cost optimisation
sustainability
AWS call these tenets or 'pillars' their Well-Architected Framework
You can read more about these here https://aws.amazon.com/architecture/well-architected/
Most of them are as described: monitoring, security, reliability, performance, cost efficient, computationally cheap (which means environmentally friendly)
A sprawling buffet of solutions
Storage
Your needs are the storage of GBs of data
It still depend on what you're trying to store here, but for most storage needs you'd use S3
To make sure we keep things 'compliant' with the Well-Architected framework we'd need to enable the use of encryption (in transite, at rest), block public bucket access etc.
To make everything cost efficient we will have to think about when we want to access this data. If accessed regularly then we'll have to use 'hot' storage, otherwise 'cold' storage S3 options are cheaper but you trade retrieval time.
Notable mentions
If you have specific data science needs you should checkout: Data Lakes (still uses S3 under the hood), Glue, Athena (a query layer on top of S3)
If you're storing text based data and require near instant seaching and retrieval using OpenSearch - this is very useful for chat related data
Data storage
This depends of your app, but most people still keep a DynamoDB table that acts as a map for S3 queries.
DDB is query optimised and super performant when you fully understand you data queries or access pattern.
Design you table around your access patterns not around entities.
eg.
Option 1: One table
PK SK
type#order timestamp
type#transaction timestamp
....
Option 2: Multiple Entity based tables
Order table,
PK SK Attr
id timestamp productIDs
Transactions table
PK SK Attr
id timestamp amount, orderId
Products table
PK SK
id category
The one-table design just simplifies the retrieval of data in a small number or requests, but you do need to play with your table design until it's just right.
My recommendation: be creative and mix and match the table styles to your needs. Entity-based tables are still useful in most apps.
Also expect to redo your tables once you find out new things.
It's crucial here that you use an infrastructure-as-code tool to make it easier to teardown and recreate tables - CDK is great for this.
Remember that you are billed per Read and Write units. This is where a well-designed table (to match your access patterns) will help you make concise queries at a low cost.
Data retrieval
This is where you have some options, depending on your app
Again I would recommend the storage for big items in S3 not DynamoDB, so in this case it's relatively easy to download GBs of data from S3.
You can also store data in optimised formats using parquet.
Also if you choose to use DynamoDB as a hash map for the S3 bucket you can quickly find your files and locations and then place those in a queue, so that the retrieval happens in the background.
You can also copy files within the bucket to a job folder, zip the data and provide the user with the URL to that zip.
You can also use DataSync for copying across buckets.
Final notes
It sounds to me like you are storing data in AWS and downloading for processing.
Most teams approach this by moving their processing and storage to AWS, running the whole process in the cloud.

How to query for multiple keys in-memory cache in Java?

I have some persistent data in the rdms and csv files (they are independent objects, but I wanted to mention it because they are in different mediums,
I can not go with what rdbms provides, actually I do not want to do a trip to database for the next hour in even the data gets old). I need to store the data in memory for performance benefits and query (only read, no other operation) the objects based on multiple columns of it, and refresh the data every hour.
In my case ,what is a good way to store and query in-memory objects other than implementing my own object store and querying methods? For instance, can you provide an example/link to replace the sql query as
select * from employees where emplid like '%input%' or surname like '%input%' or email like '%input%';
Sorry for the dummy query but it explains what kind of queries are possible.

Go find yourself a key store implementation with the features you want. Use your Query string as the key and the result as the value. https://github.com/ben-manes/caffeine Has quite a few features including record timeouts (like an hour).
For my own work, I use a LRU key store (limited to X entries) containing objects with the timeout information and I manually decide if the record is stale or not before I use it. LRU is basically a linked-list which moves "read" records to the head of the list and drops the tail when records are added beyond the maximum desired size. This keeps the popular records in the store longer.

Is SQL IN Query better for performance or Java method ContainsAll

I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.

I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !

In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.

Improving application performance with large queries

I was looking into using caching for improving website performance recently and thought of implementing it in my application.
I have a scenario where I am executing a large query to get list of orders based on order status and customer/supplier. There can be other filters selected at runtime by the user. The query is also complex with a lot of joins amongst tables.
Can we implement caching memcached/Redis in anyway to improve performance in this area by storing the orders object in cache with order id as key and order object as value when they get created and write some logic to fetch the list or orders with runtime filters applied or do i need to run the query itself for fetching orders?
Also there are simple queries like get orders object by orderNumber. Can we cache these to avoid DB query?
Please let me know of any possible solution for improving site performance.
Thanks in advance :)

Database Data Filtering Best Practice

I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.

I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.

i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.