There are 2 ways to fetch multiple documents in couchbase.
N1QL query
Reactive client (source)
I understand that 2nd has backpresssure and all because of it being reactive in nature. But I want to understand what other functional differences are there between 2 methods? (for example 1 does fire get query to all the shard vs another to a particular shard only etc.). Can someone help me understand functional differences and caveats of using 1st approach over 2nd?
My use case is to do get multiple document by id.
The best I can tell, when you use the Flux.fromIterable as in that example, it will use the key-value API behind the scenes. This is different from the N1QL approach in a number of ways that include (but probably aren't limited to):
N1QL can be used to fetch document by other non-document key attributes (e.g. `SELECT * FROM foo WHERE name LIKE '%best wishes%'
N1QL queries will use the Couchbase query service and (usually) the index service and (most likely) the data service.
The key-value API will go directly the the data service.
I think using N1QL to fetch documents by ID may not need to use the index service (assuming you use the right syntax), but will still need to use the query service. So there is some overhead.
Key-value access is always the fastest way to retrieve data from Couchbase. However, depending on your document size, concurrency needs, other operations, and what overhead the Reactive client introduces (if any--I don't know), the difference in overall performance could be anywhere from 0 to way-way-way better.
My gut recommendation is to go with Reactive (and therefore key-value) for your use case of "get multiple document by id".
Related
I am developing a spring boot REST API, which has to fetch large volume of data (100-200k records) from dynamoDB table based on search conditions and return the response to the API consumer without loading the entire object list in its memory. With SQL based database, I have used JDBCTemplate queryForStreams method for similar requirement. But for no-sql database like DynamoDB, I could not find similar methods to stream the data.
One sample scenario is to fetch all passengers who booked business class ticket on Christmas weekend from xyz airline dynamoDB database.
note: Edited for clarity.
Reading GB's of data per request from DynamoDB does not seem scalable. Does the end user require all that data, what is the purpose?
DynamoDB can only return 1MB per request so for a single end user API call you would have to make many paginated requests to DynamoDB.
If you are using Scan then your solution is not at all scalable and I would possibly suggest using a different database.
This is not a good use case for REST in general. Have you considered storing the query result in an S3?
Your rest API will return a task id, that you can then use to check the progress of the query and eventually download the result.
This way you get infinite scalability and can run huge amounts of parallel dynamo scans or queries.
The fastest way to do this is going to be using a parallel Scan operation. Assuming you have sufficient read capacity on the DynamoDB table, this is going to give you very high speed results.
See "parallel scan using Java" at https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ScanJavaDocumentAPI.html for an example
DynamoDB
Think of DDB in the way that Amazon uses it, for e-commerce - small sub 100s list of paginated data, the items are usually small in size, but the items must be easy to update
In this case you would never need to store / fetch GBs of data from the tables
Your needs as 'how might we...' question
How might we store GBs of data in AWS and retrieve that data quickly?
AWS Best Practices
Before we dive in solving the 'hmw' question above we need to understand some core tenets of AWS
operational excellence
security
reliability
performance efficiency
cost optimisation
sustainability
AWS call these tenets or 'pillars' their Well-Architected Framework
You can read more about these here https://aws.amazon.com/architecture/well-architected/
Most of them are as described: monitoring, security, reliability, performance, cost efficient, computationally cheap (which means environmentally friendly)
A sprawling buffet of solutions
Storage
Your needs are the storage of GBs of data
It still depend on what you're trying to store here, but for most storage needs you'd use S3
To make sure we keep things 'compliant' with the Well-Architected framework we'd need to enable the use of encryption (in transite, at rest), block public bucket access etc.
To make everything cost efficient we will have to think about when we want to access this data. If accessed regularly then we'll have to use 'hot' storage, otherwise 'cold' storage S3 options are cheaper but you trade retrieval time.
Notable mentions
If you have specific data science needs you should checkout: Data Lakes (still uses S3 under the hood), Glue, Athena (a query layer on top of S3)
If you're storing text based data and require near instant seaching and retrieval using OpenSearch - this is very useful for chat related data
Data storage
This depends of your app, but most people still keep a DynamoDB table that acts as a map for S3 queries.
DDB is query optimised and super performant when you fully understand you data queries or access pattern.
Design you table around your access patterns not around entities.
eg.
Option 1: One table
PK SK
type#order timestamp
type#transaction timestamp
....
Option 2: Multiple Entity based tables
Order table,
PK SK Attr
id timestamp productIDs
Transactions table
PK SK Attr
id timestamp amount, orderId
Products table
PK SK
id category
The one-table design just simplifies the retrieval of data in a small number or requests, but you do need to play with your table design until it's just right.
My recommendation: be creative and mix and match the table styles to your needs. Entity-based tables are still useful in most apps.
Also expect to redo your tables once you find out new things.
It's crucial here that you use an infrastructure-as-code tool to make it easier to teardown and recreate tables - CDK is great for this.
Remember that you are billed per Read and Write units. This is where a well-designed table (to match your access patterns) will help you make concise queries at a low cost.
Data retrieval
This is where you have some options, depending on your app
Again I would recommend the storage for big items in S3 not DynamoDB, so in this case it's relatively easy to download GBs of data from S3.
You can also store data in optimised formats using parquet.
Also if you choose to use DynamoDB as a hash map for the S3 bucket you can quickly find your files and locations and then place those in a queue, so that the retrieval happens in the background.
You can also copy files within the bucket to a job folder, zip the data and provide the user with the URL to that zip.
You can also use DataSync for copying across buckets.
Final notes
It sounds to me like you are storing data in AWS and downloading for processing.
Most teams approach this by moving their processing and storage to AWS, running the whole process in the cloud.
I am supposed to join some huge SQL tables with the json of some REST services by some common key ( we are talking about multiple sql tables with a few REST services calls ). The thing is this data is not real time/ infinite stream and also don’t think I could order the output of the REST services by the join columns. Now the silly way would be to bring all data and then match the rows, but that would imply to store everything in memory/ some storage like Cassandra or Redis.
But, I was wondering if flink could use some king of stream window to join say X elements ( so really just store in RAM just those elements at a point ) but also storing the nonmatched element for later match in maybe some kind of hash map. This is what I mean by smart join.
The devil is in the details, but yes, in principle this kind of data enrichment is quite doable with Flink. Your requirements aren't entirely clear, but I can provide some pointers.
For starters you will want to acquaint youself with Flink's managed state interfaces. Using these interfaces will ensure your application is fault tolerant, upgradeable, rescalable, etc.
If you wanted to simply preload some data, then you might use a RichFlatmap and load the data in the open() method. In your case a CoProcessFunction might be more appropriate. This is a streaming operator with two inputs that can hold state and also has access to timers (which can be used to expire state that is no longer needed, and to emit results after waiting for out-of-order data to arrive).
Flink also has support for asynchronous i/o, which can make working with external services more efficient.
One could also consider approaching this with Flink's higher level SQL and Table APIs, by wrapping the REST service calls as user-defined functions.
I have a table from which I extract 8 columns, said columns will be properties of a pojo, say MyPojo.
I want to remove duplicates.
I came up with two strategies.
1-Let oracle take care of this with distinct keyword
select distinct c1,c2...c8 from TABLE where...`
2-Do this in java with cqengine (https://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Logical_Elimination_Strategy):
DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));
3-Do this in java with a set
simply storing rows inside of a Set<MyPojo>
From a performance point of view which one is better?
Let the database do the work. In this case you don't send unnecessary data over the network which will - probably - have the biggest positive impact on performance.
Also it is the most compact solution in terms of code size.
The best way to decide these things is to model it.
What are the access patterns in your application?
If this is would be a one-off request: have the database do the filtering.
If you expect to get many such identical requests: have the database do the filtering, and consider caching results in the application.
If you expect to get a variety of queries on the same dataset, consider caching the unfiltered dataset into the application tier, and querying it with CQEngine.
There is no rule of thumb such as "always have the database do the work". If your application operates at any kind of scale, you will not want every request to hit the database. You need to scale out your application tier.
On the other hand, you should not over-engineer. The answer depends on the traffic volume and data access patterns that you expect.
I have found the Jquery datatables plug in extremely useful for simple, read only applications where I'd like to give the user pagination, sorting and searching of very large sets of data (millions of rows using server side processing).
I have a system for reusing this code but I end up doing the same thing over and over alot. I'd like to write a very generalized api that I essentially just need to configure the sql needed to retrieve the data used in the table. I am looking for a good design pattern/approach to do this. I've seen articles like this http://www.codeproject.com/Articles/359750/jQuery-DataTables-in-Java-Web-Applications and have a complete understanding of how server side processing works (have done it in java and asp.net many times). For someone to answer you will probably need to have a deep understanding of how server side processing works in java but here are some issues that come up with attempting to do this:
I generally run three separate queries. A count without the search clause, a count with the clause included, the query for the actual data. I haven't found an efficient way to do all 3 at once and doing so requires a lot of extra data to come back from db (ie counts over and over). The api needs to support behavior based on these three different queries and complex queries at that. I generally row number () over an index for the pagination to be relatively speedy with large data.
*where clause changes dynamically (user can search over a variable number of rows).
*order by clause changes for the same reason.
overall, each case is often pretty specific to the data we need. Is there a good way to abstract this so that I can do minimal work when I want to use the plug in server side.
So, the steps are as follows in most projects:
*extract the params the plug on sends to the server (alot of times my own are added, mostly date ranges)
*build the unfiltered count query (this is rarely dynamic).
*build the filtered count query (is dynamic)
*build the data query
*construct a model object of the table and return it as json.
A lot of the issues occur setting the prepared statements with a variable number of parameters. Dynamically generating the sql in a general way (say based on just column names) seems unlikely. I am wondering if someone else has created something they are using for this or if it sounds like a specific pattern is applicable. It has just occurred to me that creating a reusable filter may be helpful in java. Any advice would be greatly appreciated. Feel free to be language agnostic as the architecture is what I'm trying to figure out.
We have base search criteria where all request parameters relevant to DataTables are mapped onto class properties (fields) and custom search criteria class that extends base and contains specific to business logic fields for sutom search. Also on server side we have repository class that takes custom search criteria as an argument and makes queries to database.
If you are familiar with C#, you could check out custom binding code and example of usage.
You could do such custom binding in your Java code as well.
I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.
"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.
First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.