Application i worked on has a user search API which will do a user search based on Name,Address,DateOfBirth,and Pincode. Of these only pincode is the mandatory field. And any 2 of other data (Date of birth, address, name) can be send. Also in one request clients can search for multiple users. So against each search key we have to give response like either found USER object or NOT FOUND response.
This is how our algorithm works for each search keys .
Fetch all the user profile from table with the given pincode.
Iterate users from step 1 and check if any two of date of birth,
name, address matches then that is a valid candidate. We are doing
this using predicate as shown below
stream(profiles.spliterator(), false).filter(new
UserSearchPredicate(key)).
This predicate will check if any 2 of the
values matched and if yes return true.
Check if only one user we got from step 2 then DONE. Otherwise Exception
Return user.
The above algorithm has a problem , it is searching the users table by pincode.
Total 20 million records are there in the table which makes the query slow. AGainst one pincode we may get 16000 or more records , iterating it is another bottle neck area. ALso this has to be done for all search keys. There can be 50 or more search keys in one request, which took almost a minute to respond.
What we have done:
Pincode column indexing is properly DONE in database. We are using oracle DB. Also tried parallel processing as well. Parallel processing has a problem with DB connections in connection pool which is not a solution we concluded.
Adding one more column to the query is not logically possible because if i add name , then where pincode='' and name='' will be the query. But the matching record can be pincode and date of birth . Apart from pincode nothing is mandatory.
Question: Is there a better way to handle this problem.Am particularly looking for a better algorithm. please help.
Related
I'm going to build an app. Until now everything runs very well. Now I have a problem. The app gets its content from a mysql database.A column is called item.I have a ratingbar. The user can rate the item there.Every time the user evaluates an item the value is stored on the database in the respective item line.The values are then added. In other words, when a user evaluates 20 times with 5 stars, the value adds up to 100 and so on.
I want to limit this. I will that the user can evaluate each day only once an item. I will it without a registration mask for the user. How can I solve this problem?
I know that i can identifier the WIFI MAC Adreess and other Unique Identifiers, but how can i solve this with them?
I can not use sqlite database, because the items should update with the time from the mysql database.
A registration mask should not be excluded. If this process is quite possible with them, then I supplement it with it.
I am looking forward to every comment
every computer has a machine ID, you will hash that and encrypt that to use as your identifier..most telecomms do not like using MAc addresses as IDs
One option would be to create UUID during every installation and sending this UUID to your server along with every request. In server, you can control very well if user can provide feedback once a day or others based on your requirement. Pls refer this link on how to create UUID.
https://developer.android.com/reference/java/util/UUID.html
I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.
A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.
I try to store emails for newsletter mailing app in Cassandra.
Current schema is :
CREATE TABLE emails (
email varchar,
comment varchar,
PRIMARY KEY (email));
I don't know how to get emails ordered by added time(so emails can be processed in parallel on different nodes).
PlayOrm on cassandra can do that sort of stuff under the covers for you as long as you are able to partition your data so you can still scale. You can query into your partitions. The order by is not yet there but a trick is instead to use where time > 0 to get everything after 1970 epoch which forces it to use the time index and then just traverse the cursor backwards for reverse order(or forwards for sorted order).
Cassandra orders on write based on your column comparator. You can't order results using any arbitrary column in your predicate. If you want to retrieve in time order, you must insert with your timestamp as your column name (or the first element in a composite name). You can also create a second CF that would store time-ordered records that you can query if needed. Unfortunately CQL gives the illusion of RDBMS-like query capability, when in reality it's still a column store with the associated query capabilities. My suggestion is to either avoid CQL (and use Thrift-based queries instead) or make sure you understand what it's doing under the covers.
I'm designing a database for capturing clinical trial data. The data are entered twice by two persons, independently and the results must be matched. What are the best database tools to use to achieve the best results. Any one has similar experiences?
Your helps are highly appreciated.
thanks.
Are you designing a database, or the app to enter data into the database?
If you are simply looking at the database, I would capture the following information:
1) user A item X entered data
2) user A userID
3) user A intem X entered date/time
4) user B item X entered data
5) user B userID
6) user B intem X entered date/time
I'd then conclude that there was something called a "Datapoint" that contained the fields
-- entering userID
-- entry date
-- entry data (double value)
I'd also assign it a unique ID for the entry
--entryID (autoinc)
I would then state that there is something called a "data trial" that has two of these things called "data entries"
If I believed that this number of entries per data trial might be 3 verifications instead of 2, I might change my design, but initially I would give my "Data Trial" the following definition:
-- data trial name
-- data trial creation date
-- user creating data trial (userID)
-- data entry 1 (dataPointID)
-- data entry 2 (dataPointID)
-- entries verified (boolean)
and give each of these a unique ID also
-- data trial ID (autoinc)
(I can't add comments yet...) Adding to Zak's answer, if there is any doubt over how many people will enter these values (say it jumps from two to three, like Zak says) I'd break the Data entry 1 and 2 (both dataPointIDs) into another table with two columns:
--data trial id
--data entry id
This way you could theoretically have as many different users
inserting the data, and the data trial table would then contain only meta data about the trial, not "business logic," which only having 2 data entries per trial essentially is.
A similar setup could be used if different trials contain different amounts of data values to be entered.
If you are looking for a good database tool you should consider using a Entity-Relationship Designer to model your database, such as Case Studio or Embarcadero ER/Studio.
Databases are not designed to solve this issue. Double entry is an application issue and violates normalization. I would implement a verification field to indicate that the data has been verified, and if it failed or not. I would likely include an audit table containing each set of entries entered.
The application would need a lookup function to determine if this is the first entry or a subsequent entry. There are a number of design issues related to this.
Verification can't find first entry.
How to correct data if it doesn't match on verification.
How to handle unverified data which should be verified.
I want to display a list of all the users in my site but I only want to display 10 people per age. I don't know how exactly to do this. I know how to do it by just displaying all the users in one page but that's not very good is it?
If I do it with the code I have now, it will only get the first ten users over and over again.
I want to be able to get all the users for a one time query, store it globally and then move through the list retrieving the next 10 and so on for display.
I am developing on appengine using Java and the Spring Framework some of the solutions I have been thinking about,
Store in the session and go through the list (very bad I guess)
hand it to the JSP, specifically to one of the scopes, page, request etc. But I think request will not work.
Look for a Spring controller that can handle this.
Generally speaking, you would use a form variable on your page (via GET or POST) called 'page', which would be a number. When you receive that in the servlet you would calculate a range based on the page number and configured rows per page.
Take a look at Paging through large datasets (yes it's Python but the same principles apply) and Queries and Indexes from the Google App Engine documentation.
Take a look at http://valuelist.sourceforge.net/
If you keep page size at 10 then you can retrieve your 10 users per age group for each page based on page number:
SELECT TOP 10 users FROM myusers
WHERE AGE = function(page_number)
ORDER BY some_ordering
I hope that JPA + appengine support such type of query.
Some database engines provide handy extensions to SQL for just this purpose. Like, in MySQL you can say something like "select ... whatever ... limit 50,10", where "50" is the row to start with and 10 is the number of rows to retrieve. Then on your display page you simply put next and previous buttons that pass the appropriate starting row number back to the server for the next run at the query.
If the SQL engine you're using has no such handy function, then you have to build an query-specific "where" clause based on the sort order.
To take a simple case, suppose in your example you are displaying the records in order by "user_name". You can use Statement.setMaxRows(10) to limit any queries to 10 rows. Then on your first call you execute, say, "select ... whatever ... from user order by user_name". Save the last user_name found. In your next button, you pass this user_name back to the server, and the query for the next call is "select ... whatever ... from user where user_name>'xxx' order by user_name", where 'xxx' is the last user_name from the previous call. Do the setMaxRows again so you are again limited to 10 rows of output. You can then let the user step through the entire output this way.
Letting the user go backwards is a bit of a pain. I've done it by keeping a table in a session variable with the starting key value for each page.