Best table structure for *set* of ManyToMany

Best table structure for *set* of ManyToMany - java

I have 2 Java Entities - Students and Courses.
The relation between them is ManyToMany - each student has a lot of courses and each course can be taken by a lot of students.
Because there are a lot of repeated sets of courses, we decided to make a table of sets of courses, and every time a student wants to take a set of courses, he needs to look if there is already set like this, if there is, he'll use its id, and if not, he'll create this set. And if the student is done with this set, we'll set null in the setId field.
The problem is that the course table is dynamic - we can add and remove course any time (unless there is a student that uses this course), and there is no limit to the number of the courses - it can be even 150.
We need the search for the existence of a set, to be as quick as possible.
We thought about:
HashCode but this HashCode needs to be dynamic and it should support even 2^100 possibilities.
Concatenate the id's of the courses as a string and search that string to see if this set exists.
Assign for each course a prime number and the id of the set will the multiplication of the prime numbers - the problem with that is the biggest number (set of all the courses) can be too big, and the decomposition of the number can take a long time.
What can be the best implementation for that requirements? Any idea will be welcome!
(The reason that we don't want to use the traditional ManyToMany table is that a performance issue. There is a big calculation that causes this.)
Thanks!

A prediction...
100 courses will lead to a few dozen very popular "sets", plus tens of thousands of infrequent sets.
You will find that the number of "sets" will be too unwieldy to be practical.

I have face same problem in my project where I shows document listing with default columns and order and which came from one table and if user changes the order or make any field available in his listing then another entry make to that table and row id is set for current user.
Here in my case order was needful you can ignore that and store JSON for user Selected course list and if list already exist in table then in this case you can assign set id for that user.
HashCode will not unique.
Some time Concatenate will have blanks spacing issue so SortedSet<Integer> JSON is always same.
For example: SortedSet setA = new TreeSet();

Related

How to re-generate deleted sequence numbers in hibernate?

As we know the below hibernate annotation generates a new number each time from the sequence starting from 1. Consider a situation wherein i have a set of records with ids(1-5).Now a record is deleted from the table which had id as 3. If we see number 3 is missing from the sequence 1-5 now because of the operation. I have a requirement for the sequence to re-generate and reassign that number 3 when i will be adding new record in the table. How to do this ?
#Id
#GeneratedValue(strategy = GenerationType.IDENTITY)
private int id;

I don't think this is a great idea. A sequence is just a number incremented of 1 each time. This allows it to be fast but already this is a bottleneck for a distributed database for writes as all the nodes need to synchronize on that number.
If you try to get the first available integer, you need basically to do a full table scan, order the records by id and check the first missing one. That's extremely costly and inefficient for something that shall be as cheap as possible.
You should view the id as a technical ID without functional meaning and thus do not care if there are holes in the sequence or not.
Edit:
I also would add the implications go deeper, even in term of business.
If I get an ID for a article I sell as a merchant and I model its deletion as removing the record or even better put a status "deleted" on it potentially with a date and reason for deletion, I have much easier bookkeeping. Actually, I would prefer the last design: keep the record and have a status that is dynamic and potentially with history. The item could be unavailable for 1 year and be used again if I sell it again.
If on the contrary I silently reuse the ID, then, my system may display an old bill with the data of the new article. Instead of ski boots that I don't sell anymore, it may become a PS5 or 1kg of rice. This is error prone.
This may not apply to all business cases, of course, but its better to consider this kind of usage before going with a design that delete data.

I Agree with Nicolas, but Just to clarify.
You are using an "Identity" and not a "Sequence" there are some differences between them, and how are declared and used (Each database could have their propietary implementation).
A Sequence is an independent object in your database with some properties (like start, end,increment,...) and an identity is a "property" of the column that depends on how the database handles it.
In the case of sequence (and depending on the database in some identities) you could create "cyclic" sequences to repeat the numbers after the cycle ends. But never a sequence or identity scans for "gaps" in the ids. (As Nicolas said is really bad for perfomance)
But depending on how your code will work you could create a cycle in a sequence to prevent having an always increasing value. But Only you are sure that there will not be conflicts when inserting new records.

How to redistribute unique integer ids in a MySQL database?

Consider this:
I have a database with 10 rows.
Each row has a unique id (int) paired with some value e.g. name (varchar).
These ids are incremented from 1 to 10.
I delete 2 of the records - 2 and 8.
I add 2 more records 11 and 12.
Questions:
Is there a good way to redistribute unique ids in this database so it would go from 1 to 10 again ?
Would this be considered bad practice ?
I ask this question, because after some use of this database: adding and deleting values the ids would differ significantly.

One way to approach this would be to just generate the row numbers you want at the time you actually query, something like this:
SET #rn = 0;
SELECT
(#rn:=#rn + 1) AS rn, name
FROM yourTable;
ORDER BY id;
Generally speaking, you should not be worrying about the auto increment values which MySQL is assigning. MySQL will make sure that the values are unique without your intervention.

If you set the ID column to be primary key and an auto-increment as well, "resetting" is not really necessary because it will keep assigning unique IDs anyways.
If the thing that bothers you are the "gaps" among the existing values, then you might resort to "sort deletion", by employing the is_deleted column with bit/boolean values. Default value would be 0 (or b0), of course. In fact, soft-deleting is advised if there are some really important data that might be useful later on, especially if it involves possibility for payment-related entries where user can delete one of such entries either by omission or deliberately.
There is no simple way to employ the deletion where you simply remove one value and re-arrange the remaining IDs to retain the sequence. A workaround might be to do the following steps:
DELETE entry first. i.e. delete from <table> where ID = _value
INSERT INTO SELECT (without id column). please note that the table need to be identical in terms of columns and types in order for this query to work properly, so to speak... and you can also utilize temporary as the backup_table. i.e. insert into <backup_table> select <coluum1, column2, ...> from <table>
TRUNCATE your table, i.e. truncate table <table>
copy the values from the temp table back into the existing table. You can utilize the INSERT INTO SELECT once again, but make sure to drop the temp table in the end
Please note that I would NOT advise you to do this, mainly because most people utilize some sort of caching in their applications and they also utilize the specific ways to evaluate whether a specific object is the same.
I.e. in Java, the equals() and hashCode() methods for POJOs are overriden and programmers generally rely on IDs to be permanent way of identifying a specific object. By utilizing the above method, you essentially break the whole concept and I would not advise you to change the object's autoincrement ID value for this reason, before anything else.
Essentially, what you want to do is simply an anti-pattern and will generally make common patterns and practices employed by experienced programmers into solutions that are prone to unexpected issues and/or failures... and this especially applies if/when advanced features are involved, such as employing this such anti-pattern into an application that utilizes galera cluster and/or application caching.

Need advice on most effective List to use, and the best practice to generate unique ids to each member

So I've got this school project, and I would really like to approach it with the best practices.
I need to make a list of customers for an insurance company. Each of these shall have a unique customer number, generated in ascending order.
Every customer can have zero to many insurances, also stored in seperate lists for each customer. Adding of insurances will happen more often than adding of customers.
Every customer can also have any numbers of claims. Every claim also has a unique id number.
If a customer cancels all insurances. All data on this customer will remain as history.
All data need to be stored via one of the file classes in the Java Standard Library. Databases are not allowed.
Actions such as showing of statistics will also be available.
Users of the program will be employees, with rights to edit every data field.
Questions:
What Collection class would be the most effective one to use? LinkedList, ArrayList, Hashmap or any other?
What file class would be the best one for saving the lists? ObjectOutputStream?
What is the best method of generating new unique ids for both customers and claims? As private fields in the customer list class? Information on the next unique id has to be restored every time the program exits and restarts.
Edit:
Not looking for help with any code. Just advice on the most common classes to use in a scenario like this.

What Collection class would be the most effective one to use?
LinkedList, ArrayList, Hashmap or any other?
Ans - LinkedList and ArrayList are types of List. HashMap is a type of Map.
What implementation of List you want to use depends on your requirement. If you are going to perform insertions and removals of elements at different points of a List frequently, then LinkedList makes more sense. It is more efficient at, say for example, removing an element in the middle of the List. Otherwise prefer to use ArrayList.
What is the best method of generating new unique ids for both
customers and claims? As private fields in the customer list class?
Information on the next unique id has to be restored every time the
program exits and restarts.
You may want to use a Singleton to generate IDs, and also persist them to a file.

Huge Leaderboard ranking with filtering

We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked.
But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard for my Country for the last month'. Number of combinations is ~20.
My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries.
I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.
Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite.
This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?
We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. Map<FilteringCombination, SortedList<User>> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.
Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well.
We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.
All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below:
Global TOP 5 My Global Ranking (425) My Country Ranking Other Rankings
1. karen (12000xp) 423. george 1. david
2. greg (11280xp) 424. nancy 2. donald
3. philips (10293xp) **425. UserX** 3. susan
4. jason (9800xp) 426. rebecca **4. UserX**
5. barbara (8000xp) 427. james 5. teresa
I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

That's a very interesting problem - thanks for posting. In general databases excel at this type of problem in which there is large amounts of data that needs to be filtered and searched. My first guess is that you are not using MySQL indexes correctly. Having said that you clearly need to regularly find the nth row in an ordered list which is something that SQL is not at all good at.
If you are looking to some form of in-memory database then you'll need something more sophisticated than REDIS. I would suggest you look at VoltDB which is very fast but not cheap.
If you would like to build your own in-memory store then you'll need to calculate memory use to see if it's feasible. You will need an index (discussed later in this answer) for each row you want to search or filter on along with the record for each user. However even for 10 million rows and 20 fields its still going to be less than 1Gb RAM which should be fine on modern computers.
Now for the data structures. I believe you are on the right track using maps to lists. I don't think the lists need to be sorted - you just need to be able to get the set of users for particular value. In fact sets may be more appropriate (again worth testing performance). Here is my suggestion to try (I've just added country and age fields - I assume you'll need others but it's a reasonable example to start with):
enum Country {
...
}
class User {
String givenName;
String familyName;
int xp;
Country country;
int age;
}
class LeaderBoard {
Set<User> users;
Map<Integer, Set<User>> xpIndex;
Map<Country, Set<User>> countryIndex;
Map<Integer, Set<User>> ageIndex;
}
Each of the indices will need to be updated when a field changes. For example:
private setUserAge(User user, int age) {
assert users.contains(user);
assert ageIndex.get(user.getAge()).contains(user);
ageIndex.get(user.getAge()).remove(user);
if (!ageIndex.containsKey(age)) {
ageIndex.put(age, new TreeSet<>());
}
ageIndex.get(age).add(user);
user.setAge(age);
}
Getting all users, by rank, that satisfy a given combination can be done in a number of ways:
countryIndex.get(Country.Germany).stream()
.filter(ageIndex.get(20)::contains)
.sorted(User::compareRank)
...
or
SortedSet<User> germanUsers = new TreeSet<>(User::compareRank);
germanUsers.addAll(countryIndex.get(Country.Germany));
germanUsers.retainAll(ageIndex.get(20));
You'll need to check which of these is more efficient - I would guess the stream implementation will be. Also it can be easily converted to a paralellStream.
You mention a concern with update efficiency. I would be very surprised if this was an issue unless there were many updates a second. In general with these types of applications you will get many more reads than writes.
I see no reason to manually partition the indexes as you are suggesting unless you are going to have hundreds of millions of entries. Better would be to experiment with HashMap vs TreeMap for the concrete instantiation of the indices.
The next obvious enhancement if you need better performance is to multithread the application. That should not be too complex as you have relatively simple data structures to synchronize. Use of parallel streams in the searches helps of course (and you get them for free in Java 8).
So my recommendation is to go with these simple data structures and eek out performance using multithreading and adjusting the concrete implementations (e.g. hash functions) before trying anything more sophisticated.

Although I am still in the middle of benchmarks, I am updating the status of the current development.
Best performance rates come when using:
Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>>
(List is sorted)
Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day]
The above can be extended for other filters, so it can be applied for other scenarios as well.
Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>>
Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects:
class FilterCombination {
private int CountryId;
private int AgeId;
private int TimeId;
...
}
then we define the Map<FilterCombination, List<User>> (sorted List)
I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of List.add(index, Object) which is O(n). A LinkedList would be better for .add(index, Object) but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task.
At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)).
Is there any other structure that performs better that O(logn + n) for insert + get rank together?
*Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List).
**As a comparator I use the following criteria
1st: XP points
in case of a draw - 2nd criterion: timestamp of last XP update
so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all).
An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

Easiest option is to choose Redis' sorted set, and use master slaves for replication. Turning on RDB on each slaves and backing RDB files up to S3. Using Kafka to persist all writes before they go to Redis. So we can replay missing transactions later on.

Avoiding exploding indices and entity-group write-rate limits with appengine

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?

Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.

I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).

As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.