Storing multiple values and picked entity with lowest value

Storing multiple values and picked entity with lowest value - java

I'm currently looking to add a section to my program where engineers are selected for jobs based on their distance from a job. I've used the google api to get the distance from one address to the other but how would I store the engneers ID, their distance from the address and select whos the closest to the destination? A linked list?
Thanks

You could use a linked list with nodes as an Engineer object class that you make yourself with ID, distance and job attributes, then traverse the linked list to find the Engineer(s) with the closest distance.
Or if all you care about is ID and distance, you could make a hashmap that hashes ID's to distances so that you only have the one hashmap instead of a bunch of link list node objects.
These are two solution out of many that you could use though, my suggestion would be to use whatever you are most comfortable with.
If you need a more specific answer, please make a more specific question.

Related

What data structure could be used to store objects with multiple comparable attributes

I want to build a data structure to store the information of multiple houses, and later user can retrieve desirable housing information through a search query. In order to achieve a fast search, I will use red black tree. The problem I am facing is that the key of each node only contains one attribute of the house i.e. price, as for the others such as number of beds, land size etc they can not be stored in a single tree. What would be a good data structure for this problem, initially I thought a tree nested in a tree, is this viable or considered good?

The problem you are facing can be solved using secondary indexes on top of your data. Secondary indexes are a concept studied intensely in the database world and you should have no trouble finding resources to help you understand how they are implemented in real databases.
So, you currently have a primary key for your data: the objects memory reference or maybe an index into a collection of references. For each attribute that you want to query you will need to have a fast way of looking up matching objects. The exact data structure you use will depend on the type of queries you perform but some kind of search tree will be a good general purpose data structure and will usually be efficient for updates which is very important for a lot of databases. Your data structure should take in a query relating to the specific attribute and return references, or primary keys, to all the objects which match that query.
In your example you might have one red-black tree for price and another for number-of-beds. If you are answering a query for "price = 30 or number-of-beds = 4" then all you need to do is query your price data structure and then your number-of-beds data structure and then since you have an "or" in your query you simply take the union of the primary keys returned from your data structures (take the intersection for "and"s).
Notice that if you add to or update your objects then you will also need to update all the indexes that change. This is a trade-off you also see in real databases; faster reads for slower writes.
A nested tree approach might work depending on what kind of queries you are making but will quickly become unsuitable if the data structure is not static - it will be very slow to update the tree if you update your objects.

Hashing with Linked Lists

I am trying to do a form of coalesced hashing, and to do so, I need to maintain multiple linked lists that get created when you try to insert something into the table and it collides with another object. How would I go about creating multiple linked lists inside of an add(object x) function and then be able to call the same list again in a find(object x) function?
For example, if my hash value is 5, and bucket 5 is occupied, I create a linked list with bucket 5 as a head, and then create a new node where the object I tried to put into 5 ends up getting put. This way when I try to find the object latter, rather than probe the table, I can just follow the linked list I created referencing slot 5 and follow it to my object.
My issue is, I can not figure out how to maintain multiple linked lists for different collisions, and then call the appropriate list later on when I try to find the object. Any help is greatly appreciated.

If you're trying to replicate something like HashMap (and it sounds very much like you are), you'll want to keep the linked lists in search tree, so that you can find the right list for inserting and for finding an object in reasonable time.

Huge Leaderboard ranking with filtering

We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked.
But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard for my Country for the last month'. Number of combinations is ~20.
My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries.
I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.
Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite.
This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?
We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. Map<FilteringCombination, SortedList<User>> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.
Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well.
We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.
All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below:
Global TOP 5 My Global Ranking (425) My Country Ranking Other Rankings
1. karen (12000xp) 423. george 1. david
2. greg (11280xp) 424. nancy 2. donald
3. philips (10293xp) **425. UserX** 3. susan
4. jason (9800xp) 426. rebecca **4. UserX**
5. barbara (8000xp) 427. james 5. teresa
I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

That's a very interesting problem - thanks for posting. In general databases excel at this type of problem in which there is large amounts of data that needs to be filtered and searched. My first guess is that you are not using MySQL indexes correctly. Having said that you clearly need to regularly find the nth row in an ordered list which is something that SQL is not at all good at.
If you are looking to some form of in-memory database then you'll need something more sophisticated than REDIS. I would suggest you look at VoltDB which is very fast but not cheap.
If you would like to build your own in-memory store then you'll need to calculate memory use to see if it's feasible. You will need an index (discussed later in this answer) for each row you want to search or filter on along with the record for each user. However even for 10 million rows and 20 fields its still going to be less than 1Gb RAM which should be fine on modern computers.
Now for the data structures. I believe you are on the right track using maps to lists. I don't think the lists need to be sorted - you just need to be able to get the set of users for particular value. In fact sets may be more appropriate (again worth testing performance). Here is my suggestion to try (I've just added country and age fields - I assume you'll need others but it's a reasonable example to start with):
enum Country {
...
}
class User {
String givenName;
String familyName;
int xp;
Country country;
int age;
}
class LeaderBoard {
Set<User> users;
Map<Integer, Set<User>> xpIndex;
Map<Country, Set<User>> countryIndex;
Map<Integer, Set<User>> ageIndex;
}
Each of the indices will need to be updated when a field changes. For example:
private setUserAge(User user, int age) {
assert users.contains(user);
assert ageIndex.get(user.getAge()).contains(user);
ageIndex.get(user.getAge()).remove(user);
if (!ageIndex.containsKey(age)) {
ageIndex.put(age, new TreeSet<>());
}
ageIndex.get(age).add(user);
user.setAge(age);
}
Getting all users, by rank, that satisfy a given combination can be done in a number of ways:
countryIndex.get(Country.Germany).stream()
.filter(ageIndex.get(20)::contains)
.sorted(User::compareRank)
...
or
SortedSet<User> germanUsers = new TreeSet<>(User::compareRank);
germanUsers.addAll(countryIndex.get(Country.Germany));
germanUsers.retainAll(ageIndex.get(20));
You'll need to check which of these is more efficient - I would guess the stream implementation will be. Also it can be easily converted to a paralellStream.
You mention a concern with update efficiency. I would be very surprised if this was an issue unless there were many updates a second. In general with these types of applications you will get many more reads than writes.
I see no reason to manually partition the indexes as you are suggesting unless you are going to have hundreds of millions of entries. Better would be to experiment with HashMap vs TreeMap for the concrete instantiation of the indices.
The next obvious enhancement if you need better performance is to multithread the application. That should not be too complex as you have relatively simple data structures to synchronize. Use of parallel streams in the searches helps of course (and you get them for free in Java 8).
So my recommendation is to go with these simple data structures and eek out performance using multithreading and adjusting the concrete implementations (e.g. hash functions) before trying anything more sophisticated.

Although I am still in the middle of benchmarks, I am updating the status of the current development.
Best performance rates come when using:
Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>>
(List is sorted)
Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day]
The above can be extended for other filters, so it can be applied for other scenarios as well.
Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>>
Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects:
class FilterCombination {
private int CountryId;
private int AgeId;
private int TimeId;
...
}
then we define the Map<FilterCombination, List<User>> (sorted List)
I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of List.add(index, Object) which is O(n). A LinkedList would be better for .add(index, Object) but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task.
At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)).
Is there any other structure that performs better that O(logn + n) for insert + get rank together?
*Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List).
**As a comparator I use the following criteria
1st: XP points
in case of a draw - 2nd criterion: timestamp of last XP update
so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all).
An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

Easiest option is to choose Redis' sorted set, and use master slaves for replication. Turning on RDB on each slaves and backing RDB files up to S3. Using Kafka to persist all writes before they go to Redis. So we can replay missing transactions later on.

Calculating shortest paths with data extracted from a database

So I need to create a web service which will communicate with my Android
application. In my android app the client choose two point start and
arrival this two point will be send to my web service to find the bus
that has the shortest path between them. I have a problem with the web
service side.
I tried to use Dijkstra's algorithm to find the shortest path between
two points. To test the Dijkstra algorithm I must extract data from a
MySQL database and not put it right into my algorithm. I don't know
how can I do it though.
In my database I have two table that contains the bus route (bus num),
code (station id), pt_arret (station name). There's another table which
contains location code (id station), latitude and longitude, and
distance (is the distance between a station and the station which
precedes.

You've got to create a structure that will let you use Dijkstra's algorithm. To do that, you must read all the relavant data from the database. The transition from relational data to object oriented is always awkward.
Ideally you want to use a single, simple SQL select per table to get the data. Optimization is tricky. A single select statement can grab a million rows almost as fast as it can grab one row; one select will get a million rows faster than 10 selects will grab 10 rows (in my experience). But grabbing too many uneeded rows might take too long if your DB connection is slow (has a narrow bandwidth).
Use Maps (TreeMap or HashMap) to keep track of what you read, so you can find "station" objects that have already been read and placed in your structure and add connections to them.
Once you have your data structure set up in memory, try to keep it around as long as possible to limit delays from rereading the database.
Keep an eye on your memory and timings. You are in danger of running too slow for your users or running low on memory. You need to pay attention to performance (which does not seem to be a common need, these days). I've made some suggestions, but I can't really know what will happen with your hardware and data. (For instance, reading the DB may not be as slow as I suspect.)
Hope this helps. If you have not done anything like this before, you've got a lot of work and learning ahead of you. I worked on a major program like this (but it also wrote to the DB), and I felt like I was swimming upstream all the way.
Addition:
What you want in memory is a set of stations (Station class objects) and routes (Route class objects). Each station would have all the data you need to describe one stations including locations. Critically, it would also need an ID. The stations should go in a TreeMap using the ID as key. (This is my predjudice, many people would use a HashMap.)
Each route will have references to the two stations it links, a distance or travel time, and any other needed information.
Each station will also contain a list of routes that reference it. I'd recommend a LinkedList for flexibility. (In this case, ArrayList is apt to waste a lot of space with unused array elements.) You will want to read the stations from the db, then read route info. As you read each route's info, create the Route object, locate the two stations, add references to them to the Route, then add the Route to both stations' route lists.
Now for each station you can spot all the routes leaving it, and then spot all the stations you can get to with one bus trip. From those stations, you can work your way on, all through your network. This structure really is a "sparse array", if you want to think of it that way.
Applying Dijkstra's algorithm--or any other algorithm--is quite straightforward. You'll want various flags on the stations and routes (fields in the Station and Route classes) to track which nodes (stations) and connections (routes) you've already used for various purposes. It might help to draw the map (start with a small one!) on a sheet of paper to track what your code is doing. My experience has been that it takes very little code to do all this, but it takes a lot of careful thought.

what to use for this requirement, Array, List, Map,?

While making my program i have come across this requirement that i have to assign unique id's to some Objects that i create. Now i am creating the objects dynamically on GUI, and initially i used simple counter to assign int value to the created node, and it worked just fine.
However the problem that this approach creates is that if while creating the GUI, if some node has to be deleted, this id is also removed and is never used again. With the next new node, everytime i have to use the latest counter value and this creates lot of missing int values if nodes are deleted during the process.
I wanted to reuse those missing id's upon creating of new nodes, for this i am confused which approach i should addopt.
MY Ideas:
Using a ArrayList that contains the available values, plus if a node
is deleted, it's id is added to this list, i sort this list and use
the minimum value for new node. Fine but, when i use this value, if
i remove it from List, the index is not deleted and this causes
problem.
HashMap, similarly like above i add available id's and remove not used, but not sure how to sort this hashMap???
Can you suggest how i should go about it? May be i need some kind of stack where i can push values, sort it and use the minimum value, and if that i used, it is removed from this stack, please give some ideas about this how to accomplish this task???

Keep a list of the deleted IDs, and when you create a new node, check that list for an ID to re-use (doesn't matter which you take); if the list is empty (as it will be initially), get a new ID "the old way". Even more clever: make the list an object that will generate a new ID if there aren't any deleted ones in it, so the caller doesn't have to worry about HOW the ID was arrived at.

You could use a TreeSet (which automatically sorts all entries added from least to greatest) to store the deleted id's (myTreeSet.add(old_id)). That way, when you go to create a new instance, you would check to see if there are any entries in the TreeSet first. To grab the lowest value, you would use myTreeSet.first() (which should be an O(1) operation). If the TreeSet is empty, which means all known id's are currently in use, then you would go ahead and use the next available id as normal.

How about a TreeSet to store the used IDs? You could then use higher(0) to find the lowest free ID. If it returns null, then you know that you have no used IDs.

The first solution works fine only if you have few nodes! Imagine an application with thousands nodes! What about memory consumption?
The Hashmap solution is better to you aims and need less controls.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.