Need help storing/retrieveing data

Need help storing/retrieveing data - java

A user has an ArrayList of numbers. Those numbers correspond to the applications that they have access to. Certain applications have different entitlement numbers(ie- 3, 72, etc.). What I want to do is store this data in a map so that when the user logs in, they can input their numbers into the Map and quickly get the applications that they are entitled to. However, some applications require 2 to 3 entitlement numbers. For example, one entry could be: ("101 and 234", "Application 1"). I was wondering if there was an ideal way of retrieving all of the values from the map that that users entitlement numbers satisfies.
How I currently have it, the program compares the users data to each application and confirms or denies access. This seems inefficient. Any help us greatly appreciated!
Note: I am reading the applications and their numbers in from an XML, so I can store them as I wish.

If there are large numbers of numbers required per application, the best approach is to use set intersection. If the numbers are contiguous or at least dense, you can optimize this into a bitset. For only one or two numbers though, I'd recommend just testing each number individually, since it's likely to be faster than full set operations.

The solution:
Define a class for each application (let's call it App). The class contains the name of the application, and a (sorted) List/array of entitlements.
Use Map to map from String to App: Map<String, App> for all single entitlement apps (you can use HashMap or TreeMap - your choice). If there are multiple apps that only need one and the same entitlement, consider Map<String, List<App>>. Exclude the apps that need multiple entitlements from the Map, and store them in a separate List/array instead.
When you are given a list of entitlements to retrieve apps, loop through the list and just grab everything that the Map maps the String to. For those that needs multiple entitlements, just check individually (you can speed up the checking a bit by sorting the list of entitlements given, and storing the entitlements to each App in sorted order - but it may not even matter since the size is small).
The solution reduces the time complexity for the operation. However, a few hundred apps times the number of entitlements of around 10 is quite small in my opinion, unless you call this many times. And it's better to time your original approach and this approach to compare - since the overhead might shadows any improvement in time.
A bit of further improvement (or not) is:
Use Map<String, List<App>> and include even apps that need multiple entitlements (those app will be mapped to by many entitlements).
When we search for app, we will use a Map<App, Integer> to keep track of how many entitlements that we have confirmed for multiple entitlement apps. So the flow will be like:
new mapAppInteger
foreach entitlement in inputListOfEntitlement
listOfApps = mapStringAppList.get(entitlement)
if listOfApps found
for each app in listOfApps
if app needs singleEntitlement
put app in output list
else // needs multiple
if app is in mapAppInteger
map app --> count + 1
if mapAppInteger.get(app) == app.numberOfRequiredEntitlement
put app in output list
remove app from mapAppInteger
else // not in mapAppInteger
map app --> 1

Related

Speed a search cache without using too much memory

I have to access a database with 380,000 entries. I don't have write access to the DB, I can just read it. I've made a search function using a map to search for users by firstname. Here is my process:
1 - Load everything from the DB
2 - Store everything into a Map<Charactere, ArrayList<User>>, using Alpha letters to store users according to the first letter of their firstname.
<A> {Alba, jessica, Alliah jane, etc ...}
<B> {Birsmben bani, etc ...}
When someone searches for a user, I take the firstletter of the firstname typed and use map.get(firstletter), then iterate on the ArrayList to find all the users.
The Map Take a huge space in the memory I guess (380,000 User object). I had to increase the heap size
I want to make it faster. Use firstname as key for the Map, in order to make it faster (there are many people with the same firstname).
I have two solutions in mind:
1 - Still use a map with firstname as key (increasing the heap size again?)
2 - Use files on the disk instead of Map (Alba.dat will contain all Alba for example) and open the right file for each search. No need to incease the heap size, but are there any side effects?
Which one is better? (pros and cons)
Update with more info
It's a database of customers who calls our customer service on the phone. The person who takes the call has to search using the customers names (usually firstname and then lastname). Using the Db is too slow to search. The solution I've implemented is much faster already (1/2 seconds vs 26 seconds using the db), but I want to improve it.

IMHO, I don't think you have to cache all the entries in memory, but a part of them, maybe:
Maybe just use a ring buffer, or
More complicated, and make more sense, to implement a LFU Cache, that keeps the N top most frequently accessed item only. See this question for a hint of how to implement such a cache.

There are several issues with your approach:
It implies that the number in users doesn't change, a good application design would work with any number of users without software change
It implies that the current problem is the only one. What happens if the next requirement that needs implementation is "search by caller id" or "search by zip code"?
It is reinventing the wheel, you are currently starting to write a database, index or information retrieval solution (however you want to name it) from scratch
The right thing to do is to export the user data into a database engine which provides proper search capabilities. The export/extraction hopefully can be speed up, if you have modification time stamps or if you can intercept updates and reapply it to your search index.
What you use for your search does not matter to much, a simple database on a modern system is fast enough. Most also provide indexing capabilities to speed up your search. If you want something which can be embedded in your application and is specialized on search and solves your problems above, I'd recommend using Lucene.

Best practice for holding huge lists of data in Java

I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.
The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.
Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)
My question is:
What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?

This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:
The size of your entries
Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.
The hardware you're using
Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.
The type of computation(s) you want to do
You've mentioned sorting, so things like TreeMap or maybe PriorityQueue come to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.
Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMap or ConcurrentSkipListMap.

You can use a caching system, check MapDB it's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.
// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");
//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");

My intuition is that you could take inspiration from the initial MapReduce paradigm and partition your problem into several smaller but similar ones and then aggregate these partial results in order to reach the complete solution.
If you solve one smaller problem instance at a time (i.e. file chunk) this will guarantee you a space consumption penalty bounded by the space requirements for this single instance.
This approach to process the file lazily will work invariant of the data structure you choose.

Store and search sets (with many possible values) in a database (from Java)

The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?

Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.

Data structure with two way O(1) lookup. Hashtable?

I'm implementing a system where I have a list of names and each person has 1 phone number. I need to be able to take a name and look up the phone number, or take a phone number and look up a name.
I know I could do this by having two hashtables - one which goes from names to phone numbers and one which goes from phone numbers to names. Then I can look up in either direction in O(1) time. However this seems like I am storing too much data - every name and every phone number is stored twice.
Is there any way to do this more efficiently? What data structure should I use to store the names and phone numbers?
I am coding in Java if that is relevant.
Many thanks!

Java does not provide a two-way hash table out-of-the-box. Your solution that relies on two hash tables is as good as it gets, unless you are willing to go with third-party libraries (which would hide the two hash tables for you) or re-implement a significant portion of HashMap<K,V>.
Then I can look up in either direction in O(1) time. However this seems like I am storing too much data - every name and every phone number is stored twice.
Not necessarily: you can use the same object that represents the phone number, in which case there would be a single object for the phone number, with two references to it stored from two hash tables.

Consider using Guava's HashBiMap, which is basically two HashMaps linked together behind the scenes. See also the BiMap interface and its related article.

Remember that the object itself is stored only once, and not in both maps. You only need double number of references - so it might not be that bad.
You can use Gauva BiMap that offers that functionality (and its interface HashBiMap)

Memory Friendly Fast Key-Value Access Solution for Android

I have an Android application that iterates through an array of thousands of integers and it uses them as key values to access pairs of integers (let us call them id's) in order to make calculations with them. It needs to do it as fast as possible and in the end, it returns a result which is crucial to the application.
I tried loading a HashMap into the memory for fast access to those numbers but it resulted in OOM Exception. I also tried writing those id's to a RandomAccessFile and storing their offsets on the file to another HashMap but it was way too slow. Also, the new HashMap that only stores the offsets is still occupying a large memory.
Now I am considering SQLite but I am not sure if it will be any faster. Are there any structures or libraries that could help me with that?
EDIT: Number of keys are more than 20 million whereas I only need to access thousands of them. I do not know which ones I will access beforehand because it changes with user input.

You could use Trove's TIntLongHashMap to map primitive ints to primitive longs (which store the ints of your value pair). This saves you the object overhead of a plain vanilla Map, which forces you to use wrapper types.
EDIT
Since your update states you have more than 20 million mappings, there will likely be more space-efficient structures than a hash map. An approach to partition your keys into buckets, combined with some sub-key compression will likely save you half the memory over even the most efficient hash map implementation.

SQLite is an embedded relational database, which uses indexes. I would bet it is much faster than using RandomAccessFile. You can give it a try.

My suggestion is to rearrange the keys in Buckets - what i mean is identify (more or less) the distribution of your keys, then make files that corresponds to each range of keys (the point is that every file must contain just as much integers that can get in memory and no more then that) then when you search for a key, you just read the whole file to the memory and look for it.
exemple, assuming the distribution of the key is uniform, store 500k values corresponding to the 0-500k key values, 500k values corresponding to 500k-1mil keys and so on...
EDIT : if you did try this approach, and it still went slow, i still have some tricks in my sleaves:
First make sure that your division is actually close to equal between all the buckets.
Try to make the buckets smaller, by making more buckets.
The idea about correct division to buckets by ranges is that when you search for a key, you go to the corresponding range bucket and The key either in it or that it is not in the whole collection. so there is no point on Concurnetly reading another bucket.
I never done that, cause im not sure concurrency works on I\O's, but it may be helpfull to Read the whole file with 2 threads one from top to bottom and the other from bottom to top until they meet. (or something like that)
While you read the whole bucket into memory, split it to 3-4 arraylists, Run 3-4 working threads to search your key on each of the arrays, the search must end way faster then.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.