A little confused about access count

A little confused about access count - java

Now my responsibility in the project is a access count module.
If user login in two hours repeatly ,it should be treat as once.
I use a concurrentHashMap to put the user id and access time.
private static Map<String,Date> loginTimeMap = new ConcurrentHashMap<String, Date>();
Every time the user access the index page , program will compare the time.
Date date = loginTimeMap.get(user.getSuUserId());
if(date==null||DateUtil.getHourInterval(new Date(),date)>=DefinedValue.LIMIT_TIME){
accessCount=accessCount+1;
loginTimeMap.put(user.getSuUserId(), new Date());
}
in the code LIMIT_TIME is a constant that refers two hours.
Will the loginTimeMap slow the server if the size of the map exceed 10000?
Really sorry for my poor English!

Will the loginTimeMap slow the server if the size of the map exceed 10000?
A HashMap has a time complexity of O(1). That is, it hashes, then goes straight to the value. It does not search for the value. That means that it's performance is not proportional to the amount of elements in the array, although with 10,000 entries it might be somewhat memory heavy!

Related

HashMap vs array performance difference in following approach

To solve Dynamic programming problem I used two approaches to store table entries, one using multi dimension array ex:tb[m][n][p][q] and other using hashmap and using indexes of 1st approach to make string to be used as key as in "m,n,p,q". But on one input first approach completes in 2 minutes while other takes more than 3 minutes.
If access time of both hashmap and array is asymptotically equal than why so big difference in performance ?

Like mentioned here:
HashMap uses an array underneath so it can never be faster than using
an array correctly.
You are right, array's and HashMap's access time is in O(1) but this just says it is independent on input size or the current size of your collection. But it doesn't say anything about the actual work which has to be done for each action.
To access an entry of an array you have to calculate the memory address of your entry. This is easy as array's memory address + (index * size of entity).
To access an entry of a HashMap, you first have to hash the given key (which needs many cpu cycles), then access the entry of the HashMap's array using the hash which holds a list (depends on implementation details of the HashMap), and last you have to linear search the list for the correct entry (those lists are very short most of the time, so it is treated as O(1)).
So you see it is more like O(10) for arrays and O(5000) hash maps. Or more precise T(Array access) for arrays and T(hashing) + T(Array access) + T(linear search) for HashMaps with T(X) as actual time of action x.

Querying for today, within a week, and within a month

I'm new. I'm writing an app for a laser tag place where we've got kids of many ages coming to shoot beams at each other. We're making a highscore screen that'll display the best scores of the day, of the week, and of the month. The idea is that people will feel proud being on the list, and there'll also be prizes once a month.
I'm getting stuck at the whole filtering by date thing.
I basically modified the classic guestbook example to the point where I can add scores and customer info, and sort them by score.
Key guestbookKey = KeyFactory.createKey("Guestbook", guestbookName);
String fornavn = req.getParameter("fornavn");
Integer score = Integer.parseInt(req.getParameter("score"));
String email = req.getParameter("email");
String tlf = req.getParameter("tlf");
Date date = new Date();
Entity highscore = new Entity("Greeting", guestbookKey);
highscore.setProperty("date", date);
highscore.setProperty("fornavn", fornavn);
highscore.setProperty("score", score);
highscore.setProperty("email", email);
highscore.setProperty("tlf", tlf);
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(highscore);
And in the jsp there's a query that grabs the overall top 5.
Query query = new Query("Highscore", highscoreKey).addSort("score", Query.SortDirection.DESCENDING);
List<Entity> greetings = datastore.prepare(query).asList(FetchOptions.Builder.withLimit(5));
And there's a form that sends the user input to the .java. Any tips as far as how I should set up the dates? Saving week # and month # and querying based on that? Seems cumbersome.

From what I can tell, your "HighScore" kind is actually a "Score" kind that keeps track of all scores.
Instead of querying for the high score for the week/month, you're probably better off having a single HighScore entity (that's separate from normal "Score" entities) that you update whenever you enter a score. Every time a new score is entered, check if the high score should be updated.
You never need a fancy query, you just need to fetch the high score entity.
Or you might want a separate high score entity for each month/week etc so you can keep track of the history. In this case you may want to encode week or month into the entity key, so you can get the current week/month's HighScore easily.

There are 2 possible approaches for a requirement like yours where you want to show highscores for a day, week, month, etc:
1, First option is to use your current model where you are storing date and score. Since app engine allows inequality filter only on 1 property, you need to apply an inequality filter on date and then find the n highest number of scores. But since the result will be sorted first for the property with inequality filter and then for any additional property, you cannot do a fetch for only the first n entries to find the top n because the top scores need not be in continuous order. See this post to understand this better. So you will have to fetch all the scores for the date range and then do further sorting of the query result at your client to find the top n. This approach is ok if the total number of scores for a week or a month will not be too high compared to the value of n. If not, this is not a scalable option.
2, Second approach is to redesign your model such that sorting happens on scores so that for getting top n scores for a particular period, you need to fetch only the first n entries. This means the approach is suitable even if number of scores are very large. This then requires converting your date to be suitable for equality filtering like for each entry storing a month number, a week number and calendar year. Then for example if you want to find the top n scores in the 3rd month, then you can query for month=3, sort by scores descending and fetch the first n matching entries. Similarly you can query for a particular week using a week number.

This is very similar to another high-score SO question. I have copied/pasted my answer to it below. Approaching this solution using a database query may cause you to join the ranks of folks who complain about GAE. You will be using a custom index. Your query will likely average 10x miliseconds slower than needed per request. You will need to index thousands, perhaps millions of records. This costs you money -- perhaps lots of it both re: data storage (indices) and instances due to your high latency for what will likely be a highly-called handler function. Think different please. My copy/paste is not as specific to your setup, but it can be easily extended easily. I hope that it might prompt you to think about lower resource, lower cost alternative. As always...HTH. -stevep
Previous high score answer:
You may want to consider an alternate approach. This is a lot of index overhead which will cause your costs to be higher, the response time for the handler executing this function to operate an order of magnitude slower and you will have moments where the eventual consistency of index updates will affect maintenance of this data. If you have a busy site, you will surely not be happy with the latency and costs associated with this approach.
There are a number of alternate approaches. Your expected site transactions per second would affect which you choose. Here is a very simple alternative. Create an ndb entity with a TextProperty. Serialize the top scores entries using a string such as score_userid. Store them in the text field by joining them with a unique character. When a new score comes in, use get_by_id to retrieve this record (ndb automatically handles memcaching for you). Split it into an array. Split the last element of the array, and check against the new score. If it is less than the score, drop it, and append the new score_userid string to the array. Sort the array, join it, and put() the new TextProperty. If you want you could set up an end of the day cron to scan your scores for the day to check to see if your process was affected by the very small chance that two scores arrived at nearly the same time causing one to overwrite the other. HTH. -stevep
Previous SO high score answer link:
GAE datastore query with filter and sort using objectify

Java: optimize hashset for large-scale duplicate detection

I am working on a project where I am processing a lot of tweets; the goal is to remove duplicates as I process them. I have the tweet IDs, which come in as strings of the format "166471306949304320"
I have been using a HashSet<String> for this, which works fine for a while. But by the time I get to around 10 million items I am drastically bogged down and eventually get a GC error, presumably from the rehashing. I tried defining a better size/load with
tweetids = new HashSet<String>(220000,0.80F);
and that lets it get a little farther, but is still excruciatingly slow (by around 10 million it is taking 3x as long to process). How can I optimize this? Given that I have an approximate idea of how many items should be in the set by the end (in this case, around 20-22 million) should I create a HashSet that rehashes only two or three times, or would the overhead for such a set incur too many time-penalties? Would things work better if I wasn't using a String, or if I define a different HashCode function (which, in this case of a particular instance of a String, I'm not sure how to do)? This portion of the implementation code is below.
tweetids = new HashSet<String>(220000,0.80F); // in constructor
duplicates = 0;
...
// In loop: For(each tweet)
String twid = (String) tweet_twitter_data.get("id");
// Check that we have not processed this tweet already
if (!(tweetids.add(twid))){
duplicates++;
continue;
}
SOLUTION
Thanks to your recommendations, I solved it. The problem was the amount of memory required for the hash representations; first, HashSet<String> was simply enormous and uncalled for because the String.hashCode() is exorbitant for this scale. Next I tried a Trie, but it crashed at just over 1 million entries; reallocating the arrays was problematic. I used a HashSet<Long> to better effect and almost made it, but speed decayed and it finally crashed on the last leg of the processing (around 19 million). The solution came with departing from the standard library and using Trove. It finished 22 million records a few minutes faster than not checking duplicates at all. Final implementation was simple, and looked like this:
import gnu.trove.set.hash.TLongHashSet;
...
TLongHashSet tweetids; // class variable
...
tweetids = new TLongHashSet(23000000,0.80F); // in constructor
...
// inside for(each record)
String twid = (String) tweet_twitter_data.get("id");
if (!(tweetids.add(Long.parseLong(twid)))) {
duplicates++;
continue;
}

You may want to look beyond the Java collections framework. I've done some memory intensive processing and you will face several problems
The number of buckets for large hashmaps and hash sets is going to
cause a lot of overhead (memory). You can influence this by using
some kind of custom hash function and a modulo of e.g. 50000
Strings are represented using 16 bit characters in Java. You can halve that by using utf-8 encoded byte arrays for most scripts.
HashMaps are in general quite wasteful data structures and HashSets are basically just a thin wrapper around those.
Given that, take a look at trove or guava for alternatives. Also, your ids look like longs. Those are 64 bit, quite a bit smaller than the string representation.
An alternative you might want to consider is using bloom filters (guava has a decent implementation). A bloom filter would tell you if something is definitely not in a set and with reasonable certainty (less than 100%) if something is contained. That combined with some disk based solution (e.g. database, mapdb, mecached, ...) should work reasonably well. You could buffer up incoming new ids, write them in batches, and use the bloom filter to check if you need to look in the database and thus avoid expensive lookups most of the time.

If you are just looking for the existence of Strings, then I would suggest you try using a Trie(also called a Prefix Tree). The total space used by a Trie should be less than a HashSet, and it's quicker for string lookups.
The main disadvantage is that it can be slower when used from a harddisk as it's loading a tree, not a stored linearly structure like a Hash. So make sure that it can be held inside of RAM.
The link I gave is a good list of pros/cons of this approach.
*as an aside, the bloom filters suggested by Jilles Van Gurp are great fast prefilters.

Simple, untried and possibly stupid suggestion: Create a Map of Sets, indexed by the first/last N characters of the tweet ID:
Map<String, Set<String>> sets = new HashMap<String, Set<String>>();
String tweetId = "166471306949304320";
sets.put(tweetId.substr(0, 5), new HashSet<String>());
sets.get(tweetId.substr(0, 5)).add(tweetId);
assert(sets.containsKey(tweetId.substr(0, 5)) && sets.get(tweetId.substr(0, 5)).contains(tweetId));
That easily lets you keep the maximum size of the hashing space(s) below a reasonable value.

How to delete duplicate/aggregate rows faster in a file using Java (no DB)

I have a 2GB big text file, it has 5 columns delimited by tab.
A row will be called duplicate only if 4 out of 5 columns matches.
Right now, I am doing dduping by first loading each coloumn in separate List
, then iterating through lists, deleting the duplicate rows as it encountered and aggregating.
The problem: it is taking more than 20 hours to process one file.
I have 25 such files to process.
Can anyone please share their experience, how they would go about doing such dduping?
This dduping will be a throw away code. So, I was looking for some quick/dirty solution, to get job done as soon as possible.
Here is my pseudo code (roughly)
Iterate over the rows
i=current_row_no.
Iterate over the row no. i+1 to last_row
if(col1 matches //find duplicate
&& col2 matches
&& col3 matches
&& col4 matches)
{
col5List.set(i,get col5); //aggregate
}
Duplicate example
A and B will be duplicate A=(1,1,1,1,1), B=(1,1,1,1,2), C=(2,1,1,1,1) and output would be A=(1,1,1,1,1+2) C=(2,1,1,1,1) [notice that B has been kicked out]

A HashMap will be your best bet. In a single, constant time operation, you can both check for duplication and fetch the appropriate aggregation structure (a Set in my code). This means that you can traverse the entire file in O(n). Here's some example code:
public void aggregate() throws Exception
{
BufferedReader bigFile = new BufferedReader(new FileReader("path/to/file.csv"));
// Notice the paramter for initial capacity. Use something that is large enough to prevent rehashings.
Map<String, HashSet<String>> map = new HashMap<String, HashSet<String>>(500000);
while (bigFile.ready())
{
String line = bigFile.readLine();
int lastTab = line.lastIndexOf('\t');
String firstFourColumns = line.substring(0, lastTab);
// See if the map already contains an entry for the first 4 columns
HashSet<String> set = map.get(firstFourColumns);
// If set is null, then the map hasn't seen these columns before
if (set==null)
{
// Make a new Set (for aggregation), and add it to the map
set = new HashSet<String>();
map.put(firstFourColumns, set);
}
// At this point we either found set or created it ourselves
String lastColumn = line.substring(lastTab+1);
set.add(lastColumn);
}
bigFile.close();
// A demo that shows how to iterate over the map and set structures
for (Map.Entry<String, HashSet<String>> entry : map.entrySet())
{
String firstFourColumns = entry.getKey();
System.out.print(firstFourColumns + "=");
HashSet<String> aggregatedLastColumns = entry.getValue();
for (String column : aggregatedLastColumns)
{
System.out.print(column + ",");
}
System.out.println("");
}
}
A few points:
The initialCapaticy parameter for the HashMap is important. If the number of entries gets bigger than the capacity, then the structure is re-hashed, which is very slow. The default initial capacity is 16, which will cause many rehashes for you. Pick a value that you know is greater than the number of unique sets of the first four columns.
If ordered output in the aggregation is important, you can switch the HashSet for a TreeSet.
This implementation will use a lot of memory. If your text file is 2GB, then you'll probably need a lot of RAM in the jvm. You can add the jvm arg -Xmx4096m to increase the maximum heap size to 4GB. If you don't have at least 4GB this probably won't work for you.
This is also a parallelizable problem, so if you're desperate you could thread it. That would be a lot of effort for throw-away code, though. [Edit: This point is likely not true, as pointed out in the comments]

I would sort the whole list on the first four columns, and then traverse through the list knowing that all the duplicates are together. This would give you O(NlogN) for the sort and O(N) for the traverse, rather than O(N^2) for your nested loops.

I would use a HashSet of the records. This can lead to an O(n) timing instead of O(n^2). You can create a class which has each of the fields with one instance per row.
You need to have a decent amount of memory, but 16 to 32 GB is pretty cheap these days.

I would do something similar to Eric's solution, but instead of storing the actual strings in the HashMap, I'd just store line numbers. So for a particular four column hash, you'd store a list of line numbers which hash to that value. And then on a second path through the data, you can remove the duplicates at those line numbers/add the +x as needed.
This way, your memory requirements will be a LOT smaller.

The solutions already posted are nice if you have enough (free) RAM. As Java tends to "still work" even if it is heavily swapping, make sure you don't have too much swap activity if you presume RAM could have been the limiting factor.
An easy "throwaway" solution in case you really have too little RAM is partitioning the file into multiple files first, depending on data in the first four columns (for example, if the third column values are more or less uniformly distributed, partition by the last two digits of that column). Just go over the file once, and write the records as you read them into 100 different files, depending on the partition value. This will need minimal amount of RAM, and then you can process the remaining files (that are only about 20MB each, if the partitioning values were well distributed) with a lot less required memory, and concatenate the results again.
Just to be clear: If you have enough RAM (don't forget that the OS wants to have some for disk cache and background activity too), this solution will be slower (maybe even by a factor of 2, since twice the amount of data needs to be read and written), but in case you are swapping to death, it might be a lot faster :-)

java- storing string values- which is most efficient- linked list, array list or hashmap

In a java application, I have a requirement where a user will define a string value, and then keep on appending further string values to original value...
There can be multiple different named strings defined by the user..
From hashmap, array list and linked list, which one should I use on the basis of following criteria:
(1) Most memory efficient
(2) Max possible space per single string value
Also what is the max possible size of a single string value in all 3 options (hashmap/array list/linked list) ?

If the user is entering the string, you shouldn't need to worry. The maximum String length is over 2 billion.
The fastest typing speed ever, 216 words per minute,
http://en.wikipedia.org/wiki/Words_per_minute
This means even a fast typist will take a minute to write 1 K of letters. To write one String which is maximum length will take 1491 days, non stop. (Assuming their keyboard, computer, or the user does die in the attempt)
It is highly unlikely you need to most efficient data structure and using the simplest and most obvious choice is a better approach. (Again because users cannot type fast enough for it to ever matter)
A Kindle can store thousands of books in a device which costs less than 100 pounds. A user could write all their live and not write enough to fill up a small, cheap mobile device.

Save your time and use StringBuilder or StringBuffer (if you need thread safety).

You will need ArrayList<Stringbuffer>

If you are creating a text editor where the user can jump to anywhere in the string and start changing it, a gap buffer is a fairly good data structure: http://en.wikipedia.org/wiki/Gap_buffer

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.