Querying for today, within a week, and within a month

Querying for today, within a week, and within a month - java

I'm new. I'm writing an app for a laser tag place where we've got kids of many ages coming to shoot beams at each other. We're making a highscore screen that'll display the best scores of the day, of the week, and of the month. The idea is that people will feel proud being on the list, and there'll also be prizes once a month.
I'm getting stuck at the whole filtering by date thing.
I basically modified the classic guestbook example to the point where I can add scores and customer info, and sort them by score.
Key guestbookKey = KeyFactory.createKey("Guestbook", guestbookName);
String fornavn = req.getParameter("fornavn");
Integer score = Integer.parseInt(req.getParameter("score"));
String email = req.getParameter("email");
String tlf = req.getParameter("tlf");
Date date = new Date();
Entity highscore = new Entity("Greeting", guestbookKey);
highscore.setProperty("date", date);
highscore.setProperty("fornavn", fornavn);
highscore.setProperty("score", score);
highscore.setProperty("email", email);
highscore.setProperty("tlf", tlf);
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(highscore);
And in the jsp there's a query that grabs the overall top 5.
Query query = new Query("Highscore", highscoreKey).addSort("score", Query.SortDirection.DESCENDING);
List<Entity> greetings = datastore.prepare(query).asList(FetchOptions.Builder.withLimit(5));
And there's a form that sends the user input to the .java. Any tips as far as how I should set up the dates? Saving week # and month # and querying based on that? Seems cumbersome.

From what I can tell, your "HighScore" kind is actually a "Score" kind that keeps track of all scores.
Instead of querying for the high score for the week/month, you're probably better off having a single HighScore entity (that's separate from normal "Score" entities) that you update whenever you enter a score. Every time a new score is entered, check if the high score should be updated.
You never need a fancy query, you just need to fetch the high score entity.
Or you might want a separate high score entity for each month/week etc so you can keep track of the history. In this case you may want to encode week or month into the entity key, so you can get the current week/month's HighScore easily.

There are 2 possible approaches for a requirement like yours where you want to show highscores for a day, week, month, etc:
1, First option is to use your current model where you are storing date and score. Since app engine allows inequality filter only on 1 property, you need to apply an inequality filter on date and then find the n highest number of scores. But since the result will be sorted first for the property with inequality filter and then for any additional property, you cannot do a fetch for only the first n entries to find the top n because the top scores need not be in continuous order. See this post to understand this better. So you will have to fetch all the scores for the date range and then do further sorting of the query result at your client to find the top n. This approach is ok if the total number of scores for a week or a month will not be too high compared to the value of n. If not, this is not a scalable option.
2, Second approach is to redesign your model such that sorting happens on scores so that for getting top n scores for a particular period, you need to fetch only the first n entries. This means the approach is suitable even if number of scores are very large. This then requires converting your date to be suitable for equality filtering like for each entry storing a month number, a week number and calendar year. Then for example if you want to find the top n scores in the 3rd month, then you can query for month=3, sort by scores descending and fetch the first n matching entries. Similarly you can query for a particular week using a week number.

This is very similar to another high-score SO question. I have copied/pasted my answer to it below. Approaching this solution using a database query may cause you to join the ranks of folks who complain about GAE. You will be using a custom index. Your query will likely average 10x miliseconds slower than needed per request. You will need to index thousands, perhaps millions of records. This costs you money -- perhaps lots of it both re: data storage (indices) and instances due to your high latency for what will likely be a highly-called handler function. Think different please. My copy/paste is not as specific to your setup, but it can be easily extended easily. I hope that it might prompt you to think about lower resource, lower cost alternative. As always...HTH. -stevep
Previous high score answer:
You may want to consider an alternate approach. This is a lot of index overhead which will cause your costs to be higher, the response time for the handler executing this function to operate an order of magnitude slower and you will have moments where the eventual consistency of index updates will affect maintenance of this data. If you have a busy site, you will surely not be happy with the latency and costs associated with this approach.
There are a number of alternate approaches. Your expected site transactions per second would affect which you choose. Here is a very simple alternative. Create an ndb entity with a TextProperty. Serialize the top scores entries using a string such as score_userid. Store them in the text field by joining them with a unique character. When a new score comes in, use get_by_id to retrieve this record (ndb automatically handles memcaching for you). Split it into an array. Split the last element of the array, and check against the new score. If it is less than the score, drop it, and append the new score_userid string to the array. Sort the array, join it, and put() the new TextProperty. If you want you could set up an end of the day cron to scan your scores for the day to check to see if your process was affected by the very small chance that two scores arrived at nearly the same time causing one to overwrite the other. HTH. -stevep
Previous SO high score answer link:
GAE datastore query with filter and sort using objectify

Related

Elasticsearch - Further filter out results based on match score using Java [duplicate]

We are currently limiting to 50 documents when we are searching in. But score for each document could be very different. eg score of 1st object = 5, score of 50th object = 0.0001. In this case we need to filter out objects with so low score.
What kind of statistical distribution/formula should we use upon querying to the elasticsearch? I am thinking of standard deviation but not so sure?

From what I understand you want to have a minimum_score to reduce results. Did you read about the min_score parameter?

PlanningEntities split across multiple collections

I am writing a staff scheduler for my office. There are at least two shifts per day (7 days a week) and I want the optimizer to make sure no one staff member works drastically more weekend shifts than another.
I have a simple working program that assigns one Staff to each Shift.
The program structure is as follows:
SchedulingSolution is the #PlanningSolution
SchedulingSolution contains a List<Shift> which is the #PlanningEntityCollectionProperty
Shift is the #PlanningEntity
Shift contains a Staff which is the #PlanningVariable
SchedulingSolution contains a #ValueRangeProvider which returns our staff roster as a List<Staff>.
This working solution schedules all staff equally but does not consider weekends and weekdays. To delineate weekdays from weekends, I have replaced the List<Shift> in SchedulingSolution with a List<Day>. Every Day contains its own List<Shift> representing the shifts that occur on that day. Now when I want to compute the number of weekends a staff member has worked, I can find all Day objects that represent weekends and count only the Shifts contained in those days.
Unfortunately, this places the List<Shift> which is the #PlanningEntityCollectionProperty in a class that is not a #PlanningSolution. The #PlanningEntityCollectionProperty annotation is now ignored and the program fails.
Have I missed an obvious way to restructure my program, or is my only option to keep my original program structure and modify my shift objects to record which day they occur on?
Thanks in advance. I'll try to pass the help forward if I can.

I would restructure your Planning Solution.
Why don't you do:
Keep the List<Shift> in the Planning Solution
Discard the List<Day> in every shift
Add a field day in Shift, possibly even an indicator whether that day is a weekend using an enum.
You can then easily calculate the number of weekend days worked by each staff with Drools as so:
(Absolutely untested!!!)
rule "balancedStaffWeekend"
when
$staff: Staff()
accumulate(
Shift(
staff == $staff
day == Day.WEEKEND);
$count: count()
);
)
then
scoreHolder.addSoftConstraintMatch(kcontext, -Math.pow(count, 2));
end
Penalising the solution by counts raised to the power of two balances the number of weekends.

Better algorithmic approach to showing trends of data per week

Suppose I have a list of projects with start date and end date. I also have a range of weeks, which varies (could be over months, years, etc)
I would like to display a graph showing 4 values per week:
projects started
projects closed
total projects started
total projects closed
I could loop over the range of weekly values, and for each week iterate through my list of projects and calculate values for each of these 4 trends per week. This would have algorithmic complexity O(nm), n is the length of list of weeks, and m is the length of projects list. That's not so great.
Is there a more efficient approach, and if so, what would it be?
If it's pertinent, I'm coding in Java

While it is true what user yurib has said there is a more efficient solution. Keep two arrays in memory projects_started and projects_ended, both with size 52. Loop through your list of projects and for each project increment corresponding value in both lists. Something like:
projects_started[projects[i].start_week]++;
projects_ended[projects[i].end_week]++;
After the loop you have all the data you need to make a graph. Complexity is O(m).
EDIT: okay, so maximum number of weeks can vary apparently, but if it's smaller than some ludicrous number (more than say a million) then this algorithm still works. Just replace 52 with n. Time complexity is O(m), space complexity is O(n).
EDIT: in order to determine the value of total projects started and ended you have to iterate through the two arrays that you now have and just add up the values. You could do this while populating the graph:
for (int i = 0; i < n)
{
total_started_in_this_week += projects_started[i];
total_ended_in_this_week += projects_ended[i];
// add new item to the graph
}

I'm not sure what the difference between "project" and "total" is, but here's a simple O(n log n) way to calculate the number of projects started and closed in each week:
For each project, add its start and end points to a list.
Sort the list in increasing order.
Walk through the list, pulling out time points until you hit a time point that occurs in a later week. At this point, "projects started" is the total number of start points you have hit, and "projects ended" is the total number of end points you have hit: report these counters, and reset them both to zero. Then continue on to process the next week.
Incidentally, if there are some weeks without any projects that start or end, this procedure will skip them out. If you want to report these weeks as "0, 0" totals, then whenever you output a week that has some nonzero total, make sure you first output as many "0, 0" weeks as it takes to fill in the gap since the last nonzero-total week. (This is easy to do just by setting a lastNonzeroWeek variable each time you output a nonzero-total week.)

First of all, I guess that actually performance won't be an issue; this looks like a case of "premature optimization". You should first do it, then do it right, then do it fast.
I suggest you use maps, which will make your code more readable and outsources implementation details (like performance).
Create a HashMap from int (representing the week number) to Set<Project>, then iterate over your projects and for each one, put it into the map at the right place. After that, iterate over the map's key set (= all non-empty weeks) and do your processing for each one.

A little confused about access count

Now my responsibility in the project is a access count module.
If user login in two hours repeatly ,it should be treat as once.
I use a concurrentHashMap to put the user id and access time.
private static Map<String,Date> loginTimeMap = new ConcurrentHashMap<String, Date>();
Every time the user access the index page , program will compare the time.
Date date = loginTimeMap.get(user.getSuUserId());
if(date==null||DateUtil.getHourInterval(new Date(),date)>=DefinedValue.LIMIT_TIME){
accessCount=accessCount+1;
loginTimeMap.put(user.getSuUserId(), new Date());
}
in the code LIMIT_TIME is a constant that refers two hours.
Will the loginTimeMap slow the server if the size of the map exceed 10000?
Really sorry for my poor English!

Will the loginTimeMap slow the server if the size of the map exceed 10000?
A HashMap has a time complexity of O(1). That is, it hashes, then goes straight to the value. It does not search for the value. That means that it's performance is not proportional to the amount of elements in the array, although with 10,000 entries it might be somewhat memory heavy!

Large Array Comparison

I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.
When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.
With my much larger database this is not possible anymore. What would you guys do?
ps: I'm also curious about any multithreaded solutions to this because the process is taking ages now.
pps: I'm coding in Java

What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there's a MySQL SOUNDEX function.
After computing the soundex of each name, all you have to do is group the rows by identical soundex.
EDIT:
If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.

There is no way around pairwise matching: the way as efficient as it gets.
If you need to do your record linkage faster, try using a string distance metrics that requires less computations than the edit distance (Bonacci distance, Jaro–Winkler distance, etc.)
You could also use another metric as a preprocessing step, and then compute edit distance to confirm or deny the match.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.