Efficient way to implement 'events since x' in Java - java

I want to be-able to ask an object 'how many events have occurred in the last x seconds' where the x is an argument.
e.g. how many events have occurred in the last 120 seconds..
How I approached is linear based on the number of events occurring but was wanting to see what the most efficient way (space & time) to achieve this requirement?;
public class TimeSinceStat {
private List<DateTime> eventTimes = new ArrayList<>();
public void apply() {
eventTimes.add(DateTime.now());
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
for (int i = 0; i < orderTimes.size(); i++) {
DateTime dateTime = eventTimes.get(i);
if (dateTime.compareTo(startTime) > 0)
return eventTimes.subList(i, eventTimes.size()).size();
}
return 0;
}
(PS - i'm using JodaTime for the date/time representation)
Edit:
The key of this algorithm to find all events that have happened in the last x seconds; the exact start time (e.g. now - 30 seconds) is may or maynot be in the collection

Store the DateTime in a TreeSet and then use tailSet to get the most recent events. This saves you from having to find the starting point by iteration (which is O(n)) and instead by searching (which is O (log n)).
TreeSet<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailSet(DateTime.now().minus(Seconds.seconds(seconds)), true).size();
}
Of course, you could also binary search on your sorted list, but this does the work for you.
Edit
If it's a concern that multiple events could occur at the same DateTime, you can take the exact same approach with a SortedMultiset from Guava:
TreeMultiset<DateTime> eventTimes;
public int eventsSince(int seconds) {
return eventTimes.tailMultiset(
DateTime.now().minus(Seconds.seconds(seconds)),
BoundType.CLOSED
).size();
}
Edit x2
Here's a much more efficient approach that leverages the fact that you only log events that happened after all other events. With each event, store the number of events up to that date:
SortedMap<DateTime, Integer> eventCounts = initEventMap();
public SortedMap<DateTime, Integer> initEventMap() {
TreeMap<DateTime, Integer> map = new TreeMap<>();
//prime the map to make subsequent operations much cleaner
map.put(DateTime.now().minus(Seconds.seconds(1)), 0);
return map;
}
private long totalCount() {
//you can handle the edge condition here
return eventCounts.getLastEntry().getValue();
}
public void logEvent() {
eventCounts.put(DateTime.now(), totalCount() + 1);
}
Then getting the count since a date is super efficient, just take the total and subtract the count of events that occurred before that date.
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
return totalCount() - eventCounts.lowerEntry(startTime).getValue();
}
This eliminates the inefficient iteration. It's a constant time lookup and an O(log n) lookup.

If you were implementing a data structure from scratch, and the data are not in sorted order, you'd want to construct a balanced order statistic tree (also see code here). This is just a regular balanced tree with the size of the tree rooted at each node maintained in the node itself.
The size fields enable efficient calcualtion of the "rank" of any key in the tree. You can do the desired range query by making two O(log n) probes into the tree for the rank of the min and max range value, finally taking their difference.
The proposed tree and set tail operations are great except the tail views will need time to construct, even though all you need is their size. The asymptotic complexity is the same as the OST, but the OST avoids this overhead completely. The difference could be meaningful if performance is very criticial.
Of course I'd definitely use the standard library solution first and consider the OST only if the speed turned out to be inadequate.

Since DateTime already implements Comparable interface, I would recommend storing the data in a TreeMap instead, and you could use TreeMap#tailMap to get a subtree of the DateTime's that occurs in the desired time.
Based on your code:
public class TimeSinceStat {
//just in case two or more events start at the "same time"
private NavigableMap<DateTime, Integer> eventTimes = new TreeMap<>();
//if this class needs to be used in multiple threads, use ConcurrentSkipListMap instead of TreeMap
public void apply() {
DateTime dateTime = DateTime.now();
Integer times = eventTimes.contains(dateTime) != null ? 0 : (eventTimes.get(dateTime) + 1);
eventTimes.put(dateTime, times);
}
public int eventsSince(int seconds) {
DateTime startTime = DateTime.now().minus(Seconds.seconds(seconds));
NavigableMap<DateTime, Integer> eventsInRange = eventTimes.tailMap(startTime, true);
int counter = 0;
for (Integer time : eventsInRange.values()) {
counter += time;
}
return counter;
}
}

Assuming the list is sorted, you could do a binary search. Java Collections already provides Collections.binarySearch, and DateTime implements Comparable (according to the JodaTime JavaDoc). binarySearch will return the index of the value you want, if it exists in the list, otherwise it returns the index of the greatest value less than the one you want (with the sign flipped). So, all you need to do is (in your eventsSince method):
// find the time you want.
int index=Collections.binarySearch(eventTimes, startTime);
if(index < 0) index = -(index+1)-1; // make sure we get the right index if startTime isn't found
// check for dupes
while(index != eventTimes.size() - 1 && eventTimes.get(index).equals(eventTimes.get(index+1))){
index++;
}
// return the number of events after the index
return eventTimes.size() - index; // this works because indices start at 0
This should be a faster way to do what you want.

Related

Data structure to check if multiple periods overlap

I have a class Period that is represented by start and end dates, where end is after start. I need to write a function to check if periods overlap.
The straightforward approach is to check every period with every other period. Is there a way to introduce a data structure that will perform faster?
class Period {
LocalDateTime start;
LocalDateTime end;
}
boolean isOverlap(Set<Period> periods) {
// TODO put the code here
}
isOverlap should return true when at least two of the periods overlap.
Checking every period against every other period will have an O(n2) time complexity. Instead, I'd sort them by start and end times and then iterate over the list. This way, a period can only overlap the periods directly before and after it (or multiple subsequent ones before or after it - but that's inconsequential, since you're looking for a single overlap to return true). You can iterate over the list and check this. The total cost of this algorithm would be the cost of the sorting, O(nlog(n)):
boolean isOverlap(Set<Period> periods) {
List<Period> sorted =
periods.stream()
.sorted(Comparator.comparing((Period p) -> p.start)
.thenComparing(p -> p.end))
.collect(Collectors.toList());
for (int i = 0; i < sorted.size() - 1; ++i) {
if (sorted.get(i).end.compareTo(sorted.get(i + 1).start) > 0) {
return true;
}
}
return false;
}

Find the top N most popular elements

I have a List of TrackDay objects for a runner going around a track field on different days. Each pair of start/finish times signal a single lap run by the runner. We are guaranteed that there is a matching start/finish date (in the order in which they appear in the appropriate lists) :
TrackDay() {
List<DateTime> startTimes
List<DateTime> finishTimes
}
I would like to find the top N days (lets say 3) that runner ran the most. This translates to finding the N longest total start/finish times per TrackDay object. The naive way would be to do the following:
for (TrackDay td : listOftrackDays) {
// loop through each start/finish lists and find out the finish-start time for each pair.
// Add the delta times (finish-start) up for each pair of start/finish objects.
// Create a map to store the time for each TrackDay
// sort the map and get the first N entries
}
Is there a better, more clean/efficient way to do the above?
The problem you're trying to solve is well-known as Selection algorithm, in particular - Quick select. While sorting in general works good, for large collections it would be better to consider this approach, since it will give you linear time instead of N*log(N).
This solution should be linear time. I have assumed that startTimes and finishTimes support random access. I don't know what API your DateTime is part of, so have used java.time.LocalDateTime.
public List<TrackDay> findTop(List<TrackDay> trackDays, int limit) {
limit = Math.min(limit, trackDays.size());
List<Duration> durations = new ArrayList<>(Collections.nCopies(limit, Duration.ZERO));
List<TrackDay> result = new ArrayList<>(Collections.nCopies(limit, null));
int lastIndex = limit - 1;
for (TrackDay trackDay : trackDays) {
Duration duration = Duration.ZERO;
for (int i = 0, n = trackDay.startTimes.size(); i < n; i++) {
duration = duration.plus(Duration.between(trackDay.startTimes.get(i), trackDay.finishTimes.get(i)));
}
Integer destinationIndex = null;
for (int i = lastIndex; i >= 0; i--) {
if (durations.get(i).compareTo(duration) >= 0) {
break;
}
destinationIndex = i;
}
if (destinationIndex != null) {
durations.remove(lastIndex);
result.remove(lastIndex);
durations.add(destinationIndex, duration);
result.add(destinationIndex, trackDay);
}
}
return result;
}

When should a Spliterator stop splitting?

I understand that there is overhead in setting up the processing of a parallel Stream, and that processing in a single thread is faster if there are few items or the processing of each item is fast.
But, is there a similar threshold for trySplit(), a point where decomposing a problem into smaller chunks is counterproductive? I'm thinking by analogy to a merge sort switching to insertion sort for the smallest chunks.
If so, does the threshold depend on the relative cost of trySplit() and consuming an item in the course of tryAdvance()? Consider a split operation that's a lot more complicated than advancing an array index—splitting a lexically-ordered multiset permutation, for example. Is there a convention for letting clients specify the lower limit for a split when creating a parallel stream, depending on the complexity of their consumer? A heuristic the Spliterator can use to estimate the lower limit itself?
Or, alternatively, is it always safe to let the lower limit of a Spliterator be 1, and let the work-stealing algorithm take care of choosing whether to continue splitting or not?
In general you have no idea how much work is done in the consumer passed to tryAdvance or forEachRemaining. Neither stream pipeline nor FJP knows this as it depends on user supplied code. It can be either much faster or much slower than the splitting procedure. For example, you may have two-elements input but the processing of each element takes one hour, so splitting this input is very reasonable.
I usually split the input as much as I can. There are three tricks which can be used to improve the splitting:
If it's hard to split evenly, but you can track (or at least roughly estimate) the size of each sub-part, feel free to split unevenly. The stream implementation will do more further splitting for the bigger part. Don't forget about SIZED and SUBSIZED characteristics.
Move the hard part of splitting to the next tryAdvance/forEachRemaining call. For example, suppose that you have a known number of permutations and in trySplit you are going to jump to other permutation. Something like this:
public class MySpliterator implements Spliterator<String> {
private long position;
private String currentPermutation;
private final long limit;
MySpliterator(long position, long limit, String currentPermutation) {
this.position = position;
this.limit = limit;
this.currentPermutation = currentPermutation;
}
#Override
public Spliterator<String> trySplit() {
if(limit - position <= 1)
return null;
long newPosition = (position+limit)>>>1;
Spliterator<String> prefix =
new MySpliterator(position, newPosition, currentPermutation);
this.position = newPosition;
this.currentPermutation = calculatePermutation(newPosition); // hard part
return prefix;
}
...
}
Move the hard part to the next tryAdvance call like this:
#Override
public Spliterator<String> trySplit() {
if(limit - position <= 1)
return null;
long newPosition = (position+limit)>>>1;
Spliterator<String> prefix =
new MySpliterator(position, newPosition, currentPermutation);
this.position = newPosition;
this.currentPermutation = null;
return prefix;
}
#Override
public boolean tryAdvance(Consumer<? super String> action) {
if(currentPermutation == null)
currentPermutation = calculatePermutation(position); // hard part
...
}
This way the hard part will also be executed in parallel with prefix processing.
If you have not so many elements left in current spliterator (for example, less than 10) and the split was requested, it's probably good just to advance to the half of your elements collecting them into array, then create an array-based spliterator for this prefix (similarly to how it's done in AbstractSpliterator.trySplit()). Here you control all the code, so you can measure in advance how normal trySplit is slower than tryAdvance and estimate the threshold when you should switch to array-based split.

How can I evaluate a hash table implementation? (Using HashMap as reference)

Problem:
I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.
I am not interested in 100% accuracy but just being in the right direction in my estimation.
I am interested in the difference not only per operation but mainly on the hashtable as a "whole".
I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).
What I did so far:
Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):
int[] keys = new int[10000000];
String[] values = new String[10000000];
for(int i = 0; i < keys.length; ++i) {
keys[i] = i;
values[i] = "" + i;
}
if(operation.equals("put", keys, values)) {
runPutOperation(map);
}
public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {
long min = Long.MAX_VALUE;
long max = Long.MIN_VALUE;
long run = 0;
for(int i = 0; i < 10; ++i) {
long start = System.currentTimeMillis();
for(int i = 0; i < keys.length; ++i) {
map.put(keys[i], values[i]);
}
long total = System.currentTimeMillis() - start;
System.out.println(total/1000d + " seconds");
if(total < min) {
min = time;
}
if(total > max) {
max = time;
}
run += time;
map = null;
map = createNewHashMap();
hintsToGC();
}
return new long[] {min, max, run};
}
public void hintsToGC() {
for(int i = 0; i < 20; ++i) {
System.out.print(". ");
System.gc();
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
private HashMapInterface<String> createNewHashMap() {
if(jdk) {
return new JDKHashMapWrapper<String>();
}
else {
return new AlternativeHashMapWrapper<String>();
}
}
public class JDKHashMapWrapper implements HashMapInterface<String> {
HashMap<Integer, String> hashMap;
JDKHashMapWrapper() {
hashMap = new HashMap<Integer, String>();
}
public String put(Integer key, String value) {
return hashMap.put(key, value);
}
//etc
}
(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?
Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor
Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?
UPDATE after #dimo414 comments:
Well at a minimum the hashtable as a "whole" isn't meaningful
I mean how the hashtable behaves under various loads both at runtime and in memory consumption.
Every data structure is a tradeoff of different methods
I agree. My trade-off is an acceptable access penalty for memory improvement
You need to identify what features you're interested in verifying
1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table
Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.
If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.
Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.
As I understand you are interested in both operations execution time and memory consumption of the maps in the test.
I will start with memory consumption as this seams not to be answered at all. What I propose is to use a small library called Classmexer. I personally used it when I need to get the 100% correct memory consumption of any object. It has the java agent approach (because it's using the Instrumentation API), which means that you need to add it as the parameter to the JVM executing your tests:
-javaagent: [PATH_TO]/classmexer.jar
The usage of the Classmexer is very simple. At any point of time you can get the memory consumption in bytes by executing:
MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL)
Note that with visibility filter you can specify if the memory calculation should be done for the object (our map) plus all other reachable object through references. That's what VisibilityFilter.ALL is for. However, this would mean that the size you get back includes all objects you used for keys and values. Thus if you have 100 Integer/String entries the reported size will include those as well.
For the timing aspect I would propose JMH tool, as this tool is made for micro bench-marking. There are plenty examples online, for example this article has map testing examples that can guide you pretty good.
Note that I should be careful when do you call the Classmexer's Memory Util as it will interfere with the time results if you call it during the time measuring. Furthermore, I am sure that there are many other tools similar to Classmexer, but I like it because it small and simple.
I was just doing something similar to this, and I ended up using the built in profiler in the Netbeans IDE. You can get really detailed info on both CPU and memory usage. I had originally written all my code in Eclipse, but Netbeans has an import feature for bringing in Eclipse projects and it set it all up no problem, if that is possibly your situation too.
For timing, you might also look at the StopWatch class in Apache Commons. It's a much more intuitive way of tracking time on targeted operations, e.g.:
StopWatch myMapTimer = new StopWatch();
HashMap<Integer, Integer> hashMap = new HashMap<>();
myMapTimer.start();
for (int i = 0; i < numElements; i++)
hashMap.put(i, i);
myMapTimer.stop();
System.out.println(myMapTimer.getTime()); // time will be in milliseconds

Huge performance difference between Vector and HashSet

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!
So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?
The code is running in a single thread.
EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).
If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.
If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.
EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output:
Time taken: 38ms
Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).
Now, changing the code above to use
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.
If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.
Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?
Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.
Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.
However, I think there is a possible explanation of what might be going on.
First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
Next, I think you must be pushing very close to filling up the heap.
So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.
One way to test this theory is to run the Vector version of the application with a much bigger heap.
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.
According to Dr Heinz Kabutz, he said this in one of his newsletters.
The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

Categories

Resources