I have a structure which contains consecutive time periods (without overlap) and a certain value.
class Record {
private TimeWindow timeWindow;
private String value;
}
interface TimeWindow {
LocalDate getBeginDate();
LocalDate getEndDate(); //Can be null
}
My goal is to implement a function which takes a date and figures out the value.
A naive implementation could be to loop through all records until the date matches the window.
class RecordHistory {
private List<Record> history;
public String getValueForDate(LocalDate date) {
for (Record record : history) {
if (record.dateMatchesWindow(date)){
return record.getValue();
}
}
return null; //or something similar
}
}
class Record {
private TimeWindow timeWindow;
private String value;
public boolean dateMatchesWindow(LocalDate subject) {
return !subject.isBefore(timeWindow.getBeginDate()) && (timeWindow.getEndDate() == null || !subject.isAfter(timeWindow.getEndDate()));
}
public String getValue(){
return value;
}
}
The origin of these values are from database queries (no chance to change the structure of the tables). The list of Records could be small or huge, and the dates vary from the start of the history until the end. However, the same date will not be calculated twice for the same RecordHistory. There will be multiple RecordHistory objects, the values represent different attributes.
Is there an efficient way to search this structure?
You can use binary search to get the matching Record (if such a record exists) in O(logn) time.
Java already has data structure that do that for you, e.g. the TreeMap. You can map every Record to its starting time, then get the floorEntry for a given time, and see whether it's a match.
// create map (done only once, of course)
TreeMap<LocalDate, Record> records = new TreeMap<>();
for (Record r : recordList) {
records.put(r.getTimeWindow().getBeginDate(), r);
}
// find record for a given date
public String getValueForDate(LocalDate date) {
Record floor = records.floorEntry(date).getValue();
if (floor.dateMatchesWindow(date)) {
return r;
}
return null;
}
If the entries are non-overlapping, and if the floor entry is not a match, than no other entry will be.
Related
Summary
We have recently changed our String-based ID schema in a complex retrieval engine and observed a severe performance drop. In essence, we changed the IDs from XXX-00000001 to X384840564 (see below for details on the ID schema) and suffer from almost doubled runtimes. Choosing a different string hash function solved the problem, but we still lack a good explanation. Thus, our questions are:
Why do we see such a strong performance drop when changing from
the old to the new ID schema?
Why does our solution of using the “parent hash” actually work?
To approach the problem, we hereafter provide (a) detailed information about the ID schemata and hash functions used, (b) a minimal working example in Java that highlights the performance defect, and (c) our performance results and observations.
(Despite the lengthy description, we have already massively reduced the code example to 4 performance critical lines – see phase 2 in the listing.)
(a) Old and new ID schema; hash functions
Our ID objects consist of a parent ID object (string of 16 characters in [A-Z0-9]) and a child ID string. The same parent ID string is on average used by 1–10 child IDs. The old child IDs had a three-letter prefix, a dash, and a zero-padded running index number of length 8, for example, XXX-00000001 (12 characters in total; X may be any letter [A-Z]). The new child IDs have one letter and 9 non-consecutive digits, for example, X384840564 (10 characters in total, X may be any letter [A-Z]). An obvious difference is that the old child ID strings are often recurring (i.e., the string ABC-00000002 occurs with multiple different parent IDs, as the running index typically starts with 1), while the new child IDs with their arbitrary digit combinations typically occur only a few times or even only with a single parent ID.
Since the ID objects are put into HashSets and HashMaps, the choice of a hash function seems crucial. Currently, the system uses the standard string hash for the parent IDs. For the child IDs, we used to XOR the string hashes of parent and child ID – called XOR hash henceforth. In theory, this should distribute different child IDs quite well. As a variant, we experimented with using only the string hash of the parent ID as the hash code of the child ID – called parent hash henceforth. That is, all child IDs sharing the same parent ID share the same hash. In theory, the parent hash could be suboptimal, as all children sharing the same parent ID end up in the same bucket, while the XOR hash should yield a better data distribution.
(b) Minimal working example
Please refer to the following listing (explanation below):
package mwe;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Random;
import java.util.Set;
public class Main {
private static final Random RANDOM = new Random(42);
private static final String DIGITS = "0123456789";
private static final String ALPHABET = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + DIGITS;
private static final int NUM_IDS = 5_000_000;
private static final int MAX_CHILDREN = 5;
private static final int REPETITIONS = 5;
private static final boolean RUN_ID_OLD = true; // e.g., 8IBKMAO2T1ORICNZ__XXX-00000002
private static final boolean RUN_ID_NEW = false; // e.g., 6TEG9R5JP1KHJN55__X580104176
private static final boolean USE_PARENT_HASH = false;
private static final boolean SHUFFLE_SET = false;
private abstract static class BaseID {
protected int hash;
public abstract BaseID getParentID();
#Override
public int hashCode() {
return this.hash;
}
}
private static class ParentID extends BaseID {
private final String id;
public ParentID(final String id) {
this.id = id;
this.hash = id.hashCode();
}
#Override
public BaseID getParentID() {
return null;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (obj instanceof ParentID) {
final ParentID o = (ParentID) obj;
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.id;
}
}
private static class ChildID extends BaseID {
private final String id;
private final BaseID parent;
public ChildID(final String id, final BaseID parent) {
this.id = id;
this.parent = parent;
// Initialize the hash code of the child ID:
if (USE_PARENT_HASH) {
// Only use the parent hash (i.e., all children have the same hash).
this.hash = parent.hashCode();
} else {
// XOR parent and child hash.
this.hash = parent.hashCode() ^ id.hashCode();
}
}
#Override
public BaseID getParentID() {
return this.parent;
}
#Override
public boolean equals(final Object obj) {
if (this == obj) {
return true;
}
if (this.hash != obj.hashCode()) {
return false;
}
if (obj instanceof ChildID) {
final ChildID o = (ChildID) obj;
final BaseID oParent = o.getParentID();
if (this.parent == null && oParent != null) {
return false;
}
if (this.parent != null && oParent == null) {
return false;
}
if (this.parent == null || !this.parent.equals(oParent)) {
return false;
}
return this.id.equals(o.id);
}
return false;
}
#Override
public String toString() {
return this.parent.toString() + "__" + this.id;
}
}
public static void run(final int repetitions, final boolean useVariant2IDs) throws IOException {
for (int i = 0; i < repetitions; i++) {
System.gc(); // Force memory reset for the next repetition.
// -- PHASE 1: CREATE DATA --------------------------------------------------------------------------------
// Fill a set of several millions random IDs. Each ID is a child ID with a reference to its parent ID.
// Each parent ID has between 1 and MAX_CHILDREN children.
Set<BaseID> ids = new HashSet<>(NUM_IDS);
for (int parentIDIdx = 0; parentIDIdx < NUM_IDS; parentIDIdx++) {
// Generate parent ID: 16 random characters.
final StringBuilder parentID = new StringBuilder();
for (int k = 0; k < 16; k++) {
parentID.append(ALPHABET.charAt(RANDOM.nextInt(ALPHABET.length())));
}
// Generate between 1 and MAX_CHILDREN child IDs.
final int childIDCount = RANDOM.nextInt(MAX_CHILDREN) + 1;
for (int childIDIdx = 0; childIDIdx < childIDCount; childIDIdx++) {
final StringBuilder childID = new StringBuilder();
if (useVariant2IDs) {
// Variant 2: Child ID = letter X plus 9 random digits.
childID.append("X");
for (int k = 0; k < 9; k++) {
childID.append(DIGITS.charAt(RANDOM.nextInt(DIGITS.length())));
}
} else {
// Variant 1: Child ID = XXX- plus zero-padded index of length 8.
childID.append("XXX-").append(String.format("%08d", childIDIdx + 1));
}
final BaseID id = new ChildID(childID.toString(), new ParentID(parentID.toString()));
ids.add(id);
}
}
System.out.print(ids.iterator().next().toString());
System.out.flush();
if (SHUFFLE_SET) {
final List<BaseID> list = new ArrayList<>(ids);
Collections.shuffle(list);
ids = new LinkedHashSet<>(list);
}
System.gc(); // Clean up temp data before starting the timer.
// -- PHASE 2: INDEX DATA ---------------------------------------------------------------------------------
// Iterate over the ID set and fill a map indexed by parent IDs. The map values are irrelevant here, so
// use empty objects.
final long timer = System.currentTimeMillis();
final HashMap<BaseID, Object> map = new HashMap<>();
for (final BaseID id : ids) {
map.put(id.getParentID(), new Object());
}
System.out.println("\t" + (System.currentTimeMillis() - timer));
// Ensure that map and IDs are not GC:ed before the timer stops.
if (map.get(new ParentID("_do_not_gc")) == null) {
map.put(new ParentID("_do_not_gc"), new Object());
}
ids.add(new ParentID("_do_not_gc"));
}
}
public static void main(final String[] args) throws IOException {
if (RUN_ID_OLD) {
run(REPETITIONS, false);
}
if (RUN_ID_NEW) {
run(REPETITIONS, true);
}
}
}
In essence, the program first generates a HashSet of IDs and then indexes these IDs by their parent ID in a HashMap. In detail:
The first phase (PHASE 1) generates 5 million parent IDs, each with 1 to 10 child IDs using either the old (e.g., XXX-00000001) or the new ID schema (e.g., X384840564) and one of the two hash functions. The generated child IDs are collected in a HashSet. We explicitly create new parent ID objects for each child ID to match the functionality of the original system. For experimentation, the IDs can optionally be shuffled in a LinkedHashSet to distort the hash-based ordering (cf. boolean SHUFFLE_SET).
The second phase (PHASE 2) simulates the performance-critical path. It reads all IDs (child IDs with their parents) from the HashSet and puts them into a HashMap with the parent IDs as keys (i.e., aggregate IDs by parent).
Note: The actual retrieval system has a more complex logic, such as reading IDs from multiple sets and merging child IDs as the map entry’s values, but it turned out that none of these steps was responsible for the strong performance gap in question.
The remaining lines try to control for the GC, such that the data structures are not GC:ed too early. We’ve tried different alternatives for controlling the GC, but the results seemed pretty stable overall.
When running the program, the constants RUN_ID_OLD and RUN_ID_NEW activate the old and the new ID schema, respectively (best activate only one at a time). USE_PARENT_HASH switches between the XOR hash (false) and the parent hash (true). SHUFFLE_SET distorts the item order in the ID set. All other constants can remain as they are.
(c) Results
All results here are based on a typical Windows desktop with OpenJDK 11. We also tested Oracle JDK 8 and a Linux machine, but observed similar effects in all cases. For the following figure, we tested each configuration in independent runs, whereas each run repeats the timing 5 times. To avoid outliers, we report the median of the repetitions. Note, however, that the timings of the repetitions do not differ much. The performance is measured in milliseconds.
Observations:
Using XOR hash yields a substantial performance drop in the
HashSet setting when switching to the new ID schema. This hash
function seems suboptimal, but we lack a good explanation.
Using the parent hash function speeds up the process regardless of the ID
schema. We speculate that the internal HashSet order is beneficial,
since the resulting HashMap will build up the same order (because
ID.hash = ID.parent.hash). Interestingly, this effect can also be
observed if the HashSet is split into, say, 50 parts, each holding a
random partition of the full HashSet. This leaves us puzzled.
The entire process seems to be heavily dependent of the reading
order in the for loop of the second phase (i.e., the internal order of the
HashSet). If we distort the order in the shuffled LinkHashSet, we
end up in a worst-case scenario, regardless of the ID schema.
In a separate experiment, we also diagnosed the number of
collisions when filling the HashMap, but could not find obvious
differences when changing the ID schema.
Who can shed more light on explaining these results?
Update
The image below shows some profiling results (using VisualVM) for the non-shuffled runs. Indent indicates nested calls. All percentage values are relative to the phase 2 timing (100%).
An obious difference seems to be HashMap.putVal's self time. There was no obvious difference for treeifying buckets.
I decided to create a Map to store metric names and the Ranges representing live periods for each metric. At first I used a TreeRangeMap to store the Ranges but since each Metric contains a single Range I switched to Ranges as shown below.
My goal is to keep the latest time range in the DEFAULT_METRICS_MAP when I receive a Range for the metric from external API.
When I had a TreeRangeMap representing Ranges, comparing them was easy. I added new metric to the TreeRangeMap and then got the max range like this:
private static Optional<Range<Long>> maxRange(TreeRangeSet<Long> rangeSet) {
Set<Range<Long>> ranges = rangeSet.asRanges();
return ranges.stream().max(Comparator.comparing(Range::upperEndpoint));
}
What would be the correct way to compare Ranges when they are not wrapped into a TreeRangeMap?
public static final Map<String, Range<Long>> DEFAULT_METRICS_MAP;
static {
Map<String, Range<Long>> theMap = new HashMap<>();
theMap.put("Metric1", Range.closed(Long.MIN_VALUE, Long.MAX_VALUE));
theMap.put("Metric2", Range.closed(10L, 20L));
theMap.put("Metric3", Range.closed(30L, 50L));
METRICS_MAP = Collections.unmodifiableMap(theMap);
}
First of all it was a correct decission to avoid using TreeRangeMap/TreeRangeSet in this particular case. As I understand (correct me if I'm wrong), you don't need to keep all the ranges for all the metrics. What you need is the latest range for each metric at every moment in time.
Ideally you would like to have a very fast method of retriving, like:
Range<Long> range = getRange(metric);
The most efficient way is to compare Range objects on receiving them:
public void setRange(String metric, Range<Long> newRange) {
Range<Long> oldRange = metricRanges.get(metric);
if (comparator.compare(newRange, oldRange) > 0) {
metricRanges.put(metric, newRange);
}
}
Here is the full example:
// Better keep this map encapsulated
private final Map<String, Range<Long>> metricRanges = new HashMap<>();
private final Comparator<Range<Long>> comparator =
Comparator.nullsFirst(Comparator.comparing(Range::upperEndpoint));
static {
// Fill in your map with default ranges
}
public void setRange(String metric, Range<Long> newRange) {
Range<Long> oldRange = metricRanges.get(metric);
if (comparator.compare(newRange, oldRange) > 0) {
metricRanges.put(metric, newRange);
}
}
public Range<Long> getRange(String metric) {
return metricRanges.get(metric);
}
If you still need Optional:
public Optional<Range<Long>> getRange(String metric) {
return Optional.of(metricRanges.get(metric));
}
PriorityQueue<StoreEmail> emails = new PriorityQueue<StoreEmail> (n,
new Comparator<StoreEmail> () {
public int compare(StoreEmail a, StoreEmail b) {
if(a.urgency != b.urgency){
return b.urgency - a.urgency;
}
else{
return a.timestamp - b.timestamp;
}
}
}
);
public class StoreEmail
{
String emailContent;
int urgency;
long timestamp;
StoreEmail(String emailContent, int urgency,long timestamp){
this.emailContent = emailContent;
this.urgency = urgency;
this.timestamp = timestamp;
}
}
Inserting in the queue
StoreEmail storeEmail = new StoreEmail(in.next(),in.nextInt(),System.currentTimeMillis());
emails.add(storeEmail);
For above comparator, Inserting following values in the priority queue.
store email value
store email5 4
store email4 4
store email3 4
store email2 4
store email1 4
Its giving different result in each run, it means comparator is not working properly, and not able to sort based on time stamp.
Note: Wanted to sort based on email value maintaining FIFO order.
Can somebody help me how to resolve this problem.
Thanks in advance. Wested a lot of time already.
I have a product class
public class Produs {
private String denumire;
private String categorie;
private String taraOrigine;
private double pret;
}
with different constructors to fit my needs. I have an ArrayList of this type where all the Products have all the fields ( the list is generated by parsing a file ) . And another list in which there are products with only the name and country of origin filled ( rest of the fields are null ).This list is also generated from another list.
My question is , how can I search the first list, using the known fields of a product located in the second list , so that I can complete every object in the first list ?
I have tried with
public Produs getProdus(Produs p)
{
for(Produs prod:produse)
{
if ((prod.getDenumire().equals(p.getDenumire()) && (prod.getTaraOrigine().equals(p.getTaraOrigine()))));
{
return prod;
}
}
return null;
}
where produse is my list of products where all fields have values and p is a Product constructed using only 2 fields.
I have also tried with overwriting equals and hashcode. The problem is that when it finds the element , the loop stops.
You need to populate it before returning the actual object.
public Produs getProdus(Produs p)
{
for(Produs prod:produse)
{
if ((prod.getDenumire().equals(p.getDenumire()) && (prod.getTaraOrigine().equals(p.getTaraOrigine()))));
{
if (prod.getCategorie() == null) {
prod.setCategorie(p.getCategorie());//assuming you have getter and setter already in Produs
}
return prod;//remove this statement, if you want multiple products to be updated and make this method as void type instead of returning Produs type. Remove return null as well from end of this method.
}
}
return null;
}
If you want to list all the producs whose criteria matches then you could create a list and populate that like below:
public void getProdus(Produs p)
{
List<Produs> productList = new ArrayList<Produs>();
for(Produs prod:produse)
{
if ((prod.getDenumire().equals(p.getDenumire()) && (prod.getTaraOrigine().equals(p.getTaraOrigine()))));
{
productList.add(prod);
}
}
for(Produs prod:productList) {//iterate over the list who matched the criteria and amend it with properties from p.
}
}
Your getProdus() function is correct. You need to call it in a loop for every object in the first list.
I have created the following method:
public List<String> listAll() {
List worldCountriesByLocal = new ArrayList();
for (Locale locale : Locale.getAvailableLocales()) {
final String isoCountry = locale.getDisplayCountry();
if (isoCountry.length() > 0) {
worldCountriesByLocal.add(isoCountry);
Collections.sort(worldCountriesByLocal);
}
}
return worldCountriesByLocal;
}
Its pretty simple and it returns a list of world countries in the users locale. I then sort it to get it alphabetic. This all works perfectly (except I seem to occasionally get duplicates of countries!).
Anyway, what I need is to place the US, and UK at the top of the list regardless. The problem I have is that I can't isolate the index or the string that will be returned for the US and UK because that is specific to the locale!
Any ideas would be really appreciated.
Anyway, what I need is to place the US, and UK at the top of the list regardless. The problem I have is that I can't isolate the index or the string that will be returned for the US and UK because that is specific to the locale!
It sounds like you should implement your own Comparator<Locale> to compare two locales with the following steps:
If the locales are the same, return 0
If one locale is the US, make that "win"
If one locale is the UK, make that "win"
Otherwise, use o1.getDisplayCountry().compareTo(o2.getDisplayCountry()) (i.e. delegate to existing behaviour)
(This will put the US before the UK.)
Then call Collections.sort with an instance of your custom comparator.
Do all of this before extracting the country names - then extract them from the sorted list.
You could also use a TreeSet to eliminate duplicates and your own Comparator to bring US and GB up to the start.
You are getting duplicates (which this will eliminate) because there are often more than one locale per country. There is a US(Spanish) as well as a US(English) and there are three Switzerlands (French, German and Italian) for example.
public class AllLocales {
// Which Locales get priority.
private static final Locale[] priorityLocales = {
Locale.US,
Locale.UK
};
private static class MyLocale implements Comparable<MyLocale> {
// My Locale.
private final Locale me;
public MyLocale(Locale me) {
this.me = me;
}
// Convenience
public String getCountry() {
return me.getCountry();
}
#Override
public int compareTo(MyLocale it) {
// No duplicates in the country field.
if (getCountry().equals(it.getCountry())) {
return 0;
}
// Check for priority ones.
for (int i = 0; i < priorityLocales.length; i++) {
Locale priority = priorityLocales[i];
// I am a priority one.
if (getCountry().equals(priority.getCountry())) {
// I come first.
return -1;
}
// It is a priority one.
if (it.getCountry().equals(priority.getCountry())) {
// It comes first.
return 1;
}
}
// Default to straight comparison.
return getCountry().compareTo(it.getCountry());
}
}
public static List<String> listAll() {
Set<MyLocale> byLocale = new TreeSet();
// Gather them all up.
for (Locale locale : Locale.getAvailableLocales()) {
final String isoCountry = locale.getDisplayCountry();
if (isoCountry.length() > 0) {
//System.out.println(locale.getCountry() + ":" + isoCountry + ":" + locale.getDisplayName());
byLocale.add(new MyLocale(locale));
}
}
// Roll them out of the set.
ArrayList<String> list = new ArrayList<>();
for (MyLocale l : byLocale) {
list.add(l.getCountry());
}
return list;
}
public static void main(String[] args) throws InterruptedException {
// Some demo usages.
List<String> locales = listAll();
System.out.println(locales);
}
}
yes, when you do sort, just provide your own comparator
Collections.sort(worldCountriesByLocal, new Comparator() {
#Override
public int compare(String o1, String o2) {
if (o1.equals(TOP_VALUE))
return -1;
if (o2.equals(TOP_VALUE))
return 1;
return o1.compareTo(o2);
}
})
where top value will be value what you want to always on top
I would write my own POJO with a sort token consisting of an integer assigning priority (e.g. 0 for US, 1 for UK, 2 for everyone else), then some delimiter and then the country name. Then I would put the array in a HashMap keyed by that sort ID and the POJO as the val. Then I would sort the keys out of the map and iterate through the sorting and retrieve the plain country name for each sorted key.
E.g.
2.Sweden
2.France
2.Tanzania
0.US
1.UK
sorts
0.US
1.UK
2.France
2.Sweden
2.Tanzania
EDIT: a POJO is needed only if you have more fields other than the country name. If it is just the country name, I would set the sort ID as the hash key and the country name as the val and skip the POJO part.