My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
-Xms2048m
-Xmx6144m
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
environment.setParallelism(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
environment.execute("FlinkDataService");
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
sourceContext.collect(tuple);
}
} else {
closedReaders.add(reader);
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
cancel();
}
}
counter.getAndIncrement();
} catch (IOException e) {
e.printStackTrace();
}
});
}
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
System.gc()
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "127.0.0.1", false, false, -1);
}
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
if (disableOperatorChaining) {
env.disableOperatorChaining();
}
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
resultSum.print().name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024).build())
.build();
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(sink).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else {
System.out.println("discarding output");
}
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(TPCHQuery03.class.getSimpleName());
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new TPCHQuery03();
}
}
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.
Related
I am writing to an in-memory distributed database in the batch size of that is user-defined in multithreaded environment. But I want to limit the number of rows written to ex. 1000 rows/sec. The reason for this requirement is that my producer is writing too fast and consumer is running into leaf-memory error. Is there any standard practice to perform throttling while batch processing of the records.
dataStream.map(line => readJsonFromString(line)).grouped(memsqlBatchSize).foreach { recordSet =>
val dbRecords = recordSet.map(m => (m, Events.transform(m)))
dbRecords.map { record =>
try {
Events.setValues(eventInsert, record._2)
eventInsert.addBatch
} catch {
case e: Exception =>
logger.error(s"error adding batch: ${e.getMessage}")
val error_event = Events.jm.writeValueAsString(mapAsJavaMap(record._1.asInstanceOf[Map[String, Object]]))
logger.error(s"event: $error_event")
}
}
// Bulk Commit Records
try {
eventInsert.executeBatch
} catch {
case e: java.sql.BatchUpdateException =>
val updates = e.getUpdateCounts
logger.error(s"failed commit: ${updates.toString}")
updates.zipWithIndex.filter { case (v, i) => v == Statement.EXECUTE_FAILED }.foreach { case (v, i) =>
val error = Events.jm.writeValueAsString(mapAsJavaMap(dbRecords(i)._1.asInstanceOf[Map[String, Object]]))
logger.error(s"insert error: $error")
logger.error(e.getMessage)
}
}
finally {
connection.commit
eventInsert.clearBatch
logger.debug(s"committed: ${dbRecords.length.toString}")
}
}
I was hoping if I could pass a user defined arguments as a throttleMax and if total records written by each thread reaches the throttleMax, thread.sleep() will be called for 1 sec. But this is going to make the entire process very slow. Can there be any other effective method, that can be used for throttle the loading of the data to 1000 rows/sec?
As others have suggested (see the comments on the question), you have better options available to you than throttling here. However, you can throttle an operation in Java with some simple code like the following:
/**
* Given an Iterator `inner`, returns a new Iterator which will emit items upon
* request, but throttled to at most one item every `minDelayMs` milliseconds.
*/
public static <T> Iterator<T> throttledIterator(Iterator<T> inner, int minDelayMs) {
return new Iterator<T>() {
private long lastEmittedMillis = System.currentTimeMillis() - minDelayMs;
#Override
public boolean hasNext() {
return inner.hasNext();
}
#Override
public T next() {
long now = System.currentTimeMillis();
long requiredDelayMs = now - lastEmittedMillis;
if (requiredDelayMs > 0) {
try {
Thread.sleep(requiredDelayMs);
} catch (InterruptedException e) {
// resume
}
}
lastEmittedMillis = now;
return inner.next();
}
};
}
The above code uses Thread.sleep, so is not suitable for use in a Reactive system. In that case, you would want to use the Throttle implementation provided in that system, e.g. throttle in Akka
I have requirement of updating hashmap. In Spark job I have JavaPairRDD and in this wrapper is having 9 different hashmap. Each hashmap is having key near about 40-50 cr keys. While merging two maps (ReduceByKey in spark) I am getting Java heap memory OutOfMemory exception. Below is the code snippet.
private HashMap<String, Long> getMergedMapNew(HashMap<String, Long> oldMap,
HashMap<String, Long> newMap) {
for (Entry<String, Long> entry : newMap.entrySet()) {
try {
String imei = entry.getKey();
Long oldTimeStamp = oldMap.get(imei);
Long newTimeStamp = entry.getValue();
if (oldTimeStamp != null && newTimeStamp != null) {
if (oldTimeStamp < newTimeStamp) {
oldMap.put(imei, newTimeStamp);
} else {
oldMap.put(imei, oldTimeStamp);
}
} else if (oldTimeStamp == null) {
oldMap.put(imei, newTimeStamp);
} else if (newTimeStamp == null) {
oldMap.put(imei, oldTimeStamp);
}
} catch (Exception e) {
logger.error("{}", Utils.getStackTrace(e));
}
}
return oldMap;
}
This method works on small dataset but failed with large dataset. Same method is being used for all 9 different hashmap. I searched for increasing heap memory but no idea how to increase this in spark as it works on cluster. My cluster size is also large (nr. 300 nodes). Please help me to find out some solutions.
Thanks.
Firstly I'd focus on 3 parameters: spark.driver.memory=45g spark.executor.memory=6g spark.dirver.maxResultSize=8g Don't take the config for granted, this is something that works on my set up without OOM errors. Check how much available memory you have in UI. You want to give executors as much memory as you can. btw. spark.driver.memory enables more heap space.
As far as i can see, this code is executed on the spark driver. I would recommend to convert those two Hashmaps to DataFrames with 2 columns imei and timestamp. Then join both using an outer join on imei and select the appropriate timestamps using when.
This code will be executed on the workers, be parallized and consequentially you wont run into the memory problems. If you plan on really doing this on the driver then follow the instructions given by Jarek and increase spark.driver.memory.
I'm trying to create a generic method in Java for querying hbase.
I currently have one written which takes in 3 arguments
A Range (to scan the table)
A Column (to be returned) ... and
A Condition (i.e. browser==Chrome)
So a statement (if written in a SQLish language) may look like
SELECT OS FROM TABLE WHERE BROWSER==CHROME IN RANGE (5 WEEKS AGO -> 2 WEEKS AGO)
Now, I know I'm not using HBase properly (using common column queries for rowkey etc.) but for the sake of experimentation I'd like to try it, to help me learn.
So the first thing I do is set a Range on the Scan. (5 weeks to 2 weeks ago), since the rowkey is the timestamp, this is very efficient.
Then I set a SingleColumnValueFilter (browser = Chrome) (after the range filter, this is pretty fast)
Then I store all the rowkeys (from the scan) into an array.
For each rowkey (in the array) I perform a GET operation to get the corresponding OS.
I have tried using MultiGet, which sped up the process a lot.
I then tried using normal GET requests, each spawning a new thread, all running concurrently, which halved the query time! But still not fast enough.
I have considered limiting the number of threads using a single connection to the database. i.e - 100 threads per connection.
Given my circumstances, what is the most efficient way to perform these GETs, or am I totally approaching it incorrectly?
Any help is hugely appreciated.
EDIT (Here is my threaded GET attempt)
List<String> newresults = Collections.synchronizedList(new ArrayList<String>());
for (String rowkey : result) {
spawnGetThread(rowkey, colname);
}
public void spawnGetThread(String rk, String cn) {
new Thread(new Runnable() {
public void run() {
String rt = "";
Get get = new Get(Bytes.toBytes(rk));
get.addColumn(COL_FAM, cn);
try {
Result getResult = tb.get(get);
rt = (Bytes.toString(getResult.value()));
} catch (IOException e) {
}
newresults.add(rt);
}
}).start();
}
Given my circumstances, what is the most efficient way to perform
these GETs, or am I totally approaching it incorrectly?
I would suggest the below way
Get is good if you know which rowkeys you can acccess upfront.
In that case you can use method like below , it will return array of Result.
/**
* Method getDetailRecords.
*
* #param listOfRowKeys List<String>
* #return Result[]
* #throws IOException
*/
private Result[] getDetailRecords(final List<String> listOfRowKeys) throws IOException {
final HTableInterface table = HBaseConnection.getHTable(TBL_DETAIL);
final List<Get> listOFGets = new ArrayList<Get>();
Result[] results = null;
try {
for (final String rowkey : listOfRowKeys) {// prepare batch of get with row keys
// System.err.println("get 'yourtablename', '" + saltIndexPrefix + rowkey + "'");
final Get get = new Get(Bytes.toBytes(saltedRowKey(rowkey)));
get.addColumn(COLUMN_FAMILY, Bytes.toBytes(yourcolumnname));
listOFGets.add(get);
}
results = table.get(listOFGets);
} finally {
table.close();
}
return results;
}
Additional Note: Rowfilters are always faster than column value filters( Which does full table scan)..
Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features
I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}
I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.
This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.
When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()
I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?
Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov
Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.