kafka stream heap usage - java

I am using kafka stream 2.1
I am trying to aggregate a stream of messages by their id. We have approximatively 20 messages with same id produced almost at the same time (a couple hundred ms maximum between two messages). So I am using a session window with a inactivity gap of 500ms and grace time of 5 seconds.
The incoming records has the ID as a key, and the value is made of few string fields + a map that can contains from 0 to few hundreds of entry (key is a string, value is an object with one string field).
Here is the code :
private final Duration INACTIVITY_GAP = Duration.ofMillis(500);
private final Duration GRACE_TIME = Duration.ofMillis(5000);
KStream<String, MyCustomMessage> source = streamsBuilder.stream("inputTopic", Consumed.with(Serdes.String(), myCustomSerde));
source
.groupByKey(Grouped.with(Serdes.String(), myCustomSerde))
.windowedBy(SessionWindows.with(INACTIVITY_GAP).grace(GRACE_TIME))
.aggregate(
// initializer
() -> {
return new CustomAggMessage();
},
// aggregates records in same session
(s, message, aggMessage) -> {
// ...
return aggMessage;
},
// merge sessions
(s, aggMessage1, aggMessage2) -> {
// ... merge
return aggMessage2;
},
Materialized.with(Serdes.String(), myCustomAggSerde)
)
.suppress(Suppressed.untilWindowCloses(unbounded()))
.selectKey((windowed, o) -> windowed.key());
.toStream().to("outputTopic")
I also tried another Suppressed : .suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(30), maxBytes(1_000_000_000L).emitEarlyWhenFull())). It did not help.
I have a custom rocksdb config :
public class CustomRocksDbConfig implements RocksDBConfigSetter {
#Override
public void setConfig(final String storeName, final Options options, final Map<String, Object> configs) {
BlockBasedTableConfig tableConfig = (BlockBasedTableConfig) options.tableFormatConfig();
tableConfig.setCacheIndexAndFilterBlocks(true);
options.setTableFormatConfig(tableConfig);
}
}
The input topic has 32 partitions (about 10k msg/second) and we run 8 instances with 4 stream threads each.
When running this. The heap usage is very high (max heap is set to 4G, machine has 8G) and makes the app crash and restart, so the lag increases.
Does someone know why ? What could I change to make this working ? Is the session window and its parameters the right way to achieve this ?

Related

How can I find Ignite nodes that aren't actively computing a Job?

I'm using Ignite 2.13 in a Windows environment.
I have exclusive use of a 110 node compute grid. I want to complete a ComputeTask that maps to 100 ComputeJobs. The jobs take a variable amount of time to complete. As the compute progresses many of the nodes become idle waiting for the remaining jobs to complete. Midway through the compute I'd like to find all the idle nodes and then kill some subset of them, leaving a few in case one of the remaining jobs randomly fails and needs to failover.
In the ComputeTask the "result" method is called as Jobs complete. From the "result" method I'm trying to find all the grid nodes with an active job. Any node that does not have an active job is assumed idle and potentially eligible to be killed off.
I have two methods for finding the nodes with an active job getActiveNodesOld and getActiveNodesLocal. They return drastically different answers.
// This version uses this node's view of the all the cluster.
// This status lags behind reality by 2 seconds.
public static Collection<UUID> getActiveNodesOld(Ignite grid)
{
Set<UUID> retval = new LinkedHashSet<>();
Collection<ClusterNode> nodes = grid.cluster().nodes();
for(ClusterNode node : nodes)
{
int activeJobs = node.metrics().getCurrentActiveJobs();
if(activeJobs > 0)
{
retval.add(node.id());
}
}
return retval;
}
// This version asks each node about its own status. The results are supposed to be more
// accurate.
public static Collection<UUID> getActiveNodesLocal(Ignite grid)
{
Set<UUID> retval = new LinkedHashSet<>();
IgniteCompute compute = grid.compute();
Collection<IgniteBiTuple<UUID, Integer>> metricsMap =
compute.broadcast(new ActiveJobCallable());
for(Map.Entry<UUID, Integer> entry : metricsMap)
{
UUID id = entry.getKey();
int activeJobs = entry.getValue();
if(activeJobs > 0)
{
retval.add(id);
}
if(activeJobs > 1){
logger.info("Warning: Node:" + id + " has " + activeJobs + " active jobs. ");
}
}
return retval;
}
private static class ActiveJobCallable implements IgniteCallable<IgniteBiTuple<UUID, Integer>>
{
#IgniteInstanceResource
private transient Ignite ignite;
public ActiveJobCallable()
{
}
#Override
public IgniteBiTuple<UUID, Integer> call()
{
ClusterNode clusterNode = ignite.cluster().localNode();
ClusterMetrics nodeMetrics = clusterNode.metrics();
return new IgniteBiTuple<>(clusterNode.id(), nodeMetrics.getCurrentActiveJobs());
}
}
These two methods return very different results with regard to which nodes are active. The first time through the getActiveNodesLocal reports that there are 0 active nodes. I can see how this is possible - perhaps a job is not considered complete until "result" returns. I'll give it the benefit of the doubt. The other method, getActiveNodesOld indicates there is a large number of idle nodes. Even though its being called for the first completed compute job so the remainder of the grid should still be working on compute jobs.
As the compute progresses neither method produces the answers I'd expect. The nodes seem to start reporting 1 activeJob and then never report 0 activeJobs even after the job is complete. Do I need to be resetting the ClusterMetrics somehow?
The method getActiveNodesLocal broadcasts an IgniteCallable into the grid. Does that callable get counted in the results of getCurrentActiveJobs()? It doesn't seem to but I thought that might explain the weird results I'm seeing. I never see the logger message I added that should be triggered if "activeJobs > 1".
Is there some other way I Should accomplish my goal of finding nodes that are idling? Can I find, for example, what node each job is currently assigned to and then determine which nodes don't have a job assignment? I don't see a method for the ComputeTask to determine node<->job mapping but maybe I'm just overlooking it.
I suppose I could have the nodes send a grid message when they start and also when they complete a job and I could track which jobs are active but that feels like the wrong solution.
Skip ClusterMetrics entirely. Create an IgniteCache and put the node UUID in the cache when the Job is started and remove the UUID when the job is completed.
Finding the Nodes that are still in the cluster and have a job could be done like this:
public static Collection<UUID> getActiveNodesCache(Ignite grid)
{
Set<UUID> retval = new TreeSet<>();
IgniteCache<UUID, ComputeJob> workingOnMap = grid.cache(JOB_CACHE);
workingOnMap.query(new ScanQuery<>(null)).forEach(entry -> retval.add((UUID) entry.getKey()));
// We count on the nodes to add and remove themselves from the cache when they start/complete jobs.
// Nodes that crashed aren't removed from this cache - they just die. So retval could
// contain nodes that are gone at this point.
Set<UUID> inGrid = grid.cluster().nodes().stream().map(ClusterNode::id).collect(Collectors.toSet());
Set<UUID> notInGrid = new LinkedHashSet<>(retval);
notInGrid.removeAll(inGrid);
retval.retainAll(inGrid); // remove any nodes that are no longer in the cluster.
logger.log(Level.INFO,
"Found {0} nodes in grid. {1} of those are active. {2} nodes were active but left grid.",
new Object[]{inGrid.size(), retval.size(), notInGrid.size()});
return retval;
}
The ComputeJobs can create the cache like this:
IgniteCache<UUID, ComputeJob> workingOnMap = grid.getOrCreateCache(getInProgressConfig());
Because you want to kill the nodes, be sure the cache is REPLICATED across the grid so that dead nodes don't take portions of the cache with them.
public static CacheConfiguration<UUID, ComputeJob> getInProgressConfig()
{
CacheConfiguration<UUID, ComputeJob> config = new CacheConfiguration<>();
config.setName(JOB_CACHE);
config.setCacheMode(CacheMode.REPLICATED);
config.setAtomicityMode(CacheAtomicityMode.ATOMIC);
config.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC);
return config;
}

Flink: How to only process specific key in window function in sliding window

I have a flink job that process Metric(name, type, timestamp, value) Object. Metrics are keyby (name, type, timestamp). I am trying to process metrics with specific timestamp starting timestamp + 50 second. Every timestamp has interval of 10 second. I am currently trying window(SlidingEventTimeWindows.of(Time.seconds(50), Time.seconds(10))) with a ProcessWindowFunction with
#Override
public void process(Tuple3<String, Integer, Long> key, Context context, Iterable<Metric> input, Collector<Metric> collector) {
long windowStartTime = context.window().getStart();
long timestamp = key.f2;
if (windowStartTime <= timestamp < windowStartTime + 10second) {
collector.out(input.iterator().next()). //to some reducer
}
However, I can only get first wave of output and stop receiving things after. I also tried adding a isProcessed field in Metric and marked in the reducer function and apply a Evictor but doesn't seem to work.
The source and sink are kafka consumer and producer. I also have watermark setup
.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<Metric>(Time.seconds(50)) {
#Override
public long extractTimestamp(Metric metrics) {
return metrics.getTimestamp() * 1000; // to millisecond
}
})
The reason why you are not getting more events in each window is that you have included the timestamp in the key. This has the effect of forcing each window to only include events that all have the same timestamp.

Size is not unloading from Caffeine caches

Right now i just wondering why, prob need help also.
So when i add values to my list it will be added but after 5 min it will remove, that works
But if i do list.size() it still says 1 but there is no values inside
public class Test extends Runnable {
private ExpiringSet<User> users; //User is just my object class
public Test() {
users = new ExpiringSet<User>(5, TimeUnit.MINUTES); //Sets the users
users.add(new User("John"));
}
#Override
public void run() {
// This will always say 1 even after i wait 1hour ive seen it has increased after a while then it said 2
//and there is no values in the users
System.out.println("Size: " + users.size());
//But this is empty after 5min
for(Iterator<User> iter = users.iterator(); iter.hasNext(); ) {
User user = (User)iter.next();
System.out.println("User " + user.getName());
}
}
}
public class ExpiringSet<E> extends ForwardingSet<E> {
private final Set<E> setView;
public ExpiringSet(long duration, TimeUnit unit) {
Cache<E, Boolean> cache = CaffeineFactory.newBuilder().expireAfterAccess(duration, unit).build();
this.setView = Collections.newSetFromMap(cache.asMap());
}
#Override
protected Set<E> delegate() {
return this.setView;
}
}
Basically i just want the users.size() to be 0 if there is no values inside users.
It just seems weird when it says 1 and there is no values.
By default expired entries are removed lazily during routine maintenance, which is triggered by a few reads or a write. That results in size of the underlying map being exposed, whereas the entries will be hidden if queried for and evicted at some later point. The Cache.cleanUp() can be called to run the maintenance work.
A background thread is required if you want expired entries to be removed promptly regardless of cache activity. This can be configured by configuring a Scheduler on the builder. Java 9+ includes a built-in JVM-wide scheduling thread rather than spinning up a new one, so the cache can be configured as,
LoadingCache<Key, Graph> graphs = Caffeine.newBuilder()
.scheduler(Scheduler.systemScheduler())
.expireAfterWrite(10, TimeUnit.MINUTES)
.build(key -> createExpensiveGraph(key));

Flink Task Manager timeout

My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
-Xms2048m
-Xmx6144m
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
environment.setParallelism(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
environment.execute("FlinkDataService");
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
sourceContext.collect(tuple);
}
} else {
closedReaders.add(reader);
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
cancel();
}
}
counter.getAndIncrement();
} catch (IOException e) {
e.printStackTrace();
}
});
}
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
System.gc()
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "127.0.0.1", false, false, -1);
}
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
if (disableOperatorChaining) {
env.disableOperatorChaining();
}
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
resultSum.print().name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024).build())
.build();
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(sink).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else {
System.out.println("discarding output");
}
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(TPCHQuery03.class.getSimpleName());
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new TPCHQuery03();
}
}
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.

Spark streaming mapWithState timeout delayed?

I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay.
I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below:
SparkConf sparkConf = new SparkConf()
.setAppName("JavaStatefulNetworkWordCount")
.setMaster("local[*]");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
ssc.checkpoint("./tmp");
StateSpec<String, Integer, Integer, Tuple2<String, Integer>> mappingFunc =
StateSpec.function((word, one, state) -> {
if (state.isTimingOut())
{
System.out.println("Timing out the word: " + word);
return new Tuple2<String,Integer>(word, state.get());
}
else
{
int sum = one.or(0) + (state.exists() ? state.get() : 0);
Tuple2<String, Integer> output = new Tuple2<String, Integer>(word, sum);
state.update(sum);
return output;
}
});
JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> stateDstream =
ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER_2)
.flatMap(x -> Arrays.asList(SPACE.split(x)))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.mapWithState(mappingFunc.timeout(Durations.seconds(5)));
stateDstream.stateSnapshots().print();
Together with nc (nc -l -p <port>)
When I type a word into the nc window I see the tuple being printed in the console every second. But it doesn't seem like the timing out message gets printed out 5s later, as expected based on the timeout set. The time it takes for the tuple to expire seems to vary between 5 & 20s.
Am I missing some configuration option, or is the timeout perhaps only performed at the same time as checkpoints?
Once an event times out it's NOT deleted right away, but is only marked for deletion by saving it to a 'deltaMap':
override def remove(key: K): Unit = {
val stateInfo = deltaMap(key)
if (stateInfo != null) {
stateInfo.markDeleted()
} else {
val newInfo = new StateInfo[S](deleted = true)
deltaMap.update(key, newInfo)
}
}
Then, timed out events are collected and sent to the output stream only at checkpoint. That is: events which time out at batch t, will appear in the output stream only at the next checkpoint - by default, after 5 batch-intervals on average, i.e. batch t+5:
override def checkpoint(): Unit = {
super.checkpoint()
doFullScan = true
}
...
removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
...
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
...
Elements are actually removed only when there are enough of them, and when state map is being serialized - which currently also happens only at checkpoint:
/** Whether the delta chain length is long enough that it should be compacted */
def shouldCompact: Boolean = {
deltaChainLength >= deltaChainThreshold
}
// Write the data in the parent state map while copying the data into a new parent map for
// compaction (if needed)
val doCompaction = shouldCompact
...
By default checkpointing occurs every 10 iterations, thus in the example above every 10 seconds; since your timeout is 5 seconds, events are expected within 5-15 seconds.
EDIT: Corrected and elaborated answer following comments by #YuvalItzchakov
Am I missing some configuration option, or is the timeout perhaps only
performed at the same time as snapshots?
Every time a mapWithState is invoked (with your configuration, around every 1 second), the MapWithStateRDD will internally check for expired records and time them out. You can see it in the code:
// Get the timed out state records, call the mapping function on each and collect the
// data returned
if (removeTimedoutData && timeoutThresholdTime.isDefined) {
newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
wrappedState.wrapTimingOutState(state)
val returned = mappingFunction(batchTime, key, None, wrappedState)
mappedData ++= returned
newStateMap.remove(key)
}
}
(Other than time taken to execute each job, it turns out that newStateMap.remove(key) actually only marks files for deletion. See "Edit" for more.)
You have to take into account the time it takes for each stage to be scheduled, and the amount of time it takes for each execution of such a stage to actually take it's turn and run. It isn't accurate because this runs as a distributed systems where other factors can come into play, making your timeout more/less accurate than you expect it to be.
Edit
As #etov rightly points out, newStateMap.remove(key) doesn't actually remove the element from the OpenHashMapBasedStateMap[K, S], but simply mark it for deletion. This is also a reason why you're seeing the expiration time adding up.
The actual relevant piece of code is here:
// Write the data in the parent state map while
// copying the data into a new parent map for compaction (if needed)
val doCompaction = shouldCompact
val newParentSessionStore = if (doCompaction) {
val initCapacity = if (approxSize > 0) approxSize else 64
new OpenHashMapBasedStateMap[K, S](initialCapacity = initCapacity, deltaChainThreshold)
} else { null }
val iterOfActiveSessions = parentStateMap.getAll()
var parentSessionCount = 0
// First write the approximate size of the data to be written, so that readObject can
// allocate appropriately sized OpenHashMap.
outputStream.writeInt(approxSize)
while(iterOfActiveSessions.hasNext) {
parentSessionCount += 1
val (key, state, updateTime) = iterOfActiveSessions.next()
outputStream.writeObject(key)
outputStream.writeObject(state)
outputStream.writeLong(updateTime)
if (doCompaction) {
newParentSessionStore.deltaMap.update(
key, StateInfo(state, updateTime, deleted = false))
}
}
// Write the final limit marking object with the correct count of records written.
val limiterObj = new LimitMarker(parentSessionCount)
outputStream.writeObject(limiterObj)
if (doCompaction) {
parentStateMap = newParentSessionStore
}
If deltaMap should be compacted (marked with the doCompaction variable), then (and only then) is the map cleared from all the deleted instances. How often does that happen? One the delta exceeds the threadshold:
val DELTA_CHAIN_LENGTH_THRESHOLD = 20
Which means the delta chain is longer than 20 items, and there are items that have been marked for deletion.

Categories

Resources