Redis insertion lag for sorted sets - java

I am trying to push a small amount of data (about 50 bytes) from my application (written in Java using the jedis driver) into a sorted set with about 360 members (each member also contains a very small amount of data). I am experiencing a 60-90 second lag time between my application making the insert and seeing the result on my redis server (a separate server in a different data center). This happens consistently. At first I thought there was something in my application causing the query to hang and then execute a minute later, but that's not the case because I can shut my application server down entirely immediately after running the query and the new item still shows up a minute later in Redis. In addition, when I removed all elements of the set and tried the insert again, it happened immediately (which is the intended behavior).
This is entirely in a test environment with no other traffic hitting either server, and my Redis server hardly has any data in it and has plenty of ram and cpu. Latency between my application server and the redis server is approx ~50ms.
Is there a configuration setting that I'm missing that could be causing such a delay?
Thanks in advance.
EDIT: Here is my code for inserting
public void save(RedisEntity entity) {
String key = entity.getKey();
String value = entity.getValue();
Jedis jedis = database.getJedis();
try {
if (entity instanceof RedisListEntity) {
if (((RedisListEntity)entity).isSavePushesLeft()) {
jedis.lpush(key, value);
} else {
jedis.rpush(key, value);
}
} else if (entity instanceof RedisSortedSetEntity) {
Long score = ((RedisSortedSetEntity)entity).getScore();
jedis.zadd(key, score, value);
} else if (entity instanceof RedisSetEntity) {
jedis.sadd(key, value);
} else {
jedis.set(key, value);
}
} catch (JedisConnectionException e) {
raiseErrorForSave(key, value);
database.returnBrokenResourceToPool(jedis);
} finally {
database.returnResourceToPool(jedis);
}
}
And retrieval (although keep in mind the key doesn't show up in redis-cli on the redis server for 1-2 minutes after insertion, so this code was never even part of the problem):
public Set<String> getSortedSetMembersWithRangeRankRev(String key, double min, double max, int start, int size) {
Set<String> result = null;
Jedis jedis = database.getJedis();
try {
result = jedis.zrevrangeByScore(key, max, min, start, size);
} catch (JedisConnectionException e) {
logger.warn("Failed to retrieve set for key " + key);
logger.warn(e.getMessage());
database.returnBrokenResourceToPool(jedis);
} finally {
database.returnResourceToPool(jedis);
}
return result;
}
EDIT: Some more info - I restarted my Redis server after it had been on for a few days (with almost no traffic as it is in a development environment) and updates seem to be coming through instantaneously as the set reaches 1000 members. This problem is still troubling for me though and I would like to identify the cause and prevent it from happening in the future - until then I cannot release this into a production environment.

Related

How to write large volumes of unique data to Postgres without storing it all in memory

I have a Spring Boot application that generates images. I'm trying to scale it to the point it can generate an unlimited number of images.
When an image is generated I create a hash using MurmurHash3 of the base64 encoded values of the image, this is then added to an object as a #Lob value. The hash is how I consider images to be unique, the images are then pushed into Postgres.
So far everything is fine and this creates ~1,000 images in a few seconds without problem. Where I'm having issues is say I want to create 100,000+ images.
When the images are generated there is a pretty good chance of duplicates, so what I thought was a good idea would be to create 'chunks' of images using a HashSet to hopefully rule out duplicates at least within the specific 'chunk'
public class CreateImages {
//...
#EventListener(ApplicationReadyEvent.class)
public void process() {
while (repository.count() < 100_000) {
createChunk();
}
}
private void createChunk() {
Set<TokenUri> result = new HashSet<>();
while (result.size() < 1000) {
final ImageWrapper imageWrapper = svgService.create(-1);
result.add(TokenUri.builder()
.hash(imageWrapper.hash())
.data(encodeService.encode(imageWrapper))
.build());
}
try {
repository.saveAll(result);
} catch (Exception e) {
log.error("Failed to save chunk {}", e.getMessage());
}
log.info("Created {} images", repository.count());
}
}
Not worrying about time taken here to create the images (it's all single threaded) this will do what I expect, each chunk doesn't contain duplicates, however more than likely it will contain duplicates when compared to previously generated chunks.
So to try and solve that I added a #Column(unique = true) annotation to each hash row being saved. Thinking Postgres will reject duplicates, but allow 'non-duplicates' to be saved.
What seems to happen though is the batch write fails due to not satisfying the condition and doesn't seem to move past it.
2022-01-02 15:38:56.071 ERROR 19292 --- [ main] o.h.engine.jdbc.spi.SqlExceptionHelper : ERROR: duplicate key value violates unique constraint "uk_90jgw9r7w8bhtgw17fmi79j0w"
Detail: Key (hash)=(1625765490) already exists.
Even when attempting to catch those with a generic Exception either I'm not handling it correctly, or it doesn't do what I expect.
Even this feels rather hackey and not a correct solution.
So, tl:dr - How can I generate an unknown (assume millions) of unique objects, (without keeping them all in memory to check for uniqueness) and safely store those into Postgres?
Is there some standard pattern for this kind of thing?
This could be a possible solution. You save only hash property in a Set and in that way you can have unique hash across chunks and also it's memory efficient because you are saving only a String and not an object with all properties. You also need to override hashCode and equals methods in TokenUri (to use hash property) because otherwise Set<TokenUri> doesn't work.
public class CreateImages {
//...
#EventListener(ApplicationReadyEvent.class)
public void process() {
Set<String> hashCodesOfSavedImages = new HashSet<>();
while (hashCodesOfSavedImages.size() < 100_000) {
hashCodesOfSavedImages.addAll(createChunkOfImages(hashCodesOfSavedImages, 1000));
}
}
private Set<String> createChunkOfImages(Set<String> hashCodesOfSavedImages, int chunkSize) {
Set<TokenUri> chunkOfImages = new HashSet<>();
while (chunkOfImages.size() < chunkSize) {
final ImageWrapper imageWrapper = svgService.create(-1);
// O(1) time complexity (contains)
if (!hashCodesOfSavedImages.contains(imageWrapper.hash())) {
chunkOfImages.add(TokenUri.builder()
.hash(imageWrapper.hash())
.data(encodeService.encode(imageWrapper))
.build());
}
}
repository.saveAll(chunkOfImages);
return chunkOfImages.stream().map(TokenUri::hash).collect(Collectors.toSet());
}
}

Flink Task Manager timeout

My program gets very slow as more and more records are processed. I initially thought it is due to excessive memory consumption as my program is String intensive (I am using Java 11 so compact strings should be used whenever possible) so I increased the JVM Heap:
-Xms2048m
-Xmx6144m
I also increased the task manager's memory as well as timeout, flink-conf.yaml:
jobmanager.heap.size: 6144m
heartbeat.timeout: 5000000
However, none of this helped with the issue. The Program still gets very slow at about the same point which is after processing roughly 3.5 million records, only about 0.5 million more to go. As the program approaches the 3.5 million mark it gets very very slow until it eventually times out, total execution time is about 11 minutes.
I checked the memory consumption in VisualVm, but the memory consumption never goes more than about 700MB.My flink pipeline looks as follows:
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironment(1);
environment.setParallelism(1);
DataStream<Tuple> stream = environment.addSource(new TPCHQuery3Source(filePaths, relations));
stream.process(new TPCHQuery3Process(relations)).addSink(new FDSSink());
environment.execute("FlinkDataService");
Where the bulk of the work is done in the process function, I am implementing data base join algorithms and the columns are stored as Strings, specifically I am implementing query 3 of the TPCH benchmark, check here if you wish https://examples.citusdata.com/tpch_queries.html.
The timeout error is this:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id <id> timed out.
Once I got this error as well:
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
Also, my VisualVM monitoring, screenshot is captured at the point where things get very slow:
Here is the run loop of my source function:
while (run) {
readers.forEach(reader -> {
try {
String line = reader.readLine();
if (line != null) {
Tuple tuple = lineToTuple(line, counter.get() % filePaths.size());
if (tuple != null && isValidTuple(tuple)) {
sourceContext.collect(tuple);
}
} else {
closedReaders.add(reader);
if (closedReaders.size() == filePaths.size()) {
System.out.println("ALL FILES HAVE BEEN STREAMED");
cancel();
}
}
counter.getAndIncrement();
} catch (IOException e) {
e.printStackTrace();
}
});
}
I basically read a line of each of the 3 files I need, based on the order of the files, I construct a tuple object which is my custom class called tuple representing a row in a table, and emit that tuple if it is valid i.e. fullfils certain conditions on the date.
I am also suggesting the JVM to do garbage collection at the 1 millionth, 1.5millionth, 2 millionth and 2.5 millionth record like this:
System.gc()
Any thoughts on how I can optimize this?
String intern() saved me. I did intern on every string before storing it in my maps and that worked like a charm.
these are the properties that I changed on my link stand-alone cluster to compute the TPC-H query 03.
jobmanager.memory.process.size: 1600m
heartbeat.timeout: 100000
taskmanager.memory.process.size: 8g # defaul: 1728m
I implemented this query to stream only the Order table and I kept the other tables as a state. Also I am computing as a windowless query, which I think it makes more sense and it is faster.
public class TPCHQuery03 {
private final String topic = "topic-tpch-query-03";
public TPCHQuery03() {
this(PARAMETER_OUTPUT_LOG, "127.0.0.1", false, false, -1);
}
public TPCHQuery03(String output, String ipAddressSink, boolean disableOperatorChaining, boolean pinningPolicy, long maxCount) {
try {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
if (disableOperatorChaining) {
env.disableOperatorChaining();
}
DataStream<Order> orders = env
.addSource(new OrdersSource(maxCount)).name(OrdersSource.class.getSimpleName()).uid(OrdersSource.class.getSimpleName());
// Filter market segment "AUTOMOBILE"
// customers = customers.filter(new CustomerFilter());
// Filter all Orders with o_orderdate < 12.03.1995
DataStream<Order> ordersFiltered = orders
.filter(new OrderDateFilter("1995-03-12")).name(OrderDateFilter.class.getSimpleName()).uid(OrderDateFilter.class.getSimpleName());
// Join customers with orders and package them into a ShippingPriorityItem
DataStream<ShippingPriorityItem> customerWithOrders = ordersFiltered
.keyBy(new OrderKeySelector())
.process(new OrderKeyedByCustomerProcessFunction(pinningPolicy)).name(OrderKeyedByCustomerProcessFunction.class.getSimpleName()).uid(OrderKeyedByCustomerProcessFunction.class.getSimpleName());
// Join the last join result with Lineitems
DataStream<ShippingPriorityItem> result = customerWithOrders
.keyBy(new ShippingPriorityOrderKeySelector())
.process(new ShippingPriorityKeyedProcessFunction(pinningPolicy)).name(ShippingPriorityKeyedProcessFunction.class.getSimpleName()).uid(ShippingPriorityKeyedProcessFunction.class.getSimpleName());
// Group by l_orderkey, o_orderdate and o_shippriority and compute revenue sum
DataStream<ShippingPriorityItem> resultSum = result
.keyBy(new ShippingPriority3KeySelector())
.reduce(new SumShippingPriorityItem(pinningPolicy)).name(SumShippingPriorityItem.class.getSimpleName()).uid(SumShippingPriorityItem.class.getSimpleName());
// emit result
if (output.equalsIgnoreCase(PARAMETER_OUTPUT_MQTT)) {
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(new MqttStringPublisher(ipAddressSink, topic, pinningPolicy)).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_LOG)) {
resultSum.print().name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else if (output.equalsIgnoreCase(PARAMETER_OUTPUT_FILE)) {
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(PATH_OUTPUT_FILE), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024).build())
.build();
resultSum
.map(new ShippingPriorityItemMap(pinningPolicy)).name(ShippingPriorityItemMap.class.getSimpleName()).uid(ShippingPriorityItemMap.class.getSimpleName())
.addSink(sink).name(OPERATOR_SINK).uid(OPERATOR_SINK);
} else {
System.out.println("discarding output");
}
System.out.println("Stream job: " + TPCHQuery03.class.getSimpleName());
System.out.println("Execution plan >>>\n" + env.getExecutionPlan());
env.execute(TPCHQuery03.class.getSimpleName());
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) throws Exception {
new TPCHQuery03();
}
}
The UDFs are here: OrderSource, OrderKeyedByCustomerProcessFunction, ShippingPriorityKeyedProcessFunction, and SumShippingPriorityItem. I am using the com.google.common.collect.ImmutableList since the state will not be updated. Also I am keeping only the necessary columns on the state, such as ImmutableList<Tuple2<Long, Double>> lineItemList.

Seeking to a Kafka Offset with Spring Cloud Stream

I have an event-sourced service that listens to a Kafka topic and saves state in a relational DB.
Considering a suitable restoration strategy for this service (i.e. how to restore the DB in a disaster recovery scenario), one option would be to save the current offset in the DB, take snapshots, and restore from a snapshot. In this scenario the service would need to seek to the offset when started in 'restoration mode'.
I am using Spring Cloud Stream, and was wondering if the framework provides any mechanism for seeking to an offset?
I realise another option for restoration would be to simply play all the events from scratch, but that's not an ideal option for some of my microservices.
If you're talking disaster, what makes you think you can write anything to DB?
In other words you may end up dealing with de-duplication on at least one event (at least you have to account for that) and if so, then de-duplication is still something you have to deal with.
I understand your concern with re-play (you simply don't want to reply from the beginning, but you can store periodic snapshots which would ensure you have a relatively fixed amount of events hat may need to be reprocessed/de-dupped.
That said, Kafka maintains the current offset, so you can rely on natural transaction features of Kafka to ensure that the next time you start your microservice it will begin from the last un-processed (successfully) offset.
There is KafkaBindingRebalanceListener interface that you can use
#Slf4j
#Component
public class KafkaRebalanceListener implements KafkaBindingRebalanceListener {
#Value("${config.kafka.topics.offsets:null}")
private String topicOffsets;
#Override
public void onPartitionsAssigned(String bindingName, Consumer<?, ?> consumer, Collection<TopicPartition> partitions, boolean initial) {
if (topicOffsets != null && initial) {
final Optional<Map<TopicPartition, Long>> offsetsOptional = parseOffset(topicOffsets);
if (offsetsOptional.isPresent()) {
final Map<TopicPartition, Long> offsetsMap = offsetsOptional.get();
partitions.forEach(tp -> {
if (offsetsMap.containsKey(tp)) {
final Long offset = offsetsMap.get(tp);
try {
log.info("Seek topic {} partition {} to offset {}", tp.topic(), tp.partition(), offset);
consumer.seek(tp, offset);
} catch (Exception e) {
log.error("Unable to set offset {} for topic {} and partition {}", offset, tp.topic(), tp.partition());
}
}
});
}
}
}
private Optional<Map<TopicPartition, Long>> parseOffset(String offsetParam) {
if (offsetParam == null || offsetParam.isEmpty()) {
return Optional.empty();
}
return Optional.of(Arrays.stream(offsetParam.split(","))
.flatMap(slice -> {
String[] items = slice.split("\\|");
String topic = items[0];
return Arrays.stream(Arrays.copyOfRange(items, 1, items.length))
.map(r -> {
String[] record = r.split(":");
int partition = Integer.parseInt(record[0]);
long offset = Long.parseLong(record[1]);
return new AbstractMap.SimpleEntry<>(new TopicPartition(topic, partition), offset);
});
}).collect(Collectors.toMap(AbstractMap.SimpleEntry::getKey, AbstractMap.SimpleEntry::getValue)));
}
}
config.kafka.topics.offsets field look like this but you can use any format
String topicOffsets = "topic2|1:100|2:120|3:140,topic3|1:1000|2:1200|3:1400";

Neo4j, SDN4, ActiveMQ multiple consumers and data syncronization

In order to speed up the data consumption in my application(Spring Boot, Neo4j database, Spring Data Neo4j 4) I have introduced Apache ActiveMQ and configured 10 concurrent consumers.
Right after that I ran into the issue with a counter updates.
I execute the following createDecisions method from my Apache ActiveMQ consumer :
#Override
public Decision create(String name, String description, String url, String imageUrl, boolean multiVotesAllowed, Long parentDecisionId, User user) {
Decision parentDecision = null;
if (parentDecisionId != null) {
parentDecision = ofNullable(findById(parentDecisionId)).orElseThrow(() -> new EntityNotFoundException("Parent decision with a given id not found"));
}
Decision decision = decisionRepository.save(new Decision(name, description, url, imageUrl, multiVotesAllowed, parentDecision, user), user);
if (parentDecision != null) {
updateTotalChildDecisions(parentDecision, 1);
}
return decision;
}
inside createDecision method I do some logic and then update parentDecision.totalChilDecisions counter:
#Override
public Decision updateTotalChildDecisions(Decision decision, Integer increment) {
decision.setTotalChildDecisions(decision.getTotalChildDecisions() + increment);
return decisionRepository.save(decision);
}
After execution in the concurrent environments this counter doesn't match the real things at database but in a single-threaded env(1 ActiveMQ consumer) everything works fine.
I think the main issue is that during totalChildDecisions update the parentDecision refers to the old SDN 4 object with not actual data(totalChildDecisions). How to correctly update parentDecision.totalChildDecisions ?
How to properly synchronize my code in order to get the counters working on the concurrent ActiveMQ consumers ?

Hashmap Over Large dataset giving OutOfMemory in spark

I have requirement of updating hashmap. In Spark job I have JavaPairRDD and in this wrapper is having 9 different hashmap. Each hashmap is having key near about 40-50 cr keys. While merging two maps (ReduceByKey in spark) I am getting Java heap memory OutOfMemory exception. Below is the code snippet.
private HashMap<String, Long> getMergedMapNew(HashMap<String, Long> oldMap,
HashMap<String, Long> newMap) {
for (Entry<String, Long> entry : newMap.entrySet()) {
try {
String imei = entry.getKey();
Long oldTimeStamp = oldMap.get(imei);
Long newTimeStamp = entry.getValue();
if (oldTimeStamp != null && newTimeStamp != null) {
if (oldTimeStamp < newTimeStamp) {
oldMap.put(imei, newTimeStamp);
} else {
oldMap.put(imei, oldTimeStamp);
}
} else if (oldTimeStamp == null) {
oldMap.put(imei, newTimeStamp);
} else if (newTimeStamp == null) {
oldMap.put(imei, oldTimeStamp);
}
} catch (Exception e) {
logger.error("{}", Utils.getStackTrace(e));
}
}
return oldMap;
}
This method works on small dataset but failed with large dataset. Same method is being used for all 9 different hashmap. I searched for increasing heap memory but no idea how to increase this in spark as it works on cluster. My cluster size is also large (nr. 300 nodes). Please help me to find out some solutions.
Thanks.
Firstly I'd focus on 3 parameters: spark.driver.memory=45g spark.executor.memory=6g spark.dirver.maxResultSize=8g Don't take the config for granted, this is something that works on my set up without OOM errors. Check how much available memory you have in UI. You want to give executors as much memory as you can. btw. spark.driver.memory enables more heap space.
As far as i can see, this code is executed on the spark driver. I would recommend to convert those two Hashmaps to DataFrames with 2 columns imei and timestamp. Then join both using an outer join on imei and select the appropriate timestamps using when.
This code will be executed on the workers, be parallized and consequentially you wont run into the memory problems. If you plan on really doing this on the driver then follow the instructions given by Jarek and increase spark.driver.memory.

Categories

Resources