i'm creating a pipeline where the inputs are json messages containing a timestamp field, used to set eventTime. The problem is about the fact that some record could arrive late or duplicate at the system, and this situations needs to be managed; to avoid duplicates I tried the following solution:
.assignTimestampsAndWatermarks(new RecordWatermark()
.withTimestampAssigner(new ExtractRecordTimestamp()))
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(3)))
.process(new WindowedFilter())
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(180)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(15)))
.process(new WindowedCountDistinct())
.map((value) -> value.toString());
where the first windowing operation is done to filter the records based on timestamp saved in a set, as follow:
public class WindowedFilter extends ProcessWindowFunction<MetricObject, MetricObject, String, TimeWindow> {
HashSet<Long> previousRecordTimestamps = new HashSet<>();
#Override
public void process(String s, Context context, Iterable<MetricObject> inputs, Collector<MetricObject> out) throws Exception {
String windowStart = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getStart()));
String windowEnd = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getEnd()));
log.info("window start: '{}', window end: '{}'", windowStart, windowEnd);
Long watermark = context.currentWatermark();
log.info(inputs.toString());
for (MetricObject in : inputs) {
Long recordTimestamp = in.getTimestamp().toEpochMilli();
if (!previousRecordTimestamps.contains(recordTimestamp)) {
log.info("timestamp not contained");
previousRecordTimestamps.add(recordTimestamp);
out.collect(in);
}
}
}
this solution works, but I've the feeling that I'm not considering something important or it could be done in a better way.
One potential problem with using windows for deduplication is that the windows implemented in Flink's DataStream API are always aligned to the epoch. This means that, for example, an event occurring at 11:59:59, and a duplicate occurring at 12:00:01, will be placed into different minute-long windows.
However, in your case it appears that the duplicates you are concerned about also carry the same timestamp. In that case, what you're doing will produce correct results, so long as you're not concerned about the watermarking producing late events.
The other issue with using windows for deduplication is the latency they impose on the pipeline, and the workarounds used to minimize that latency.
This is why I prefer to implement deduplication with a RichFlatMapFunction or a KeyedProcessFunction. Something like this will perform better than a window:
private static class Event {
public final String key;
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.cleanupFullSnapshot()
.build();
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
Here the stream is being deduplicated by key, and the state involved is being automatically cleared after one minute.
My goal is to cache data inmemory for 60s. As soon as the entry is read again from cache, I want to remove it from cache (permit single reads only).
If those 60s expired in the meantime and the entry is still available in cache, I want to writebehind the entry into a database.
Is there any existing technology/spring/apache framework that already offers such a cache?
(sidenote: I don't want to use complex libraries like redis, ehcache etc for such a simple usecase).
If set up manually, I'd be doing as follows. But probably there are better options?
#Service
public class WriteBehindCache {
static class ObjectEntry {
Object data;
LocalDateTime timestamp;
public ObjectEntry(Object data) {
this.data = data;
timestamp = LocalDateTime.now();
}
}
Map<String, ObjectEntry> cache = new ConcurrentHashMap<>();
//batch every minute
#Scheduled(fixedRate = 60000)
public void writeBehind() {
LocalDateTime now = LocalDateTime.now();
List<ObjectEntry> outdated = cache.values()
.filter(entry -> entry.getValue().timestamp.plusSeconds(60).isBefore(now))
.collect(Collectors.toList());
databaseService.persist(outdated);
cache.removeAll(outdated); //pseudocode
}
//always keep most recent entry
public void add(String key, Object data) {
cache.put(key, new ObjectEntry(data));
}
//fallback lookup to database if cache is empty
public Object get(String key) {
ObjectEntry entry = cache.remove(key);
if (entry == null) {
entry = databaseService.query(key);
if (entry != null) databaseService.remove(entry);
}
return entry;
}
}
Your solution has two problems:
You are doing a sequential scan for persisting, which will get costly when there are a lot of entries
The code has race conditions
Due to the race conditions the code does not satisfy your requirements. Its possible to construct a concurrent access sequence where an entry is removed from the cache but as well was written to the database
Is there any existing technology/spring/apache framework that already offers such a cache? (sidenote: I don't want to use complex libraries like redis, ehcache etc for such a simple usecase).
I think you can solve the concurrency issues based on the ConcurrentHashMap. But I don't know an elegant way for the timeout. Still, a possible solution is to use a caching library. I'd like to offer an example based on cache2k which is not heavy (about a 400k jar) and has other nice use cases as well. As an extra there is also good support for the Spring caching abstraction.
public static class WriteBehindCache {
Cache<String, Object> cache = Cache2kBuilder.of(String.class, Object.class)
.addListener((CacheEntryExpiredListener<String, Object>) (cache, entry)
-> persist(entry.getKey(), entry.getValue()))
.expireAfterWrite(60, TimeUnit.SECONDS)
.build();
public void add(String key, Object data) {
cache.put(key, data);
}
public Object get(String key) {
return cache.invoke(key, e -> {
if (e.exists()) {
Object v = e.getValue();
e.remove();
return v;
}
return loadAndRemove(e.getKey());
});
}
// stubs
protected void persist(String key, Object value) {
}
protected Object loadAndRemove(String key) {
return null;
}
}
With this wiring the cache blocks out concurrent operation on one entry, so it is certain that only one database operation runs for one entry at a time.
You can do it in similar ways with other caching libraries. Using the JCache/JSR107 API the code would look almost identical.
A more "lighter" approach is to use jhalterman's expiringmap
Personally, I believe a cache should be in every developers toolbox. However, I am the author of cache2k. Of course, I need to say that.
This code below is our code to delete property for a given Entity type:
#Override
public boolean deleteProperty(String instance, String storeName, String propertyName) {
final boolean[] success = {false};
final PersistentEntityStore entityStore = manager.getPersistentEntityStore(xodusRoot, instance);
try {
entityStore.executeInTransaction(new StoreTransactionalExecutable() {
#Override
public void execute(#NotNull final StoreTransaction txn) {
EntityIterable entities = txn.findWithProp(storeName, propertyName);
final boolean[] hasError = {false};
entities.forEach(entity -> {
if(!entity.deleteProperty(propertyName)) {
hasError[0] = true;
}
});
success[0] = hasError[0];
}
});
} finally {
//entityStore.close();
}
return success[0];
}
I understand that Xodus is transactional and that if one of the deleteProperty operation here fails it will roll back (I may need to know if this is confirmed).
Still, is there a official way to delete a property for all existing entities of a given type?
I understand that Xodus is transactional and that if one of the deleteProperty operation here fails it will roll back (I may need to know if this is confirmed).
Yes, it's true. Here transaction will be flushed after StoreTransactionalExecutable performs there job. But you can split EntityIterable into batches (of size 100 for example) and after processing each batch execute txn.flush() method. Do not forget to check flush result since it returns boolean.
Still, is there a official way to delete a property for all existing entities of a given type?
No, there isn't. Only manually like I described above.
I'm trying to setup a cache using guava, with the following code:
private List<Profile> buildCache() {
LoadingCache cache = CacheBuilder.newBuilder()
.expireAfterWrite(10, TimeUnit.MINUTES)
.maximumSize(40)
.build(
new CacheLoader<Profile, List<Profile>>() {
#Override
public List<Profile> load(Profile profile) throws Exception {
Profile profile1 = new Profile();
Profile.setEmployed(true);
return profileDAO.getAllProfiles(Profile1, null);
}
}
);
return (List<Profile>) cache;
}
public List<Profile> getAllProfiles(Profile profile, Integer size) throws Exception {
return profileDAO.getAllProfiles(profile, size);
}
The idea here is that this will create a cache using get all profile. The method for that uses a new profile object to set a boolean on whether that employee is employed or not. The size variable means that the method will return however many indicated. When null, it defaults to top 10.
I have two issues:
1. This is the first time I have ever used a cache, so I really do not know if I am doing this correctly.
2. I cannot find anything in the documentation on how to implement this within my app. How am I supposed to call this? I tried modifying the getAllProfiles method to return it:
public List<Profile> getAllProfiles(Profile profile, Integer size) throws Exception {
return buildCache();
}
But that simply returns an exception that I cannot cast the cache into a java list:
Exception occurred: java.lang.ClassCastException: com.google.common.cache.LocalCache$LocalLoadingCache cannot be cast to java.util.List
If its any help, my app is also using spring, so I've also been doing research into that. Is there any difference between springframework.cache.guava and google.common.cache, or is it just Spring's inbuilt guava cache?
Ok, I think I managed to figure it out:
private LoadingCache<Integer, List<Profile>> loadingCache = CacheBuilder.newBuilder()
.refreshAfterWrite(10,TimeUnit.MINUTES)
.maximumSize(100).build(
new CacheLoader<Integer, List<Profile>>() {
#Override
public List<Profile> load(Integer integer) throws Exception {
Profile profile= new Profile();
if (integer == null) {
integer = 10;
}
return profileDAO.getAllProfiles(profile, integer);
}
}
);
First, I should have specified the key and value being passed into LoadingCache, in this case, an Integer and a List of Profile. Also, when I declared the new CacheLoader in the build function, I should have kept that layout of key and value. Finlly, when calling the getAll method, I should have loaded using the key Integer, not a profile object.
As for calling the function:
public List<Profile> getAllProfiles(Profile profile, Integer size) throws Exception {
return loadingCache.get(size);
}
This serves to get lists of certain legnths that are stored in the cache. If the list of that length is not in the cache, the getAll method will run, using the size varaible you pass to it.
#Eugene, Thank you for your help. Your explanation on the load method really helped put the cache into perspective for me.
I have two process and for each process, I will get different Record object and I need to validate those Record object. This means I cannot use single validator as I have to validate different fields for both the process.
For processA, I am using ValidatorA class to validate its Record object.
For processB, I am using ValidatorB class to validate its Record object.
If they are valid, then I will move forward otherwise I won't move forward. Below is my process code both for A and B.
public class ProcessConsumer implements Runnable {
private static final Logger logger = Logger.getInstance(ProcessConsumer.class);
private final String processName;
private final Validator validator;
private final RecordProcessor<byte[], byte[]> process;
public ProcessConsumer(String processName, Validator validator) {
this.processName = processName;
this.validator = validator;
this.process = new RecordProcessor<>();
}
#Override
public void run() {
try {
process.subscribe(getTopicsBasedOnProcessName(processName));
....
while (true) {
ConsumerRecords<byte[], byte[]> crs = process.poll(2000);
for (ConsumerRecord<byte[], byte[]> cr : crs) {
// record object will be different for my both the processes.
Record record = decoder.decode(cr.value());
Optional<DataHolder> validatedHolder = validator.getDataHolder(processName, record);
if (!validatedHolder.isPresent()) {
logger.logError("records dropped. payload= ", record);
continue;
}
// send validatedHolder to processor class
Processor.getInstance().execute(validatedHolder);
}
}
} catch (Exception ex) {
logger.logError("error= ", ExceptionUtils.getStackTrace(ex));
}
}
}
Below is my ValidatorA class in which I am validating few fields on record object and if it is valid, then I am returning DataHolder.
public class ValidatorA extends Validator {
private static final Logger logger = Logger.getInstance(ValidatorA.class);
#Override
public static Optional<DataHolder> getDataHolder(String processName, Record record) {
Optional<DataHolder> dataHolder = Optional.absent();
if (isValid(processName, record))
dataHolder = Optional.of(buildDataHolder(processName, record));
return dataHolder;
}
private DataHolder isValid(String processName, Record record) {
return isValidClientIdDeviceId(processName, record) && isValidPayId(processName, record)
&& isValidHolder(processName, record)
}
private DataHolder buildDataHolder(String processName, Record record) {
Map<String, String> holder = (Map<String, String>) DataUtils.extract(record, "holder");
String deviceId = (String) DataUtils.extract(record, "deviceId");
Integer payId = (Integer) DataUtils.extract(record, "payId");
String clientId = (String) DataUtils.extract(record, "clientId");
// add mandatory fields in the holder map after getting other fields
holder.put("isClientId", (clientId == null) ? "false" : "true");
holder.put("isDeviceId", (clientId == null) ? "true" : "false");
holder.put("abc", (clientId == null) ? deviceId : clientId);
return new DataHolder.Builder(record).setClientId(clientId).setDeviceId(deviceId)
.setPayId(String.valueOf(payId)).setHolder(holder).build();
}
private boolean isValidHolder(String processName, Record record) {
Map<String, String> holder = (Map<String, String>) DataUtils.extract(record, "holder");
if (MapUtils.isEmpty(holder)) {
logger.logError("invalid holder is coming.");
return false;
}
return true;
}
private boolean isValidpayId(String processName, Record record) {
Integer payId = (Integer) DataUtils.extract(record, "payId");
if (payId == null) {
logger.logError("invalid payId is coming.");
return false;
}
return true;
}
private boolean isValidClientIdDeviceId(String processName, Record record) {
String deviceId = (String) DataUtils.extract(record, "deviceId");
String clientId = (String) DataUtils.extract(record, "clientId");
if (Strings.isNullOrEmpty(clientId) && Strings.isNullOrEmpty(deviceId)) {
logger.logError("invalid clientId and deviceId is coming.");
return false;
}
return true;
}
}
And below is my ValidatorB class in which I am validating few different fields as compared to ValidatorA on record object and if it is valid, then I am returning DataHolder.
public class ValidatorB extends Validator {
private static final Logger logger = Logger.getInstance(ValidatorB.class);
#Override
public static Optional<DataHolder> getDataHolder(String processName, Record record) {
Optional<DataHolder> dataHolder = Optional.absent();
if (isValid(processName, record))
dataHolder = Optional.of(buildDataHolder(processName, record));
return dataHolder;
}
private DataHolder isValid(String processName, Record record) {
return isValidType(processName, record) && isValidDatumId(processName, record) && isValidItemId(processName, record);
}
private DataHolder buildDataHolder(String processName, Record record) {
String type = (String) DataUtils.extract(record, "type");
String datumId = (String) DataUtils.extract(record, "datumId");
String itemId = (String) DataUtils.extract(record, "itemId");
return new DataHolder.Builder(record).setType(type).setDatumId(datumId)
.setItemId(itemId).build();
}
private boolean isValidType(String processName, Record record) {
String type = (String) DataUtils.extract(record, "type");
if (Strings.isNullOrEmpty(type)) {
logger.logError("invalid type is coming.");
return false;
}
return true;
}
private boolean isValidDatumId(String processName, Record record) {
String datumId = (String) DataUtils.extract(record, "datumId");
if (Strings.isNullOrEmpty(datumId)) {
logger.logError("invalid datumId is coming.");
return false;
}
return true;
}
private boolean isValidItemId(String processName, Record record) {
String itemId = (String) DataUtils.extract(record, "itemId");
if (Strings.isNullOrEmpty(itemId)) {
logger.logError("invalid itemId is coming.");
return false;
}
return true;
}
}
And below is my abstract class:
public abstract class Validator {
public abstract Optional<DataHolder> getDataHolder(String processName, Record record);
}
Question:
This is how I am calling for both of my process. As you can see, I am passing processName and its particular validator in the constructor argumnets.
ProcessConsumer processA = new ProcessConsumer("processA", new ValidatorA());
ProcessConsumer processB = new ProcessConsumer("processB", new ValidatorB());
Is this a good design where for each of my process, pass its validator along with?
Is there any way we can avoid passing that? And internally figure out what validators to use basis on the processName? I already have an enum with all my processName. I need to make this design extensible so that if I add new process in future, it should be scalable.
Also the way I have my abstract class Validator is right? It is not doing any useful things at all looks like.
Each of my Validator is basically trying to validate whether the record object is valid or not. If they are valid, then they make DataHolder builder and return it, otherwise it returns Optional.absent();
I saw this post where they talked about using Decorator pattern but I am not sure how that will help me in this case.
First when I see the declaration and the implementation of them :
public abstract class Validator {
public abstract Optional<DataHolder> getDataHolder(String processName, Record record);
}
I don't think "Validator" is the best term. Your validators are not only validators. What you call validators have as main function : extract data for a specific process. The extraction requires a validation but it is not the main function.
While the main function of a validator is validating.
So I think you could call them something as : ProcessDataExtractor.
ProcessConsumer processA = new ProcessConsumer("processA", new ValidatorA());
ProcessConsumer processB = new ProcessConsumer("processB", new ValidatorB());
Is this a good design where for each of my process, pass its validator
along with? Is there any way we can avoid passing that? And internally
figure out what validators to use basis on the processName? I already
have an enum with all my processName. I need to make this design
extensible so that if I add new process in future, it should be
scalable.
Scalability is another thing.
Having a extensible design is broadly having a design which doesn't imply important and or risky modifications as soon as a new "normal" requirement happens in the life of the application.
If a new process consumer is added, you have to add a ProcessDataExtractor according to your needs. The client should be aware of this new process consumer.
If the client code instantiate its process consumer and its data extractor at compile-time, using enum and map to represent process consumers and data extractors doesn't make your design not expandable since it requires very little of modification and these are isolated
If you want to have still less of modification in your code, you could instantiate by reflection the extractor and using a naming convention to retrieve them. For example, place them always in the same package and name each extractor with the same prefix, for example : ProcessDataExtractorXXX or XXX is the variable part.
The problem of this solution is at compile time : clients doesn't know necessary the ProcessDataExtractor available.
If you want that the adding of a new process consumer and extractor to be dynamic, that is during the runtime of the application and that clients may retrieve them during runtime too, it is another subject I think.
At compile-time, the design could be better because so far the client of the ProcessConsumer and ProcessDataExtractor classes may make a bad use of them (that is using Process A with ProcessDataExtractor B).
To avoid that, you have several ways of doing.
But you have guessed the idea : making the initialization and the mapping between ProcessConsumer and ProcessDataExtractor in a dedicated place and a protected way.
To achieve it, I advise you to introduce a interface for ProcessConsumer which provides only the functional method from Runnable:
public interface IProcessConsumer extends Runnable {
}
From now clients who want to consume a process should only use this interface to perform their task. We don't want provide to the client method or constructor to choose its data extractor.
To do it, the concrete ProcessConsumer class should be an inner private class in order to not allow clients to instantiate it directly.
They will have to use a factory to address this need.
In this way, client are able to create the specific process consumer with the required data extractor by requesting a factory of Processes which is responsible to ensure the consistence between a data extractor and a process and which also guarantees the instantiation of a new process consumer at each call (your processes are stateful, so you have to create a new process consumer for each process consumer you start).
Here is the ProcessConsumerFactory class :
import java.util.HashMap;
import java.util.Map;
public class ProcessConsumerFactory {
public static enum ProcessType {
A("processA"), B("processB");
private String name;
ProcessType(String name) {
this.name = name;
}
public String getName() {
return name;
}
}
private class ProcessConsumer implements IProcessConsumer {
private final ProcessType processType;
private final ProcessDataExtractor extractor;
private final RecordProcessor<byte[], byte[]> process;
public ProcessConsumer(ProcessType processType, ProcessDataExtractor extractor) {
this.processType = processType;
this.extractor = extractor;
this.process = new RecordProcessor<>();
}
#Override
public void run() {
// your implementation...
}
}
private static ProcessConsumerFactory instance = new ProcessConsumerFactory();
private Map<ProcessType, ProcessDataExtractor> extractorsByProcessName;
private ProcessConsumerFactory() {
extractorsByProcessName = new HashMap<>();
extractorsByProcessName.put(ProcessType.A, new ProcessDataExtractorA());
extractorsByProcessName.put(ProcessType.B, new ProcessDataExtractorB());
// add a new element in the map to add a new mapping
}
public static ProcessConsumerFactory getInstance() {
return instance;
}
public IProcessConsumer createNewProcessConsumer(ProcessType processType) {
ProcessDataExtractor extractor = extractorsByProcessName.get(processType);
if (extractor == null) {
throw new IllegalArgumentException("processType " + processType + " not recognized");
}
IProcessConsumer processConsumer = new ProcessConsumer(processType, extractor);
return processConsumer;
}
}
Now the clients of the Process consumers class could instante them like that:
IProcessConsumer processConsumer = ProcessConsumerFactory.getInstance().createNewProcessConsumer(ProcessType.A);
Also the way I have my abstract class Validator is right? It is not
doing any useful things at all looks like.
You use an abstract class for validators but for the moment you have not move common behavior in this abstract class, so you could use an interface :
public interface ProcessDataExtractor{
Optional<DataHolder> getDataHolder(ProcessType processType, Record record);
}
You could introduce the abstract class later if it becomes suitable.
There are a few problems with your design:
Catch invalid data as early as possible.
Post-construction is not the right way. Once an object, in this case Record, is constructed it should have valid state. Which means, your validation should be performed prior to constructing Record.
Get data from somehere: from web, database, text file, user input or whatever.
Validate data.
Construct Record object. At this point, either Record object has valid state, or it fails construction for example by raising exception.
Now, if the source from which you get data, contains mostly invalid data, it should be dealt there. Because that is a problem in itself. If the source is right but reading or getting the data has problems, it should be solved first.
Assuming the above issues, if exists, solved then the sequence of program should be something like this.
// Get the data from some source
// Perform Validation on the data. This is generic validation, like validation // of data read from an input form etc.
validate deviceId
validate payId
validate clientId
...
if invalid do something
else if valid proceed
// construct Record object
Record record = new Record(deviceId, payId, clientId, ...)
// At this point record has valid data
public class Record {
deviceId
payId
clientId
Record(deviceId, payId, clientId, ...) {
// Perform business rule validation, pertaining to Record's requirements.
// For example, if deviceId must be in certain range etc.
// if data is valid, perform construction.
// else if invalid, don't construct. throw exception or something
// to deal with invalid condition
}
}
Another problem is, you use some utils class to extract internal data from Record. That is not right either. Record itself should provide getters to its attributes. Right now, what is related to Record is scattered between
Record, Utils, and Validator.
I think your code needs a thorough re-evaluation. I suggest, take a pause, start again but this time at a higher level. Start designing without code for example with some sort of diagramming. Start with only box and arrows (Something like a class diagram but don't need to use a UML tool etc. Pencil and paper. Decide what should go where. Things like,
what are the entities you are dealing with.
What properties each entity has, attributes, methods etc.
What is the relationship among these entities
Consider the sequence these entities are used or interact, then keep refining it.
Without this high level view, dealing with the design question at the code level is difficult and almost always gives bad results.
Once you dealt with the design at a higher level. Then, putting that in code is much easier. Of course you can refine it at that level too, but high level structure should be considered first.