Flink deduplication and processWindowFunction - java

i'm creating a pipeline where the inputs are json messages containing a timestamp field, used to set eventTime. The problem is about the fact that some record could arrive late or duplicate at the system, and this situations needs to be managed; to avoid duplicates I tried the following solution:
.assignTimestampsAndWatermarks(new RecordWatermark()
.withTimestampAssigner(new ExtractRecordTimestamp()))
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(3)))
.process(new WindowedFilter())
.keyBy(new MetricGrouper())
.window(TumblingEventTimeWindows.of(Time.seconds(180)))
.trigger(ContinuousEventTimeTrigger.of(Time.seconds(15)))
.process(new WindowedCountDistinct())
.map((value) -> value.toString());
where the first windowing operation is done to filter the records based on timestamp saved in a set, as follow:
public class WindowedFilter extends ProcessWindowFunction<MetricObject, MetricObject, String, TimeWindow> {
HashSet<Long> previousRecordTimestamps = new HashSet<>();
#Override
public void process(String s, Context context, Iterable<MetricObject> inputs, Collector<MetricObject> out) throws Exception {
String windowStart = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getStart()));
String windowEnd = DateTimeFormatter.ISO_INSTANT.format(Instant.ofEpochMilli(context.window().getEnd()));
log.info("window start: '{}', window end: '{}'", windowStart, windowEnd);
Long watermark = context.currentWatermark();
log.info(inputs.toString());
for (MetricObject in : inputs) {
Long recordTimestamp = in.getTimestamp().toEpochMilli();
if (!previousRecordTimestamps.contains(recordTimestamp)) {
log.info("timestamp not contained");
previousRecordTimestamps.add(recordTimestamp);
out.collect(in);
}
}
}
this solution works, but I've the feeling that I'm not considering something important or it could be done in a better way.

One potential problem with using windows for deduplication is that the windows implemented in Flink's DataStream API are always aligned to the epoch. This means that, for example, an event occurring at 11:59:59, and a duplicate occurring at 12:00:01, will be placed into different minute-long windows.
However, in your case it appears that the duplicates you are concerned about also carry the same timestamp. In that case, what you're doing will produce correct results, so long as you're not concerned about the watermarking producing late events.
The other issue with using windows for deduplication is the latency they impose on the pipeline, and the workarounds used to minimize that latency.
This is why I prefer to implement deduplication with a RichFlatMapFunction or a KeyedProcessFunction. Something like this will perform better than a window:
private static class Event {
public final String key;
}
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.cleanupFullSnapshot()
.build();
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
Here the stream is being deduplicated by key, and the state involved is being automatically cleared after one minute.

Related

Apache Drill: Write general-purpose array_agg UDF

I would like to create an array_agg UDF for Apache Drill to be able to aggregate all values of a group to a list of values.
This should work with any major types (required, optional) and minor types (varchar, dict, map, int, etc.)
However, I get the impression that Apache Drill's UDF API does not really make use of inheritance and generics. Each type has its own writer and handler, and they cannot be abstracted to handle any type. E.g., the ValueHolder interface seems to be purely cosmetic and cannot be used to have type-agnostic hooking of UDFs to any type.
My current implementation
I tried to solve this by using Java's reflection so I could use the ListHolder's write function independent of the holder of the original value.
However, I then ran into the limitations of the #FunctionTemplate annotation.
I cannot create a general UDF annotation for any value (I tried it with the interface ValueHolder: #param ValueHolder input.
So to me it seems like the only way to support different types to have separate classes for each type. But I can't even abstract much and work on any #Param input, because input is only visible in the class where its defined (i.e. type specific).
I based my implementation on https://issues.apache.org/jira/browse/DRILL-6963
and created the following two classes for required and optional varchars (how can this be unified in the first place?)
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class VarChar_Agg implements DrillAggFunc {
#Param org.apache.drill.exec.expr.holders.VarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
listWriter.varChar().write(input);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class NullableVarChar_Agg implements DrillAggFunc {
#Param NullableVarCharHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
if (input.isSet != 1) {
return;
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj;
org.apache.drill.exec.expr.holders.VarCharHolder outHolder = new org.apache.drill.exec.expr.holders.VarCharHolder();
outHolder.start = input.start;
outHolder.end = input.end;
outHolder.buffer = input.buffer;
listWriter.varChar().write(outHolder);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
Interestingly, I can't import org.apache.drill.exec.vector.complex.writer.BaseWriter to make the whole thing easier because then Apache Drill would not find it.
So I have to put the entire package path for everything in org.apache.drill.exec.vector.complex.writer in the code.
Furthermore, I'm using the depcreated ObjectHolder. Any better solution?
Anyway: These work so far, e.g. with this query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.eventLabel) AS label_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
WHERE tbl.data.slug IS NOT NULL
GROUP BY tbl.data.slug
however, when I use ORDER BY, I get this:
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: UnsupportedOperationException: NULL
Fragment 0:0
Additionally, I tried more complex types, namely maps/dicts.
Interestingly, when I call SELECT sqlTypeOf(tbl.data) FROM tbl, I get MAP.
But when I write UDFs, the query planner complains about having no UDF array_agg for type dict.
Anyway, I wrote a version for dicts:
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Map_Agg implements DrillAggFunc {
#Param MapHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
#FunctionTemplate(
name = "array_agg",
scope = FunctionScope.POINT_AGGREGATE,
nulls = NullHandling.INTERNAL
)
public static class Dict_agg implements DrillAggFunc {
#Param DictHolder input;
#Workspace ObjectHolder agg;
#Output org.apache.drill.exec.vector.complex.writer.BaseWriter.ComplexWriter out;
#Override
public void setup() {
agg = new ObjectHolder();
}
#Override
public void reset() {
agg = new ObjectHolder();
}
#Override public void add() {
if (agg.obj == null) {
// Initialise list object for output
agg.obj = out.rootAsList();
}
org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter listWriter =
(org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter) agg.obj;
//listWriter.copyReader(input.reader);
input.reader.copyAsValue(listWriter);
}
#Override
public void output() {
((org.apache.drill.exec.vector.complex.writer.BaseWriter.ListWriter)agg.obj).endList();
}
}
But here, I get an empty list in the field data_agg for my query:
SELECT
MIN(tbl.`timestamp`) AS start_view,
MAX(tbl.`timestamp`) AS end_view,
array_agg(tbl.data) AS data_agg
FROM `dfs.root`.`/path/to/avro/folder` AS tbl
GROUP BY tbl.data.viewSlag
Summary of questions
Most importantly: How do I create an array_agg UDF for Apache Drill?
How to make UDFs type-agnostic/general purpose? Do I really have to implement an entire class for each Nullable, Required and Repeated version of all types? That's a lot to do and quite tedious. Isn't there a way to handle values in an UDF agnostic to the underlying types?
I wish Apache Drill would just use what Java offers here with function generic types, specialised function overloading and inheritence of their own type system. Am I missing something on how to do that?
How can I fix the NULL problem when I use ORDER BY on my varchar version of the aggregate?
How can I fix the problem where my aggregate of maps/dicts is an empty list?
Is there an alternative to using the deprecated ObjectHolder?
To answer your question, unfortunately you've run into one of the limits of the Drill Aggregate UDF API which is that it can only return simple data types.1 It would be a great improvement to Drill to fix this, but that is the current status. If you're interested in discussing that further, please start a thread on the Drill user group and/or slack channel. I don't think it is impossible, but it would require some modification to the Drill internals. IMHO it would be well worth it because there are a few other UDFs that I'd like to implement that need this feature.
The second part of your question is how to make UDFs type agnostic and once again... you've found yet another bit of ugliness in the UDF API. :-) If you do some digging in the codebase, you'll see that most of the Math functions have versions that accept FLOAT, INT etc..
Regarding the aggregate of null or empty lists. I actually have some good news here... The current way of doing that is to provide two versions of the function, one which accepts regular holders and the second which accepts nullable holders and returns an empty list or map if the inputs are null. Yes, this sucks, but the additional good news is that I'm working on cleaning this up and hopefully will have a PR submitted soon that will eliminate the need to do this.
Regarding the ObjectHolder, I wrote a median function that uses a few Stacks to compute a streaming median and I used the ObjectHolder for that. I think it will be with us for some time as there is no alternative at the moment.
I hope this answers your questions.

Spring Batch : Write a List to a database table using a custom batch size

Background
I have a Spring Batch job where :
FlatFileItemReader - Reads one row at a time from the file
ItemProcesor - Transforms the row from the file into a List<MyObject> and returns the List. That is, each row in the file is broken down into a List<MyObject> (1 row in file transformed to many output rows).
ItemWriter - Writes the List<MyObject> to a database table. (I used this
implementation to unpack the list received from the processor
and delegae to a JdbcBatchItemWriter)
Question
At point 2) The processor can return a List of 100000 MyObject instances.
At point 3), The delegate JdbcBatchItemWriter will end up writing the entire List with 100000 objects to the database.
My question is : The JdbcBatchItemWriter does not allow a custom batch size. For all practical purposes, the batch-size = commit-interval for the step. With this in mind, is there another implementation of an ItemWriter available in Spring Batch that allows writing to the database and allows configurable batch size? If not, how do go about writing a custom writer myself to acheive this?
I see no obvious way to set the batch size on the JdbcBatchItemWriter. However, you can extend the writer and use a custom BatchPreparedStatementSetter to specify the batch size. Here is a quick example:
public class MyCustomWriter<T> extends JdbcBatchItemWriter<T> {
#Override
public void write(List<? extends T> items) throws Exception {
namedParameterJdbcTemplate.getJdbcOperations().batchUpdate("your sql", new BatchPreparedStatementSetter() {
#Override
public void setValues(PreparedStatement ps, int i) throws SQLException {
// set values on your sql
}
#Override
public int getBatchSize() {
return items.size(); // or any other value you want
}
});
}
}
The StagingItemWriter in the samples is an example of how to use a custom BatchPreparedStatementSetter as well.
The answer from Mahmoud Ben Hassine and the comments pretty much covers all aspects of the solution and is the accepted answer.
Here is the implementation I used if anyone is interested :
public class JdbcCustomBatchSizeItemWriter<W> extends JdbcDaoSupport implements ItemWriter<W> {
private int batchSize;
private ParameterizedPreparedStatementSetter<W> preparedStatementSetter;
private String sqlFileLocation;
private String sql;
public void initReader() {
this.setSql(FileUtilties.getFileContent(sqlFileLocation));
}
public void write(List<? extends W> arg0) throws Exception {
getJdbcTemplate().batchUpdate(sql, Collections.unmodifiableList(arg0), batchSize, preparedStatementSetter);
}
public void setBatchSize(int batchSize) {
this.batchSize = batchSize;
}
public void setPreparedStatementSetter(ParameterizedPreparedStatementSetter<W> preparedStatementSetter) {
this.preparedStatementSetter = preparedStatementSetter;
}
public void setSqlFileLocation(String sqlFileLocation) {
this.sqlFileLocation = sqlFileLocation;
}
public void setSql(String sql) {
this.sql = sql;
}
}
Note :
The use of Collections.unmodifiableList prevents the need for any explicit casting.
I use sqlFileLocation to specify an external file that contains the sql and FileUtilities.getfileContents simply returns the contents of this sql file. This can be skipped and one can directly pass the sql to the class as well while creating the bean.
I wouldn't do this. It presents issues for restartability. Instead, modify your reader to produce individual items rather than having your processor take in an object and return a list.

Add Fields to Csv with Spark

So, I have a CSV which contains spatial (latitude, longitude) and temporal (timestamp) data.
To be useful for us, we converted the spatial information to "geohash", and the temporal information to "timehash".
The problem is, how to add the geohash and timehash as fields for each row in the CSV with spark (since the data is about 200 GB)?
we tried to use JavaPairRDD and it's function mapTopair , but the problem remains in how to convert back to a JavaRdd and then to CSV? So i think this was a bad solution I'm asking for a simple way.
Update of question :
After #Alvaro is help i have created this java class :
public class Hash {
public static SparkConf Spark_Config;
public static JavaSparkContext Spark_Context;
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
public Hash(String path) {
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.config("spark.master", "local")
.getOrCreate();
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
Dataset df=spark.read().csv(path)
.withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
.withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("C:/Users/Ahmed/Desktop/preprocess2");
}
public static void main(String[] args) {
String path = "C:/Users/Ahmed/Desktop/cabs_trajectories/cabs_trajectories/green/2013";
Hash h = new Hash(path);
}
}
and then i get serialization problem, which disappear when i delete write().csv()
One of the most efficient ways is to load the CSV using the Datasets API and use User Defined Function to convert the columns you've specified. In this way, your data will always remain structure, not having to deal with tuples.
First of all, you create your User Define Functions: geohashConverter, which takes two values (latitude and longitude), and timehashConverter, which only takes the timestamp.
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
#Override
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
#Override
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
Once created, you have to register them:
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
And finally, just read your CSV file, and apply the User Defined Functions by calling the withColumn. It will create a new column based on the User Defined Function you are calling with callUDF. callUDF always receives a String with the name of the registered UDF you want to call and one or many Columns whose value will be passed to the UDF.
And finally, just save your dataset by calling write().csv("path")
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.callUDF;
spark.read().csv("/source/path")
.withColumn("geohash", callUDF("geohashConverter", col("latitude"), col("longitude")))
.withColumn("timehash", callUDF("timehashConverter", col("timestamp")))
.write().csv("/path/to/save");
Hope it helped!
Update
It would be pretty helpful if you post the code which is causing problems, because the exception says almost nothing about what part of the code is not serializable.
Anyways, from my personal experience with Spark, I think the problem is the object you are using to caculate the hashes. Bear in mind that this object has to be distributed through the cluster. If this object cannot be serialized, it will throw a Task not serializable Exception. You have two options to work around it:
Implement the Serializable interface in the class that you use to calculate the hash.
Create an static method that generate the hashes and call this method from the UDF.
Update 2
and then i get serialization problem, which disappear when i delete
write().csv()
It's an expected behaviour. When you delete write().csv() you are executing nothing. You should know how Spark works. In this code, all the methods called before csv() are transformations. In Spark, transformation are not executed until an action like csv(), show() or count() is called.
The problem is that you are creating and executing the Spark Job in a non-serializable class (and even worst in a constructor!!!??)
Creating the Spark job in an static method solves the problem. Bear in mind that your Spark code must be distributed through the cluster, and consequently, it must be serializable. It worked for me and must work for you:
public class Hash {
public static void main(String[] args) {
String path = "in/prueba.csv";
UDF2 geohashConverter = new UDF2<Long, Long, String>() {
public String call(Long latitude, Long longitude) throws Exception {
// convert here
return "calculate_hash";
}
};
UDF1 timehashConverter = new UDF1<Long, String>() {
public String call(Long timestamp) throws Exception {
// convert here
return "calculate_hash";
}
};
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL Example")
.config("spark.master", "local")
.getOrCreate();
spark.udf().register("geohashConverter", geohashConverter, DataTypes.StringType);
spark.udf().register("timehashConverter", timehashConverter, DataTypes.StringType);
spark
.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.load(path)
.withColumn("geohash", callUDF("geohashConverter", col("_c6"), col("_c7")))
.withColumn("timehash", callUDF("timehashConverter", col("_c1")))
.write().csv("resultados");
}
}

Flatten processing result in spring batch

Does anyone know how in spring-batch (3.0.7) can I flat a result of processor that returns list of entities?
Example:
I got a processor that returns List
public class MyProcessor implements ItemProcessor < Long , List <Entity>> {
public List<Entity> process ( Long id )
}
Now all following processors / writers need to work on List < Entity >. Is there any way to flat the result to simply Entity so the further processors in given step can work on single Entities?
The only way is to persist the list somehow with a writer and then create a separate step that would read from the persisted data.
Thanks in advance!
As you know, processors in spring-batch can be chained with a composite processor. Within the chain, you can change the processing type from processor to processor, but of course input and output type of two "neighbour"-processors have to match.
However, Input out Output type is always treated as one item. Therefore, if the output type of a processor ist a List, this list is regared as one item. Hence, the following processor needs to have an InputType "List", resp., if a writer follows, the Writer needs to have a List-of-List as type its write-method.
Moreover, a processor can not multiply its element. There can only be one output item for every input element.
Basically, there is nothing wrong with having a chain like
Reader<Integer>
ProcessorA<Integer,List<Integer>>
ProcessorB<List<Integer>,List<Integer>>
Writer<List<Integer>> (which leads to a write-method write(List<List<Integer>> items)
Depending on the context, there could be a better solution.
You could mitigate the impact (for instance reuseability) by using wrapper-processors and a wrapper-writer like the following code examples:
public class ListWrapperProcessor<I,O> implements ItemProcessor<List<I>, List<O>> {
ItemProcessor<I,O> delegate;
public void setDelegate(ItemProcessor<I,O> delegate) {
this.delegate = delegate;
}
public List<O> process(List<I> itemList) {
List<O> outputList = new ArrayList<>();
for (I item : itemList){
O outputItem = delegate.process(item);
if (outputItem!=null) {
outputList.add(outputItem);
}
}
if (outputList.isEmpty()) {
return null;
}
return outputList;
}
}
public class ListOfListItemWriter<T> implements InitializingBean, ItemStreamWriter<List<T>> {
private ItemStreamWriter<T> itemWriter;
#Override
public void write(List<? extends List<T>> listOfLists) throws Exception {
if (listOfLists.isEmpty()) {
return;
}
List<T> all = listOfLists.stream().flatMap(Collection::stream).collect(Collectors.toList());
itemWriter.write(all);
}
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(itemWriter, "The 'itemWriter' may not be null");
}
public void setItemWriter(ItemStreamWriter<T> itemWriter) {
this.itemWriter = itemWriter;
}
#Override
public void close() {
this.itemWriter.close();
}
#Override
public void open(ExecutionContext executionContext) {
this.itemWriter.open(executionContext);
}
#Override
public void update(ExecutionContext executionContext) {
this.itemWriter.update(executionContext);
}
}
Using such wrappers, you could still implement "normal" processor and writers and then use such wrappers in order to move the "List"-handling out of them.
Unless you can provide a compelling reason, there's no reason to send a List of Lists to your ItemWriter. This is not the way the ItemProcessor was intended to be used. Instead, you should create/configure and ItemReader to return one object with relevant objects.
For example, if you're reading from the database, you could use the HibernateCursorItemReader and a query that looks something like this:
"from ParentEntity parent left join fetch parent.childrenEntities"
Your data model SHOULD have a parent table with the Long id that you're currently passing to your ItemProcessor, so leverage that to your advantage. The reader would then pass back ParentEntity objects, each with a collection of ChildEntity objects that go along with it.

How build my own Application Setting

I want to build a ApplicationSetting for my application. The ApplicationSetting can be stored in a properties file or in a database table. The settings are stored in key-value pairs. E.g.
ftp.host = blade
ftp.username = dummy
ftp.pass = pass
content.row_pagination = 20
content.title = How to train your dragon.
I have designed it as follows:
Application settings reader:
interface IApplicationSettingReader {
Map read();
}
DatabaseApplicationSettingReader implements IApplicationSettingReader {
dao appSettingDao;
Map read() {
List<AppSettingEntity> listEntity = appSettingsDao.findAll();
Map<String, String> map = new HaspMap<String, String>();
foreach (AppSettingEntity entity : listEntity) {
map.put(entity.getConfigName(), entity.getConfigValue());
}
return new AppSettings(map);
}
}
DatabaseApplicationSettingReader implements IApplicationSettingReader {
dao appSettingDao;
Map read() {
//read from some properties file
return new AppSettings(map);
}
}
Application settings class:
AppSettings {
private static AppSettings instance = new AppSettings();
private Map map;
private AppSettings() {
}
public static AppSettings getInstance() {
if (instance == null) {
throw new RuntimeException("Object not configure yet");
}
return instance;
}
public static configure(IApplicationSettingReader reader) {
this.map = reader.read();
}
public String getFtpSetting(String param) {
return map.get("ftp." + param);
}
public String getContentSetting(String param) {
return map.get("content." + param);
}
}
Test class:
AppSettingsTest {
IApplicationSettingReader reader;
#Before
public void setUp() throws Exception {
reader = new DatabaseApplicationSettingReader();
}
#Test
public void getContentSetting_should_get_content_title() {
AppSettings.configure(reader);
Instance settings = AppSettings.getInstance();
String title = settings.getContentSetting("title");
assertNotNull(title);
Sysout(title);
}
}
My questions are:
Can you give your opinion about my code, is there something wrong ?????
I configure my application setting once, while the application start, I configure the application setting with appropriate reader (DbReader or PropertiesReader), I make it singleton because the application just have one instance of ApplicationSettngs. The problem is, when some user edit the database or file directly to database or file, I can't get the changed values. Now, I want to implement something like ApplicationSettingChangeListener. So if the data changes, I will refresh my application settings. Do you have any suggestions how this can be implementedb ????
I haven't throughly inspected your code, but there seems to be a concurrency issue. The map is thread-unsafe (HashMap), so if you mutate it through config() and have other threads access map, you have a problem.
Though you could use a ConcurrentHashMap instead HashMap, a batch operation on ConcurrentHashMap is not atomic. Meaning that, if you use it, you will see a "half-way" modified config. That could not be okay depending on your app.
So, the solution for this is to use this:
private volatile ImmutableMap map;
public config(){
ImmutableMap newMap = createNewMap();
this.map = newMap;
}
This will change your configs atomically (no intermediate state is visible).
As for updating your config on the fly, log4j does it using a background thread that monitors the config file. You could of course monitor a db table instead by polling it periodically.
In that case, your Config class will have preferably a ScheduledExecutor with a task that will monitor files/db and call config() periodically.
The answer to question #2 is to use a thread and check periodically if the file has been changed or to simply reinitialize your settings with the file contents.

Categories

Resources