We are using SpringData to implement a (distributed) sequence generator with DynamoDB with the help of DynamoDB's conditional updates feature - specifically using optimistic locking provided by Amazon DynamoDB SDK's #DynamoDBVersionAttribute.
The annotated POJO for the counter item:
#Data
#DynamoDBTable(tableName = "counter")
public class Counter {
#DynamoDBHashKey
private String key = "counter";
#DynamoDBVersionAttribute
private Long value;
}
The SpringData Repository (we are using Boost Chicken's SpringData community lib for DynamoDB)
#Repository
interface CounterRepository extends CrudRepository<Counter, String> {
}
and the implementation itself:
#Slf4j
#Component
#RequiredArgsConstructor
public class SequenceGenerator {
private static final int MAX_VALUE = 1_000_000;
private final CounterRepository repository;
public int next() {
try {
var counter = repository.findById("counter").orElse(new Counter());
var updated = repository.save(counter);
var value = (int) (updated.getValue() % MAX_VALUE);
log.debug("Generated new sequence number {}", value);
return value;
} catch (ConditionalCheckFailedException ex) {
log.debug("Detected an optimistic lock while trying to generate a new sequence number. Will try to generate a new one.");
return next();
}
}
}
Our solution seems to work just fine, but I'm a little bit worried about performance.
Testing shows that 5 concurrent threads (i.e., 5 concurrent MicroService instances in the end) looping to generate sequence numbers take at around 4 seconds to generate 25 numbers, because they constantly run into optimistic locking conditions.
Is there a better way to achieve our goal? I've looked into AtomicCounters, but the doc specifically states
An atomic counter would not be appropriate where overcounting or undercounting can't be tolerated (for example, in a banking application). In this case, it is safer to use a conditional update instead of an atomic counter.
which rules them out for our case, where numbers must be unique.
Related
Could you help me with one thing? Imagine I have a simple RESTful microserver with one GET method which simply responds with a random String.
I assemble all the strings in a ConcurrentHashSet<String> that holds all answers.
There is a sloppy implementation below, the main thing is that the Set<String> is a fail-safe and can be modified simultaneously.
#RestController
public class Controller {
private final StringService stringService;
private final CacheService cacheService;
public Controller(final StringService stringService, final CacheService cacheService) {
this.stringService = stringService;
this.cacheService = cacheService;
}
#GetMapping
public String get() {
final String str = stringService.random();
cacheService.add(str);
return str;
}
}
public class CacheService {
private final Set<String> set = ConcurrentHashMap.newKeySet();
public void add(final String str) {
set.add(str);
}
}
While you are reading this line my endpint is being used by 1 billion people.
I want to shard the cache. Since my system is heavily loaded I can't hold all the strings on one server. I want to have 256 servers/instances and uniformly distribute my cache utilizing str.hashCode()%256 function to determine on each server/instance should a string be kept.
Could you tell me what should I do next?
Assume that currently, I have only running locally Spring Boot application.
You should check out Hazelcast, it is open source and has proved useful for me in a case where i wanted to share data among multiple instances of my application. The In-memory data grid provided by hazelcast might just be the thing you are looking for.
I agree with Vicky, this is what Hazelcast is made for. It's a single jar, a couple lines of code and instead of a HashMap, you have an IMap, which is an extension of HashMap, and you're good to go. All the distribution, sharding, concurrency, etc is done for you. Check out:
https://docs.hazelcast.org/docs/3.11.1/manual/html-single/index.html#map
Try follow codes.But,it is a bad way,you best use Map to cache your data in one instance.If you need to create distributed application,try distributed catche service like Redis.
class CacheService {
/**
* assume read operation is more frequently than write operation
*/
private final static List<Set<String>> sets = new CopyOnWriteArrayList<>();
static {
for (int i = 0; i < 256; i++) {
sets.add(ConcurrentHashMap.newKeySet());
}
}
public void add(final String str) {
int insertIndex = str.hashCode() % 256;
sets.get(insertIndex).add(str);
}
}
In our project I have written a small class which is designed to take the result from an ElasticSearch query containing a named aggregation and return information about each of the buckets returned in the result in a neutral format, suitable for passing on to our UI.
public class AggsToSimpleChartBasicConverter {
private SearchResponse searchResponse;
private String aggregationName;
private static final Logger logger = LoggerFactory.getLogger(AggsToSimpleChartBasicConverter.class);
public AggsToSimpleChartBasicConverter(SearchResponse searchResponse, String aggregationName) {
this.searchResponse = searchResponse;
this.aggregationName = aggregationName;
}
public void setChartData(SimpleChartData chart,
BucketExtractors.BucketNameExtractor keyExtractor,
BucketExtractors.BucketValueExtractor valueExtractor) {
Aggregations aggregations = searchResponse.getAggregations();
Terms termsAggregation = aggregations.get(aggregationName);
if (termsAggregation != null) {
for (Terms.Bucket bucket : termsAggregation.getBuckets()) {
chart.add(keyExtractor.extractKey(bucket), Long.parseLong(valueExtractor.extractValue(bucket).toString()));
}
} else {
logger.warn("Aggregation " + aggregationName + " could not be found");
}
}
}
I want to write a unit test for this class by calling setChartData() and performing some assertions against the object passed in, since the mechanics of it are reasonably simple. However in order to do so I need to construct an instance of org.elasticsearch.action.search.SearchResponse containing some test data, which is required by my class's constructor.
I looked at implementing a solution similar to this existing question, but the process for adding aggregation data to the result is more involved and requires the use of private internal classes which would likely change in a future version, even if I could get it to work initially.
I reviewed the ElasticSearch docs on unit testing and there is a mention of a class org.elasticsearch.test.ESTestCase.java (source) but there is no guidance on how to use this class and I'm not convinced it is intended for this scenario.
How can I easily unit test this class in a manner which is not likely to break in future ES releases?
Note, I do not want to have to start up an instance of ElasticSearch, embedded or otherwise since that is overkill for this simple unit test and would significantly slow down the execution.
I have a Java class that has a Guava LoadingCache<String, Integer> and in that cache, I'm planning to store two things: the average time active employees have worked for the day and their efficiency. I am caching these values because it would be expensive to compute every time a request comes in. Also, the contents of the cache will be refreshed (refreshAfterWrite) every minute.
I was thinking of using a CacheLoader for this situation, however, its load method only loads one value per key. In my CacheLoader, I was planning to do something like:
private Service service = new Service();
public Integer load(String key) throws Exception {
if (key.equals("employeeAvg"))
return calculateEmployeeAvg(service.getAllEmployees());
if (key.equals("employeeEff"))
return calculateEmployeeEff(service.getAllEmployees());
return -1;
}
For me, I find this very inefficient since in order to load both values, I have to invoke service.getAllEmployees() twice because, correct me if I'm wrong, CacheLoader's should be stateless.
Which made me think to use the LoadingCache.put(key, value) method so I can just create a utility method that invokes service.getAllEmployees() once and calculate the values on the fly. However, if I do use LoadingCache.put(), I won't have the refreshAfterWrite feature since it's dependent on a cache loader.
How do I make this more efficient?
It seems like your problem stems from using strings to represent value types (Effective Java Item 50). Instead, consider defining a proper value type that stores this data, and use a memoizing Supplier to avoid recomputing them.
public static class EmployeeStatistics {
private final int average;
private final int efficiency;
// constructor, getters and setters
}
Supplier<EmployeeStatistics> statistics = Suppliers.memoize(
new Supplier<EmployeeStatistics>() {
#Override
public EmployeeStatistics get() {
List<Employee> employees = new Service().getAllEmployees();
return new EmployeeStatistics(
calculateEmployeeAvg(employees),
calculateEmployeeEff(employees));
}});
You could even move these calculation methods inside EmployeeStatistics and simply pass in all employees to the constructor and let it compute the appropriate data.
If you need to configure your caching behavior more than Suppliers.memoize() or Suppliers.memoizeWithExpiration() can provide, consider this similar pattern, which hides the fact that you're using a Cache inside a Supplier:
Supplier<EmployeeStatistics> statistics = new Supplier<EmployeeStatistics>() {
private final Object key = new Object();
private final LoadingCache<Object, EmployeeStatistics> cache =
CacheBuilder.newBuilder()
// configure your builder
.build(
new CacheLoader<Object, EmployeeStatistics>() {
public EmployeeStatistics load(Object key) {
// same behavior as the Supplier above
}});
#Override
public EmployeeStatistics get() {
return cache.get(key);
}};
However, if I do use LoadingCache.put(), I won't have the refreshAfterWrite feature since it's dependent on a cache loader.
I'm not sure, but you might be able to call it from inside the load method. I mean, compute the requested value as you do and put in the other. However, this feels hacky.
If service.getAllEmployees is expensive, then you could cache it. If both calculateEmployeeAvg and calculateEmployeeEff are cheap, then recompute them when needed. Otherwise, it looks like you could use two caches.
I guess, a method computing both values at once could be a reasonable solution. Create a tiny Pair-like class aggregating them and use it as the cache value. There'll be a single key only.
Concerning your own solution, it could be as trivial as
class EmployeeStatsCache {
private long validUntil;
private List<Employee> employeeList;
private Integer employeeAvg;
private Integer employeeEff;
private boolean isValid() {
return System.currentTimeMillis() <= validUntil;
}
private synchronized List<Employee> getEmployeeList() {
if (!isValid || employeeList==null) {
employeeList = service.getAllEmployees();
validUntil = System.currentTimeMillis() + VALIDITY_MILLIS;
}
return employeeList;
}
public synchronized int getEmployeeAvg() {
if (!isValid || employeeAvg==null) {
employeeAvg = calculateEmployeeAvg(getEmployeeList());
}
return employeeAvg;
}
public synchronized int getEmployeeEff() {
if (!isValid || employeeAvg==null) {
employeeAvg = calculateEmployeeEff(getEmployeeList());
}
return employeeAvg;
}
}
Instead of synchronized methods you may want to synchronize on a private final field. There are other possibilities (e.g. Atomic*), but the basic design is probably simpler than adapting Guava's Cache.
Now, I see that there's Suppliers#memoizeWithExpiration in Guava. That's probably even simpler.
Just for the sake of a thought exercise, how could the uniqueness of an attribute be enforced for each instance of a given class ?
Uniqueness here can be defined as being on a single JVM and within a single user session.
This is at Java-level and not to do with databases, the main purpose being to verify if a collision has occurred.
The first obvious step is to have a static attribute at class level.
Having an ArrayList or other container seems impractical as the number of instances rises.
Incrementing a numeric counter at class level appears to be a simplest approach but the id must always follow the last-used-id.
Enforcing a hash or non-numeric id could be problematic.
Concurrency might be of concern. If it is possible for two instances get an id at the same time then this should be prevented.
How should this problem be tackled ? What solutions/approaches might already exist ?
If you care about performance, here is a thread safe, fast (lock-free) and collision-free version of unique id generation
public class Test {
private static AtomicInteger lastId = new AtomicInteger();
private int id;
public Test() {
id = lastId.incrementAndGet();
}
...
Simply use the UUID class in Java http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html. Create a field of the type UUID in the classes under inspection and initialize this field in the constructor.
public class Test {
public UUID id;
public Test() {
id = UUID.randomUUID();
}
}
When it comes time for detecting collisions, simply compare the string representations of the UUIDs of the objects like this ...
Test testObject1 = new Test();
Test testObject2 = new Test();
boolean collision = testObject1.id.toString().equals(testObject2.id.toString());
Or more simply use the compareTo() method in the UUID class ...
boolean collision = testObject2.id.compareTo(testObject1.id) == 0 ? true : false;
0 means that the ids are the same. +1 and -1 when they are not equal.
Merit: universally unique (can be time based, random) and hence should takes care of threading issues (some one should confirm this ... this is based off the best of my knowledge). more information here and here.
To make it thread-safe refer to this question on SO is java.util.UUID thread safe?
Demerit: will require a change in the structure of the classes under inspection, i.e. the id field will have to added in the source of the classes themselves. which might or might not be convenient.
UUID is a good solution, but UUID.randomUUID() on the backend use method:
synchronized public void SecureRandom.nextBytes(byte[] bytes)
So this is slow: threads lock a single monitor object in each id generation operation.
The AtomicInteger is better, because it loops in a CAS operation. But again, for each id generation operation a synchronization operation must be done.
In the solution below, only prime numbers generation is synchronized. Synchronization is on a volatile int, so is fast and thread-safe. Having a set of primes, many ids are generated in a iteration.
Fixed number of threads
Edit: Solution for fixed number of thread
I you know apriory how many threads will use the Id generation, then You can generate IDs with values
Id = I mod X + n*X
Where X is the number of threads, I is the thread number, and N is a local variable that is incremented for each Id generation. The code for this solution is really simple, but it must be integrated with the hole program infrastructure.
Ids generated from primes
The idea is to generate the ids as factors of prime numbers
id = p_1^f1 * p_2^f2 * p_2^f3 * ... * p_n^fn
We use different prime numbers in each thread to generate different sets of ids in each thread.
Assuming that we use primes (2,3,5), the sequence will be:
2, 2^2, 2^3, 2^4, 2^5,..., 2^64
Then, when we see that a overflow will be generated, we roll the factor to the next prime:
3, 2*3 , 2^2*3, 2^3*3, 2^4*3, 2^5*3,..., 2^62*3
and next
3^2, 2*3^2 , 2^2*3^2, .....
Generation class
Edit: primer order generation must be done on AtomicInteger to be correct
Each instance of class IdFactorialGenerator will generate different sets of ids.
To have a thread save generation of Ids, just use ThreadLocal to have a per-thread instance setup. Synchronization is realized only during prime number generation.
package eu.pmsoft.sam.idgenerator;
public class IdFactorialGenerator {
private static AtomicInteger nextPrimeNumber = 0;
private int usedSlots;
private int[] primes = new int[64];
private int[] factors = new int[64];
private long id;
public IdFactorialGenerator(){
usedSlots = 1;
primes[0] = Sieve$.MODULE$.primeNumber(nextPrimeNumber.getAndAdd(1));
factors[0] = 1;
id = 1;
}
public long nextId(){
for (int factorToUpdate = 0; factorToUpdate < 64; factorToUpdate++) {
if(factorToUpdate == usedSlots) {
factors[factorToUpdate] = 1;
primes[factorToUpdate] = Sieve$.MODULE$.primeNumber(nextPrimeNumber.getAndAdd(1));
usedSlots++;
}
int primeToExtend = primes[factorToUpdate];
if( primeToExtend < Long.MAX_VALUE / id) {
// id * primeToExtend < Long.MAX_VALUE
factors[factorToUpdate] = factors[factorToUpdate]*primeToExtend;
id = id*primeToExtend;
return id;
} else {
factors[factorToUpdate] = 1;
id = 1;
for (int i = 0; i < usedSlots; i++) {
id = id*factors[i];
}
}
}
throw new IllegalStateException("I can not generate more ids");
}
}
To get the prime numbers I use a implementations on scala provided here in the problem 7: http://pavelfatin.com/scala-for-project-euler/
object Sieve {
def primeNumber(position: Int): Int = ps(position)
private lazy val ps: Stream[Int] = 2 #:: Stream.from(3).filter(i =>
ps.takeWhile(j => j * j <= i).forall(i % _ > 0))
}
I'm running an import job that has worked pretty well until a couple days ago when the amount of entities has increased dramatically.
What happens is that I get a Lock wait timout exceeded. The application then retries and an exception is thrown since I call em.getTransaction().begin(); one more time.
To get rid of this problem I changed the innodb_lock_wait_timeout to 120 and
lowered the batch side to 50 entities.
What I can't figure out is how to handle all of this properly in code. I don't want the entire import to fail because of locking. How would you handle this? Do you have any code
example? Maybe some other thoughts? Please go nuts!
My BatchPersister:
public class BatchPersister implements Persister {
private final static Log log = getLog(BatchPersister.class);
private WorkLogger workLog = WorkLogger.instance();
private static final int BATCH_SIZE = 500;
private int persistedObjects;
private long startTime;
private UpdateBatch batch;
private String dataSource;
public BatchPersister(String dataSource) {
this.dataSource = dataSource;
}
public void persist(Persistable obj) {
persistedObjects++;
logProgress(100);
if (batch == null)
batch = new UpdateBatch(BATCH_SIZE, dataSource);
batch.add(obj);
if (batch.isFull()) {
batch.persist();
batch = null;
}
}
}
UpdateBatch
public class UpdateBatch {
private final static Log log = LogFactory.getLog(UpdateBatch.class);
private WorkLogger workLogger = WorkLogger.instance();
private final Map<Object, Persistable> batch;
private final EntityManager em;
private int size;
/**
* Initializes the batch and specifies its size.
*/
public UpdateBatch(int size, String dataSource) {
this.size = size;
batch = new LinkedHashMap<Object, Persistable>();
em = EmFactory.getEm(dataSource);
}
public void persist() {
log.info("Persisting " + this);
em.getTransaction().begin();
persistAllToDB();
em.getTransaction().commit();
WorkLog batchLog = new WorkLog(IMPORT_PERSIST, IN_PROGRESS);
batchLog.setAffectedItems(batch.size());
workLogger.log(batchLog);
em.close();
}
/**
* Persists all data in this update batch
*/
private void persistAllToDB() {
for (Persistable persistable : batch.values())
em.persist(persistable);
}
#Override
public String toString() {
final ArrayList<Persistable> values = new ArrayList<Persistable>(batch.values());
Persistable first = values.get(0);
Persistable last = values.get(values.size() - 1);
return "UpdateBatch[" +
first.getClass().getSimpleName() + "(" + first.getId() + ")" +
" - " +
last.getClass().getSimpleName() + "(" + last.getId() + ")" +
"]";
}
}
}
Solution 1.
Do not use JPA, it was not designed to work with massive database operations. Since you have access to your DataSource and you are managing transactions manually there is nothing that stops you from using plain old SQL.
Solution 2.
There might be a performance problem connected with persistence context first level cache - every persisted entity is kept in that cache, when this cache gets large it may hurt performance (mostly memory)
To improve situation set hibernate.jdbc.batch_size property (or equivalent, if your are not using Hibernate implementation of JPA) to more or less 20 - thanks to that queries will be send to database in 20 queries packs.
Secondly clean persistence context every 20 operations, forcing synchronization with database.
private void persistAllToDB() {
int counter = 0;
for (Persistable persistable : batch.values())
em.persist(persistable);
counter++;
if(counter % 20 == 0){
em.flush();
em.clear();
}
}
}
Solution 3.
Tune MySQL InnoDB engine [http://dev.mysql.com/doc/refman/5.1/en/insert-speed.html, http://dev.mysql.com/doc/refman/5.0/en/innodb-tuning.html]. If your table is heavily indexed it may hurt inserts performance.
That's my speculations, hope something would help you.
Pitor already named a couple of options. I would point out that a variation of his "Solution 2" would be to leverage the Hibernate StatelessSession api instead of using Session and clearing.
However, something else you should consider is that a transaction is a grouping of statements that that are expected to fail or succeed in total. If you have a bunch of statements and one in the middle fails and you want all the preceding statements to be persistent, then you should not be grouping them together in a single transaction. Group your statements properly in transactions. Generally it is a good idea to enable jdbc batching in Hibernate anyway; it generally leads to more efficient database communication.