How to shard a Set?

How to shard a Set? - java

Could you help me with one thing? Imagine I have a simple RESTful microserver with one GET method which simply responds with a random String.
I assemble all the strings in a ConcurrentHashSet<String> that holds all answers.
There is a sloppy implementation below, the main thing is that the Set<String> is a fail-safe and can be modified simultaneously.
#RestController
public class Controller {
private final StringService stringService;
private final CacheService cacheService;
public Controller(final StringService stringService, final CacheService cacheService) {
this.stringService = stringService;
this.cacheService = cacheService;
}
#GetMapping
public String get() {
final String str = stringService.random();
cacheService.add(str);
return str;
}
}
public class CacheService {
private final Set<String> set = ConcurrentHashMap.newKeySet();
public void add(final String str) {
set.add(str);
}
}
While you are reading this line my endpint is being used by 1 billion people.
I want to shard the cache. Since my system is heavily loaded I can't hold all the strings on one server. I want to have 256 servers/instances and uniformly distribute my cache utilizing str.hashCode()%256 function to determine on each server/instance should a string be kept.
Could you tell me what should I do next?
Assume that currently, I have only running locally Spring Boot application.

You should check out Hazelcast, it is open source and has proved useful for me in a case where i wanted to share data among multiple instances of my application. The In-memory data grid provided by hazelcast might just be the thing you are looking for.

I agree with Vicky, this is what Hazelcast is made for. It's a single jar, a couple lines of code and instead of a HashMap, you have an IMap, which is an extension of HashMap, and you're good to go. All the distribution, sharding, concurrency, etc is done for you. Check out:
https://docs.hazelcast.org/docs/3.11.1/manual/html-single/index.html#map

Try follow codes.But,it is a bad way,you best use Map to cache your data in one instance.If you need to create distributed application,try distributed catche service like Redis.
class CacheService {
/**
* assume read operation is more frequently than write operation
*/
private final static List<Set<String>> sets = new CopyOnWriteArrayList<>();
static {
for (int i = 0; i < 256; i++) {
sets.add(ConcurrentHashMap.newKeySet());
}
}
public void add(final String str) {
int insertIndex = str.hashCode() % 256;
sets.get(insertIndex).add(str);
}
}

Related

Concurrent inserting to DB

I made a parser based on Jsoup. This parser handles a page with pagination. This page contains, for example, 100 links to be parsed. I created a main loop that goes over pagination. And I need to run async tasks to parse each of 100 items on each page. As I understand, Jsoup does not support async requests handling. After handling each of item I need to save it to DB. I want to avoid errors during insert into DB's table (if threads will use the same id for different items at the same time, if its possible). What you could suggest?
Could I use simple Thread instance to parse each item:
public class ItemParser extends Thread {
private String url;
private MySpringDataJpaRepository repo;
public ItemParser(String url, MySpringDataJpaRepository repoReference) {
this.url = url;
this.repo = repoReference;
}
#Override
public void run() {
final MyItem item = jsoupParseItem();
repo.save(item);
}
}
And run this like:
public class Parser {
#Autowired
private MySpringDataJpaRepository repoReference; // <-- SINGLETON
public static void main(String[] args) {
int pages = 10000;
for (int i = 0; i < pages; i++) {
Document currentPage = Jsoup.parse();
List<String> links = currentPage.extractLinks(); // contains 100 links to be parsed on each for-loop iteration
links.forEach(link -> new ItemParser(link, repoReference).start());
}
}
}
I know that this code is not compilable, I just want to show you my idea.
Or maybe it's better to use Spring Batch?
What is best practice to solve this?
What do you think?

If you use row level locking should be fine. It might save problems to have each insert be a transaction but this has implications given the whole notion of a transaction as a unit of work (i.e. if a single insert fails do you want the whole run to fail and rollback?).
Also, if you use UUIDs or db-generated ids you won't have any collision issues.
As to how to structure the code, I'd look at using Runnables for each task, and a thread pool executor. Too many threads and the system will lose efficiency for trying to manage them all. I notice you're using spring, so take a look at https://docs.spring.io/spring/docs/current/spring-framework-reference/html/scheduling.html

How should I implement Guava cache when I plan to cache multiple values efficiently?

I have a Java class that has a Guava LoadingCache<String, Integer> and in that cache, I'm planning to store two things: the average time active employees have worked for the day and their efficiency. I am caching these values because it would be expensive to compute every time a request comes in. Also, the contents of the cache will be refreshed (refreshAfterWrite) every minute.
I was thinking of using a CacheLoader for this situation, however, its load method only loads one value per key. In my CacheLoader, I was planning to do something like:
private Service service = new Service();
public Integer load(String key) throws Exception {
if (key.equals("employeeAvg"))
return calculateEmployeeAvg(service.getAllEmployees());
if (key.equals("employeeEff"))
return calculateEmployeeEff(service.getAllEmployees());
return -1;
}
For me, I find this very inefficient since in order to load both values, I have to invoke service.getAllEmployees() twice because, correct me if I'm wrong, CacheLoader's should be stateless.
Which made me think to use the LoadingCache.put(key, value) method so I can just create a utility method that invokes service.getAllEmployees() once and calculate the values on the fly. However, if I do use LoadingCache.put(), I won't have the refreshAfterWrite feature since it's dependent on a cache loader.
How do I make this more efficient?

It seems like your problem stems from using strings to represent value types (Effective Java Item 50). Instead, consider defining a proper value type that stores this data, and use a memoizing Supplier to avoid recomputing them.
public static class EmployeeStatistics {
private final int average;
private final int efficiency;
// constructor, getters and setters
}
Supplier<EmployeeStatistics> statistics = Suppliers.memoize(
new Supplier<EmployeeStatistics>() {
#Override
public EmployeeStatistics get() {
List<Employee> employees = new Service().getAllEmployees();
return new EmployeeStatistics(
calculateEmployeeAvg(employees),
calculateEmployeeEff(employees));
}});
You could even move these calculation methods inside EmployeeStatistics and simply pass in all employees to the constructor and let it compute the appropriate data.
If you need to configure your caching behavior more than Suppliers.memoize() or Suppliers.memoizeWithExpiration() can provide, consider this similar pattern, which hides the fact that you're using a Cache inside a Supplier:
Supplier<EmployeeStatistics> statistics = new Supplier<EmployeeStatistics>() {
private final Object key = new Object();
private final LoadingCache<Object, EmployeeStatistics> cache =
CacheBuilder.newBuilder()
// configure your builder
.build(
new CacheLoader<Object, EmployeeStatistics>() {
public EmployeeStatistics load(Object key) {
// same behavior as the Supplier above
}});
#Override
public EmployeeStatistics get() {
return cache.get(key);
}};

However, if I do use LoadingCache.put(), I won't have the refreshAfterWrite feature since it's dependent on a cache loader.
I'm not sure, but you might be able to call it from inside the load method. I mean, compute the requested value as you do and put in the other. However, this feels hacky.
If service.getAllEmployees is expensive, then you could cache it. If both calculateEmployeeAvg and calculateEmployeeEff are cheap, then recompute them when needed. Otherwise, it looks like you could use two caches.
I guess, a method computing both values at once could be a reasonable solution. Create a tiny Pair-like class aggregating them and use it as the cache value. There'll be a single key only.
Concerning your own solution, it could be as trivial as
class EmployeeStatsCache {
private long validUntil;
private List<Employee> employeeList;
private Integer employeeAvg;
private Integer employeeEff;
private boolean isValid() {
return System.currentTimeMillis() <= validUntil;
}
private synchronized List<Employee> getEmployeeList() {
if (!isValid || employeeList==null) {
employeeList = service.getAllEmployees();
validUntil = System.currentTimeMillis() + VALIDITY_MILLIS;
}
return employeeList;
}
public synchronized int getEmployeeAvg() {
if (!isValid || employeeAvg==null) {
employeeAvg = calculateEmployeeAvg(getEmployeeList());
}
return employeeAvg;
}
public synchronized int getEmployeeEff() {
if (!isValid || employeeAvg==null) {
employeeAvg = calculateEmployeeEff(getEmployeeList());
}
return employeeAvg;
}
}
Instead of synchronized methods you may want to synchronize on a private final field. There are other possibilities (e.g. Atomic*), but the basic design is probably simpler than adapting Guava's Cache.
Now, I see that there's Suppliers#memoizeWithExpiration in Guava. That's probably even simpler.

Java pattern for parameters of which only one needs to be non-null?

In the last time I often write long functions that have several parameters but use only one of them and the functionality is only different at a few keypoints that are scattered around the function. Thus splitting the function would create too many small functions without a purpose. Is this good style or is there a good general refactoring pattern for this? To be more clear, an example:
public performSearch(DataBase dataBase, List<List<String>> segments) {performSearch(dataBase,null,null,segments);}
public performSearch(DataBaseCache dataBaseCache,List<List<String>> segments) {performSearch(null,dataBaseCache,null,segments);}
public performSearch(DataBase dataBase, List<String> keywords {performSearch(dataBase,null,keywords,null);}
public performSearch(DataBaseCache dataBaseCache,List<String> keywords) {performSearch(null,dataBaseCache,keywords,null);}
/** either dataBase or dataBaseCache may be null, dataBaseCache is used if it is non-null, else dataBase is used (slower). */
private void performSearch(DataBase dataBase, DataBaseCache dataBaseCache, List<String> keywords, List<List<String>> segments)
{
SearchObject search = new SearchObject();
search.setFast(true);
...
search.setNumberOfResults(25);
if(dataBaseCache!=null) {search.setSource(dataBaseCache);}
else {search.setSource(dataBase);}
... do some stuff ...
if(segments==null)
{
// create segments from keywords
....
segments = ...
}
}
This style of code works but I don't like all those null parameters and the possibilities of calling methods like this wrong (both parameters null, what happens if both are non-null) but I don't want to write 4 seperate functions either... I know this may be too general but maybe someone has a general solution to this principle of problems :-)
P.S.: I don't like to split up a long function if there is no reason for it other than it being long (i.e. if the subfunctions are only ever called in that order and only by this one function) especially if they are tightly interwoven and would need a big amount of parameters transported around them.

I think it is very bad procedural style. Try to avoid such coding. Since you already have a bulk of such code it may be very hard to re-factor it because each method contains its own logic that is slightly different from other. BTW the fact that it is hard is an evidence that the style is bad.
I think you should use behavioral patterns like
Chain of responsibilities
Command
Strategy
Template method
that can help you to change your procedural code to object oriented.

Could you use something like this
public static <T> T firstNonNull(T...parameters) {
for (T parameter: parameters) {
if (parameter != null) {
return parameter;
}
}
throw new IllegalArgumentException("At least one argument must be non null");
}
It does not check if more than one parameter is not null and they must be of the same type, but you could use it like this:
search.setSource(firstNonNull(dataBaseCache, database));

Expecting nulls is an anti-pattern because it litters your code with NullPointerExceptions waiting to happen. Use the builder pattern to construct the SearchObject. This is the signature you want, I'll let you figure out the implementation:
class SearchBuilder {
SearchObject search = new SearchObject();
List<String> keywords = new ArrayList<String>();
List<List<String>> segments = new ArrayList<List<String>>();
public SearchBuilder(DataBase dataBase) {}
public SearchBuilder(DataBaseCache dataBaseCache) {}
public void addKeyword(String keyword) {}
public void addSegment(String... segment) {}
public void performSearch();
}

I agree with what Alex said. Without knowing the problem I would recommend following structure based on what was in the example:
public interface SearchEngine {
public SearchEngineResult findByKeywords(List<String> keywords);
}
public class JDBCSearchEngine {
private DataSource dataSource;
public JDBCSearchEngine(DataSource dataSource) {
this.dataSource = dataSource;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// Find from JDBC datasource
// It might be useful to use a DAO instead of datasource, if you have database operations other that searching
}
}
public class CachingSearchEngine {
private SearchEngine searchEngine;
public CachingSearchEngine(SearchEngine searchEngine) {
this.searchEngine = searchEngine;
}
public SearchEngineResult findByKeywords(List<String> keywords) {
// First check from cache
...
// If not found, then fetch from real search engine
SearchEngineResult result = searchEngine.findByKeywords(keywords);
// Then add to cache
// Return the result
return result;
}
}

Is there anything in Java close to the parallel collections in Scala?

What is the simplest way to implement a parallel computation (e.g. on a multiple core processor) using Java.
I.E. the java equivalent to this Scala code
val list = aLargeList
list.par.map(_*2)
There is this library, but it seems overwhelming.

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/
Don't give up so fast, snappy! ))
From the javadocs (with changes to map to your f) the essential matter is really just this:
ParallelLongArray a = ... // you provide
a.replaceWithMapping (new LongOp() { public long op(long a){return a*2L;}};);
is pretty much this, right?
val list = aLargeList
list.par.map(_*2)
& If you are willing to live with a bit less terseness, the above can be a reasonably clean and clear 3 liner (and of course, if you reuse functions, then its the same exact thing as Scala - inline functions.):
ParallelLongArray a = ... // you provide
LongOp f = new LongOp() { public long op(long a){return a*2L;}};
a.replaceWithMapping (f);
[edited above to show concise complete form ala OP's Scala variant]
and here it is in maximal verbose form where we start from scratch for demo:
import java.util.Random;
import jsr166y.ForkJoinPool;
import extra166y.Ops.LongGenerator;
import extra166y.Ops.LongOp;
import extra166y.ParallelLongArray;
public class ListParUnaryFunc {
public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
// create a parallel long array
// with random long values
ParallelLongArray a = ParallelLongArray.create(n-1, new ForkJoinPool());
a.replaceWithGeneratedValue(generator);
// use it: apply unaryLongFuncOp in parallel
// to all values in array
a.replaceWithMapping(unaryLongFuncOp);
// examine it
for(Long v : a.asList()){
System.out.format("%d\n", v);
}
}
static final Random rand = new Random(System.nanoTime());
static LongGenerator generator = new LongGenerator() {
#Override final
public long op() { return rand.nextLong(); }
};
static LongOp unaryLongFuncOp = new LongOp() {
#Override final public long op(long a) { return a * 2L; }
};
}
Final edit and notes:
Also note that a simple class such as the following (which you can reuse across your projects):
/**
* The very basic form w/ TODOs on checks, concurrency issues, init, etc.
*/
final public static class ParArray {
private ParallelLongArray parr;
private final long[] arr;
public ParArray (long[] arr){
this.arr = arr;
}
public final ParArray par() {
if(parr == null)
parr = ParallelLongArray.createFromCopy(arr, new ForkJoinPool()) ;
return this;
}
public final ParallelLongArray map(LongOp op) {
return parr.replaceWithMapping(op);
}
public final long[] values() { return parr.getArray(); }
}
and something like that will allow you to write more fluid Java code (if terseness matters to you):
long[] arr = ... // you provide
LongOp f = ... // you provide
ParArray list = new ParArray(arr);
list.par().map(f);
And the above approach can certainly be pushed to make it even cleaner.

Doing that on one machine is pretty easy, but not as easy as Scala makes it. That library you posted is already apart of Java 5 and beyond. Probably the simplest thing to use is a ExecutorService. That represents a series of threads that can be run on any processor. You send it tasks and those things return results.
http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
http://www.fromdev.com/2009/06/how-can-i-leverage-javautilconcurrent.html
I'd suggest using ExecutorService.invokeAll() which will return a list of Futures. Then you can check them to see if their done.
If you're using Java7 then you could use the fork/join framework which might save you some work. With all of these you can build something very similar to Scala parallel arrays so using it is fairly concise.

Using threads, Java doesn't have this sort of thing built-in.

There will be an equivalent in Java 8: http://www.infoq.com/articles/java-8-vs-scala

How do I synchronize to prevent a java.util.ConcurrentModificationException

I have program consisting of a number of classes. I have a problem with the interraction of two of the classes - WebDataCache and Client. The problem classes are listed below.
WebData:
This is just a data class representing some data retrieved from the internet.
WebService:
This is just a web service wrapper class which connects to a particular web service, reads some data and stores it in an object of type WebData.
WebDataCache:
This is a class which uses the WebService class to retreive data that's cached in a map, keyed by the ids fields of the data.
Client:
This is is a class which contains a refrence to an instance of the WebDataCache class and uses the cached data.
The problem is (as illustrated below) when the class is looping through the cached data, it is possible for the WebDataCache to update the underlying collection.
My question is how do I synchronize access to the cache?
I don't want to synchronize the whole cache as there are multiple instance of the Client class, however each instantiated with a unique id (i.e. new Client(0,...), new Client(1,...), new Client(2,...), etc each instance only interested in data keyed by the id the client was instansiated with.
Are there any relevent design patterns I can use?
class WebData {
private final int id;
private final long id2;
public WebData(int id, long id2) {
this.id = id;
this.id2 = id2;
}
public int getId() { return this.id; }
public long getId2() { return this.id2; }
}
class WebService {
Collection<WebData> getData(int id) {
Collection<WebData> a = new ArrayList<WebData>();
// populate A with data from a webservice
return a;
}
}
class WebDataCache implements Runnable {
private Map<Integer, Map<Long, WebData>> cache =
new HashMap<Integer, Map<Long, WebData>>();
private Collection<Integer> requests =
new ArrayList<Integer>();
#Override
public void run() {
WebService webSvc = new WebService();
// get data from some web service
while(true) {
for (int id : requests) {
Collection<WebData> webData = webSvc.getData(id);
Map<Long, WebData> row = cache.get(id);
if (row == null)
row = cache.put(id, new HashMap<Long, WebData>());
else
row.clear();
for (WebData webDataItem : webData) {
row.put(webDataItem.getId2(), webDataItem);
}
}
Thread.sleep(2000);
}
}
public synchronized Collection<WebData> getData(int id){
return cache.get(id).values();
}
public synchronized void requestData(int id) {
requests.add(id);
}
}
-
class Client implements Runnable {
private final WebDataCache cache;
private final int id;
public Client(int id, WebDataCache cache){
this.id = id;
this.cache = cache;
}
#Override
public void run() {
cache.requestData(id);
while (true) {
for (WebData item : cache.getData(id)) {
// java.util.ConcurrentModificationException is thrown here...
// I understand that the collection is probably being modified in WebDataCache::run()
// my question what's the best way to sychronize this code snippet?
}
}
}
}
Thanks!

Use java.util.concurrent.ConcurrentHashMap instead of plain old java.util.HashMap. From the Javadoc:
A hash table supporting full
concurrency of retrievals and
adjustable expected concurrency for
updates. This class obeys the same
functional specification as Hashtable,
and includes versions of methods
corresponding to each method of
Hashtable. However, even though all
operations are thread-safe, retrieval
operations do not entail locking, and
there is not any support for locking
the entire table in a way that
prevents all access. This class is
fully interoperable with Hashtable in
programs that rely on its thread
safety but not on its synchronization
details.
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ConcurrentHashMap.html
So you would replace:
private Map<Integer, Map<Long, WebData>> cache =
new HashMap<Integer, Map<Long, WebData>>();
With
private Map<Integer, Map<Long, WebData>> cache =
new ConcurrentHashMap<Integer, Map<Long, WebData>>();

My best recommendation is to use an existing cache implementation such as JCS or EhCache - these are battle tested implementations.
Otherwise, you have a couple of things going on in your code. Things that can break in funny ways.
HashMap can grow infinite loops when modified concurrently by multiple threads. So don't. Use java.util.concurrent.ConcurrentHashMap instead.
The ArrayList that you use for WebDataCache.requests isn't thread-safe either and you have inconsistent synchronization - either change it to a safer list implementation from java.util.concurrent or make sure that all access to it is synchronizing on the same lock.
Lastly, have your code checked with FindBugs and/or properly reviewed by someone with solid knowledge and experience on writing multi-threaded code.
If you want to read a book on this stuff, I can recommend Java Concurrency in Practice by Brian Goetz.

In addition to the other posted recommendations, consider how often the cache is updated versus just being read. If the reading dominates and updating is rare, and it's not critical that the reading loop be able to see every update immediately, consider using a CopyOnWriteArraySet. It and its sibling CopyOnWriteArrayList allow concurrent reading and updating of the members; the reader sees a consistent snapshot unaffected by any mutation of the underlying collection -- analogous to the SERIALIZABLE isolation level in a relational database.
The problem here, though, is that neither of these two structures give you your dictionary or associative array storage (a la Map) out of the box. You'd have to define a composite structure to store the key and value together, and, given that CopyOnWriteArraySet uses Object#equals() for membership testing, you'd have to write an unconventional key-based equals() method for your structure.

The answer from LES2 is good except that you would have to replace:
row = cache.put(id, new HashMap<Long, WebData>());
with:
row = cache.put(id, new ConcurrentHashMap<Long, WebData>());
For that's the one that hold the "problematic" collection and not the whole cache.

You can synchronize on the row returned by the cache who is at the end who holds the collection that is being shared.
On WebDataCache:
Map<Long, WebData> row = cache.get(id);
if (row == null) {
row = cache.put(id, new HashMap<Long, WebData>());
} else synchronized( row ) {
row.clear();
}
for (WebData webDataItem : webData) synchronized( row ) {
row.put(webDataItem.getId2(), webDataItem);
}
// it doesn't make sense to synchronize the whole cache here.
public Collection<WebData> getData(int id){
return cache.get(id).values();
}
On Client:
Collection<WebData> data = cache.getData(id);
synchronized( data ) {
for (WebData item : cache.getData(id)) {
}
}
Of course this is far from perfect it just answer the question of what to synchronize. In this case it would be the access to the underlaying collection in. row.clear row.put on the cache and the iteration on the client.
BTW, why do you have a Map in the cache, and you use a collection in the client. You should use the same structure on both and don't expose the underlying implementation.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to shard a Set? - java

You should check out Hazelcast, it is open source and has proved useful for me in a case where i wanted to share data among multiple instances of my application. The In-memory data grid provided by hazelcast might just be the thing you are looking for.

Related

Concurrent inserting to DB

How should I implement Guava cache when I plan to cache multiple values efficiently?

Java pattern for parameters of which only one needs to be non-null?

Is there anything in Java close to the parallel collections in Scala?

How do I synchronize to prevent a java.util.ConcurrentModificationException

Categories

Resources