Ideas on concurrent datastructure

Ideas on concurrent datastructure - java

I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.

In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.

In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.

Related

Optimizing method with list of 500k+ elements

I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.

As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}

Selecting Data from Room without observing

I need to select data from a table, manipulate it and then insert it into another table. This only happens when the app is opened for the first time that day and this isn't going to be used in the UI. I don't want to use LiveData because it doesn't need to be observed but when I was looking into how to do it, most people say I should use LiveData. I've tried using AsyncTask but I get the error "Cannot access database on the main thread since it may potentially....".
Here is the code for my AsyncTask
public class getAllClothesArrayAsyncTask extends AsyncTask<ArrayList<ClothingItem>, Void, ArrayList<ClothingItem>[]> {
private ClothingDao mAsyncDao;
getAllClothesArrayAsyncTask(ClothingDao dao) { mAsyncDao = dao;}
#Override
protected ArrayList<ClothingItem>[] doInBackground(ArrayList<ClothingItem>... arrayLists) {
List<ClothingItem> clothingList = mAsyncDao.getAllClothesArray();
ArrayList<ClothingItem> arrayList = new ArrayList<>(clothingList);
return arrayLists;
}
}
And this is how I'm calling it in my activity
mClothingViewModel = new ViewModelProvider(this).get(ClothingViewModel.class);
clothingItemArray = mClothingViewModel.getClothesArray();
What is the best practice in this situation?

Brief summary:
Room really doesn't allow to do anything (query|insert|update|delete) on main thread. You can switch off this control on RoomDatabaseBuilder, but you better shouldn't.
If you don't care about UI, minimally you can just to put your ROOM-ish code (Runnable) to one of - Thread, Executor, AsyncTask (but it was deprecated last year)... I've put examples below
Best practices in this one-shot operation to DB I think are Coroutines (for those who use Kotlin at projects) and RxJava (for those who use Java, Single|Maybe as a return types). They give much more possibilities but you should invest your time to get the point of these mechanisms.
To observe data stream from Room there are LiveData, Coroutines Flow, RxJava (Flowable).
Several examples of using Thread-switching with lambdas enabled (if you on some purpose don't want to learn more advanced stuff):
Just a Thread
new Thread(() -> {
List<ClothingItem> clothingList = mAsyncDao.getAllClothesArray();
// ... next operations
});
Executor
Executors.newSingleThreadExecutor().submit(() -> {
List<ClothingItem> clothingList = mAsyncDao.getAllClothesArray();
// ... next operations
});
AsyncTask
AsyncTask.execute(() -> {
List<ClothingItem> clothingList = mAsyncDao.getAllClothesArray();
// ... next operations
});
If you use Repository pattern you can put all this thread-switching there
Another useful link to read about life after AsyncTask deprecation

Google App Engine Objectify - load single objects or list of keys?

I am trying to get a grasp on Google App Engine programming and wonder what the difference between these two methods is - if there even is a practical difference.
Method A)
public Collection<Conference> getConferencesToAttend(Profile profile)
{
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Conference> conferences = new ArrayList<Conference>();
for(String conferenceString : keyStringsToAttend)
{
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString)).now());
}
return conferences;
}
Method B)
public Collection<Conference> getConferencesToAttend(Profile profile)
List<String> keyStringsToAttend = profile.getConferenceKeysToAttend();
List<Key<Conference>> keysToAttend = new ArrayList<>();
for (String keyString : keyStringsToAttend) {
keysToAttend.add(Key.<Conference>create(keyString));
}
return ofy().load().keys(keysToAttend).values();
}
the "conferenceKeysToAttend" list is guaranteed to only have unique Conferences - does it even matter then which of the two alternatives I choose? And if so, why?

Method A loads entities one by one while method B does a bulk load, which is cheaper, since you're making just 1 network roundtrip to Google's datacenter. You can observe this by measuring time taken by both methods while loading a bunch of keys multiple times.
While doing a bulk load, you need to be cautious about loaded entities, if datastore operation throws exception. Operation might succeed even when some of the entities are not loaded.

The answer depends on the size of the list. If we are talking about hundreds or more, you should not make a single batch. I couldn't find documentation what is the limit, but there is a limit. If it not that much, definitely go with loading one by one. But, you should make the calls asynchronous by not using the now function:
List<<Key<Conference>> conferences = new ArrayList<Key<Conference>>();
conferences.add(ofy().load().key(Key.create(Conference.class,conferenceString));
And when you need the actual data:
for (Key<Conference> keyConference : conferences ) {
Conference c = keyConference.get();
......
}

Guava CacheBuilder removal listener

Please show me where I'm missing something.
I have a cache build by CacheBuilder inside a DataPool. DataPool is a singleton object whose instance various thread can get and act on. Right now I have a single thread which produces data and add this into the said cache.
To show the relevant part of the code:
private InputDataPool(){
cache=CacheBuilder.newBuilder().expireAfterWrite(1000, TimeUnit.NANOSECONDS).removalListener(
new RemovalListener(){
{
logger.debug("Removal Listener created");
}
public void onRemoval(RemovalNotification notification) {
System.out.println("Going to remove data from InputDataPool");
logger.info("Following data is being removed:"+notification.getKey());
if(notification.getCause()==RemovalCause.EXPIRED)
{
logger.fatal("This data expired:"+notification.getKey());
}else
{
logger.fatal("This data didn't expired but evacuated intentionally"+notification.getKey());
}
}}
).build(new CacheLoader(){
#Override
public Object load(Object key) throws Exception {
logger.info("Following data being loaded"+(Integer)key);
Integer uniqueId=(Integer)key;
return InputDataPool.getInstance().getAndRemoveDataFromPool(uniqueId);
}
});
}
public static InputDataPool getInstance(){
if(clsInputDataPool==null){
synchronized(InputDataPool.class){
if(clsInputDataPool==null)
{
clsInputDataPool=new InputDataPool();
}
}
}
return clsInputDataPool;
}
From the said thread the call being made is as simple as
while(true){
inputDataPool.insertDataIntoPool(inputDataPacket);
//call some logic which comes with inputDataPacket and sleep for 2 seconds.
}
and where inputDataPool.insertDataIntoPool is like
inputDataPool.insertDataIntoPool(InputDataPacket inputDataPacket){
cache.get(inputDataPacket.getId());
}
Now the question is, the element in cache is supposed to expire after 1000 nanosec.So when inputDataPool.insertDataIntoPool is called second time, the data which has been inserted first time will be evacuated as it must have got expired as the call is being after 2 seconds of its insertion.And then correspondingly Removal Listener should be called.
But this is not happening. I looked into cache stats and evictionCount is always zero, no matter how much time cache.get(id) is called.
But importantly, if I extend inputDataPool.insertDataIntoPool
inputDataPool.insertDataIntoPool(InputDataPacket inputDataPacket){
cache.get(inputDataPacket.getId());
try{
Thread.sleep(2000);
}catch(InterruptedException ex){ex.printStackTrace();
}
cache.get(inputDataPacket.getId())
}
then the eviction take place as expected with removal listener being called.
Now I'm very much clueless at the moment where I'm missing something to expect such kind of behaviour. Please help me see,if you see something.
P.S. Please ignore any typos.Also no check is being made, no generic has been used, all as this is just in the phase of testing the CacheBuilder functionality.
Thanks

As explained in the javadoc and in the user guide, There is no thread that makes sure entries are removed from the cache as soon as the delay has elapsed. Instead, entries are removed during write operations, and occasionally during read operations if writes are rare. This is to allow for a high throughput and a low latency. And of course, every write operation doesn't cause a cleanup:
Caches built with CacheBuilder do not perform cleanup and evict values
"automatically," or instantly after a value expires, or anything of
the sort. Instead, it performs small amounts of maintenance during
write operations, or during occasional read operations if writes are
rare.
The reason for this is as follows: if we wanted to perform Cache
maintenance continuously, we would need to create a thread, and its
operations would be competing with user operations for shared locks.
Additionally, some environments restrict the creation of threads,
which would make CacheBuilder unusable in that environment.

I had the same issue and I could find this at guava's documentation for CacheBuilder.removalListener
Warning: after invoking this method, do not continue to use this cache
builder reference; instead use the reference this method returns. At
runtime, these point to the same instance, but only the returned
reference has the correct generic type information so as to ensure
type safety. For best results, use the standard method-chaining idiom
illustrated in the class documentation above, configuring a builder
and building your cache in a single statement. Failure to heed this
advice can result in a ClassCastException being thrown by a cache
operation at some undefined point in the future.
So by changing your code to use the builder reference that is called after adding the removalListnener this problem can be resolved
CacheBuilder builder=CacheBuilder.newBuilder().expireAfterWrite(1000, TimeUnit.NANOSECONDS).removalListener(
new RemovalListener(){
{
logger.debug("Removal Listener created");
}
public void onRemoval(RemovalNotification notification) {
System.out.println("Going to remove data from InputDataPool");
logger.info("Following data is being removed:"+notification.getKey());
if(notification.getCause()==RemovalCause.EXPIRED)
{
logger.fatal("This data expired:"+notification.getKey());
}else
{
logger.fatal("This data didn't expired but evacuated intentionally"+notification.getKey());
}
}}
);
cache=builder.build(new CacheLoader(){
#Override
public Object load(Object key) throws Exception {
logger.info("Following data being loaded"+(Integer)key);
Integer uniqueId=(Integer)key;
return InputDataPool.getInstance().getAndRemoveDataFromPool(uniqueId);
}
});
This problem will be resolved. It is kind of wired but I guess it is what it is :)

How do I make "simple" throughput servlet-filter?

I'm looking to create a filter that can give me two things: number of request pr minute, and average responsetime pr minute. I already got the individual readings, I'm just not sure how to add them up.
My filter captures every request, and it records the time each request takes:
public void doFilter(ServletRequest request, ...()
{
long start = System.currentTimeMillis();
chain.doFilter(request, response);
long stop = System.currentTimeMillis();
String time = Util.getTimeDifferenceInSec(start, stop);
}
This information will be used to create some pretty Google Chart charts. I don't want to store the data in any database. Just a way to get current numbers out when requested
As this is a high volume application; low overhead is essential.
I'm assuming my applicationserver doesn't provide this information.

I did something similar once. If I remember well, I had something like
public class StatisticsFilter implements ...
{
public static Statistics stats;
public class PeriodicDumpStat extends Thread
{
...
}
public void doFilter(ServletRequest request, ...()
{
long start = System.currentTimeMillis();
chain.doFilter(request, response);
long stop = System.currentTimeMillis();
stats.add( stop - start );
}
public void init()
{
Thread t = new PeriodicDumpStat();
t.setDaemon( true );
t.start();
}
}
(That's only a sketch)
Make sure that the Statistics object is correctly synchronize, as it will be accessed concurrently.
I had a background DumpStatistics thread that was periodically dumping the stats in an XML file, to be processed later. For better encapsulation, I had the thread as an inner class. You can of course use Runnable as well. As #Trevor Tippins pointed out, it's also good to flag the thread as daemon thread.
I also used Google Chart and had actually another ShowStatisticsServlet that would rad the XML file and turn the data into a nice Chart. The servlet would not depends on the filter, but only on the XML file, so both were actually decoupled. The XML file can be created as a temporary file with File.createTempFile. (Another variant would be of course to keep all data in memory, but storing the data was handy for us to backup the results of perf. tests and analyze them later)
Some colleague claimed that the synchronization in the Statistics object would "kill" the app performance, but in practice it was really neglectable. The overhead to dump the file as well, given that it was done each 10 sec or so.
Hope it helps, or give you some ideas.
PS: And as #William Louth commented, you should write such infrastructure code only if you can't solve your issue with an existing solution. In my case, I was also benchmarking the internal time of my code, not only the complete request processing time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Ideas on concurrent datastructure - java

Related

Optimizing method with list of 500k+ elements

Selecting Data from Room without observing

Google App Engine Objectify - load single objects or list of keys?

Guava CacheBuilder removal listener

How do I make "simple" throughput servlet-filter?

Categories

Resources