Multi Threading in Google App Engine Datastore

Multi Threading in Google App Engine Datastore - java

How can I make the operations of getting and setting a property from datastore, thread safe?
Currently, I have code which puts tasks in the queue and each task perform a task and then updates a property called of numberOfTasks which is of type int. It basically fetches the current value of this property and increments it.
However as tasks are executed in the queue, the final value is not coming to be correct because of the threading issue. Sometimes, two tasks tries to update the proeprty at the same time and hence sometime the increment isnt done.
Could anyone please help in getting this done correctly?
Datastore Property Getter Method:
private String doGet(String rowId) throws EntityNotFoundException {
Key egsKey = KeyFactory.createKey(DATASTORE_KIND, rowId);
Entity egsEntity = datastore.get(egsKey);
// schema changed from String to Text type. Transparently handle that here.
Object propertyValue = egsEntity.getProperty(PROPERTY_KEY);
if (propertyValue instanceof String) {
return (String) propertyValue;
}
Text text = (Text) propertyValue;
return text.getValue();
}
Datastore Property SETTER METHOD:
private void doPut(String rowId, List<String> list) {
Entity entity = new Entity(DATASTORE_KIND, rowId);
entity.setProperty(PROPERTY_KEY, list);
datastore.put(entity);
}
Setter and Getter Methods:
public synchronized int getPendingUsersForProcessing() {
String pendingUsersForProcessingAsString = null;
try {
pendingUsersForProcessingAsString = doGet(PENDING_USERS_FOR_PROCESSING);
return Integer.valueOf(pendingUsersForProcessingAsString);
} catch (NumberFormatException e) {
throw new IllegalStateException("The num of last batches processed in Datastore is not a number: "
+ pendingUsersForProcessingAsString);
} catch (EntityNotFoundException e) {
return DEFAULT_PENDING_USERS_FOR_PROCESSING;
}
}
/** {#inheritDoc } */
#Override
public synchronized void setPendingUsersForProcessing(int pendingUsersForProcessing) {
doPut(PENDING_USERS_FOR_PROCESSING, String.valueOf(pendingUsersForProcessing));
LOG.info("Number of Pending Users For Processing is set to : " + pendingUsersForProcessing);
}
Code Where I am trying to update the property:
int pendingUsers = appProperties.getPendingUsersForProcessing();
int requestUsers = request.getUserKeys().size();
appProperties.setPendingUsersForProcessing(pendingUsers + requestUsers);

This is not exactly a threading issue as you may have multiple instances of your app performing the tasks, and those instances do not know about each other. So this is a contention situation.
You have several options on how to resolve it.
Use sharding for your counters.
Instead of constantly updating the same entity, create a new entity for each completed task, using the time when a task was completed as an id. The advantage of this approach is that it creates an audit trail and you can always get stats like the number of tasks completed today, within the last hour, etc. To count the number of entities you can use a keys-only query, which is almost free and very fast. The disadvantage is a higher cost of writing these entities - this is not a solution if you have a very large number of tasks to complete.
Instead of counting tasks, count the results of these tasks. For example, if a task updates a user status, you can count the number of users with "pending" status using a free and fast keys-only query. This is a very good approach if you already have an indexed property that you can use as a flag to count the tasks completed.

Related

Java thread slowing Postgres DB update

I have a piece of legacy code which is basically like:
// instance variable
List<Future<Object>> futureList;
methodA() {
List list = getListOfMessages();
for (Object o : list) {
methodB(o);
}
}
void methodB(Object o) {
// after multi-threading, this statement takes ~2 mins
someDAO.update(o.value);
// some other tasks
}
This works fine, except that I have about million records that are retrieved into list via getListOfMessages(). So I was asked to multithread it and I changed it to something like...
methodA() {
List list = getListOfMessages();
// created executorservice here
for(Object o : list) {
Future future = executorService.submit(methodB(o));
futureList.add(future);
}
// call another method to see the status of each ask
checkFutureStatus(futureList);
}
void checkFutureStatus(List<Future<Object>> list) {
for(Future<Object> future : list) {
try {
future.get(1000, TimeUnit.Milliseconds);
} catch (InterrupException | ExecutionExecption e) {
} catch (TimeoutException e) {
}
}
}
So basically, for each list item, I pass it to methodB but have it handled by separate threads. Once all threads have been submitted, I check the status of the threads but every thread throws a TimeoutException. On debugging, I see that the threads take too long for DB updates...like 1-2 min.
Just to be sure that the threads are not competing with each other, I had the getListOfMessages() return just one message. And even that is taking 1-2 min, If I just revert everything and go non threaded approach, the DB update takes 1ms. I can't really figure out why the multi-thread implementation is causing the db update to take so long.
I'm using Postgres 10 and the db update is via jdbctemplate.
Thank you in advance.
Edit:
Added method to explain how I'm checking the status of each thread.

How to wait for some period of time and after that just return default value?

I have below code which tells me whether my data is PARTIAL or FULL. It works fine most of the time.
public static String getTypeOfData {
DataType type = new SelectTypes().getType();
if (type == DataType.partial || type == DataType.temp) {
return "partial";
}
return "full";
}
But sometimes, this line DataType type = new SelectTypes().getType(); just hangs and it keeps on waiting forever. This code is not in my control as it is developed by some other teams.
What I want to do is if this line DataType type = new SelectTypes().getType(); takes more than 10 second (or any default number of second), my method should return back a default string which can be partial.
Is this possible to do by any chance? Any example will help me to understand better.
I am using Java 7.

The ExecutorService provides methods which allow you to schedule tasks and invoke them with timeout options. This should do what you are after, however, please pay attention since terminating threads could leave your application in an inconsistent state.
If possible, you should contact the owners of the API and ask for clarification or more information.
EDIT: As per your comment, would caching be a possibility? Meaning that on start up, or some other point, you application goes through the SelectTypes and gets their type and stores them. Assuming that these do not change often, you can save them/update them periodically.
EDIT 2: As per your other comment, I cannot really add much more detail. You would need to add a method call which would allow your application to set these up the moment it is launched (this will depend on what framework you are using, if any).
A possible way would be to make the class containing the getTypeOfData() method as a Singleton. You would then amend the class to pull this information as part of its creation mechanism. Lastly, you would then create a Map<String, Type> in which you would throw in all your types. You could use getClass().getName() to populate the key for your map, and what you are doing now for the value part.

If you are not well aware of executor service then the easiest way to achieve this is by using Thread wait and notify mechanism:
private final static Object lock = new Object();
private static DataType type = null;
public static String getTypeOfData {
new Thread(new Runnable() {
#Override
public void run() {
fetchData();
}
}).start();
synchronized (lock) {
try {
lock.wait(10000);//ensures that thread doesn't wait for more than 10 sec
if (type == DataType.partial || type == DataType.temp) {
return "partial";
}else{
return "full";
}
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
return "full";
}
private static void fetchData() {
synchronized (lock) {
type = new SelectTypes().getType();
lock.notify();
}
}
You might have to do some little changes to make it work and looks better like instead of creating new thread directly you can use a Job to do that and some other changes based on your requirement. But the main idea remains same that Thread would only wait for max 10 sec to get the response.

Can Java Objects reinstantiate on their own?

I have a process in my application that uploads a document to my server via a Servlet and waits for completion, the server then processes the file using 2 threads, and keeps the Status while it is running.
This is how the Status class looks:
class Status implements Serializable {
private Integer read;
private Integer validated = 0;
private Integer processed = 0;
private Integer failed = 0;
public Status (int read) {
this.read = read;
}
/*
* Getter methods go here.
* No Setter methods.
*/
public void incrementValidated() {
synchronized(validated) { validated++; }
}
public void incrementProcessed() {
synchronized(processed) { processed++; }
}
public void incrementFailed() {
synchronized(failed) { failed++; }
}
}
Now, the server processes the file in this way:
A thread validates the read rows according to DB values, putting in a queue those that are OK.
A thread waits until it has a batch of items in the queue, and then it persists the batch of X items.
The Status is updated when the items are OK (incrementValidated), when the items are persisted (incrementProcessed), and when an item is invalid (incrementFailed).
The Status stored in a ConcurrentHashMap<String, Status>, where the key is the user's sessionID (because this process can handle multiple requests).
While the process is running, the client is polling the server via Servlet too, and all it does is return statusMap.get(sessionId); until the process is complete.
My problem comes on files that run for too long, for example 5min. When it is running and polling the server to get the status, sometimes all the values are set back to 0, and the only value that stays the same is the read property.
I'm not sure how is that possible, since the object has no setters, so all I can imagine is that the object is being re-instantiated using the same value on the constructor, and therefore keeping the same value.
Is that even possible? or am I missing something?
(it looks like the address changes when this happens)

Your synchronization is broken. When you do validated++; you create a new object (remember that Integer is unmutable). So in fact, there is no synchronization at all.
To fix this, make the fields of primitive type int (as suggested in a comment) and make the three methods synchronized.
int validated;
...
public synchronized void incrementValidated() {
validated++;
}

Your synchronization is invalid. You create a new object each time you do ++, hence synchronization occurs on different objects.
Use simple types with dedicated Object locks or AtomicIntegers.
Besides: are you certain you don't need to synchronize all integer variables with the same lock? In that case you can synchronize on the Status itself through marking the method as synchronized.

Performance Issues with Multithreaded code when using CallableTask/Futures and ObjectMapper

I am working on a REST service based project in which I have two components as mentioned below-
Client which will make the necessary URL's for the Service component
Then Service(REST service) component will use those URL's to get the data from database.
In general URL will look like this-
http://host.qa.ebay.com:8080/deservice/DEService/get/USERID=9012/PROFILE.ACCOUNT,PROFILE.ADVERTISING,PROFILE.DEMOGRAPHIC,PROFILE.FINANCIAL
What it means from the above URL is- For the USERID- 9012 give me the data from database for these columns-
[PROFILE.ACCOUNT, PROFILE.ADVERTISING, PROFILE.DEMOGRAPHIC, PROFILE.FINANCIAL]
And currently I am doing Benchmarking on the client component side. And I found out that below method is taking bunch of time(95 Percentile) around ~15ms.
Below method will accept two parameters-
List<DEKey> keys- sample data in keys will have USERID=9012
List<String> reqAttrNames- sample data for reqAttrNames will be-
[PROFILE.ACCOUNT, PROFILE.ADVERTISING, PROFILE.DEMOGRAPHIC, PROFILE.FINANCIAL]
Below is the code-
public DEResponse getDEAttributes(List<DEKey> keys, List<String> reqAttrNames) {
DEResponse response = null;
try {
String url = buildGetUrl(keys,reqAttrNames);
if(url!=null){
List<CallableTask<DEResponse>> tasks = new ArrayList<CallableTask<DEResponse>>();
CallableTask<DEResponse> task = new DEResponseTask(url);
tasks.add(task);
// STEP 2: Execute worker threads for all the generated urls
List<LoggingFuture<DEResponse>> futures = null;
try {
long waitTimeout = getWaitTimeout(keys);
futures = executor.executeAll(tasks, null, waitTimeout, TimeUnit.MILLISECONDS);
// STEP 3: Consolidate results of the executed worker threads
if(futures!=null && futures.size()>0){
LoggingFuture<DEResponse> future = futures.get(0);
response = future.get();
}
} catch (InterruptedException e1) {
logger.log(LogLevel.ERROR,"Transport:getDEAttributes Request timed-out :",e1);
}
}else{
//
}
} catch(Throwable th) {
}
return response;
}
And the above method will give me back the DEResponse object.
Below is the DEResponseTask class
public class DEResponseTask extends BaseNamedTask implements CallableTask<DEResponse> {
private final ObjectMapper m_mapper = new ObjectMapper();
#Override
public DEResponse call() throws Exception {
URL url = null;
DEResponse DEResponse = null;
try {
if(buildUrl!=null){
url = new URL(buildUrl);
DEResponse = m_mapper.readValue(url, DEResponse.class);
}else{
logger.log(LogLevel.ERROR, "DEResponseTask:call is null ");
}
} catch (MalformedURLException e) {
}catch (Throwable th) {
}finally{
}
return DEResponse;
}
}
Is there any problem with the way this multithreaded code is written? If yes, how can I make this efficient?
Signature for executeAll method for executor as in my company they have there own executor which will implement Sun Executor class-
/**
* Executes the given tasks, returning a list of futures holding their
* status and results when all complete or the timeout expires, whichever
* happens first. <tt>Future.isDone()</tt> is <tt>true</tt> for each
* element of the returned list. Upon return, tasks that have not completed
* are cancelled. Note that a <i>completed</i> task could have terminated
* either normally or by throwing an exception. The results of this method
* are undefined if the given collection is modified while this operation is
* in progress. This is entirely analogous to
* <tt>ExecutorService.invokeAll()</tt> except for a couple of important
* differences. First, it cancels but does not <b>interrupt</b> any
* unfinished tasks, unlike <tt>ExecutorService.invokeAll()</tt> which
* cancels and interrupts unfinished tasks. This results in a better
* adherence to the specified timeout value, as interrupting threads may
* have unexpected delays depending on the nature of the tasks. Also, all
* eBay-specific features apply when the tasks are submitted with this
* method.
*
* #param tasks the collection of tasks
* #param timeout the maximum time to wait
* #param unit the time unit of the timeout argument
* #return a list of futures representing the tasks, in the same sequential
* order as produced by the iterator for the given task list. If the
* operation did not time out, each task will have completed. If it did
* time out, some of these tasks will not have completed.
* #throws InterruptedException if interrupted while waiting, in which case
* unfinished tasks are cancelled
*/
public <V> List<LoggingFuture<V>> executeAll(Collection<? extends CallableTask<V>> tasks,
Options options,
long timeout, TimeUnit unit)
throws InterruptedException {
return executeAll(tasks, options, timeout, unit, false);
}
Update:-
This component is taking time as soon as I increase the load of my program which is doing Benchmarking by increasing the threads to 20
newFixedThreadPool(20)
But I believe this component works fine if I use-
newSingleThreadExecutor
The only reason, I can think of is, might be in the above code, there is a blocking call so that is the reason threads get blocked and that's why it is taking time?
Updated:-
So this line should be written like this?-
if(futures!=null && futures.size()>0){
LoggingFuture<DEResponse> future = futures.get(0);
//response = future.get();//replace this with below code-
while(!future.isDone()) {
Thread.sleep(500);
}
response = future.get();
}

I don't see anything that should be causing a performance hit other than that you're using a complicated non-standard Executor. I realize that you don't have any choice in the matter of which Executor you can use, but out of curiosity I'd try replacing it with a ThreadPoolExecutor to see if this makes any difference, and bring it up with the powers that be at your work if you notice a major improvement - at my job we discovered that the encryption library written by another division was absolute crap (80-90% of our CPU time was spent in their code) and successfully lobbied them into rewriting it.
edit:
public class Aggregator implements Runnable {
private static ConcurrentLinkedQueue<Future<DEResponse>> queue = new ConcurrentLinkedQueue<>();
private static ArrayList<DEResponse> aggregation = new ArrayList<>();
public static void offer(Future<DEResponse> future) {
queue.offer(future);
}
public static ArrayList<DEResponse> getAggregation() {
return aggregation;
}
public void run() {
while(!queue.isEmpty()) { // make sure that all of the futures are added before this loop starts; better still, if you know how many worker threads there are then keep a count of how many futures are in your aggregator and quit this loop when aggregator.size() == [expected number of futures]
aggregation.add(queue.poll().get());
}
}
}
public void getDEAttributes(List<DEKey> keys, List<String> reqAttrNames) {
try {
if(url!=null){
try {
futures = executor.executeAll(tasks, null, waitTimeout, TimeUnit.MILLISECONDS);
if(futures!=null && futures.size()>0){
Aggregator.offer(futures.get(0));
}
}
}
}
}

If I read your code correctly, there is one clear performance issue. This:
public class DEResponseTask extends BaseNamedTask implements CallableTask<DEResponse> {
private final ObjectMapper m_mapper = new ObjectMapper();
is getting called once per task, and creation of ObjectMapper instances is very expensive.
There are many way to fix this, but you probably want to either:
Make m_mapper reference static (just created once) -- mappers are safe to share once configured, OR
Pass in shared ObjectMapper (sharing is safe)
Doing this should make a big difference for JSON handling efficiency.

Loop through the ResultSet in an efficient way and add column values to List<String>

I am working on a multithreaded project in which each thread will randomly find columns for that table and I will be using those columns in my SELECT sql query and then I will be executing that SELECT sql query. AFter exectuing that query, I will be looping through the result set and will add the data for each columns into List<String>.
Here columnsList will contains columns delimited by comma. For example-
col1, col2, col3
Below is my code.
class ReadTask implements Runnable {
public ReadTask() {
}
#Override
public run() {
...
while ( < 60 minutes) {
.....
final int id = generateRandomId(random);
final String columnsList = getColumns(table.getColumns());
final String selectSql = "SELECT " + columnsList + " from " + table.getTableName() + " where id = ?";
resultSet = preparedStatement.executeQuery();
List<String> colData = new ArrayList<String>(columnsList.split(",").length);
boolean foundData = false;
if (id >= startValidRange && id <= endValidRange) {
if (resultSet.next()) {
foundData = true;
for (String column : columnsList.split(",")) {
colData.add(resultSet.getString(column.trim()));
}
resultSet.next();//do I need this here?
}
} else if (resultSet.next()) {
addException("Data Present for Non Valid ID's", Read.flagTerminate);
}
....
}
}
private static void addException(String cause, boolean flagTerminate) {
AtomicInteger count = exceptionMap.get(cause);
if (count == null) {
count = new AtomicInteger();
AtomicInteger curCount = exceptionMap.putIfAbsent(cause, count);
if (curCount != null) {
count = curCount;
}
}
count.incrementAndGet();
if(flagTerminate) {
System.exit(1);
}
}
}
Problem Statement:-
After executing the SELECT sql query. Below are my two scenarios-
I need to see whether the id is between the valid range. If it is between the Valid Range then check whether resultSet has any data or not. If it has data then loop around the resultSet using the columns from the columnsList and start adding it in coldData list of String.
else if id is not in the valid range then I need to check I am not getting any data back from the resultSet. But somehow if I am getting the data back and flag is true to stop the program, then exit the program. Else if I am getting the data back but flag is false to stop the program, then count how many of those happening. So for this, I have created addException method.
Can anyone help me out whether the way I am doing here for my above two scenarios is right or not? It looks like, I can improve the if/else loop code more I guess for my above two scenario.

There's probably a few things you could do to make the code a bit faster:
Regarding the query part, if the table never changes, you could move the columnsList initialization outside the while loop, and perhaps even make it static if all the threads use the same query. Likewise, you are recomputing the split and trimmed columns list out of this variable for each query result. This could be done once for all outside the loop.
Regarding the test itself, indeed you could reverse the nesting. You're currently doing something like:
if (B) {
if (A) ok;
}
else if (A) error;
when it could be more simply written:
if (A){
if (B) ok;
else error;
}
Your code could be better stated as follow:
if (resultSet.next()) {
if (id >= startValidRange && id <= endValidRange) {
foundData = true;
for (String column : columnsList.split(",")) {
colData.add(resultSet.getString(column.trim()));
}
}
else
addException("Data Present for Non Valid ID's", Read.flagTerminate);
}
Regarding the exceptions logging part, you should avoid the static method to handle the storage directly in the map, it's a failry strong source of contention and will prevent your threads to focus on doing their real work, which is launching and processing queries. Typically, accessing a map has a time complexity in O(log n), and looking at your code, you are doing that access twice, and do all sort of checks to make sure that the accounting is correct. Comparatively, pushing a value in a queue is a constant time operation, and synchronisation is handled by the queue itself.
So, my advice here is to delegate the handling of the map to a dedicated thread, and add a synchronized queue for your query threads to give it their exceptions. That way you won't need to deal with concurrent accesses to your map (this can be messy). Again, from the query threads standpoint, the logging process will be a simple "fire and forget" action, and the logging thread will just have to pull new messages from the queue and add them to the map.
If you don't already know how to build such a setup, there's the Oracle tutorial. There are also several SO questions and answers on the topic (producer-consumer).
Update: if you want to even reduce contention, you can create one queue per query thread, and make the map thread check all the queue in turn. The risk of concurrent access is reduced to two threads at the same time: one query thread and the map thread. It will induce a bit more work in the map thread, but at the same time will avoid many threads rescheduling (which happens each time a thread is blocked by a lock). The less reschedule happens, the less time is spent on thread management, and the more time is available for real work.
Note that at any rate, you should be careful that not too many items pile up in the queue(s). If that scenario is likely to happen (which I doubt, but I don't know the details of your data to be certain), you might want to use BlockingQueues (look up the class description, and SO questions on the topic for further details).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Multi Threading in Google App Engine Datastore - java

Related

Java thread slowing Postgres DB update

How to wait for some period of time and after that just return default value?

Can Java Objects reinstantiate on their own?

Performance Issues with Multithreaded code when using CallableTask/Futures and ObjectMapper

Loop through the ResultSet in an efficient way and add column values to List<String>

Categories

Resources