Apache Flink: Wierd FlatMap behaviour

Apache Flink: Wierd FlatMap behaviour - java

I'm ingesting a stream of data into Flink. For each 'instance' of this data, I have a timestamp. I can detect if the machine I'm getting the data from is 'producing' or 'not producing', this is done via a custom flat map function that's located in it's own static class.
I want to calculate how long the machine has been producing / not producing.
My current approach is collecting the production and non production timestamps in two plain lists. For each 'instance' of the data, I calculate the current production/non-production duration by subtracting the latest timestamp from the earliest timestamp. This is giving me incorrect results, though. When the production state changes from producing to non producing, I clear the timestamp list for producing and vice versa, so that if the production starts again, the duration starts from zero.
I've looked into the two lists I collect the respective timestamps in and I see things I don't understand. My assumption is that, as long as the machine 'produces', the first timestamp in the production timestamp list stays the same, while new timestamps are added to the list per new instance of data.
Apparantly, this assumption is wrong since I get seemingly random timestamps in the lists. They are still correctly ordered, though.
Here's my code for the flatmap function:
public static class ImaginePaperDataConverterRich extends RichFlatMapFunction<ImaginePaperData, String> {
private static final long serialVersionUID = 4736981447434827392L;
private transient ValueState<ProductionState> stateOfProduction;
SimpleDateFormat dateFormat = new SimpleDateFormat("dd.MM.yyyy HH:mm:ss.SS");
DateFormat timeDiffFormat = new SimpleDateFormat("dd HH:mm:ss.SS");
String timeDiffString = "00 00:00:00.000";
List<String> productionTimestamps = new ArrayList<>();
List<String> nonProductionTimestamps = new ArrayList<>();
public String calcProductionTime(List<String> timestamps) {
if (!timestamps.isEmpty()) {
try {
Date firstDate = dateFormat.parse(timestamps.get(0));
Date lastDate = dateFormat.parse(timestamps.get(timestamps.size()-1));
long timeDiff = lastDate.getTime() - firstDate.getTime();
if (timeDiff < 0) {
System.out.println("Something weird happened. Maybe EOF.");
return timeDiffString;
}
timeDiffString = String.format("%02d %02d:%02d:%02d.%02d",
TimeUnit.MILLISECONDS.toDays(timeDiff),
TimeUnit.MILLISECONDS.toHours(timeDiff) % TimeUnit.HOURS.toHours(1),
TimeUnit.MILLISECONDS.toMinutes(timeDiff) % TimeUnit.HOURS.toMinutes(1),
TimeUnit.MILLISECONDS.toSeconds(timeDiff) % TimeUnit.MINUTES.toSeconds(1),
TimeUnit.MILLISECONDS.toMillis(timeDiff) % TimeUnit.SECONDS.toMillis(1));
} catch (ParseException e) {
e.printStackTrace();
}
System.out.println("State duration: " + timeDiffString);
}
return timeDiffString;
}
#Override
public void open(Configuration config) {
ValueStateDescriptor<ProductionState> descriptor = new ValueStateDescriptor<>(
"stateOfProduction",
TypeInformation.of(new TypeHint<ProductionState>() {}),
ProductionState.NOT_PRODUCING);
stateOfProduction = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(ImaginePaperData ImaginePaperData, Collector<String> output) throws Exception {
List<String> warnings = new ArrayList<>();
JSONObject jObject = new JSONObject();
String productionTime = "0";
String nonProductionTime = "0";
// Data analysis
if (stateOfProduction == null || stateOfProduction.value() == ProductionState.NOT_PRODUCING && ImaginePaperData.actSpeedCl > 60.0) {
stateOfProduction.update(ProductionState.PRODUCING);
} else if (stateOfProduction.value() == ProductionState.PRODUCING && ImaginePaperData.actSpeedCl < 60.0) {
stateOfProduction.update(ProductionState.NOT_PRODUCING);
}
if(stateOfProduction.value() == ProductionState.PRODUCING) {
if (!nonProductionTimestamps.isEmpty()) {
System.out.println("Production has started again, non production timestamps cleared");
nonProductionTimestamps.clear();
}
productionTimestamps.add(ImaginePaperData.timestamp);
System.out.println(productionTimestamps);
productionTime = calcProductionTime(productionTimestamps);
} else {
if(!productionTimestamps.isEmpty()) {
System.out.println("Production has stopped, production timestamps cleared");
productionTimestamps.clear();
}
nonProductionTimestamps.add(ImaginePaperData.timestamp);
warnings.add("Production has stopped.");
System.out.println(nonProductionTimestamps);
//System.out.println("Production stopped");
nonProductionTime = calcProductionTime(nonProductionTimestamps);
}
// The rest is just JSON stuff
Do I maybe have to hold these two timestamp lists in a ListState?
EDIT: Because another user asked, here is the data I'm getting.
{'szenario': 'machine01', 'timestamp': '31.10.2018 09:18:39.432069', 'data': {1: 100.0, 2: 100.0, 101: 94.0, 102: 120.0, 103: 65.0}}
The behaviour I expect is that my flink program collects the timestamps in the two lists productionTimestamps and nonProductionTimestamps. Then I want my calcProductionTime method to subtract the last timestamp in the list from the first timestamp, to get the duration between when I first detected the machine is "producing" / "not-producing" and the time it stopped "producing" / "not-producing".

I found out that the reason for the 'seemingly random' timestamps is Apache Flink's parallel execution. When the parallelism is set to > 1, the order of events isn't guaranteed anymore.
My quick fix was to set the parallelism of my program to 1, this guarantees the order of events, as far as I know.

Related

Improve performance of loading 100,000 records from database

We created a program to make the use of the database easier in other programs. So the code im showing gets used in multiple other programs.
One of those other programs gets about 10,000 records from one of our clients and has to check if these are in our database already. If not we insert them into the database (they can also change and have to be updated then).
To make this easy we load all the entries from our whole table (at the moment 120,000), create a class for every entry we get and put all of them into a Hashmap.
The loading of the whole table this way takes around 5 minutes. Also we sometimes have to restart the program because we run into a GC overhead error because we work on limited hardware. Do you have an idea of how we can improve the performance?
Here is the code to load all entries (we have a global limit of 10.000 entries per query so we use a loop):
public Map<String, IMasterDataSet> getAllInformationObjects(ISession session) throws MasterDataException {
IQueryExpression qe;
IQueryParameter qp;
// our main SDP class
Constructor<?> constructorForSDPbaseClass = getStandardConstructor();
SimpleDateFormat itaTimestampFormat = new SimpleDateFormat("yyyyMMddHHmmssSSS");
// search in standard time range (modification date!)
Calendar cal = Calendar.getInstance();
cal.set(2010, Calendar.JANUARY, 1);
Date startDate = cal.getTime();
Date endDate = new Date();
Long startDateL = Long.parseLong(itaTimestampFormat.format(startDate));
Long endDateL = Long.parseLong(itaTimestampFormat.format(endDate));
IDescriptor modDesc = IBVRIDescriptor.ModificationDate.getDescriptor(session);
// count once before to determine initial capacities for hash map/set
IBVRIArchiveClass SDP_ARCHIVECLASS = getMasterDataPropertyBag().getSDP_ARCHIVECLASS();
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
int nrOfHitsTotal = session.getDocumentServer().queryCount(session, qp, "*");
int initialCapacity = (int) (nrOfHitsTotal / 0.75 + 1);
// MD sets; and objects already done (here: document ID)
HashSet<String> objDone = new HashSet<>(initialCapacity);
HashMap<String, IMasterDataSet> objRes = new HashMap<>(initialCapacity);
qp.close();
// do queries until hit count is smaller than 10.000
// use modification date
boolean keepGoing = true;
while(keepGoing) {
// construct query expression
// - basic part: Modification date & class type
// a. doc. class type
qe = SDP_ARCHIVECLASS.getQueryExpression(session);
// b. ID
qe = SearchUtil.appendQueryExpressionWithANDoperator(session, qe,
new PlainExpression(modDesc.getQueryLiteral() + " BETWEEN " + startDateL + " AND " + endDateL));
// 2. Query Parameter: set database; set expression
qp = session.getDocumentServer().getClassFactory()
.getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
qp.setExpression(qe);
// order by modification date; hitlimit = 0 -> no hitlimit, but the usual 10.000 max
qp.setOrderByExpression(session.getDocumentServer().getClassFactory().getOrderByExpressionInstance(modDesc, true));
qp.setHitLimitThreshold(0);
qp.setHitLimit(0);
// Do not sort by modification date;
qp.setHints("+NoDefaultOrderBy");
keepGoing = false;
IInformationObject[] hits = null;
IDocumentHitList hitList = null;
hitList = session.getDocumentServer().query(qp, session);
IDocument doc;
if (hitList.getTotalHitCount() > 0) {
hits = hitList.getInformationObjects();
for (IInformationObject hit : hits) {
String objID = hit.getID();
if(!objDone.contains(objID)) {
// do something with this object and the class
// here: construct a new SDP sub class object and give it back via interface
doc = (IDocument) hit;
IMasterDataSet mdSet;
try {
mdSet = (IMasterDataSet) constructorForSDPbaseClass.newInstance(session, doc);
} catch (Exception e) {
// cause for this
String cause = (e.getCause() != null) ? e.getCause().toString() : MasterDataException.ERRMSG_PART_UNKNOWN;
throw new MasterDataException(MasterDataException.ERRMSG_NOINSTANCE_POSSIBLE, this.getClass().getSimpleName(), e.toString(), cause);
}
objRes.put(mdSet.getID(), mdSet);
objDone.add(objID);
}
}
doc = (IDocument) hits[hits.length - 1];
Date lastModDate = ((IDateValue) doc.getDescriptor(modDesc).getValues()[0]).getValue();
startDateL = Long.parseLong(itaTimestampFormat.format(lastModDate));
keepGoing = (hits.length >= 10000 || hitList.isResultSetTruncated());
}
qp.close();
}
return objRes;
}

Loading 120,000 rows (and more) each time will not scale very well, and your solution may not work in the future as the record size grows. Instead let the database server handle the problem.
Your table needs to have a primary key or unique key based on the columns of the records. Iterate through the 10,000 records performing JDBC SQL update to modify all field values with where clause to exactly match primary/unique key.
update BLAH set COL1 = ?, COL2 = ? where PKCOL = ?; // ... AND PKCOL2 =? ...
This modifies an existing row or does nothing at all - and JDBC executeUpate() will return 0 or 1 indicating number of rows changed. If number of rows changed was zero you have detected a new record which does not exist, so perform insert for that new record only.
insert into BLAH (COL1, COL2, ... PKCOL) values (?,?, ..., ?);
You can decide whether to run 10,000 updates followed by however many inserts are needed, or do update+optional insert, and remember JDBC batch statements / auto-commit off may help speed things up.

Improving code design to enhance performance

I am trying to do something like below. I don't like the design of this as I am using 4 for loops to achieve this. Can I further enhance the design to achieve this?
Creating a map with dates as keys.
Sort the list values inside the map on dates(dates have hours and minutes here)
Giving an incremental id to each dto.
int serialNumber = 1;
if (hList != null && !hList.isEmpty()) {
// create a Map with dates as keys
HashMap<String, ArrayList<BookDTO>> mapObj = new HashMap<>();
for (int count = 0; count < hList.size(); count++) {
BookDTO bookDTO = (BookDTO) hList.get(count);
ArrayList<BookDTO> list = new ArrayList<>();
list.add(bookDTO);
Calendar depDate = bookDTO.getDepartureDate();
SimpleDateFormat format = new SimpleDateFormat("dd-MM-yyyy");
if (depDate != null) {
String formattedDate = format.format(depDate.getTime());
if (mapObj.containsKey(formattedDate)) {
mapObj.get(formattedDate).add(bookDTO);
} else {
mapObj.put(formattedDate, list);
}
}
}
// Sort the values inside the map based on dates
for (Entry<String, ArrayList<BookingDTO>> entry : mapObj.entrySet()) {
Collections.sort(entry.getValue(), new BookDTOComparator(DATES));
}
for (Entry<String, ArrayList<BookDTO>> entry : mapObj.entrySet()) {
serialNumber = setItinerarySerialNumber(entry.getValue(), serialNumber);
}

I believe you can merge two last loops. So, we will see only two loops. (for now I can see three loops)
You can also try Arrays.parallelSort(entry.getValue()) if lists are too large and it is applicable
Also, if it applicable, see code below:
int serialNumber = 1;
SimpleDateFormat format = new SimpleDateFormat("dd-MM-yyyy");
ArrayList<BookDTO> hListCopy = new ArrayList<>(hList);
Collections.sort(hListCopy, new NewBookDTOComparator()); // 2. sorting
HashMap<String, ArrayList<BookDTO>> mapObj = new HashMap<>();
for (BookDTO bookDTO : hListCopy) {
serialNumber = setItinerarySerialNumber(bookDTO, serialNumber); // 3. serialNumber
Calendar depDate = bookDTO.getDepartureDate();
if (depDate != null) {
String formattedDate = format.format(depDate.getTime());
if (mapObj.containsKey(formattedDate)) {
mapObj.get(formattedDate).add(bookDTO);
} else {
ArrayList<BookDTO> list = new ArrayList<>();
list.add(bookDTO);
mapObj.put(formattedDate, list);
}
}
}
So, only one loop (and one sorting algorithm).
For list cope-constructor used System.arraycopy internally, you can google performance of.
You can sort hList instead without creating new 'hListCopy' if applicable.
Beware of NewBookDTOComparator, you should sort not only by minutes and hours, but also by 'DepartureDate'
I think SimpleDateFormat format should be static field or class field.
You can also try Arrays.parallelSort(hListCopy) if lists are too large and it is applicable

Android: get UsageStats per hour

I use UsageStats feature of Android, but the smallest interval is DAILY INTERVAL.
long time = System.currentTimeMillis();
List<UsageStats> appList = manager.queryUsageStats(UsageStatsManager.INTERVAL_DAILY, time - DAY_IN_MILLI_SECONDS, time);
How can I get UsageStats in an hourly interval?

All credit goes to this answer. I have learned from that one.
How can we collect app usage data for customized time range (e.g. for per 1 hour)?
We have to call queryEvents(long begin_time, long end_time) method as it will provide us all data starting from begin_time to end_time. It give us each app data through foreground and background events instead of total spent time like queryUsageStats() method. So, using foreground and background events time stamp, we can count the number of times an app has been launched and also can find out the usage duration for each app.
Implementation to Collect Last 1 Hour App Usage Data
At first, add the following line in the AndroidManifest.xml file and also request user to get permission of usage access.
<uses-permission android:name="android.permission.PACKAGE_USAGE_STATS" />
Add the following lines inside any method
long hour_in_mil = 1000*60*60; // In Milliseconds
long end_time = System.currentTimeMillis();
long start_time = end_time - hour_in_mil;
Then, call the method getUsageStatistics()
getUsageStatistics(start_time, end_time);
getUsageStatistics methiod
#RequiresApi(api = Build.VERSION_CODES.LOLLIPOP)
void getUsageStatistics(long start_time, long end_time) {
UsageEvents.Event currentEvent;
// List<UsageEvents.Event> allEvents = new ArrayList<>();
HashMap<String, AppUsageInfo> map = new HashMap<>();
HashMap<String, List<UsageEvents.Event>> sameEvents = new HashMap<>();
UsageStatsManager mUsageStatsManager = (UsageStatsManager)
context.getSystemService(Context.USAGE_STATS_SERVICE);
if (mUsageStatsManager != null) {
// Get all apps data from starting time to end time
UsageEvents usageEvents = mUsageStatsManager.queryEvents(start_time, end_time);
// Put these data into the map
while (usageEvents.hasNextEvent()) {
currentEvent = new UsageEvents.Event();
usageEvents.getNextEvent(currentEvent);
if (currentEvent.getEventType() == UsageEvents.Event.ACTIVITY_RESUMED ||
currentEvent.getEventType() == UsageEvents.Event.ACTIVITY_PAUSED) {
// allEvents.add(currentEvent);
String key = currentEvent.getPackageName();
if (map.get(key) == null) {
map.put(key, new AppUsageInfo(key));
sameEvents.put(key,new ArrayList<UsageEvents.Event>());
}
sameEvents.get(key).add(currentEvent);
}
}
// Traverse through each app data which is grouped together and count launch, calculate duration
for (Map.Entry<String,List<UsageEvents.Event>> entry : sameEvents.entrySet()) {
int totalEvents = entry.getValue().size();
if (totalEvents > 1) {
for (int i = 0; i < totalEvents - 1; i++) {
UsageEvents.Event E0 = entry.getValue().get(i);
UsageEvents.Event E1 = entry.getValue().get(i + 1);
if (E1.getEventType() == 1 || E0.getEventType() == 1) {
map.get(E1.getPackageName()).launchCount++;
}
if (E0.getEventType() == 1 && E1.getEventType() == 2) {
long diff = E1.getTimeStamp() - E0.getTimeStamp();
map.get(E0.getPackageName()).timeInForeground += diff;
}
}
}
// If First eventtype is ACTIVITY_PAUSED then added the difference of start_time and Event occuring time because the application is already running.
if (entry.getValue().get(0).getEventType() == 2) {
long diff = entry.getValue().get(0).getTimeStamp() - start_time;
map.get(entry.getValue().get(0).getPackageName()).timeInForeground += diff;
}
// If Last eventtype is ACTIVITY_RESUMED then added the difference of end_time and Event occuring time because the application is still running .
if (entry.getValue().get(totalEvents - 1).getEventType() == 1) {
long diff = end_time - entry.getValue().get(totalEvents - 1).getTimeStamp();
map.get(entry.getValue().get(totalEvents - 1).getPackageName()).timeInForeground += diff;
}
}
smallInfoList = new ArrayList<>(map.values());
// Concatenating data to show in a text view. You may do according to your requirement
for (AppUsageInfo appUsageInfo : smallInfoList)
{
// Do according to your requirement
strMsg = strMsg.concat(appUsageInfo.packageName + " : " + appUsageInfo.launchCount + "\n\n");
}
TextView tvMsg = findViewById(R.id.MA_TvMsg);
tvMsg.setText(strMsg);
} else {
Toast.makeText(context, "Sorry...", Toast.LENGTH_SHORT).show();
}
}
AppUsageInfo.class
import android.graphics.drawable.Drawable;
class AppUsageInfo {
Drawable appIcon; // You may add get this usage data also, if you wish.
String appName, packageName;
long timeInForeground;
int launchCount;
AppUsageInfo(String pName) {
this.packageName=pName;
}
}
How can I customize these codes to collect per 1 hour data?
As you want to get per hour data, please change the end_time and start_time value for every hour data. For instance: If I would try to collect past per hour data (for past 2 hour data). I would do the following thing.
long end_time = System.currentTimeMillis();
long start_time = end_time - (1000*60*60);
getUsageStatistics(start_time, end_time);
end_time = start_time;
start_time = start_time - hour_in_mil;
getUsageStatistics(start_time, end_time);
However, you may use a Handler to skip repeatedly writing start_time and end_time to change value of these variables. Each time data will be collected for one hour, a task will be completed and after automatically changing the values of the variables, you will again call the getUsageStatistics method.
Note: Maybe, you will not be able to retrieve data for more than past 7.5 days as events are only kept by the system for a few days.

Calendar cal = (Calendar) Calendar.getInstance().clone();
//I used this and it worked, only for 7 days and a half ago
if (daysAgo == 0) {
//Today - I only count from 00h00m00s today to present
end = cal.getTimeInMillis();
start = LocalDate.now().toDateTimeAtStartOfDay().toInstant().getMillis();
} else {
long todayStartOfDayTimeStamp = LocalDate.now().toDateTimeAtStartOfDay().toInstant().getMillis();
if (mDaysAgo == -6) {
//6 days ago, only get events in time -7 days to -7.5 days
cal.setTimeInMillis(System.currentTimeMillis());
cal.add(Calendar.DATE, daysAgo + 1);
end = cal .getTimeInMillis();
start = end - 43200000;
} else {
//get events from 00h00m00s to 23h59m59s
//Current calendar point to 0h0m today
cal.setTimeInMillis(todayStartOfDayTimeStamp);
cal.add(Calendar.DATE, daysAgo + 1);
end = calendar.getTimeInMillis();
cal.add(Calendar.DATE, -1);
start = calendar.getTimeInMillis();
}
}

I don't think it's possible, even if you ask for data in the middle of an interval, it looks like the data is stored in buckets and the minimum bucket is a day.
In UsageStatsManager documentation, it says:
A request for data in the middle of a time interval will include that interval.
Also, INTERVAL_BEST is not a real interval, it just selects one of the available intervals for the given time range. In
UsageStatsManager.java source code, it says:
/**
* The number of available intervals. Does not include {#link #INTERVAL_BEST}, since it
* is a pseudo interval (it actually selects a real interval).
* {#hide}
*/
public static final int INTERVAL_COUNT = 4;

Yes, Android is providing minimum INTERVAL_DAILY. But for the best result, you can use INTERVAL_BEST. Android is giving the best interval timer for the given time range in queryUsageStats(int, long, long).
Happy coding...

How to add elements to ConcurrentHashMap using ExecutorService

I have a requirement of reading User Information from 2 different sources (db) per userId and storing consolidated information in a Map with key as userId. Users in numbers can vary based on period they have opted for. Group of users may belong to different Period of Year.eg daily, weekly, monthly users.
I used HashMap and LinkedHashMap to get this done. As it slows down the process and to make it faster, I thought of using Threading here.
Reading some tutorials and examples now I am using ConcurrentHashMap and ExecutorService.
In cases based on some validation I want to skip the current iteration and move to next User info. It doesnot allow to use continue keyword to use within for loop. Is there any way to achieve same differently within Multithreaded code.
Moreover below code piece though it works, but its not significantly that faster than the code without threading which creates doubt if Executor Service is implemented correctly.
How do we debug in case we get any error in Multithreaded code. Execution holds at debug point but its not consistent and it does not move to next line with F6.
Can someone point out if I am missing something in the code. Or any other example of simillar use case also can be of great help.
public void getMap() throws UserException
{
long startTime = System.currentTimeMillis();
Map<String, Map<Integer, User>> map = new ConcurrentHashMap<String, Map<Integer, User>>();
//final String key = "";
try
{
final Date todayDate = new Date();
List<String> applyPeriod = db.getPeriods(todayDate);
for (String period : applyPeriod)
{
try
{
final String key = period;
List<UserTable1> eligibleUsers = db.findAllUsers(key);
Map<Integer, User> userIdMap = new ConcurrentHashMap<Integer, User>();
ExecutorService executor = Executors.newFixedThreadPool(eligibleUsers.size());
CompletionService<User> cs = new ExecutorCompletionService<User>(executor);
int userCount=0;
for (UserTable1 eligibleUser : eligibleUsers)
{
try
{
cs.submit(
new Callable<User>()
{
public User call()
{
int userId = eligibleUser.getUserId();
List<EmployeeTable2> empData = db.findByUserId(userId);
EmployeeTable2 emp = null;
if (null != empData && !empData.isEmpty())
{
emp = empData.get(0);
}else{
String errorMsg = "No record found for given User ID in emp table";
logger.error(errorMsg);
//continue;
// conitnue does not work here.
}
User user = new User();
user.setUserId(userId);
user.setFullName(emp.getFullName());
return user;
}
}
);
userCount++;
}
catch(Exception ex)
{
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
}
for (int i = 0; i < userCount ; i++ ) {
try {
User user = cs.take().get();
if (user != null) {
userIdMap.put(user.getUserId(), user);
}
} catch (ExecutionException e) {
} catch (InterruptedException e) {
}
}
executor.shutdown();
map.put(key, userIdMap);
}
catch(Exception ex)
{
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
}
}
catch(Exception ex){
String errorMsg = "Error while creating map :" + ex.getMessage();
logger.error(errorMsg);
}
logger.info("Size of Map : " + map.size());
Set<String> periods = map.keySet();
logger.info("Size of periods : " + periods.size());
for(String period :periods)
{
Map<Integer, User> mapOfuserIds = map.get(period);
Set<Integer> userIds = mapOfuserIds.keySet();
logger.info("Size of Set : " + userIds.size());
for(Integer userId : userIds){
User inf = mapOfuserIds.get(userId);
logger.info("User Id : " + inf.getUserId());
}
}
long endTime = System.currentTimeMillis();
long timeTaken = (endTime - startTime);
logger.info("All threads are completed in " + timeTaken + " milisecond");
logger.info("******END******");
}

You really don't want to create a thread pool with as many threads as users you've read from the db. That doesn't make sense most of the time because you need to keep in mind that threads need to run somewhere... There are not many servers out there with 10 or 100 or even 1000 cores reserved for your application. A much smaller value like maybe 5 is often enough, depending on your environment.
And as always for topics about performance: You first need to test what your actual bottleneck is. Your application may simply don't benefit of threading because e.g. you are reading form a db which only allows 5 concurrent connections a the same time. In that case all your other 995 threads will simply wait.
Some other thing to consider is network latency: Reading multiple user ids from multiple threads may even increase the round trip time needed to get the data for one user from the database. An alternative approach might be to not read one user at a time, but the data of all 10'000 of them at once. That way your maybe available 10 GBit Ethernet connection to your database might really speed things up because you have only small communication overhead with the database but it might serve you all data you need in one answer quickly.
So in short, from my opinion your question is about performance optimization of your problem in general, but you don't know enough yet to decide which way to go.

you could try something like that:
List<String> periods = db.getPeriods(todayDate);
Map<String, Map<Integer, User>> hm = new HashMap<>();
periods.parallelStream().forEach(s -> {
eligibleUsers = // getEligibleUsers();
hm.put(s, eligibleUsers.parallelStream().collect(
Collectors.toMap(UserTable1::getId,createUserForId(UserTable1:getId))
});
); //
And in the createUserForId you do your db-reading
private User createUserForId(Integer id){
db.findByUserId(id);
//...
User user = new User();
user.setUserId(userId);
user.setFullName(emp.getFullName());
return user;
}

Proving that SimpleDateFormat is not threadsafe

I want to show to a colleague that SimpleDateFormat is not thread-safe through a simple JUnit test. The following class fails to make my point (reusing SimpleDateFormat in a multi-threaded environment) and I don't understand why. Can you spot what is preventing my use of SDF from throwing a runtime exception?
public class SimpleDateFormatThreadTest
{
#Test
public void test_SimpleDateFormat_MultiThreaded() throws ParseException{
Date aDate = (new SimpleDateFormat("dd/MM/yyyy").parse("31/12/1999"));
DataFormatter callable = new DataFormatter(aDate);
ExecutorService executor = Executors.newFixedThreadPool(1000);
Collection<DataFormatter> callables = Collections.nCopies(1000, callable);
try{
List<Future<String>> futures = executor.invokeAll(callables);
for (Future f : futures){
try{
assertEquals("31/12/1999", (String) f.get());
}
catch (ExecutionException e){
e.printStackTrace();
}
}
}
catch (InterruptedException e){
e.printStackTrace();
}
}
}
class DataFormatter implements Callable<String>{
static SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
Date date;
DataFormatter(Date date){
this.date = date;
}
#Override
public String call() throws RuntimeException{
try{
return sdf.format(date);
}
catch (RuntimeException e){
e.printStackTrace();
return "EXCEPTION";
}
}
}

Lack of thread safety doesn't necessarily mean that the code will throw an exception. This was explained in Andy Grove's article, SimpleDateFormat and Thread Safety, which is no longer available online. In it, he showed SimpleDateFormat's lack of thread safety by showing that the output would not always be correct, given different inputs.
When I run this code, I get the following output:
java.lang.RuntimeException: date conversion failed after 3 iterations.
Expected 14-Feb-2001 but got 01-Dec-2007
Note that "01-Dec-2007" isn't even one of the strings in the test data. It is actually a combination of the dates being processed by the other two threads!
While the original article is no longer available online, the following code illustrates the issue. It was created based on articles that appeared to have been based on Andy Grove's initial article.
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Locale;
public class SimpleDateFormatThreadSafety {
private final SimpleDateFormat dateFormat = new SimpleDateFormat("dd-MMM-yyyy", Locale.US);
public static void main(String[] args) {
new SimpleDateFormatThreadSafety().dateTest(List.of("01-Jan-1999", "14-Feb-2001", "31-Dec-2007"));
}
public void dateTest(List<String> testData) {
testData.stream()
.map(d -> new Thread(() -> repeatedlyParseAndFormat(d)))
.forEach(Thread::start);
}
private void repeatedlyParseAndFormat(String value) {
for (int i = 0; i < 1000; i++) {
Date d = tryParse(value);
String formatted = dateFormat.format(d);
if (!value.equals(formatted)) {
throw new RuntimeException("date conversion failed after " + i
+ " iterations. Expected " + value + " but got " + formatted);
}
}
}
private Date tryParse(String value) {
try {
return dateFormat.parse(value);
} catch (ParseException e) {
throw new RuntimeException("parse failed");
}
}
}
Sometimes this conversion fails by returning the wrong date, and sometimes it fails with a NumberFormatException:
java.lang.NumberFormatException: For input string: ".E2.31E2"

Isn't this part from javadoc of SimpleDateFormatter has sufficent proof about it?
Synchronization
Date formats are not synchronized. It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally.
And the major observation of not being thread safe is to get unexpected results and not an exception.

It is not thread safe because of this code in SimpleDateFormat (in sun JVM 1.7.0_02):
private StringBuffer format(Date date, StringBuffer toAppendTo,
FieldDelegate delegate) {
// Convert input date to time field list
calendar.setTime(date);
....
}
Each call to format stores the date in a calendar member variable of the SimpleDateFormat, and then subsequently applies the formatting to the contents of the calendar variable (not the date parameter).
So, as each call to format happens the data for all currently running formats may change (depending on the coherence model of your architecture) the data in the calendar member variable that is used by every other thread.
So if you run multiple concurrent calls to format you may not get an exception, but each call may return a result derived from the date of one of the other calls to format - or a hybrid combination of data from many different calls to format.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache Flink: Wierd FlatMap behaviour - java

I found out that the reason for the 'seemingly random' timestamps is Apache Flink's parallel execution. When the parallelism is set to > 1, the order of events isn't guaranteed anymore. My quick fix was to set the parallelism of my program to 1, this guarantees the order of events, as far as I know.

Related

Improve performance of loading 100,000 records from database

Improving code design to enhance performance

Android: get UsageStats per hour

How to add elements to ConcurrentHashMap using ExecutorService

Proving that SimpleDateFormat is not threadsafe

Categories

Resources