I have an application which reads and writes a number of files. The aim is to prevent a specific file from being read or written while it is being written by another thread. I do not want to lock the reading and writing of all files while a single file is being written as this causes unnecessary locking.
To try and achieve this I am using a concurrentHashMap in conjunction with a synchonized block, but if there is a better solution I am open to it.
Here is the rough code.
private static final ConcurrentMap<String, String> lockMap = new ConcurrentHashMap();
private void createCache(String templatePath, String cachePath){
//get template
String temp = getTemplate(templatePath);
String myRand = randomString();
lockMap.put(cachePath,myRand);
// save cache file
try {
// ** is lockMap.get(cachePath) still threadsafe if another thread has changed the row's value?
synchronized ( lockMap.get(cachePath) ){
Files.write(Paths.get(cachePath),temp.getBytes(StandardCharsets.UTF_8));
}
} finally {
// remove lock if not locked by another thread in the meantime
lockMap.remove(cachePath, myRand);
}
}
private String getCache(String cachePath){
String output = null;
//only lock if this specific file is being written at the moment
if ( lockMap.contains(cachePath) ){
synchronized ( lockMap.get(cachePath) ){
output = getFile(cachePath);
}
} else {
output = getFile(cachePath);
}
return output;
}
// main event
private String cacheToString (String templatePath, String cachePath){
File cache = new File(cachePath);
if ( !cache.exists() ){
createCache(templatePath, cachePath)
}
return getCache(cachePath);
}
The problem I have is that although the thread will only remove the lock for the requested file if it's unchanged by another thread, it's still possible for another thread to update the value in the lockMap for this entry - if this happens will the synchronisation fail?
I would write a new temporary file each time and rename it when finished. Renaming is atomic.
// a unique counter across restarts
final AtomicLong counter = new AtomicLong(System.currentTimeMillis()*1000);
private void createCache(String templatePath, String cachePath) {
//get template
String temp = getTemplate(templatePath);
Path path = Paths.get(cachePath);
Path tmpPath = Paths.get(path.getParent().toString(), counter.getAndIncrement() + ".tmp");
// save cache file
Files.write(tmpPath, temp.getBytes(StandardCharsets.UTF_8));
Files.move(tmpPath, path, ATOMIC_MOVE, REPLACE_EXISTING);
}
If multiple threads try to write to the same file, the last one to perform a move wins.
Related
I have a watch service running on a folder, when I am trying to modify and existing file using evenKind == Modify (basically pasting a same file without removing the current file) I am getting FileNotFoundException (The process cannot access the file because it is being used by another process.)
if (eventKind == StandardWatchEventKinds.ENTRY_MODIFY) {
String newFileChecksum = null;
if (eventPath.toFile().exists()) {
newFileChecksum = getFileChecksum(eventPath.toFile());
}
if (fileMapper.containsKey(eventPath)) {
String existingFileChecksum = fileMapper.get(eventPath);
if (!existingFileChecksum.equals(newFileChecksum)) {
fileMapper.replace(eventPath, existingFileChecksum, newFileChecksum);
log.info("listener.filemodified IN");
for (DirectoryListener listener : this.listeners) {
listener.fileModified(this, eventPath);
}
log.info("listener.filemodified OUT");
} else {
log.info("existing checksum");
log.debug(String.format(
"Checksum for file [%s] has not changed. Skipping plugin processing.",
eventPath.getFileName()));
}
}
}
In the code when...getFileChecksum() is called
if (eventPath.toFile().exists()) {
newFileChecksum = getFileChecksum(eventPath.toFile());
}
So ideally, eventPath.toFile().exists() is TRUE, hence code is going inside if but when getFileChecksum() is called, it goes to method...
private synchronized String getFileChecksum(File file) throws IOException, NoSuchAlgorithmException {
MessageDigest md5Digest = MessageDigest.getInstance("MD5");
FileInputStream fis = null;
if(file.exists()) {
try {
fis = new FileInputStream(file);
} catch(Exception e) {
e.printStackTrace();
}
} else {
log.warn("File not detected.");
}
byte[] byteArray = new byte[1024];
int bytesCount = 0;
while ((bytesCount = fis.read(byteArray)) != -1) {
md5Digest.update(byteArray, 0, bytesCount);
};
fis.close();
byte[] bytes = md5Digest.digest();
StringBuilder stringBuilder = new StringBuilder();
for (int i=0; i< bytes.length ;i++) {
stringBuilder.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
return stringBuilder.toString();
}
}
An exception is coming fis = new FileInputStream(file); even if the file is present in the folder.
FileNotFoundException (The process cannot access the file because it is being used by another process.)
I created a RandomAccessFile and a channel to release any LOCK placed on file, but it is not working. Please suggest what could be happening here.
//UPDATE --> This is the infinite while loop that I have,
WHAT IS HAPPENING? WHEN I PUT A FILE 1 create and 2 update are getting called, suppose, when I am deleting the file, 1 delete 1 modify is being called, and IF I PUT THE SAME FILE BACK TO FOLDER, I GET CREATE but before CREATE is finishing, MODIFY IS BEING called. and create is not running instead modify is running.
I fixed this issue by putting Thread.sleep(500) between
WatchKey wk = watchService.take();
Thread.sleep(500)
for (WatchEvent<?> event : wk.pollEvents()) {
But I dont think I can justify use of sleep here. Please help
WatchService watchService = null;
WatchKey watchKey = null;
while (!this.canceled && (watchKey == null)) {
watchService = watchService == null
? FileSystems.getDefault().newWatchService() : watchService;
watchKey = this.directory.register(watchService,
StandardWatchEventKinds.ENTRY_MODIFY, StandardWatchEventKinds.ENTRY_DELETE,
StandardWatchEventKinds.ENTRY_CREATE);
}
while (!this.canceled) {
try {
WatchKey wk = watchService.take();
for (WatchEvent<?> event : wk.pollEvents()) {
Kind<?> eventKind = event.kind();
System.out.println("Event kind : " + eventKind);
Path dir = (Path)wk.watchable();
Path eventPath = (Path) event.context();
Path fullPath = dir.resolve(eventPath);
fireEvent(eventKind, fullPath);
}
wk.reset();
}
I have a better approach, use and a while loop on a var isFileReady like this...
var isFileReady = false;
while(!isFile...) {
}
inside while create a try and catch.
try {
FileInputStream fis = new FileInputStream();
isFileReady = true;
} catch () {
catch exception or print file not ready.
}
This will solve your problem.
The WatchService is verbose and may report multiple ENTRY_MODIFY events for save operation - even when another application is part way through or doing writes repeatedly. Your code is probably acting on a modify event while the other app is still writing and there may be a second ENTRY_MODIFY on its way.
A safer strategy for using the WatchService is to collate the events you receive and only act on the changes when there is a pause. Something like this will ensure that you block on first event but then poll the watch service with small timeout to see if more changes are present before you act on the previous set:
WatchService ws = ...
HashSet<Path> modified = new HashSet<>();
while(appIsRunning) {
int countNow = modified.size();
WatchKey k = countNow == 0 ? ws.take() : ws.poll(1, TimeUnit.MILLISECONDS);
if (k != null) {
// Loop through k.pollEvents() and put modify file path into modified set:
// DO NOT CALL fireEvent HERE, save the path instead:
...
if (eventKind == ENTRY_MODIFY)
modified.add(filePath);
}
// Don't act on changes unless no new events:
if (countNow == modified.size()) {
// ACT ON modified list here - the watch service did not report new changes
for (Path filePath : modified) {
// call fireEvent HERE:
fireEvent(filePath);
}
// reset the list so next watch call is take() not poll(1)
modified.clear();
}
}
If you are also looking out for CREATE and DELETE operations with MODIFY you will have to collate and ignore some of the earlier events because the last recorded event type can take precedence over a previously recorded type. For example, if calling take() then poll(1) until nothing new is reported:
Any DELETE then CREATE => you might want to consider as MODIFY
Any CREATE then MODIFY => you might want to consider as CREATE
Any CREATE or MODIFY then a DELETE => treat as DELETE
Your logic would also want to only act when value of modified.size() + created.size() + deleted.size() gets changed between runs.
let me guess...
modify event gets called when you modify a file. to modify the file you most likely use a seperate tool like notepad that opens and LOCKS the file.
your watcher gets an event that the file gets modified (right now) but you can not modify it again (which fileinputstream wants to do) since it is locked already.
A process I've been working on for a little while now. Process was running fine until the performance was taking a hit. I figured out a way to get it to perform very fast, but I'm really unsure what is happening behind the scenes. And it's now throwing warnings and errors and I'm not sure what to do. File is getting porocessed but I'm not sure if all threads are complete, and I don't believe I am shutting down the app correctly. Here is everything you need to know...
File is read using a buffered reader, we then run some data quality checks on each record, every record that is read and passes data quality checks we create a java object out of it and insert into a List. Once the List is 1000 objects big, we then call an OracleService class which has a Repo autowired and we execute a saveAll method with the List. We then continue to read the file and do this until the file is done being read. I am passing in, to the service, and ExecutorService object. So every time we call that service it is getting a new List object containing my objects (this object is basically the table we are loading) and a new ExecutorService Object. Process is running fine but getting a ton of exceptions being thrown once I try to shutdown. Here is all my code...
My Controller class run method. This will get called from another class which implements CommandLineRunner
public void run() throws ParseException, IOException, InterruptedException {
logger.info("******************** Aegis Check Inclearing DDA Trial Balance Table Load starting ********************");
try (BufferedReader reader = new BufferedReader(new FileReader(inputFile))) {
String line = reader.readLine();
int count = 0;
TrialBalanceBuilder builder = new TrialBalanceBuilder();
while (line != null) {
if (line.startsWith("D")) {
if (dataQuality(line)) {
TrialBalance trialBalance = builder.buildTrialBalanceObject(line, procDt, time);
insertList.add(trialBalance);
count++;
if (count == 1000) {
oracleService.loadToTableTrialBalance(insertList, executorService);
count = 0;
insertList.clear();
}
} else {
logger.info("Data quality check FAILED for record: " + line);
oracleService.revertInserts("DDA_TRIAL_BAL_STG",procDt.toString());
System.exit(111);
}
}
line = reader.readLine();
}
logger.info("Leftover record count is " + insertList.size());
oracleService.loadToTableTrialBalance(insertList, executorService);
} catch (IOException e) {
e.printStackTrace();
}
logger.info("Updating Metadata table with new batch proc date");
InclearingBatchMetadataBuilder inclearingBatchMetadataBuilder = new InclearingBatchMetadataBuilder();
InclearingBatchMetadata inclearingBatchMetadata = inclearingBatchMetadataBuilder.buildInclearingBatchMetadataObject("DDA_TRIAL_BAL_STG", procDt, time, Constants.bankID);
oracleService.insertBatchProcDtIntoMetaTable(inclearingBatchMetadata);
logger.info("Successfully updated Metadata table with new batch proc date: " + procDt);
Thread.sleep(10000);
oracleService.cleanUpGOS("DDA_TRIAL_BAL_STG",1);
executorService.shutdownNow();
logger.info("******************** Aegis Check Inclearing DDA Trial Balance Table Load ended successfully ********************");
}
I'm passing in an ExecutorService object to the service class. This is defined as...
private final ThreadFactory threadFactory = new ThreadFactoryBuilder().setNameFormat("Orders-%d").setDaemon(true).build();
private ExecutorService executorService = Executors.newFixedThreadPool(10, threadFactory);
My service class looks as such....
#Service("oracleService")
public class OracleService {
private static final Logger logger = LoggerFactory.getLogger(OracleService.class);
#Autowired
TrialBalanceRepo trialBalanceRepo;
#Transactional
public void loadToTableTrialBalance(List<TrialBalance> trialBalanceList, ExecutorService executorService) {
logger.debug("Let's load to the database");
logger.debug(trialBalanceList.toString());
List<TrialBalance> multiThreadList = new ArrayList<>(trialBalanceList);
try {
executorService.execute(() -> trialBalanceRepo.saveAll(multiThreadList));
} catch (ConcurrentModificationException | DataIntegrityViolationException ignored) {}
logger.debug("Successfully loaded to database");
}
In my run method i then call a few more methods in that Service class which create nativequeries and execute on the database (for purging etc.)
Anyway, I never know when the threads are complete. And I am finding in pre-production, when running with a lot of data, we shut down the app and not all the data is completely loaded. Also I don't know if this is even the best design. Do I keep passing in these executorservice objects? The whole point of this was to get optimal parallelism going so that our performance was better. Perhaps there is a better way (preferably without redesigning the entire app and using something other than JPA)
I'm trying to build a process that will watch a list of directories (populated via JPA) and when a new file is detected in a folder a new thread is started to process that folder. A maximum of one thread should only be running per folder but multiple threads could run spanning different folders.
I've got that working somewhat with the below code but the issue I've found is.. say 1 out of 5 files have moved so far. A thread will be immediately made once the first is detected, the ProcessDatasource thread would then loop through the dir and make 1 file objects to process. In the mean time 4 files would trigger the systemfilewatcher but would block due to a datasource thread already running on that folder. Now since filesystemwatcher will have already triggered when the files landed it won't run again which will leave those 4 files in limbo until another lands in that folder....
To solve this I thought if a file lands and a thread is already running I could call a method within the thread to add the file to the List of files it's currently processing but I'm struggling to do that when the threads are made dynamically in the below loop. Of course this could just be an awful way of doing all this so open to any suggestions.
private boolean checkThreadRunning(String threadName){
Set<Thread> threadSet = Thread.getAllStackTraces().keySet();
for ( Thread t : threadSet){
if ( t.getThreadGroup() == Thread.currentThread().getThreadGroup() && t.getName().equals(threadName)) {
return true;
}
}
return false;
}
public void run(String... args) throws IOException {
WatchService watchService = FileSystems.getDefault().newWatchService();
List<DataSource> datasourceList = readDataSources(); // Load a list of DataSource objects into the datasourceList.
Map<WatchKey, DataSource> keys = registerKeys(watchService, datasourceList);
WatchKey key;
while ((key = watchService.take()) != null) {
DataSource dataSource = keys.get(key);
for (WatchEvent<?> event : key.pollEvents()) {
String dataSourceName = dataSource.getDatasourceName();
String threadName = "datasourceThread-" + dataSourceName;
// Check if there is already a thread running on this datasource (folder)
if (checkThreadRunning(threadName)) {
System.out.println("Found another file for datasource " + dataSourceName + "but an instance is already running");
// Need something here to pass this new file into the currently running thread to be processed...
} else {
// If not then start a thread which will work through processing the files within the folder.
new Thread(new ProcessDatasource(threadName, dataSource)).start();
}
}
key.reset();
}
}
I am trying to take a very long file of strings and convert it to an XML according to a schema I was given. I used jaxB to create classes from that schema. Since the file is very large I created a thread pool to improve the performance but since then it only processes one line of the file and marshalls it to the XML file, per thread.
Below is my home class where I read from the file. Each line is a record of a transaction, for every new user encountered a list is made to store all of that users transactions and each list is put into a HashMap. I made it a ConcurrentHashMap because multiple threads will work on the map simultaneously, is this the correct thing to do?
After the lists are created a thread is made for each user. Each thread runs the method ProcessCommands below and receives from home the list of transactions for its user.
public class home{
public static File XMLFile = new File("LogFile.xml");
Map<String,List<String>> UserMap= new ConcurrentHashMap<String,List<String>>();
String[] UserNames = new String[5000];
int numberOfUsers = 0;
try{
BufferedReader reader = new BufferedReader(new FileReader("test.txt"));
String line;
while ((line = reader.readLine()) != null)
{
parsed = line.split(",|\\s+");
if(!parsed[2].equals("./testLOG")){
if(Utilities.checkUserExists(parsed[2], UserNames) == false){ //User does not already exist
System.out.println("New User: " + parsed[2]);
UserMap.put(parsed[2],new ArrayList<String>()); //Create list of transactions for new user
UserMap.get(parsed[2]).add(line); //Add First Item to new list
UserNames[numberOfUsers] = parsed[2]; //Add new user
numberOfUsers++;
}
else{ //User Already Existed
UserMap.get(parsed[2]).add(line);
}
}
}
reader.close();
} catch (IOException x) {
System.err.println(x);
}
//get start time
long startTime = new Date().getTime();
tCount = numberOfUsers;
ExecutorService threadPool = Executors.newFixedThreadPool(tCount);
for(int i = 0; i < numberOfUsers; i++){
System.out.println("Starting Thread " + i + " for user " + UserNames[i]);
Runnable worker = new ProcessCommands(UserMap.get(UserNames[i]),UserNames[i], XMLfile);
threadPool.execute(worker);
}
threadPool.shutdown();
while(!threadPool.isTerminated()){
}
System.out.println("Finished all threads");
}
Here is the ProcessCommands class. The thread receives the list for its user and creates a marshaller. From what I unserstand marshalling is not thread safe so it is best to create one for each thread, is this the best way to do that?
When I create the marshallers I know that each from (from each thread) will want to access the created file causing conflicts, I used synchronized, is that correct?
As the thread iterates through it's list, each line calls for a certain case. There are a lot so I just made pseudo-cases for clarity. Each case calls the function below.
public class ProcessCommands implements Runnable{
private static final boolean DEBUG = false;
private List<String> list = null;
private String threadName;
private File XMLfile = null;
public Thread myThread;
public ProcessCommands(List<String> list, String threadName, File XMLfile){
this.list = list;
this.threadName = threadName;
this.XMLfile = XMLfile;
}
public void run(){
Date start = null;
int transactionNumber = 0;
String[] parsed = new String[8];
String[] quoteParsed = null;
String[] universalFormatCommand = new String[9];
String userCommand = null;
Connection connection = null;
Statement stmt = null;
Map<String, UserObject> usersMap = null;
Map<String, Stack<BLO>> buyMap = null;
Map<String, Stack<SLO>> sellMap = null;
Map<String, QLO> stockCodeMap = null;
Map<String, BTO> buyTriggerMap = null;
Map<String, STO> sellTriggerMap = null;
Map<String, USO> usersStocksMap = null;
String SQL = null;
int amountToAdd = 0;
int tempDollars = 0;
UserObject tempUO = null;
BLO tempBLO = null;
SLO tempSLO = null;
Stack<BLO> tempStBLO = null;
Stack<SLO> tempStSLO = null;
BTO tempBTO = null;
STO tempSTO = null;
USO tempUSO = null;
QLO tempQLO = null;
String stockCode = null;
String quoteResponse = null;
int usersDollars = 0;
int dollarAmountToBuy = 0;
int dollarAmountToSell = 0;
int numberOfSharesToBuy = 0;
int numberOfSharesToSell = 0;
int quoteStockInDollars = 0;
int shares = 0;
Iterator<String> itr = null;
int transactionCount = list.size();
System.out.println("Starting "+threadName+" - listSize = "+transactionCount);
//UO dollars, reserved
usersMap = new HashMap<String, UserObject>(3); //userName -> UO
//USO shares
usersStocksMap = new HashMap<String, USO>(); //userName+stockCode -> shares
//BLO code, timestamp, dollarAmountToBuy, stockPriceInDollars
buyMap = new HashMap<String, Stack<BLO>>(); //userName -> Stack<BLO>
//SLO code, timestamp, dollarAmountToSell, stockPriceInDollars
sellMap = new HashMap<String, Stack<SLO>>(); //userName -> Stack<SLO>
//BTO code, timestamp, dollarAmountToBuy, stockPriceInDollars
buyTriggerMap = new ConcurrentHashMap<String, BTO>(); //userName+stockCode -> BTO
//STO code, timestamp, dollarAmountToBuy, stockPriceInDollars
sellTriggerMap = new HashMap<String, STO>(); //userName+stockCode -> STO
//QLO timestamp, stockPriceInDollars
stockCodeMap = new HashMap<String, QLO>(); //stockCode -> QLO
//create user object and initialize stacks
usersMap.put(threadName, new UserObject(0, 0));
buyMap.put(threadName, new Stack<BLO>());
sellMap.put(threadName, new Stack<SLO>());
try {
//Marshaller marshaller = getMarshaller();
synchronized (this){
Marshaller marshaller = init.jc.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
marshaller.marshal(LogServer.Root,XMLfile);
marshaller.marshal(LogServer.Root,System.out);
}
} catch (JAXBException M) {
M.printStackTrace();
}
Date timing = new Date();
//universalFormatCommand = new String[8];
parsed = new String[8];
//iterate through workload file
itr = this.list.iterator();
while(itr.hasNext()){
userCommand = (String) itr.next();
itr.remove();
parsed = userCommand.split(",|\\s+");
transactionNumber = Integer.parseInt(parsed[0].replaceAll("\\[", "").replaceAll("\\]", ""));
universalFormatCommand = Utilities.FormatCommand(parsed, parsed[0]);
if(transactionNumber % 100 == 0){
System.out.println(this.threadName + " - " +transactionNumber+ " - "+(new Date().getTime() - timing.getTime())/1000);
}
/*System.out.print("UserCommand " +transactionNumber + ": ");
for(int i = 0;i<8;i++)System.out.print(universalFormatCommand[i]+ " ");
System.out.print("\n");*/
//switch for user command
switch (parsed[1].toLowerCase()) {
case "One"
*Do Stuff"
LogServer.create_Log(universalFormatCommand, transactionNumber, CommandType.ADD);
break;
case "Two"
*Do Stuff"
LogServer.create_Log(universalFormatCommand, transactionNumber, CommandType.ADD);
break;
}
}
}
The function create_Log has multiple cases so as before, for clarity I just left one. The case "QUOTE" only calls one object creation function but other other cases can create multiple objects. The type 'log' is a complex XML type that defines all the other object types so in each call to create_Log I create a log type called Root. The class 'log' generated by JaxB included a function to create a list of objects. The statement:
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(quote_QuoteType);
takes the root element I created, creates a list and adds the newly created object 'quote_QuoteType' to that list. Before I added threading this method successfully created a list of as many objects as I wanted then marshalled them. So I'm pretty positive the bit in class 'LogServer' is not the issue. It is something to do with the marshalling and syncronization in the ProcessCommands class above.
public class LogServer{
public static log Root = new log();
public static QuoteServerType Log_Quote(String[] input, int TransactionNumber){
ObjectFactory factory = new ObjectFactory();
QuoteServerType quoteCall = factory.createQuoteServerType();
**Populate the QuoteServerType object called quoteCall**
return quoteCall;
}
public static void create_Log(String[] input, int TransactionNumber, CommandType Command){
System.out.print("TRANSACTION "+TransactionNumber + " is " + Command + ": ");
for(int i = 0; i<input.length;i++) System.out.print(input[i] + " ");
System.out.print("\n");
switch(input[1]){
case "QUOTE":
System.out.print("QUOTE CASE");
QuoteServerType quote_QuoteType = Log_Quote(input,TransactionNumber);
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(quote_QuoteType);
break;
}
}
So you wrote a lot of code, but have you try if it is actually working? After quick look I doubt it. You should test your code logic part by part not going all the way till the end. It seems you are just staring with Java. I would recommend practice first on simple one threaded applications. Sorry if I sound harsh, but I will try to be constructive as well:
Per convention, the classes names are starts with capital letter, variables by small, you do it other way.
You should make a method in you home (Home) class not a put all your code in the static block.
You are reading the whole file to the memory, you do not process it line by line. After the Home is initialized literary whole content of file will be under UserMap variable. If the file is really large you will run out of the heap memory. If you assume large file than you cannot do it and you have to redisign your app to store somewhere partial results. If your file is smaller than memmory you could keep it like that (but you said it is large).
No need for UserNames, the UserMap.containsKey will do the job
Your thread pools size should be in the range of your cores not number of users as you will get thread trashing (if you have blocking operation in your code make tCount = 2*processors if not keep it as number of processors). Once one ProcessCommand finish, the executor will start another one till you finish all and you will be efficiently using all your processor cores.
DO NOT while(!threadPool.isTerminated()), this line will completely consume one processor as it will be constantly checking, call awaitTermination instead
Your ProcessCommand, has view map variables which will only had one entry cause as you said, each will process data from one user.
The synchronized(this) is Process will not work, as each thread will synchronized on different object (different isntance of process).
I believe creating marshaller is thread safe (check it) so no need to synchronization at all
You save your log (whatever it is) before you did actual processing in of the transactions lists
The marshalling will override content of the file with current state of LogServer.Root. If it is shared bettween your proccsCommand (seems so) what is the point in saving it in each thread. Do it once you are finished.
You dont need itr.remove();
The log class (for the ROOT variable !!!) needs to be thread-safe as all the threads will call the operations on it (so the list inside the log class must be concurrent list etc).
And so on.....
I would recommend, to
Start with simple one thread version that actually works.
Deal with processing line by line, (store reasults for each users in differnt file, you can have cache with transactions for recently used users so not to keep writing all the time to the disk (see guava cache)
Process multithreaded each user transaction to your user log objects (again if it is a lot you have to save them to the disk not keep all in memmory).
Write code that combines logs from diiffernt users to create one (again you may want to do it mutithreaded), though it will be mostly IO operations so not much gain and more tricky to do.
Good luck
override cont
I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache:
java.io.IOException: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during the job from 4/8/13 2:20 PM to 4/8/13 2:20 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1246)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1237)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1152)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2541)
at java.lang.Thread.run(Thread.java:662)
What sets this question apart from this one is the fact that this error appears intermittently. I've run the same code successfully on a smaller dataset. Furthermore, when I change the number of centroids from 12 (seen above in the code) to 8, it fails on iteration 5 instead of 6 (which can you see in the centroids_6 name above).
Here's the relevant DistributedCache code in the main driver that runs the KMeans loop:
int iteration = 1;
long changes = 0;
do {
// First, write the previous iteration's centroids to the dist cache.
Configuration iterConf = new Configuration();
Path prevIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration - 1));
FileSystem fs = prevIter.getFileSystem(iterConf);
Path pathPattern = new Path(prevIter, "part-*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
DistributedCache.addCacheFile(status.getPath().toUri(), iterConf);
}
// Now, set up the job.
Job iterJob = new Job(iterConf);
iterJob.setJobName("KMeans " + iteration);
iterJob.setJarByClass(KMeansDriver.class);
Path nextIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration));
KMeansDriver.delete(iterConf, nextIter);
// Set input/output formats.
iterJob.setInputFormatClass(SequenceFileInputFormat.class);
iterJob.setOutputFormatClass(SequenceFileOutputFormat.class);
// Set Mapper, Reducer, Combiner
iterJob.setMapperClass(KMeansMapper.class);
iterJob.setCombinerClass(KMeansCombiner.class);
iterJob.setReducerClass(KMeansReducer.class);
// Set MR formats.
iterJob.setMapOutputKeyClass(IntWritable.class);
iterJob.setMapOutputValueClass(VectorWritable.class);
iterJob.setOutputKeyClass(IntWritable.class);
iterJob.setOutputValueClass(VectorWritable.class);
// Set input/output paths.
FileInputFormat.addInputPath(iterJob, data);
FileOutputFormat.setOutputPath(iterJob, nextIter);
iterJob.setNumReduceTasks(nReducers);
if (!iterJob.waitForCompletion(true)) {
System.err.println("ERROR: Iteration " + iteration + " failed!");
System.exit(1);
}
iteration++;
changes = iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).getValue();
iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).setValue(0);
} while (changes > 0);
How else would the files be modified? The only possibility I can think of is that, at the completion of one iteration, the loop begins again before the centroids from the previous job have finished writing. But within the comment, I invoke the job with waitForCompletion(true), so there shouldn't be any residual parts of the job running when the loop starts over. Any ideas?
This isn't really an answer, but I did realize it was silly to use the DistributedCache in the way I was, as opposed to reading the results from the previous iteration directly from HDFS. I instead wrote this method in the main driver:
public static HashMap<Integer, VectorWritable> readCentroids(Configuration conf, Path path)
throws IOException {
HashMap<Integer, VectorWritable> centroids = new HashMap<Integer, VectorWritable>();
FileSystem fs = FileSystem.get(path.toUri(), conf);
FileStatus [] list = fs.globStatus(new Path(path, "part-*"));
for (FileStatus status : list) {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, status.getPath(), conf);
IntWritable key = null;
VectorWritable value = null;
try {
key = (IntWritable)reader.getKeyClass().newInstance();
value = (VectorWritable)reader.getValueClass().newInstance();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
while (reader.next(key, value)) {
centroids.put(new Integer(key.get()),
new VectorWritable(value.get(), value.getClusterId(), value.getNumInstances()));
}
reader.close();
}
return centroids;
}
This is invoked in the setup() method of the Mapper and Reducer during each iteration, to read the centroids of the previous iteration.
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
Path centroidsPath = new Path(conf.get(KMeansDriver.CENTROIDS));
centroids = KMeansDriver.readCentroids(conf, centroidsPath);
}
This allowed me to remove the block of code in the loop in my original question which writes the centroids to the DistributedCache. I tested it, and it now works on both large and small datasets.
I still don't know why I was getting the error I posted about (how would something in the read-only DistributedCache be changed? especially when I was changing HDFS paths on every iteration?), but this seems to both work and be a much less hack-y way of reading the centroids.