As per My Project,
Data has been fetched from database through a query,
There is an Iterator on result set and data has been added continuously to this result set.
By iterating over Iterator object results are added to ArrayList.
Once we got all the entries (more than 200000) then writing it to a file.
But as it is using more heap space of jvm ,I need to use a worker thread which runs in back ground and writes the data to the file.
As I am new to multi threading ,
I thought of using Executor service by creating fixed thread pool of 1 thread and whenever entries reaches the count of 50000 ,then submit those entries to executor to append them to file.
please suggest me if this approach is fine or do I need to follow any other approach.
I don't think you need a ThreadPool in order to handle single thread. You can do it by creating a single thread(pseudo code):
List<Entry> list = new ArraList<Entry>(); // class member that will hold the entries from Result set. I Assume entry as `Entry` here
....
void addEntry(Entry entry){
list.add(entry);
if(list.size() >= 20000){
//assign current list to a temp list inorder to reinitialze the list for next set of entries.
final List tempList = list;// tempList has 20000 entries!
list = new ArraList<Entry>();// list is reinitialized
// initiate a thread to write tempList to file
Thread t = new Thread(new Runnable(){
public void run() {
// stuff that will write `tempList` to file
}});
t.start();// start thread for writing.It will be run in background and
//the calling thread (from where you called `addEntry()` )will continue to add new entries to reinitialized list
}//end of if condition
}
Note: You mentioned about the heap space - even if we use thread it still uses heap.
Executing the process in a thread will free up the main thread to do other stuff.
It will not solve your heap space problem.
The heap space problem is caused by the number of entries returned from the query. You could change your query to return only a set number of rows. Process that and do the query again starting from the last row that you processed.
If you are using MS SQL, there is already an answer here on how to split your queries.
Row offset in SQL Server
You don't need to fetch all 20000 entries before writing them to file, unless they have some dependencies to each other.
In the simplest case you can write the entries directly to file as you're fetching them, making it unnecessary to have large amounts of heap.
An advanced version of that is the producer-consumer pattern, which you can then adjust to get different speed/memory use characteristics.
Created worker thread which process entries in the beckground.Starting this thread before fetching entries and stopping it when finished fetching all entries,
public class WriteToOutputFile implements Runnable{
BlockingQueue<entry> queue;
File file;
volatile boolean processentries;
WriteToOutputFile(BlockingQueue queue,File file){
this.queue = queue;
this.file = file;
this.processentries= tue;
}
#override
public void run(){
while(processentries && !queue.isEmpty()){
entry = queue.take();
if (entry== lastentry)break;
//logic to write entries to file
} }
public void stop(){
processentries = false;
queue.put(lastentry);
}
}
Related
I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});
Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
This is SAP PI requirement,
Source System: XY_Client
Middleware: PI System
Target system : SAP
The XML files are received to the PI system, for each XML file an internal file is generated to keep track of store_number, and count of xml files.
How it works: suppose if XML_FILE_1 reaches PI, internal file called sequence_gen is created. the file contains the store number present in XML file and count will be initialized to 1.
so first time,
sequence_gen file contains Store: 1001 Count:1
(after some time interval)If XML_FILE_2 reaches PI, second time,
sequence_gen file contains Store: 1001 Count:2
and so on..
My question is : If 'n' number of files come at the same time to PI system, the 1st file will lock the sequence_gen file. so how will the 2nd file update the value to the sequence_gen file? So how to tackle this problem?
I thought of creating a thread instance for every call and storing it in a database. and retrieving each instance, performing function, returning result to the xml call and deleting that instance.. Is it possible? How to go forward on this?
Rather than keep track of all of the threads that are locking and unlocking the file, you could have a single thread that is in charge of changing it. Have each thread place a request to change the file into a Concurrent Queue, which then notifies the Sequence_Gen thread to write to its own file. In essence:
Sequence_Gen thread:
#Override
public synchronized void Run(){
while(true){ //Some condition
while(queue.isEmpty()) {
this.wait();
}
Object obj = queue.pop();
//Open file
file.write(obj);
//Close file
}
}
Then, in any other thread, just queue and notify that there is something to write.
public synchronized void AddItem(Object item) {
queue.put(item);
this.notifyAll();
}
I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.
In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.
In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.
This is the first time I've encountered something like below.
Multiple Threads (Inner classes implementing Runnable) sharing a Data Structure (instance variable of the upper class).
Working: took classes from Eclipse project's bin folder, ran on a Unix machine.
NOT WORKING: directly compiled the src on Unix machine and used those class files. Code compiles and then runs with no errors/warnings, but one thread is not able to access shared resource properly.
PROBLEM: One thread adds elements to the above common DS. Second thread does the following...
while(true){
if(myArrayList.size() > 0){
//do stuff
}
}
The Log shows that the size is updated in Thread 1.
For some mystic reason, the workflow is not enetering if() ...
Same exact code runs perfectly if I directly paste the class files from Eclipse's bin folder.
I apologize if I missed anything obvious.
Code:
ArrayList<CSRequest> newCSRequests = new ArrayList<CSRequest>();
//Thread 1
private class ListeningSocketThread implements Runnable {
ServerSocket listeningSocket;
public void run() {
try {
LogUtil.log("Initiating...");
init(); // creates socket
processIncomongMessages();
listeningSocket.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private void processIncomongMessages() throws IOException {
while (true) {
try {
processMessage(listeningSocket.accept());
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
private void processMessage(Socket s) throws IOException, ClassNotFoundException {
// read message
ObjectInputStream ois = new ObjectInputStream(s.getInputStream());
Object message = ois.readObject();
LogUtil.log("adding...: before size: " + newCSRequests.size());
synchronized (newCSRequests) {
newCSRequests.add((CSRequest) message);
}
LogUtil.log("adding...: after size: " + newCSRequests.size()); // YES, THE SIZE IS UPDATED TO > 0
//closing....
}
........
}
//Thread 2
private class CSRequestResponder implements Runnable {
public void run() {
LogUtil.log("Initiating..."); // REACHES..
while (true) {
// LogUtil.log("inside while..."); // IF NOT COMMENTED, FLOODS THE CONSOLE WITH THIS MSG...
if (newCSRequests.size() > 0) { // DOES NOT PASS
LogUtil.log("inside if size > 0..."); // NEVER REACHES....
try {
handleNewCSRequests();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
....
}
UPDATE
Solution was to add synchronized(myArrayList) before I check the size in the Thread 2.
To access a shared structure in a multi-threaded environment, you should use implicit or explicit locking to ensure safe publication and access among threads.
Using the code above, it should look like this:
while(true){
synchronized (myArrayList) {
if(myArrayList.size() > 0){
//do stuff
}
}
//sleep(...) // outside the lock!
}
Note: This pattern looks much like a producer-consumer and is better implemented using a queue. LinkedBlockingQueue is a good option for that and provides built-in concurrency control capabilities. It's a good structure for safe publishing of data among threads.
Using a concurrent data structure lets you get rid of the synchronized block:
Queue queue = new LinkedBlockingQueue(...)
...
while(true){
Data data = queue.take(); // this will wait until there's data in the queue
doStuff(data);
}
Every time you modify a given shared variable inside a parallel region (a region with multiple threads running in parallel) you must ensure mutual exclusion. You can guarantee mutual exclusion in Java by using synchronized or locks, normally you use locks when you want a finer grain synchronization.
If the program only performance reads on a given shared variable there is no need for synchronized/lock the accesses to this variable.
Since you are new in this subject I recommend you this tutorial
If I got this right.. There are at least 2 threads that work with the same, shared, datastructure. The array you mentioned.. One thread adds values to the array and the second thread "does stuff" if the size of the array > 0.
There is a chance that the thread scheduler ran the second thread (that checks if the collection is > 0), before the first thread got a chance to run and add a value.
Running the classes from bin or recompiling them has nothing to do. If you were to run the application over again from the bin directory, you might seen the issue again. How many times did you ran the app?
It might not reproduce consistently but at one point you might see the issue again.
You could access the datastruce in a serial fashion, allowing only one thread at a time to access the array. Still that does not guarantee that the first thread will run and only then the second one will check if the size > 0.
Depending on what you need to accomplish, there might be better / other ways to achieve that. Not necessarily using a array to coordinate the threads..
Check the return of
newCSRequests.add((CSRequest) message);
I am guessing its possible that it didn't get added for some reason. If it was a HashSet or similar, it could have been because the hashcode for multiple objects return the same value. What is the equals implementation of the message object?
You could also use
List list = Collections.synchronizedList(new ArrayList(...));
to ensure the arraylist is always synchronised correctly.
HTH
I have a stateful EJB which calls an EJB stateless method of Web parsing pages.
Here is my stateful code :
#Override
public void parse() {
while(true) {
if(false == _activeMode) {
break;
}
for(String url : _urls){
if(false == _activeMode) {
break;
}
for(String prioritaryUrl : _prioritaryUrls) {
if(false == _activeMode)
break;
boursoramaStateless.parseUrl(prioritaryUrl);
}
boursoramaStateless.parseUrl(url);
}
}
}
No problem here.
I have some asynchronously call (with JMS) that add to my _urls variable (a List) some value. Goal is to parse new url inside my infinity loop.
I receive ConcurrentModificationException when I try to add new url in my List via JMS onMessage method but it seems to be working because this new url is parsed.
When I try to wrap a synchronized block :
while(true){
synchronized(_url){
// code...
}
}
My new url is never parsed, I expected to be parsed after a for() loop finished...
So my question is : how can I modify List when it's accessed inside a loop without having ConcurrentModificationException please ?
I just want 2 threads to modify some shared resource at same time without synchronized block...
You may want a CopyOnWriteArrayList.
For (String s : urls) uses an Iterator internally. The iterator checks for concurrent modification so that its behavior is well defined.
You can use a for(int i= ... loop. This way, no exception is thrown, and if elements are only added to the end of the List, you still get a consistent snapshot (the list as it exists at some time during the iteration). If the elements in the list are moved around, you may get missing entries.
If you want to use synchronised, you need to synchronise on both ends, but that way you lose concurrent reads.
If you want concurrent access AND consistent snapshots, you can use any of the collections in the java.util.concurrent package.
CopyOnWriteArrayList has already been mentioned. The other interesting are LinkedBlockingQueue and ArrayBlockingQueue (Collections but not Lists) but that's about all.
ok thank you guys.
So I made some modifications.
1) added iterator and leaving synchronized block (inside parse() function and around addUrl() function which add new url to my List)
--> it's work like a charm, no ConcurrentModificationException launched
2) added iterator and removed synchronized blocks
--> ConcurrentModificationException is still launched...
For now, I will read more about your answers and test your solutions.
Thank you again guys
First, forget about synchronized when running into Java EE container. It bothers the container to optimize threads utilization and will not work in clustered environment.
Second, it seems that your design is wrong. You should not update private field of the bean using JMS. This thing causes ConcurrentModificationException. You probably should modify your bean to retrieve the collection from database and your MDB to store the URL into the Database.
Other, easier for you solution is the following.
Retrieve the currently existing URLs and copy them to other collection. Then iterate over this collection. When the global collection is updated via JMS the update is not visible in the copied collection, so no exceptions will be thrown:
while(true) {
for (String url : copyUrls(_prioritaryUrls)) {
// deal with url
}
}
private List<String> copyUrls(List<Stirng> urls) {
return new ArrayList<String>(urls); // this create copy of the source list
}
//........
public void onMessage(Message message) {
_prioritaryUrls.add(((TextMessage)message).getText());
}