Im chasing some memory issues in an app that pulls file names from a kafka queue and does some processing on each. This app runs in Docker with an instance / partition.
Each instance has a single consumer handle that retrieves the next file name and puts it into an ArrayBlockingQueue. Meanwhile there are several threads that take the next file from this queue and do the processing. Im using this secondary queuing as each file can take some time to copy and process (there are instances of "exponential backoff" used IE a thread may be sleeping) so it seemed prudent to have several 'in the pipeline' simultaneously.
My question is about the relative benefits (w/re memory mgmt) of doing it this way (several 'permanent' threads reading from a shared queue) vs launching a new thread for each file as it gets pulled from the queue. In this alternative track I would imagine a FixedThreadPool that would generate a new thread as each file was pulled from kafka.
Is there any advantage to one method vs the other?
Edit:
my primary concern is minimizing GC time. I want to avoid having anything substantial sent to old-gen. This makes me think that the 2nd model is a better way to go.
Related
I need to process many PDF-Files. So I have a list of files (files that are in some folder or zip file). I want a subtask per PDF. Then I create a subtask per page, so it can be processed.
I was thinking of using a fork/join pool but that just keeps creating more subtasks to read more files and I run out of memory.
Sometimes I get many small files, sometimes I get large files with many pages. It makes no sense loading more documents when there are already many pages queued up to be processed.
Each pdf file from a folder is read and a subtask (2) is created, forked, and joined.
For each page a subtask (3) is created, forked, and joined.
Process this page.
There's ForkJoinTask.helpQuiesce(), which might be good enough in some situations. I can just call ForkJoinTask.helpQuiesce() after creating some subtasks. This way the subtasks are more likely to be processed before more data is loaded.
But I can't find anything to set the priority of a subtask. Wouldn't that be a lot easier? If I understand the documentation correctly, there is one submission queue and then one task queue per worker thread. Is there no way to control which tasks from the submission queue are processed first? I can pass a factory for the worker threads, but not for the submission queue.
Like in the divide-and-conquer metaphor: It might make more sense to plunder all cities before you invade a new country or even a new continent, so you get enough resources needed for those tasks. But how is this controlled?
I know Fork/Join uses work stealing and you usually don't have to bother. But I need to build a batch processing tool and I can't have it just load gigabytes of data to memory before it even begins processing any of the pages. But I don't need some framework like hadoop for a bunch of pdf files. That would be overkill.
I could use a PriorityQueue<E>, but that seems to be a lot more work as this is only a simple data structure, while Fork/Join is a framework.
Is there no way of controlling the order in which tasks are processed? What am I missing? Is there some other priority queue based solution available in Java?
I designed a java application. A friend suggested using multi-threading, he claims that running my application as several threads will decrease the run time significantly.
In my main class, I carry several operations that are out of our scope to fill global static variables and hash maps to be used across the whole life time of the process. Then I run the core of the application on the entries of an array list.
for(int customerID : customers){
ConsumerPrinter consumerPrinter = new ConsumerPrinter();
consumerPrinter.runPE(docsPath,outputPath,customerID);
System.out.println("Customer with CustomerID:"+customerID+" Done");
}
for each iteration of this loop XMLs of the given customer is fetched from the machine, parsed and calculations are taken on the parsed data. Later, processed results are written in a text file (Fetched and written data can reach up to several Giga bytes at most and 50 MBs on average). More than one iteration can write on the same file.
Should I make this piece of code multi-threaded so each group of customers are taken in an independent thread?
How can I know the most optimal number of threads to run?
What are the best practices to take into consideration when implementing multi-threading?
Should I make this piece of code multi-threaded so each group of customers are taken
in an independent thread?
Yes multi-threading will save your processing time. While iterating on your list you can spawn new thread each iteration and do customer processing in it. But you need to do proper synchronization meaning if two customers processing requires operation on same resource you must synchronize that operation to avoid possible race condition or memory inconsistency issues.
How can I know the most optimal number of threads to run?
You cannot really without actually analyzing the processing time for n customers with different number of threads. It will depend on number of cores your processor has, and what is the actually processing that is taking place for each customer.
What are the best practices to take into consideration when implementing multi-threading?
First and foremost criteria is you must have multiple cores and your OS must support multi-threading. Almost every system does that in present times but is a good criteria to look into. Secondly you must analyze all the possible scenarios that may led to race condition. All the resource that you know will be shared among multiple threads must be thread-safe. Also you must also look out for possible chances of memory inconsistency issues(declare your variable as volatile). Finally there are something that you cannot predict or analyze until you actually run test cases like deadlocks(Need to analyze Thread dump) or memory leaks(Need to analyze Heap dump).
The idea of multi thread is to make some heavy process into another, lets say..., "block of memory".
Any UI updates have to be done on the main/default thread, like print messenges or inflate a view for example. You can ask the app to draw a bitmap, donwload images from the internet or a heavy validation/loop block to run them on a separate thread, imagine that you are creating a second short life app to handle those tasks for you.
Remember, you can ask the app to download/draw a image on another thread, but you have to print this image on the screen on the main thread.
This is common used to load a large bitmap on a separated thread, make math calculations to resize this large image and then, on the main thread, inflate/print/paint/show the smaller version of that image to te user.
In your case, I don't know how heavy runPE() method is, I don't know what it does, you could try to create another thread for him, but the rest should be on the main thread, it is the main process of your UI.
You could optmize your loop by placing the "ConsumerPrinter consumerPrinter = new ConsumerPrinter();" before the "for(...)", since it does not change dinamically, you can remove it inside the loop to avoid the creating of the same object each time the loop restarts : )
While straight java multi-threading can be used (java.util.concurrent) as other answers have discussed, consider also alternate programming approaches to multi-threading, such as the actor model. The actor model still uses threads underneath, but much complexity is handled by the actor framework rather than directly by you the programmer. In addition, there is less (or no) need to reason about synchronizing on shared state between threads because of the way programs using the actor model are created.
See Which Actor model library/framework for Java? for a discussion of popular actor model libraries.
In our multithreaded java app, we are using LinkedBlockingDeque separate instance for each thread, assume threads (c1, c2, .... c200)
Threads T1 & T2 receive data from socket & add the object to the specific consumer's Q between c1 to c200.
Infinite loop inside the run(), which calls LinkedBlockingDeque.take()
In the load run the CPU usage for the javae.exe itself is 40%. When we sum up the other process in the system the overall CPU usage reaches 90%.
By using JavaVisualVM the run() is taking more CPU and we suspect the LinkedBlockingDeque.take()
So tried alternatives like thread.wait and notify and thread.sleep(0) but no change.
The reason why each consumer having separate Q is for two reason,
1.there might be more than one request for consumer c1 from T1 or T2
2.if we dump all req in single q, the seach time for c1 to c200 will be more and search criteria will extend.
3.and let the consumer have the separate Q to process thier requests
Trying to reduce the CPU usage and in need of your inputs...
SD
do profiling and make sure that the queue methods take relatively much CPU time. Is your message processing so simple that is compared to putting/taking to/from queue?
How many messages are processed per second? How many CPUs are there? If each CPU is processing less than 100K messages per second, then it's likely that the reason is not the access to the queues, but message handling itself.
Putting in LinkedBlockingDeque creates an instance of a helper object. And I suspect, each new message is allocated from heap, so 2 creation per message. Try to use a pool of preallocated messages and circular buffers.
200 threads is a way too many. This means, too many context switches. Try to use actor libraries and thread pools, for example, https://github.com/rfqu/df4j (yes, it's mine).
Check if http://code.google.com/p/disruptor/ would fit for your needs.
I am having a multi-threaded application that fetches different web pages. For this, I've devised a parent child relationship between the threads.
The parent simply takes different urls from its page and spawns new threads. These threads keep on continuously fetching the pages until the page changes. The main thread polls every 2 minutes for the changes at the main page(and creates new threads if there is any change)
Main thread algo
while(true){
find_new_instances(...);
if we get any new, then
Thread.start(...);
Thread.sleep(120000);
}
The main thread has a String arraylist that stores the url of each new thread that it creates. I've heard that threads should only use immutable objects for writes. Is the used of mutable list here causing problems ?
In the child thread, activities such as page fetch and database inserts take place.
However, the application gradually increases its memory requirements and eventually deadlocks/starves into a frozen state or OutOfMemory if the no. of threads are too large.
I am at loss to try out anything. If you have experienced similar problems, kindly suggest.
I faced similar type of issue while I was developing a GUI based application.
Reasons of crashing
Create a Thread Pool. use the
available thread in you application.
You cannot create infinite thread
that will cause crash in your
application
May be you are creating new
objects(or string) and storing the
data if possible use the same object
assign the value. If the data is long
you can store in file or database.
rather than holding the data always
If i understood your problem right then you can bypasse the OutOfMemory error by setting the old threads value to null and preforming a gc()
You have a memory leak. I suggest you take a heap dump when you run out of memory and analyse it to see where the leak is.
To trigger a heap dump automatically, you can use the option
-XX:+HeapDumpOnOutOfMemoryError
and perhaps
-XX:HeapDumpPath=/path/to/heap/dumps
If you want a pool of worker threads, I suggest you use an ExecutorService or even a ScheduledExecutorService to perform a task at a regular interval. (However this is unlikely to be you problem)
I was asked a question during interview today. First they asked how to provide Synchronization
between thread. Then they asked how to provide Synchronization between process, because I told them, the variable inside each process can not be shared with other process, so they asked me to explain how two process can communicate with each other and how to provide Synchronization
between them, and where to declare the shared variable? Now the interview finished, but I want to know the answer, can anyone explain me?Thank you.
I think the interviewer(s) may not be using the proper terminology. A process runs in its own space, and has been mentioned in separate answers, you have to use OS-specific mechanisms to communicate between process. This is called IPC for Inter-Process Communication.
Using sockets is a common practice, but can be grossly inefficient, depending on your application. But if working with pure Java, this may be the only option since sockets are universally supported.
Shared memory is another technique, but that is OS-specific and requires OS-specific calls. You would have to use something like JNI for a Java application to access shared memory services. Shared memory access is not synchronized, so you will likely have to use semaphors to synchronize access among multiple processes.
Unix-like systems provide multiple IPC mechansims, and which one to use depends on the nature of your application. Shared memory can be a limited resource, so it may not be the best method. Googling on this topics provides numerous hits providing useful information on the technical details.
A process is a collection of virtual memory space, code, data, and system resources. A thread is code that is to be serially executed within a process. A processor executes threads, not processes, so each application has at least one process, and a process always has at least one thread of execution, known as the primary thread. A process can have multiple threads in addition to the primary thread. Prior to the introduction of multiple threads of execution, applications were all designed to run on a single thread of execution.
When a thread begins to execute, it continues until it is killed or until it is interrupted by a thread with higher priority (by a user action or the kernel's thread scheduler). Each thread can run separate sections of code, or multiple threads can execute the same section of code. Threads executing the same block of code maintain separate stacks. Each thread in a process shares that process's global variables and resources.
To communicate between two processes I suppose you can use a ServerSocket and Socket to manage process synchronization. You would bind to a specific port (acquire lock) and if a process already is bound you can connect to the socket (block) and wait until the server socket is closed.
private static int KNOWN_PORT = 11000;//arbitrary valid port
private ServerSocket socket;
public void acquireProcessLock(){
socket = new ServetSocket(KNOWN_PORT);
INetAddress localhostInetAddres = ...
try{
socket.bind(localhostInetAddres );
}catch(IOException failed){
try{
Socket socket = new Socket(localhostInetAddres ,KNOWN_PORT);
socket.getInputStream().read();//block
}catch(IOException ex){ acquireProcessLock(); } //other process invoked releaseProcessLock()
}
}
public void releaseProcessLock(){
socket.close();
}
Not sure if this is the actual best means of doing it but I think its worth considering.
Synchronization is for threads only it wont work for processes in Java. There is no utility in them working across processes, since the processes do not share any state that would need to be synchronized. A variable in one process will not have the same data as a variable in the other process
From a system point of view, a thread is defined by his "state" and the "instruction pointer".
The instruction pointer (eip) contains the address of the next instruction to be executed.
A thread "state" can be : the registers (eax, ebx,etc), the signals, the open files, the code, the stack, the data managed by this thread (variables, arrays, etc) and also the heap.
A process is a group of threads that share a part of their "state": it might be the code, the data, the heap.
Hope i answer your question ;)
EDIT:
The processes can communicate via the IPCs (Inter process communications). There are 3 mecanisms : shared memory, message queue. Synchronization between processes can me made with the Semaphors
Threads synchronization can me made with mutexes (pthread_mutex_lock, pthread_mutex_unlock, etc)
Check Terracotta Cluster or Terracotta's DSO Clustering documentation to see how this issue can be solved (bytecode manipulation, maintaince the semantics of Java Language Specification on putfield/getfield-level etc.)
the most simplest answer is process means a program under execution and a program is nothing but collection of functions.
where thread is the part of the proccess because all the threads are functions.
in other way we can say that a process may have multiple threads.
always OS allocates the memory for a process and that memory is disributed among the threads of that process.OS does not allocates memory for threads.
In one sentence processes are designed more independently than threads are.
Their major differences can be described at memory level. Different processes share nothing among each other, from register, stock memory to heap memory, which make them safe on their own tracks. However, normally threads are designed to share a common heap memory, which provides a more closely connected way for multiple processes computing task. Creating a more efficient way to take up computation resources.
E.g. If I compute with 3 processes, I have to let them each finish their jobs and wait for their results in a system level, at the mean time, registers and stack memory are always taken up. However, if I do it with 3 threads, then if thread 2 luckily finish its job earlier, because the result it computed had already been stored to the common heap memory pool, we can simply kill it without waiting for others to deliver their results, and this released resources of registers and stock memory can be used on other purposes.
Process:
A process is nothing but a program under execution.
Each process have its own memory address space.
Process are used for heavyweight tasks i.e. is basically execution of applications.
Cost of communication between process is high.
Switching from one process to another require some time for saving and loading registers, memory maps etc.
Process is operating system approach.
Threads:
A thread is light weight sub-process.
Thread share the same address space.
Cost of communication between the thread is low.
Note: At least one process is required for each thread.
I suppose the processes can communicate through a third-party : a file or a database...