Parallel programming in Java

Parallel programming in Java - java

How can we do Parallel Programming in Java? Is there any special framework for that? How can we make the stuff work?
I will tell you guys what I need, think that I developed a web crawler and it crawls a lot of data from the internet. One crawling system will not make things work properly, so I need more systems working in parallel. If this is the case can I apply parallel computing? Can you guys give me an example?

If you are asking about pure parallel programming i.e. not concurrent programming then you should definitely try MPJExpress http://mpj-express.org/. It is a thread-safe implementation of mpiJava and it supports both distributed and shared memory models. I have tried it and found very reliable.
1 import mpi.*;
2
3
/**
4 * Compile:impl specific.
5 * Execute:impl specific.
6 */
7
8 public class Send {
9
10 public static void main(String[] args) throws Exception {
11
12 MPI.Init(args);
13
14 int rank = MPI.COMM_WORLD.Rank() ; //The current process.
15 int size = MPI.COMM_WORLD.Size() ; //Total number of processes
16 int peer ;
17
18 int buffer [] = new int[10];
19 int len = 1 ;
20 int dataToBeSent = 99 ;
21 int tag = 100 ;
22
23 if(rank == 0) {
24
25 buffer[0] = dataToBeSent ;
26 peer = 1 ;
27 MPI.COMM_WORLD.Send(buffer, 0, len, MPI.INT, peer, tag) ;
28 System.out.println("process <"+rank+"> sent a msg to "+ 29 "process <"+peer+">") ;
30
31 } else if(rank == 1) {
32
33 peer = 0 ;
34 Status status = MPI.COMM_WORLD.Recv(buffer, 0, buffer.length, 35 MPI.INT, peer, tag);
36 System.out.println("process <"+rank+"> recv'ed a msg\n"+ 37 "\tdata <"+buffer[0] +"> \n"+ 38 "\tsource <"+status.source+"> \n"+ 39 "\ttag <"+status.tag +"> \n"+ 40 "\tcount <"+status.count +">") ;
41
42 }
43
44 MPI.Finalize();
45
46 }
47
48 }
One of the most common functionalities provided by messaging libraries like MPJ Express is the support of point-to-point communication between executing processes. In this context, two processes belonging to the same communicator (for instance the MPI.COMM_WORLD communicator) may communicate with each other by sending and receiving messages. A variant of the Send() method is used to send the message from the sender process. On the other hand, the sent message is received by the receiver process by using a variant of the Recv() method. Both sender and receiver specify a tag that is used to ﬁnd a matching incoming messages at the receiver side.
After initializing the MPJ Express library using the MPI.Init(args) method on line 12, the program obtains its rank and the size of the MPI.COMM_WORLD communicator. Both processes initialize an integer array of length 10 called buffer on line 18. The sender process—rank 0—stores a value of 10 in the ﬁrst element of the msg array. A variant of the Send() method is used to send an element of the msg array to the receiver process.
The sender process calls the Send() method on line 27. The ﬁrst three arguments are related to the data being sent. The sending bu!er—the bu!er array—is the ﬁrst argument followed by 0 (o!set) and 1 (count). The data being sent is of MPI.INT type and the destination is 1 (peer variable); the datatype and destination are speciﬁed as fourth and ﬁfth argument to the Send() method. The last and the sixth argument is the tag variable. A tag is used to identify messages at the receiver side. A message tag is typically an identiﬁer of a particular message in a speciﬁc communicator.
On the other hand the receiver process (rank 1) receives the message using the blocking receive method.

Java supports threads, thus you can have multi threaded Java application. I strongly recommend the Concurrent Programming in Java: Design Principles and Patterns book for that:
http://java.sun.com/docs/books/cp/

You want to look at the Java Parallel Processing Framework (JPPF)

You can have a look at Hadoop and Hadoop Wiki.This is an apache framework inspired by google's map-reduce.It enables you to do distributed computing using multiple systems.Many companies like Yahoo,Twitter use it(Sites Powered By Hadoop).Check this book for more information on how to use it Hadoop Book.

In java parallel processing is done using threads which are part of the runtime library
The Concurrency Tutorial should answer a lot of questions on this topic if you're new to java and parallel programming.

As far as I know, on most operating systems the Threading mechanism of Java should be based on real kernel threads. This is good from the parallel programming prospective. Other languages like Python simply do some time multiplexing of the processor (namely, if you run a heavvy multithreaded application on a multiprocessor machine you'll see only one processor running).
You can easily find something just googling it: by example this is the first result for "java threading":
http://download-llnw.oracle.com/javase/tutorial/essential/concurrency/
Basically it boils down to extend the Thread class, overload the "run" method with the code belonging to the other thread and call the "start" method on an instance of the class you extended.
Also if you need to make something thread safe, have a look to the synchronized methods.

This is the parallel programming resource I've been pointed to in the past:
http://www.jppf.org/
I have no idea whether its any good or not, just that someone recommended it a while ago.

I have heard about one at conference a few years ago - ParJava. But I'm not sure about the current status of the project.

Read the section ón threads in the java tutorial. http://download-llnw.oracle.com/javase/tutorial/essential/concurrency/procthread.html

java.util.concurrency package and the Brian Goetz book "Java concurrency in practice"
There is also a lot of resources here about parallel patterns by Ralph Johnson (one of the GoF design pattern author) :
http://parlab.eecs.berkeley.edu/wiki/patterns/patterns

Is the Ateji PX parallel-for loop what you're looking for ?
This will crawl all sites in parallel (notice the double bar next to the for keyword) :
for||(Site site : sites) {
crawl(site);
}
If you need to compose the results of crawling, then you'll probably want to use a parallel comprehension, such as :
Set result = set for||{ crawl(site) | Site site : sites }
Further reading here : http://www.ateji.com/px/whitepapers/Ateji%20PX%20for%20Java%20v1.0.pdf

You might want to check out Hadoop. It's designed to have jobs running over an arbitrary amount of boxes and takes care of all the bookkeeping for you. It's inspired by Google's MapReduce and their related tools and so it even comes from web indexing.

Have you looked at this:
http://www.javacodegeeks.com/2013/02/java-7-forkjoin-framework-example.html?ModPagespeed=noscript
The Fork / Join Framework?
I am also trying to learn a bit about this.

Parallelism
Parallelism means that an application splits its tasks up into smaller subtasks which can be processed in parallel, for instance on multiple CPUs at the exact same time.

you can use JCSP (http://www.cs.kent.ac.uk/projects/ofa/jcsp/) the library implements CSP (Communicating Sequential Processes) principles in Java, parallelisation is abstracted from thread level and you deal instead with processes.

Java SE 5 and 6 introduced a set of packages in java.util.concurrent.* which provide powerful concurrency building blocks.
check this for more information.
http://www.oracle.com/technetwork/articles/java/fork-join-422606.html

You might try Parallel Java 2 Library.
On the website Prof. Alan Kaminsky wrote:
Fast forward to 2013, when I started developing PJ2. Parallel computing had expanded far beyond what it was a decade earlier. Multicore parallel computers were equipped with many more CPU cores and much larger main memory, such that computations that used to require a whole cluster now could be done on a single multicore node. New kinds of parallel computing hardware had become commonplace, notably graphics processing unit (GPU) accelerators. Cloud computing services, such as Amazon's EC2, allowed anyone to run parallel programs on a virtual supercomputer with thousands of cores. New application areas for parallel computing had opened up, notably big data analytics. New parallel programming APIs had arisen, such as OpenCL and NVIDIA Corporation's CUDA for GPU parallel programming, and map-reduce frameworks like Apache's Hadoop for big data computing. To explore and take advantage of all these trends, I decided that a completely new Parallel Java 2 Library was needed.
In early 2013 when PJ2 wasn't yet available (although an earlier version was), I tried Java Parallel Processing Framework (JPPF). JPPF was okay but at first glance PJ2 looks interesting.

There is a library called Habanero-Java (HJ), developed at Rice University that was built using lambda expressions and can run on any Java 8 JVM.
HJ-lib integrates a wide range of parallel programming constructs (e.g., async tasks, futures, data-driven tasks, forall, barriers, phasers, transactions, actors) in a single programming model that enables unique combinations of these constructs (e.g., nested combinations of task and actor parallelism).
The HJ runtime is responsible for orchestrating the creation, execution, and termination of HJ tasks, and features both work-sharing and work-stealing schedulers. You can follow the tutorial to set it up on your computer.
Here is a simple HelloWorld example:
import static edu.rice.hj.Module1.*;
public class HelloWorld {
public static void main(final String[] args) {
launchHabaneroApp(() -> {
finish(() -> {
async(() -> System.out.println("Hello World - 1!"));
async(() -> System.out.println("Hello World - 2!"));
async(() -> System.out.println("Hello World - 3!"));
async(() -> System.out.println("Hello World - 4!"));
});
});
}}
Each async method runs in parallel with the other async methods, while the content within these methods run sequentially. The program doesn't continue until all code within the finish method complete.

Short answer with example library
If you are interested in parallel processing using Java, I would recommend you to give a try to Hazelcast Jet.
No more words needed from my side. Just check the website and learn by their examples. It give you pretty solid background and imagination about what does it meen to process data paralelly.
https://jet.hazelcast.org/

Related

Does the code with CompletableFutures and no custom Executors use only the number of threads equal to the number of cores?

I am reading java 8 in action, chapter 11 (about CompletableFutures), and it got me thinking about my company's code base.
The java 8 in action book says that if you have code like I write down below, you will only use 4 CompletableFutures at a time(if you have a 4 core computer). That means that if you want to perform for example 10 operations asynchronously, you will first run the first 4 CompletableFutures, then the second 4, and then the 2 remaining ones, because the default ForkJoinPool.commonPool() only provides the number of threads equal to Runtime.getRuntime().availableProcessors().
In my company's code base, there are #Service classes called AsyncHelpers, that contain a method load(), that uses CompletableFutures to load information about a product asynchronously in separate chunks. I was wondering if they only use 4 threads at a time.
There are several such async helpers in my company's code base, for example there's one for product list page (PLP) and one for product details page(PDP). A product details page is a page dedicated to a specific product showing it's detailed characteristics, cross-sell products, similar products and many more things.
There was an architectural decision to load the details of the pdp page in chunks. The loading is supposed to happen asynchronously, and the current code uses CompletableFutures. Let's look at pseudocode:
static PdpDto load(String productId) {
CompletableFuture<Details> photoFuture =
CompletableFuture.supplyAsync(() -> loadPhotoDetails(productId));
CompletableFuture<Details> characteristicsFuture =
CompletableFuture.supplyAsync(() -> loadCharacteristics(productId));
CompletableFuture<Details> variations =
CompletableFuture.supplyAsync(() -> loadVariations(productId));
// ... many more futures
try {
return new PdpDto( // construct Dto that will combine all Details objects into one
photoFuture.get(),
characteristicsFuture.get(),
variations.get(),
// .. many more future.get()s
);
} catch (ExecutionException|InterruptedException e) {
return new PdpDto(); // something went wrong, return an empty DTO
}
}
As you can see, the code above uses no custom executors.
Does this mean that if that load method has 10 CompletableFutures and there are currently 2 people loading the PDP page, and we have 20 CompletableFutures to load in total, then all those 20 CompletableFutures won't be executed all at once, but only 4 at a time?
My colleague told me that each user will get 4 threads, but I think the JavaDoc quite clearly states this:
public static ForkJoinPool commonPool()
Returns the common pool instance. This pool is statically constructed; its run state is unaffected by attempts to shutdown() or shutdownNow(). However this pool and any ongoing processing are automatically terminated upon program System.exit(int). Any program that relies on asynchronous task processing to complete before program termination should invoke commonPool().awaitQuiescence, before exit.
Which means that there's only 1 pool with 4 threads for all users of our website.

Yes, but it’s worse than that...
The default size of the common pool is 1 less than the number of processors/cores (or 1 if there’s only 1 processor), so you’re actually processing 3 at a time, not 4.
But your biggest performance hit is with parallel streams (if you use them), because they use the common pool too. Streams are meant to be used for super fast processing, so you don’t want them to share their resources with heavy tasks.
If you have task that is designed to be asynch (ie take more than a few milliseconds) then you should create a pool to run them in. Such a pool can be statically created and reused by all calling threads, which avoids overhead of pool creation per use. You should also tune the pool size by stress testing your code to find the optimum size to maximise throughput and minimise response time.

In my company's code base, there are [...] classes [...] that contain a method load(), that uses CompletableFutures to load information [...]
So, are you saying that the load() method waits for I/O to complete?
If so, and if what #Bohemian says is true, then you should not be using the default thread pool.
#Bohemian says that the default pool has approximately the same number of threads as your host has CPUs. That's great if your application has a lot of compute bound tasks to perform in the background. But it's not so great if your application has a lot of threads that are waiting for replies from different network services. That's a whole different story.
I am not an expert in the subject, and I don't know how (apart from doing experiments) to find out what the best number of threads is, but whatever that number is, it's going to have little to do with how many CPUs your system has, and therefore, you should not be using the default pool for that purpose.

Understanding NodeJS & Non-Blocking IO

So, I've recently been injected with the Node virus which is spreading in the Programming world very fast.
I am fascinated by it's "Non-Blocking IO" approach and have indeed tried out a couple of programs myself.
However, I fail to understand certain concepts at the moment.
I need answers in layman terms (someone coming from a Java background)
1. Multithreading & Non-Blocking IO.
Let's consider a practical scenario. Say, we have a website where users can register. Below would be the code.
..
..
// Read HTTP Parameters
// Do some Database work
// Do some file work
// Return a confirmation message
..
..
In a traditional programming language, the above happens in a sequential way. And, if there are multiple requests for registration, the web server creates a new thread and the rest is history. Of course, programmers can create threads of their own to work on Line 2 and Line 3 simultaneously.
In Node, as I understand, Lines 2 & 3 will be run in parallel while the rest of the program gets executed and the Interpreter polls the lines 2 & 3 every 'x' ms.
Now, my question is, if Node is a single threaded language, what does the job of lines 2 & 3 while the rest of the program is being executed?
2. Scalability
I recently read that LinkedIn have adapted Node as a back-end for their Mobile Apps and have seen massive improvements.
Can anyone explain how it has made such a difference?
3. Adapting in other programming languages
If people are claiming that Node to be making a lot of difference when it comes to performance, why haven't other programming languages adapted this Non-Blocking IO paradigm?
I'm sure I'm missing something. Only if you can explain me and guide me with some links, would be helpful.
Thanks.

A similar question was asked and probably contains all the info you're looking for: How the single threaded non blocking IO model works in Node.js
But I'll briefly cover your 3 parts:
1.
Lines 2 and 3 in a very simple form could look like:
db.query(..., function(query_data) { ... });
fs.readFile('/path/to/file', function(file_data) { ... });
Now the function(query_data) and function(file_data) are callbacks. The functions db.query and fs.readFile will send the actual I/O requests but the callbacks allow the processing of the data from the database or the file to be delayed until the responses are received. It doesn't really "poll lines 2 and 3". The callbacks are added to an event loop and associated with some file descriptors for their respective I/O events. It then polls the file descriptors to see if they are ready to perform I/O. If they are, it executes the callback functions with the I/O data.
I think the phrase "Everything runs in parallel except your code" sums it up well. For example, something like "Read HTTP parameters" would execute sequentially, but I/O functions like in lines 2 and 3 are associated with callbacks that are added to the event loop and execute later. So basically the whole point is it doesn't have to wait for I/O.
2.
Because of the things explained in 1., Node scales well for I/O intensive requests and allows many users to be connected simultaneously. It is single threaded, so it doesn't necessarily scale well for CPU intensive tasks.
3.
This paradigm has been used with JavaScript because JavaScript has support for callbacks, event loops and closures that make this easy. This isn't necessarily true in other languages.
I might be a little off, but this is the gist of what's happening.

Q1. " what does the job of lines 2 & 3 while the rest of the program is being executed?"
Answer: "Nothing". Lines 2 and 3 each themselves start their respective jobs, but those jobs cannot be done immediately because (for example) the disk sectors required are not loaded in yet - so the operating system issues a call to the disk to go get those sectors, then "Nothing happens" (node goes on with it's next task) until the disk subsystem (later) issues an interrupt to report they're ready, at which point node returns control to lines #2 and #3.
Q2. single-thread non-blocking dedicates almost no resources to each incoming connection (just some housekeeping data about the connected socket). It's very memory efficient. Traditional web servers "fork" a whole new process to handle each new connection - that means making a humongous copy of every bit of code and data variables needed, and time-slicing the CPU to deal with it all. That's massively wasteful of resources. Thus - if your load is a lot of idle connections waiting for stuff, as was theirs, node makes loads more sense.
Q3. almost every programming language does already have non-blocking I/O if you want to use it. Node is not a programming language, it's a web server that runs javascript and uses non-blocking I/O (eg: I personally wrote my own identical thing 10 years ago in perl, as did google (in C) when they started, and I'm sure loads of other people have similar web servers too). The non-blocking I/O is not the hard part - getting the programmer to understand how to use it is the tricky bit. Javascript happens to work well for that, because those programmers are already familiar with event programming.

Even though node.js has been around for a few years, it's performance model is still a bit mysterious.
I recently started a blog and decided that the node.js model would be a good first topic since I wanted to understand it better myself and it would be helpful to others to share what I learned. Here are a couple of articles I wrote that explain the high level concepts and some tradeoffs:
Blocking vs. Non-Blocking I/O – What’s going on?
Understanding node.js Performance

Test inter-device communication timings in Java

Scenario:
I want to test a communication between 2 devices. They communicate by frames.
I start up the application (on device 1) and I send a number of frames (each frames contains a unique (int) ID). Device 2 receives each frame and sends an acknowledgement (and just echo's the ID) or it doesn't. (when frame got lost)
When device 1 receives the ACK I want to compare the time it took to send and receive the ACK back.
From looking around SO
How do I measure time elapsed in Java?
System.nanoTime() is probably the best way to monitor the elapsed time. However this is all happening in different threads according to the classic producer-consumer pattern where a thread (on device 1) is always reading and another is managing the process (and also writing the frames). Now thank you for bearing with me my question is:
Question: Now for the problem: I need to convey the unique ID from the ACK frame from the reading thread to the managing thread. I've done some research and this seems to be an good candidate for wait/notify system or not? Or perhaps I just need a shared array that contains data of each frame send? But than how does the managing thread know it happened?
Context I want to compare these times because I want to research what factors can hamper communication.

Why don't you just populate a shared map with <unique id, timestamp> pairs? You can expire old entries by periodically removing entries older than a certain amount.

I suggest you reformulate your problem with tasks (Callable). Create a task for the writer and one for the reader role. Submit these in pairs in an ExecutorService and let the Java concurrency framework handle the concurrency for you. You only have to think about what will be the result of a task and how would you want to use it.
// Pseudo code
ExecutorService EXC = Executors.newCachedThreadPool();
Future<List<Timestamp>> readerFuture = EXC.submit(new ReaderRole(sentFramwNum));
Future<List<Timestamp>> writerFuture = EXC.submit(new WriterRole(sentFrameNum));
List<Timestamp> writeResult = writerFuture.get(); // wait for the completion of writing
List<Timestamp> readResult = readerFuture.get(); // wait for the completion of reading
This is pretty complex stuff but much cleaner and more stable that a custom developed synchronization solution.
Here is a pretty good tutorial for the Java concurrency framework: http://www.vogella.com/articles/JavaConcurrency/article.html#threadpools

Multiple things at once (Threads?)

All,
What is a really simple way of having a program do more than one thing at once, even if the computer does not necessarily have multiple 'cores'. Can I do this by creating more than one Thread?
My goal is to be able to have two computers networked (through Sockets) to respond to each-other's requests, while my program will at the same time be able to be managing a UI. I want the server to potentially handle more than one client at the same time as well.
My understanding is that the communication is done with BufferedReader.readLine() and PrintWriter.println(). My problem is that I want the server to be waiting on multiple readLine() requests, and also be doing other things. How do I handle this?
Many thanks,
Jonathan

Yes, you can do this by having multiple threads inside your Java program.
As the mechanisms in Java gets rather complicated when you do this, have a look at the appropriate section in the Java Tutorial:
http://java.sun.com/docs/books/tutorial/essential/concurrency/

Yes, just create multiple threads. They will run concurrently, whether or not the processor has multiple cores. (With a single core, the OS simply suspends the execution of the running thread at certain points and runs another thread for a while, so in effect, multiple ones seem to be running at the same time).
Here's a good concurrency tutorial: http://java.sun.com/docs/books/tutorial/essential/concurrency/

The standard Java tutorial for Sockets is a good start. I wrote the exact program you are describing using this as a base. The last point on the page "Supporting Multiple Clients" describes how threads are implemented.
http://java.sun.com/docs/books/tutorial/networking/sockets/clientServer.html

Have a look at this page: http://www.ashishmyles.com/tutorials/tcpchat/index.html -- it gives a good description of threads, UI details, etc, and gives a chat example where they merge the two together.
Also, consider using Apache MINA. It's quite lightweight, doesn't rely on any external libraries (apart from slf4j) and makes it very easy to get stuff from sockets without needing to go around in as loop, and it's also quite non-blocking (or blocking when you need it to be). So, you have a class which implements IoHandler and then you register that with an acceptor or some other Mina connection class. Then, it notifies you about when packets are received. It handles all the usually-crippling backend stuff for you in a pleasant way (i.e., manually creating multiple threads for clients and then managing these).
It also has codec support, where you can transform sent and received messages. So, say you want to receive Java objects on either end of your connection -- this will do the conversion for you. Perhaps you also want to zip them up to make it more efficient? You can write that too, adding that to the chain below the object codec.

Can I do this by creating more than one Thread?
What is a really simple way of having
a program do more than one thing at
once, even if the computer does not
necessarily have multiple 'cores'. Can
I do this by creating more than one
Thread?
If you have 1 single core, then "official" only 1 task can be executed at the same time. Because your computer
s processor is so fast and executes so many instructions per second it creates the illusion that your computer is doing multiple tasks simultaneously while every small unit it only executes 1 task. You can create this illusion in Java by creating threads which get scheduled by your operating system to run for a short period of time.
My advice is to have a look at the java.util.concurrent package because it contains a lot of helpful tools to make playing around with threads a lot easier(Back in the days when this package did not exists it was a lot harder). I for example like to use
ExecutorService es = Executors.newCachedThreadPool();
to create a thread pool which I can submit tasks to run simultaneously. Then when I have task which I like to have run, I call
es.execute(runnable);
where runnable looks like:
Runnable runnable = new Runnable() {
public void run() {
// code to run.
}
};
For example say you run the following code:
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package mytests;
import java.util.Date;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
/**
*
* #author alfred
*/
public class Main {
/**
* #param args the command line arguments
* #throws Exception
*/
public static void main(String[] args) throws Exception {
// TODO code application logic here
final CountDownLatch latch = new CountDownLatch(2);
final long start = System.nanoTime();
ExecutorService es = Executors.newCachedThreadPool();
Runnable runnable = new Runnable() {
public void run() {
sleep(1);
System.out.println(new Date());
latch.countDown();
}
};
es.submit(runnable);
es.submit(runnable);
latch.await(); // waits only latch.countDown() has been called 2 times.
// 1 nanoseconds is equal to 1/1000000000 of a second.
long total = (System.nanoTime() - start) / 1000000;
System.out.println("total time: " + total);
es.shutdown();
}
public static void sleep(int i) {
try {
Thread.sleep(i * 1000);
} catch (InterruptedException ie) {}
}
}
The output would look like
run:
Fri Apr 02 03:34:14 CEST 2010
Fri Apr 02 03:34:14 CEST 2010
total time: 1055
BUILD SUCCESSFUL (total time: 1 second)
But I ran 2 tasks which each both ran for at least 1 second(because of sleep of 1 second). If I would run those 2 tasks sequentially then it would take at least 2 seconds, but because I used threads it only took 1 second. This is what you wanted and it is easily accomplished using the java.util.concurrent package.
I want the server to potentially handle more than one client at the same time as well.
My goal is to be able to have two
computers networked (through Sockets)
to respond to each-other's requests,
while my program will at the same time
be able to be managing a UI. I want
the server to potentially handle more
than one client at the same time as
well.
I would advice you to have a look at Netty framework(MINA which also developed by the creator of MINA, but Netty is better(more development) in my opinion).:
The Netty project is an effort to
provide an asynchronous event-driven
network application framework and
tools for rapid development of
maintainable high performance & high
scalability protocol servers &
clients.
It will do all the heavy lifting for you. When I read the user guide I was totally amazed with netty. Netty uses nio which is for highly concurrent servers the new way to do IO which scales much better. Like I said before this framework does all the heavy lifting for you
My problem is that I want the server to be waiting on multiple readLine() requests, and also be doing other things
My understanding is that the
communication is done with
BufferedReader.readLine() and
PrintWriter.println(). My problem is
that I want the server to be waiting
on multiple readLine() requests, and
also be doing other things. How do I
handle this?
Again when you look into the netty's user guide + examples you will find out that it does all the heavy lifting for you in an efficient way. You will only have to specify some simple callbacks to get the data from the clients.
Hopefully this has answered all your question. Else I would advice you to leave a comment so that I will try to explain it better.

Clustering using Threads in Java

I have a job that takes too long time in Java. So I want to divide this job into threads and run them. After the threads finishes their jobs, returns to my service and Service give them new jobs. ThreadGroup is suitable for this or any other recommendation?

First of all, you need threads if either:
a) You have a multiprocessor machine
or b) You have a single processor but your jobs are IO-intensive (and not CPU-intensive)
Otherwise, you will gain nothing when using threads.
What you need here is ThreadPool

Check out the ExecutorCompletionService - it does exactly this.
Example: [pulled from Java 6 API JavaDocs]
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException {
CompletionService<Result> ecs
= new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
Result r = ecs.take().get();
if (r != null)
use(r);
}
}

Not sure in what state of development your project currently is, since your problem statement is quite limited, but you might want to consider getting having a look at the fork-join project coming in JDK7: http://www.ibm.com/developerworks/java/library/j-jtp11137.html
There's a lot to gain & learn from looking at that, and since it's all open source you can already download the code as a patch and have a go at working with it.
(Might not be applicable for anything you have to implement right now, but worth a look non the less if you intend to develop / maintain your application for some time in the future)

Take a look at the java.util.concurrent package.
There's a tutorial where you should find everything you need to know here:
http://java.sun.com/docs/books/tutorial/essential/concurrency/
Focus on the High Level Concurrency Objects in particular.

ThreadGroup isn't generally of much use to application code. It's not a great deal of use to container code either. The Java PlugIn uses ThreadGroup to distinguish which applet a thread belongs to.
java.util.concurrent, in particular ExecutorService, provides amongst other things handy utilities for handling threads and concurrency.
For computationally intensive fine-grained tasks, the fork-join framework in JDK7 will be useful.
Before starting on this difficult code, you might want to consider whether it is worth it. Can you do other optimisations that doesn't require large scale thread use? Is it I/O latency you are trying to deal with? If it is CPU-intensive, there is not a great deal of point in using many more threads than you have in hardware.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.