This question is about using the gm4java library to interact with Graphics Magick (in scala).
I've been testing the PooledGMService as it is demonstrated here with scala, and it's working well.
However, I noticed that it does not perform similarly to batch mode within the command line interface for gm (gm batch batchfile.gm). When I run a gm batch file from the command line with any number of images, it launches 1 gm process. However, if I:
val config = new GMConnectionPoolConfig()
val service = new PooledGMService(config)
and then share the instance of service across 4 threads, where I perform some operation on one image per thread like:
service.execute(
"convert",
srcPath.toString(),
"-resize", percent + "%",
outPath.toString()
)
I see that 4 separate gm processes are created.
I believe this has performance impacts (a test with 100 images, with the code mentioned above against the gm cli with a batch file, takes the same time, but my scala code uses 4x as much CPU).
My question is: how do I use gm4java so that a single gm process is working on several images (or at least several kinds of conversions for the same image), just like the cli batch mode? I've tried a few attempts (some desperately silly) with no luck here.
My exact scala code, can be found here if you are curious.
update 05/27/14
With the guidance of a comment by gm4java's author I realized that I was benchmarking two different gm commands. The updated benchmarking results are:
100 x 30MB images (3.09GB tot)
on i7 quadcore (8 logical cpu's w/ hyper-threading)
Criteria Time
gm cli batchfile 106s
my code 1 thread 112s
my code 4 threads 40s
my code 6 threads 31s
my code 7 threads 31s
my code 8 threads 28s
Upon closer inspection, I also saw that while my code ran, the same gm processes with the same process ids were kept up the whole time. This alleviated my worries that I was losing out on performance due to some overhead related to starting and terminating gm threads.
Rephrasing
I guess the heart of my question is what to do to make gm4java as fast as possible? The tip about matching gm the threadcount with the machine's execution engine count is useful. Is there anything else that comes to mind?
My particular use case is resizing input images (30MB is average, 50-60MB occasionally, and 100-500MB very rarely) to a few set sizes (with thumbnails being the most important and highest priority). Deployment will probably be on amazon ec2 with 7 or 14 "compute units"
The design of PooledGMService is to make max use of your computer power by starting multiple instances of GM processes to process your image manipulation request in a highly concurrent manner. 100 image is a too small sample size to test performance. If your goal is to make best use of your multi-CPU server to convert images, you need to test with large amount of samples (at least few thousands) and tweak the configuration to find the best number of concurrent GM processes to use. See documentation of GMConnectionPoolConfig for all the configuration options.
If you have only 8 CPUs, don't start more than 7 GM processes. If you are testing on a 2-CPU laptop, don't run more than 2 GM processes. In the example, you accepted all the default configuration setting, which will start maximal 8 GM processes upon demand. But that won't be the right configuration to just process 100 images on a merely 2 CPU laptop.
If all you want is to mimic the command line batch mode. Than the SimpleGMService is your best friend. Look at the usage pattern here.
The right solution is very much depends on your real use case. If you can tell us more about what exactly you are trying to achieve, your hardware environment and etc, we can be better equipped to help you.
Related
I have a problem where I need to run a complex function on a large 3d array. For each array row, I will execute anywhere from 100 to 1000 instructions, and depending on the data on that row some instructions will or not be executed.
This array is large but would still fit inside a GPU shared memory (around 2GB in size). I could execute these instructions on separate parts of the array given that they don't need to be processed in order, so I'm thinking executing on the GPU could be a good option. I'm not entirely sure because the instructions executed will change depending on the data itself (lots of if/then/else in there) and I've read branching could an issue.
These instructions are an abstract syntax tree representing a short program that operates over the array row and returns a value.
Does this look like an appropriate problem to be tackled by the GPU?
What other info would be needed to determine that?
I'm thinking to write this in Java and use JCuda.
Thanks!
Eduardo
It Depends. How big is your array, i.e. how many parallel tasks does your array provide (in your case it sounds like the number of rows is the number of parallel tasks you're going to execute)? If you have few rows (ASTs) but many columns (commands), then maybe it's not worth it. The other way round would work better, because more work can be parallelized.
Branching can indeed be an issue if you're unaware. You can do some optimizations though to mitigate that cost - after you got your initial prototype running and can do some comparision measurements.
The issue with Branching is, that all streaming multiprocessors in one "Block" need to execute the same instruction. If one core does not need that instruction, it sleeps. So if you have two ASTs, each with 100 distinct commands, the multiprocessors will take 200 commands to complete the calculation, some of the SMs will be sleeping while the other do their commands.
If you have 1000 commands max and some only use a subset, the processor will take as many commands as the AST with the most commands has - in the optimal case. E.g. a set of (100, 240, 320, 1, 990) will run for at least 990 commands, even though one of the ASTs only uses one command. And if that command isn't in the set of 990 commands from the last AST, it even runs for 991 commands.
You can mitigate this (after you have the prototype working and can do actual measurements) by optimizing the array you send to the GPU, so that one set of Streaming Multiprocessors (Block) has a similar set of instructions to do. As different SMs don't interfere with each other on the execution level, they don't need to wait on each other. The size of the blocks is also configurable when you execute the code, so you can adjust it somewhat here.
For even more optimization - only 32 (NVidia "Warp")/64 (AMD "Wavefront") of the threads in a block are executed at the same time, so if you organize your array to exploit this, you can even gain a bit more.
How much of a difference those optimizations make is dependant on how sparse / dense / mixed your command array will be. Also not all optimizations actually optimize your execution time. Testing and comparing is key here. Another source of optimization is your memory layout, but with your described use case it shouldn't be a problem. You can look up Memory Coalescing for more info on that.
I need to get an ideal number of threads in a batch program, which runs in batch framework supporting parallel mode, like parallel step in Spring Batch.
As far as I know, it is not good that there are too many threads to execute steps of a program, it may has negative effect to the performance of the program. Some factors could arise performance degradation(context switching, race condition when using shared resources(locking, sync..) ... (are there any other factors?)).
Of course the best way of getting the ideal number of threads is for me to have actual program tests adjusting the number of threads of the program. But in my situation, it is not that easy to have the actual test because many things are needed for the tests(persons, test scheduling, test data, etc..), which are too difficult for me to prepare now. So, before getting the actual tests, I want to know the way of getting a guessable ideal number of threads of my program, as best as I can.
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection??
Is there a rational way such as a formula in a situation like this?
The most important consideration is whether your application/calculation is CPU-bound or IO-bound.
If it's IO-bound (a single thread is spending most of its time waiting for external esources such as database connections, file systems, or other external sources of data) then you can assign (many) more threads than the number of available processors - of course how many depends also on how well the external resource scales though - local file systems, not that much probably.
If it's (mostly) CPU bound, then slightly over the number of
available processors is probably best.
General Equation:
Number of Threads <= (Number of cores) / (1 - blocking factor)
Where 0 <= blocking factor < 1
Number of Core of a machine : Runtime.getRuntime().availableProcessors()
Number of Thread you can parallelism, you will get by printing out this code :
ForkJoinPool.commonPool()
And the number parallelism is Number of Core of your machine - 1. Because that one is for main thread.
Source link
Time : 1:09:00
What should I consider to get the ideal number of threads(steps) of my program?? number of CPU cores?? number of processes on a machine on which my program would run?? number of database connection?? Is there a rational way such as a formula in a situation like this?
This is tremendously difficult to do without a lot of knowledge over the actual code that you are threading. As #Erwin mentions, IO versus CPU-bound operations are the key bits of knowledge that are needed before you can determine even if threading an application will result is any improvements. Even if you did manage to find the sweet spot for your particular hardware, you might boot on another server (or a different instance of a virtual cloud node) and see radically different performance numbers.
One thing to consider is to change the number of threads at runtime. The ThreadPoolExecutor.setCorePoolSize(...) is designed to be called after the thread-pool is in operation. You could expose some JMX hooks to do this for you manually.
You could also allow your application to monitor the application or system CPU usage at runtime and tweak the values based on that feedback. You could also keep AtomicLong throughput counters and dial the threads up and down at runtime trying to maximize the throughput. Getting that right might be tricky however.
I typically try to:
make a best guess at a thread number
instrument your application so you can determine the effects of different numbers of threads
allow it to be tweaked at runtime via JMX so I can see the affects
make sure the number of threads is configurable (via system property maybe) so you don't have to rerelease to try different thread numbers
I know that for keeping a responsive interface in Android the heavy work must be completed in an independent thread. I understand well how to accomplish this (by using AsynTask..., etc), and this is not the point of the question, just for everybody to know.
But I've been struggling for a while with a very simple parallel program. This program search for the lowest integer in an array witch length is 15000000.
I implemented this runnable:
...
run(){
highestInteger = integers[firstIndex];
for(int i = firstIndex; i < secondIndex; i++){
if(highestInteger<integers[i]){
highestInteger = integers[i];
}
}
}
... so I could look for the highest integer in the first half of the array (in one thread) and look for the highest integer in the other half of the array (in the second thread).
The program works very well on my computer (as a java/not-android program) and by very well I mean that the parallel times are shorter (almost by a half) than the serial ones.
But on my android tablet (4 cores) the times are often the same and the serial ones are almost always shorter.
I do have notice (with the debugger) that in my tablet there are several threads running:
The main/ui thread (3 cores left)
Binder 1 Thread (2 cores left)
Binder 2 Thread (1 core left :( )
Binder 3 Thread (some times I see it in the debugger some times I
don't).
So there are 3 threads running, and I need at least 2 free cores for my program to run efficiently. I've read a bit about binder threads but I don't really understand that very well.
Is there a way to solve this or not? Is there a way in which I can avoid the automatic creation of those binder threads or not? Or it is not possible to get this kind of threading to work until we have like a 6 core device?
I do have notice (with the debugger) that in my tablet there are several threads running
You have several threads that have been created. Most will be blocked waiting on I/O.
Is there a way to solve this or not?
The decision of core allocation is made by the operating system and will take into account other programs, plus power consumption (keeping all four cores running at all times will be really bad for the battery), as Andy Fadden (of the core Android team) points out in this SO comment and this SO comment. Note that there are ~750 million Android devices in use today, the vast majority of which will have fewer than four cores, and most of those with only a single core, and so you need to take that into account as well.
Is there a way in which I can avoid the automatic creation of those binder threads or not?
Only by not writing an Android app. Those threads are used for inter-process communication, which is essential to running an Android app.
Or it is not possible to get this kind of threading to work until we have like a 6 core device?
It is certainly possible. Andy Fadden demonstrates it in this StackOverflow answer. There may be ways to reorganize your algorithm to make better use of SMP on Android, as is outlined in the documentation. You might also consider Renderscript Compute as an alternative to doing your work in Java.
I want to change the number of threads for a JMeter test plan at runtime.
I have Googled my problem and found a proposed solution to use JMeter plugins. But in this solution I would have to schedule the thread group before running the test plan, which I don't want. I also found another potential solution which changes the property, but doesn't affect test plan behavior at run time.
Ultimately, what I am trying to do is change the thread number given in a thread group and have it immediately increase or decrease the number of threads in the current running test plan.
Is this possible?
IMHO that's just a fancy feature that has no real benefit when doing proper performance testing.
In order to generate relevant test output (report), you need repeatability, and clearly defined test methodology and scenarios. In order to compare impact of any application/server/infrastructure changes, you need repeatability.
What do you mean by
We can't predict the user of our site
That's why we do performance testing at the first place. To find out what is our application/infrastructure limit.
I.e. the most significant metric you can produce is how your application response time changes when number of parallel users change. But not change erratically, in run time.
With jMeter plugins' Ultimate thread group you can cover any imaginable scenario.
The short answer is: no, you cannot change the number of threads dynamically during runtime. Each thread count value is only read once when the test plan is first compiled and is not resolved again after this point, so it remains fixed.
This feature is indeed useful, and surprisingly difficult to implement even with commercial tools such as Loadrunner. I would compare it to finding a loudspeakers maximum volume. You would manually turn the volume up until it started to crackle, then turn it back down slightly to maintain that maximum volume. In the same way, to find the peak capacity of an application, you want control to 'turn the volume up' until errors are seen, then back it down slightly to see if it stabilizes. You can then maintain that load to find where the bottleneck is.
Anyway, to answer the question, what I have done in the past is use an external influence, such as a file name or similar. Then combine that with the thread unique reference you can control which threads run and which are held (by pausing or similar).
For example, if you start with 100 threads, then create a file called '5.txt' in a specific location, you can add code such that if the threads sees that it's own reference is equal to or lower than the number then it can run. If not then it drops into a pause. At the start of this example 5 threads would run, and 95 would pause. You can then rename the file to '25.txt', and threads 6 to 25 would start running. It would work the other way too, changing it to '20.txt' would mean threads 21-25 pause again.
The key is to start enough threads to exceed your expected peak.
you can change it based on a variable which you set in a startup thread. See below.
In Jmeter how do I set a variable number of threads using a beanshell sampler variable?
However once the thread group has started you can not modify it. To the guy who said this feature would not be useful I disagree. There are many types of load tests and they do not all have the same number of users running for the duration. Here are just 2 example types of enterprise load tests we conduct at the bank where I work:
duration test - same number of users run the entire time (possibly
with a short ramp-up period)
Break point test - ramp up the number of users incrementally till the
application breaks
Spike test - run with a constant number of users but sporadically
throw in a large number of users
A break point test ramps up the number of users until the application breaks (the point being to see how high your app can scale). You can sort of do this using the thread groups "ramp up period" property. If you set ramp up time to 1000 and the number of threads to 100 it will add 1 thread every 10 seconds.
Spike tests are like duration tests but at some intervals a large number of users log in. This is used to guage the applications response time during peak hours or how it will respond if you all of a sudden get a large number of users (a very real scenario).
I find that Jmeter does not handle all load test scenarios that are needed in enterprise load testing. One work around Im considering is to just start all the threads but find a way to make some of them sleep. So you could set the number of threads to 1000 but somehow make 980 of them sleep or do nothing. Then maybe when the time_in_seconds%5==0 (every 5 minutes) you allow the other threads to run - simulating a spike test. The idea is you can hard code the threads to 1000 and will always have 1000 threads running - but they don't all have to be doing something at all times.
(in other words you can probably find a way but you have to get creative)
Update:
I just found this plugin which allows different types of testing. Have not tried it yet but looks promising:
http://jmeter-plugins.org/wiki/ThroughputShapingTimer/
You can set/change the number of threads at runtime using command line option...
you can use function calls, or variable references to User Parameters (which in turn could be functions), or variable references to variables set up by functions earlier in the test. There's more than one way to do it.
Suppose you want to be able to vary the number of threads in a test plan. Choose a suitable property name, say group1.threads. Replace the thread count in the GUI (or the JMX, if you're feeling brave!) with the following function call:
Please set below property in JMeter thread group as below
${__property(group1.threads)}
Then, when starting JMeter, define the property on the command line:
jmeter -Jgroup1.threads=10
We can't predict the user of our site.
Sure you can. This is what the HTTP logs of your existing site are for. You can also use logs from tools like Omniture or your CDN logs. If you look at the combination of Actual user IP address, request and referer tags in the logs you will be able to build a traversal map of ever single user on your site. You will be able to profile the high hit unique leaf node pages of a given business process to understand how many times a particular business process happens an hour. You will be able to examine abandonment by taking a look at the funnel in tools such as Omniture. If you need tools for this analysis I recommend Splunk. It's easy to install and configure. Time to value is very fast.
The more log data you have which you are using to profile the closer you can come to actual for what users do during a day/week/month/spot sale/end of quarter/end of year/etc....You need to combine actual at a point in time with actual from earlier points in time to project growth over time since you will need to allow for growth in your performance testing model.
If you don't get the values right then the value of your test as a predictor of what will/can happen in production will be quite low. This is not a failure of any given tool, but a failure in process on the planning front for the actual load model used as part of the test requirements. If you cannot build these models then you need to pull someone into your team who can.
This ability to produce a valid load model independent of tool is the difference between tests which reduce risk and throwing load.
By enabling the BeanShell server you can vary properties at runtime.
Just enable it and telnet on port 9001 (warning: not secure!)
Based on a test I did, unfortnately, it appears that the thread count it's not applied at runtime. However you can still manipulate the load of the test by other means, for example apply a costant throughput timer parametrized with a property named "throughput" and vary it at runtime like this:
setprop("throughput","2000");
It's well explained in the guide.
I'm working on a system at the moment. It's a complex system but it boils down to a Solver class with a method like this:
public int solve(int problem); // returns the solution, or 0 if no solution found
Now, when the system is up and running, a run time of about 5 seconds for this method is expected and is perfectly fast enough. However, I plan to run some tests that look a bit like this:
List<Integer> problems = getProblems();
List<Integer> solutions = new ArrayList<Integer>(problems.size);
Solver solver = getSolver();
for (int problem: problems) {
solutions.add(solver.solve(problem));
}
// see what percentage of solutions are zero
// get arithmetic mean of non-zero solutions
// etc etc
The problem is I want to run this on a large number of problems, and don't want to wait forever for the results. So say I have a million test problems and I want the tests to complete in the time it takes me to make a cup of tea, I have two questions:
Say I have a million core processor and that instances of Solver are threadsafe but with no locking (they're immutable or something), and that all the computation they do is in memory (i.e. there's no disk or network or other stuff going on). Can I just replace the solutions list with a threadsafe list and kick off threads to solve each problem and expect it to be faster? How much faster? Can it run in 5 seconds?
Is there a decent cloud computing service out there for Java where I can buy 5 million seconds of time and get this code to run in five seconds? What do I need to do to prepare my code for running on such a cloud? How much does 5 million seconds cost anyway?
Thanks.
You have expressed your problem with two major points of serialisation: Problem production and solution consumption (currently expressed as Lists of integers). You want to get the first problems as soon as you can (currently you won't get them until all problems are produced).
I am assuming as well that there is a correlation between the problem list order and the solution list order – that is solutions.get(3) is the solution for problems.get(3) – this would be a huge problem for parallelising it. You'd be better off having a Pair<P, S> of problem/solution so you don't need to maintain the correlation.
Parallelising the solver method will not be difficult, although exactly how you do it will depend a lot on the compute costs of each solve method (generally the more expensive the method the lower the overhead costs of parallelising, so if these are very cheap you need to batch them). If you end up with a distributed solution you'll have much higher costs of course. The Executor framework and the fork/join extensions would be a great starting point.
You're asking extremely big questions. There is overhead for threads, and a key thing to note is that they run in the parent process. If you wanted to run a million of these solvers at the same time, you'd have to fork them into their own processes.
You can use one program per input, and then use a simple batch scheduler like Condor (for Linux) or HPC (for Windows). You can run those on Amazon too, but there's a bit of a learning curve, it's not just "upload Java code & go".
Sure, you could use a standard worker-thread paradigm to run things in parallel. But there will be some synchronization overhead (e.g., updates to the solutions list will cause lock contention when everything tries to finish at the same time), so it won't run in exactly 5 seconds. But it would be faster than 5 million seconds :-)
Amazon EC2 runs between US$0.085 and US$0.68 per hour depending on how much CPU you need (see pricing). So, maybe about $120. Of course, you'll need to set up something separate to distribute your jobs across various CPUs. One option might be just to use Hadoop (see this question about whether Hadoop is right for running simulations.
You could read things like Guy Steele's talk on parallelism for more info on how to think parallel.
Use an appropriate Executor. Have a look at http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()
Check out these article on concurrency:
http://www.vogella.de/articles/JavaConcurrency/article.html
http://www.baptiste-wicht.com/2010/09/java-concurrency-part-7-executors-and-thread-pools/
Basically, Java 7's new Fork/Join model will work really well for this approach. Essentially you can set up your million+ tasks and it will spread them as best it can accross all available processors. You would have to provide your custom "Cloud" task executor, but it can be done.
This assumes, of course, that your "solving" algorithm is rediculously parallel. In short, as long as the Solver is fully self-contained, they should be able to be split among an arbitrary number of processors.