More efficient way to blend pixels (semi-transparency)?

More efficient way to blend pixels (semi-transparency)? - java

I'm working on drawing semi-transparent images on top of other images for a small 2d game. To currently blend the images I'm using the formula found here: https://en.wikipedia.org/wiki/Alpha_compositing#Alpha_blending
My implementation of this is as follows;
private static int blend(int source, int dest, int trans)
{
double alpha = ((double) trans / 255.0);
int sourceRed = (source >> 16 & 0xff);
int sourceGreen = (source >> 8 & 0xff);
int sourceBlue = (source & 0xff);
int destRed = (dest >> 16 & 0xff);
int destGreen = (dest >> 8 & 0xff);
int destBlue = (dest & 0xff);
int blendedRed = (int) (alpha * sourceRed + (1.0 - alpha) * destRed);
int blendedGreen = (int) (alpha * sourceGreen + (1.0 - alpha) * destGreen);
int blendedBlue = (int) (alpha * sourceBlue + (1.0 - alpha) * destBlue);
return (blendedRed << 16) + (blendedGreen << 8) + blendedBlue;
}
Now, it works fine, but it has a pretty high overhead since it's being called for every single pixel every single frame. I get a performance drop of around 30% FPS as opposed to simply rendering the image without blending.
I just wanted to know if anyone can think of a better way to optimise this code as I'm probably doing too many bit operations.

not a java coder (so read with prejudice) but you are doing some things really wrong (from mine C++ and low level gfx perspective):
mixing integers and floating point
that requires conversions which are sometimes really costly... Its much better to use integer weights (alpha) in range <0..255> and then just divide by 255 or bitshift by 8. That would be most likely much faster.
bitshifting/masking to obtain bytes
yes its fine but there are simpler and faster methods simply by using
enum{
_b=0, // db
_g=1,
_r=2,
_a=3,
};
union color
{
DWORD dd; // 1x32 bit unsigned int
BYTE db[4]; // 4x8 bit unsigned int
};
color col;
col.dd=some_rgba_color;
r = col.dd[_r]; // get red channel
col.dd[_b]=5; // set blue channel
decent compilers could optimize some parts of your code to this internally on its own but I doubt it can do it everywhere...
You can also use pointers instead of union in the same way...
function overhead
you got function blending single pixel. That means it will be called a lot. its usually much faster to blend region (rectangle) per single call than call stuff on per pixel basis. Because you trash the stack this way. To limit this you can try these (for functions that are called massively):
Recode your app so you can blend regions instead of pixels causing much less function calls.
Lower the stack trashing by lowering operands, return values and internal variables of called function to limit the amount of RAM being allocated/freed/overwritten/copied each call... For example by using static or global variables for example the Alpha will most likely not be changing much. Or you can use alpha encoded in the color directly instead of having alpha as operand.
use inline or macros like #define to place the source code directly to code instead of function call.
For starters I would try to recode your function body to something like this:
enum{
_b=0, // db
_g=1,
_r=2,
_a=3,
};
union color
{
unsigned int dd; // 1x32 bit unsigned int
unsigned char db[4]; // 4x8 bit unsigned int
};
private static unsigned int blend(unsigned int src, unsigned int dst, unsigned int alpha)
{
unsigned int i,a,_alpha=255-alpha;
color s,d;
s.dd=src;
d.dd=dst;
for (i=0;i<3;i++)
{
a=(((unsigned int)(s.db[i]))*alpha) + (((unsigned int)(d.db[i]))*_alpha);
a>>=8;
d.db[i]=a;
}
return d.dd;
}
However if you want true speed use GPU (OpenGL Blending).

Related

Bitshift four integers to form a float

I'm working in an application, that uses a rather strange format for it's colours. It's using a variation of ARGB, by using a float to store all the data. The colours themselves are hardcoded into the classes, and are decoded by this operation:
float alpha = (float)(color >> 24 & 255) / 255.0F;
float red = (float)(color >> 16 & 255) / 255.0F;
float blue = (float)(color >> 8 & 255) / 255.0F;
float green = (float)(color & 255) / 255.0F;
The 32 bits of the float are used to extract four 8-Bit Integers by Bit-Shifting them out, and converting them to a float between 0 and 1. This way they are processed further with a graphics library.
I wrote a class to represent colours, since I really don't want to work with all these magical float's. I need a method to reverse the process, but I'm clueless to how this reversal would work. I started by converting my float's to integers between 0 and 1, but that's about as far as I get. How can I take these values and "glue" them together in the same order as the original values?
public int toARGB() {
System.out.println(fromARGB(1615855616)); // gives Color{red=0.3137255, green=0.0, blue=0.0, alpha=0.3764706}
System.out.println(fromARGB(-2130706433)); // gives Color{red=1.0, green=1.0, blue=1.0, alpha=0.5019608}
int closeAlpha = (int) (alpha*255.0F);
int closeRed = (int) (red*255.0F);
int closeGreen = (int) (green*255.0F);
int closeBlue = (int) (blue*255.0F);
//How do I proceed here?
return null;
}

Tinting pixels in Java - Need a faster method

I'm making a doom style pseudo-3D game.
The world is rendered pixel by pixel into a buffered image, which is later displayed on the JPanel. I want to keep this approach so that lighting individual pixels will be easier.
I want to be able to color the textures in the game to many different colors.
Coloring the whole texture and storing it in a separate buffered image takes too much time and memory for my purpose. So I am tinting each pixel of the texture during the rendering stage.
The problem I am having is that tinting each pixel is quite expensive. When an uncolored wall covers the entire screen, I get around 65 fps. And when a colored wall covers the screen, I get 30 fps.
This is my function for tinting the pixels:
//Change the color of the pixel using its brightness.
public static int tintABGRPixel(int pixelColor, Color tintColor) {
//Calculate the luminance. The decimal values are pre-determined.
double lum = ((pixelColor>>16 & 0xff) * 0.2126 +
(pixelColor>>8 & 0xff) * 0.7152 +
(pixelColor & 0xff) * 0.0722) / 255;
//Calculate the new tinted color of the pixel and return it.
return ((pixelColor>>24 & 0xff) << 24) |
((int)(tintColor.getBlue()*lum) & 0xff) |
(((int)(tintColor.getGreen()*lum) & 0xff) << 8) |
(((int)(tintColor.getRed()*lum) & 0xff) << 16);
}
Sorry for the illegible code. This function calculates the brightness of the original pixel, multiplies the new color by the brightness, and converts it back into an int.
It only contains simple operations, but this function is called up to a million times per frame in the worst case. The bottleneck is the calculation in the return statement.
Is there a more efficient way to calculate the new color?
Would it be best if I changed my approach?
Thanks

Do the work in Parallel
Threads aren't necessarily the only way to parallelise code, CPUs often have instructions sets such as SIMD which allow you to compute the same arithmetic on multiple numbers at once. GPUs take this idea and run with it, allowing you to run the same function on hundreds to thousands of numbers in parallel. I don't know how to do this in Java, but I'm sure with some googling its possible to find an method that works.
Algorithm - Do less work
Is it possible to reduce the amount of time the function needs to be called? Calling any function a million times per frame is going to hurt. Unless the overhead of each function call is managed (inlining it, reusing the stack frame, caching the result if possible), you'll want to do less work.
Possible options could be:
Make the window/resolution of the game smaller.
Work with a different representation. Are you doing a lot of operations that are easier to do when pixels are HSV instead of RGB? Then only convert to RGB when you are about to render the pixel.
Use a limited number of colours for each pixel. That way you can work out the possible tints in advance, so they are only a lookup away, as opposed to a function call.
Tint as little as possible. Maybe there is some UI that is tinted and shouldn't be. Maybe lighting effects only travel so far.
As a last resort, make tinted the default. If tinting pixels is done so much then possibly "untinting" happens far less and you can get better performance by doing that.
Performance - (Micro-)optimising the code
If you can settle for an "approximate tint" this SO answer gives an approximation for the brightness (lum) of a pixel that should be cheaper to compute. (The formula from the link is Y = 0.33 R + 0.5 G + 0.16 B, which can be written Y = (R+R+B+G+G+G)/6.
The next step is to time your code (profile is a good term to know for googling) to see what takes up the most resources. It may well be that it isn't this function here, but another piece of code. Or waiting for textures to load.
From this point on we will assume the function provided in the question takes up the most time. Let's see what it is spending its time on. I don't have the rest of your code, so I can't benchmark all of it, but I can compile it and look at the bytecode that is produced. Using javap on a class containing the function I get the following (bytecode has been cut where there are repeats).
public static int tintABGRPixel(int, Color);
Code:
0: iload_0
1: bipush 16
3: ishr
4: sipush 255
7: iand
8: i2d
9: ldc2_w #2 // double 0.2126d
12: dmul
13: iload_0
...
37: dadd
38: ldc2_w #8 // double 255.0d
41: ddiv
42: dstore_2
43: iload_0
44: bipush 24
46: ishr
47: sipush 255
50: iand
51: bipush 24
53: ishl
54: aload_1
55: pop
56: invokestatic #10 // Method Color.getBlue:()I
59: i2d
60: dload_2
61: dmul
62: d2i
63: sipush 255
66: iand
67: ior
68: aload_1
69: pop
...
102: ireturn
This can look scary at first, but Java bytecode is nice, in that you can match each line (or instruction) to a point in your function. It hasn't done anything crazy like rewrite it or vectorize it or anything that makes it unrecognizable.
The general method to see if a change has made an improvement, is to measure the code before and after. With that knowledge you can decide if a change is worth keeping. Once the performance is good enough, stop.
Our poor man profiling is to look at each instruction, and see (on average, according to online sources) how expensive it is. This is a little naive, as how long each instruction takes to execute can depend on a multitude of things such as the hardware it is running on, the versions of software on the computer, and the instructions around it.
I don't have a comprehensive list of the time cost for each instruction, so I'm going to go with some heuristics.
integer operations are faster than floating operations.
constants are faster than local memory, which is faster than global memory.
powers of two can allow for powerful optimisations.
I stared at the bytecode for a while, and all I noticed was that from lines [8 - 42] there are a lot of floating point operations. This section of code works out lum (the brightness). Other than that, nothing else stands out, so let's rewrite the code with our first heuristic in mind. If you don't care for the explanation, I'll provide the final code at the end.
Let us just consider what the blue colour value (which we will label B) will be by the end of the function. The changes will apply to red and green too, but we will leave them out for brevity.
double lum = ((pixelColor>>16 & 0xff) * 0.2126 +
(pixelColor>>8 & 0xff) * 0.7152 +
(pixelColor & 0xff) * 0.0722) / 255;
...
... | ((int)(tintColor.getBlue()*lum) & 0xff) | ...
This can be rewritten as
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
double a = 0.2126, b = 0.7152, c = 0.0722;
double lum = (a*x + b*y + c*z) / 255;
int B = (int)(tintColor.getBlue()*lum) & 0xff;
We don't want to be doing as many floating point operations, so let us do some refactoring. The idea is that the floating point constants can be written as fractions. For example, 0.2126 can be written as 2126/10000.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z;
double temp = (double)(tintColor.getBlue() * top) / 10000 / 255;
int B = (int)temp & 0xff;
So now we do three integer multiplications (imul) instead of three dmuls. The cost is one extra floating division, which alone would probably not be worth it. We can avoid this issue by piggybacking on the other division that we are already doing. Combining the two sequential divisions into one division is as simple as changing / 10000 / 255 to /2550000. We can also setup the code for one more optimization by moving the casting and division to one line.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z);
int temp = (int)((double)(tintColor.getBlue()*top) / 2550000);
int B = temp & 0xff;
This could be a good place to stop. However, if you need to squeeze a tiny bit more performance out of this function, we can optimise dividing by a constant and casting a double to an int (which I believe are two expensive operations) to a multiply (by a long) and a shift.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z;
int Btemp = (int)((tintColor.getBlue() * top * 1766117501L) >> 52);
int B = temp & 0xff;
where the magic numbers are two that were magicked up when I compiled a c++ version of the code with clang. I am not able to explain how to produce this magic, but it works as far as I have tested with a couple of values for x, y, z, and tintColor.getBlue(). When testing I assumed all the values are in the range [0 - 256), and I tried only a couple of examples.
The final code is below. Be warned that this is not well tested and may have edge cases that I've missed, so let me know if there are any bugs. Hopefully it is fast enough.
public static int tintABGRPixel(int pixelColor, Color tintColor) {
// Calculate the luminance. The decimal values are pre-determined.
int x = pixelColor>>16 & 0xff, y = pixelColor>>8 & 0xff, z = pixelColor & 0xff;
int top = 2126*x + 7252*y + 722*z;
int Btemp = (int)((tintColor.getBlue() * top * 1766117501L) >> 52);
int Gtemp = (int)((tintColor.getGreen() * top * 1766117501L) >> 52);
int Rtemp = (int)((tintColor.getRed() * top * 1766117501L) >> 52);
//Calculate the new tinted color of the pixel and return it.
return ((pixelColor>>24 & 0xff) << 24) | Btemp & 0xff | (Gtemp & 0xff) << 8 | (Rtemp & 0xff) << 16;
}
EDIT: Alex found that the magic number should be 1755488566L instead of 1766117501L.

To get better performance you'll have to get rid of objects like Color during image manipulation, also if you know that a method is to be called million times (image.width * image.height times) then it's best to inline this method. In general JVM would probably inline this method itself, but you should not take the risk.
You can use PixelGrabber to get all the pixels into an array. Here's a general usage
final int[] pixels = new int[width * height];
final PixelGrabber pixelgrabber = new PixelGrabber(image, 0, 0, width, height, pixels, 0, 0);
for(int i = 0; i < height; i++) {
for(int j = 0; j < width; j++) {
int p = pixels[i * width + j]; // same as image.getRGB(j, i);
int alpha = ( ( p >> 24) & 0xff );
int red = ( ( p >> 16) & 0xff );
int green = ( ( p >> 8) & 0xff );
int blue = ( p & 0xff );
//do something i.e. apply luminance
}
}
Above is just an example of how to iterate row and column indexes, however in your case nested loop is not needed. This should reasonably improve the performance.
This can probably be parallelized also using Java 8 streams easily, however be careful before using streams while dealing with images, as streams are a lot slower than plain old loops.
You can also try replacing int with byte where applicable (i.e. individual color components don't need to be stored in int). Basically try using primitive datatypes and even in primitive datatypes use smallest that's applicable.

At this point you are really close to the metal on this calculation. I think you'll have to change your approach to really improve things, but a quick idea is to cache the lum calculation. That is a simple function of pixel color and your lum isn't dependent on anything but that. If you cache that it could save you a lot of calcs. While you're caching you could cache this calc too:
((pixelColor>>24 & 0xff) << 24)
I don't know if that'll save you a ton of time, but I think at this point that is just about all you could do from a micro-optimization stand point.
Now you could refactor your pixel loop to use parallelism, and do those pixel calcs in parallel on your CPU this might set you up for the next idea too.
If neither of those above ideas work I think you might need to try and push color calculations off to the GPU card. This is all bare metal math that has to happen millions of times which is what graphics cards do best. Unfortunately this is a deep topic with lots of education that has to happen in order to pick the best option. Here were some interesting things to research:
https://code.google.com/archive/p/java-gpu/
https://github.com/nativelibs4java/JavaCL
http://jogamp.org/jogl/www/
https://www.lwjgl.org/
I know some of those are huge frameworks which isn't what you asked for. But they might contain other relatively unknown libs that you could use to push these math calcs off to the GPU. The #Parrallel annotation looked like it could be the most useful or JavaCL bindings.

Fastest way to sum digits using bit operations when summing color values

My particular case of summing digits deals with colors represented as integer. Java function BufferedImage.getRGB returns image in 0x00RRGGBB format. I'm making a function that gives you grayscale (color independent) sum of colors on the image. Currently, my operation looks very naive:
//Just a pseudocode
int sum = 0;
for(x->width) {
for(y->height) {
int pixel = image.getRGB(x,y);
sum+=(pixel&0x00FF0000)+(pixel&0x0000FF00)+(pixel&0x000000FF);
}
}
//The average value for any color then equals:
float avg = sum/(width*height*3);
I was wondering if I could do it even faster with some bit-shifting logic. And I am mostly asking this question to learn more about bit-shifting as I doubt any answer will speed up the program really significantly.

R, G and B do not attribute equally to the perceived intensity. A better way to sum things up than this:
sum+=(pixel&0x00FF0000)+(pixel&0x0000FF00)+(pixel&0x000000FF);
Would be, with the necessary bitshifting and weighing (assuming 00RRGGBB):
sum+= ((pixel&0x00FF0000)>>16) * .30 / 255
+ ((pixel&0x0000FF00)>> 8) * .59 / 255
+ (pixel&0x000000FF) * .11 / 255;
You might want to leave the /255 part out here and replace the floating point numbers with scaled-up integer numbers (like 30, 59 and 11), bearing in mind that you'll need a long sum to prevent overflow to a reasonable degree.

Adjusting audio volume in real time from byte[]?

I am trying to write a simple application which plays sound and can alter the volume of that sound at any time during playing. I am doing this by converting each pair of bytes in the byte array of the sound into an int, then multiplying that int by increase or decrease in volume and then writing them back as two bytes (i.e. 1 sample). However, this results in extreme distortion in the sound. Is it possible that I have got the bit shifting wrong? My sound format is:
.wav 44100.0hz, 16bit, little-endian
At the moment the byte array that I pass the adjustVolume method represents a 10th of a second of audio data. i.e. sampleRate/10
Is there something I am missing here that is causing it to distort and not scale volume properly? Have I got the writing of bytes back and fort wrong?
private byte[] adjustVolume(byte[] audioSamples, double volume) {
byte[] array = new byte[audioSamples.length];
for (int i = 0; i < array.length; i += 2) {
// convert byte pair to int
int audioSample = (int) (((audioSamples[i + 1] & 0xff) << 8) | (audioSamples[i] & 0xff));
audioSample = (int) (audioSample * volume);
// convert back
array[i] = (byte) audioSample;
array[i + 1] = (byte) (audioSample >> 16);
}
return array;
}
This code is based off of: Audio: Change Volume of samples in byte array in which the asker is trying to do the same thing. However, having used the code from his question (which I think was not updated after he got his answer) I can't get it to work and I am not exactly sure what it is doing.

I suggest you wrap your byte array in a ByteBuffer (not forgetting to set its .order() to little endian), read a short, manipulate it, write it again.
Sample code:
// Necessary in order to convert negative shorts!
private static final int USHORT_MASK = (1 << 16) - 1;
final ByteBuffer buf = ByteBuffer.wrap(audioSamples)
.order(ByteOrder.LITTLE_ENDIAN);
final ByteBuffer newBuf = ByteBuffer.allocate(audioSamples.length)
.order(ByteOrder.LITTLE_ENDIAN);
int sample;
while (buf.hasRemaining()) {
sample = (int) buf.getShort() & USHORT_MASK;
sample *= volume;
newBuf.putShort((short) (sample & USHORT_MASK));
}
return newBuf.array();

Changing alpha of rgb bit

I am setting alpha of rgb of buffredimage in java.
This code changes alpha value but i can't retrieve the same value after saving the file.
How to overcome this problem.
// ================ Code for setting alpha ===============
int alpha=140;
// alpha value to set in rgb
int b=alpha<<24;
b=b|0x00ffffff;
ialpha.setRGB(0, 0,ialpha.getRGB(0, 0)&b);
// ialpha is a bufferedimage of type TYPE_INT_ARGB
ImageIO.write(ialpha, "png", new File("C:/newimg.png"));
System.out.println("\nFile saved !");
// ================ Code for getting alpha ===============
int val=(ialpha.getRGB(0, 0)&0xff000000)>>24;
if(val<0)
val=256+val;
System.out.println("Returned alpha value:"+val);
This just returns 255 as alpha value. it does not return value i set i.e 140.
Please help me to retrieve alpha value i previously set.

The problem is in the code for getting the alpha. In the second bit shift operation, you don't take the sign bit into consideration.
int val=(ialpha.getRGB(0, 0) & 0xff000000) >> 24;
This will give the value 0xffffff8c (given your initial alpha of 140 of 0x8c).
See Bitwise and Bit Shift Operators for more detail. In particular:
The unsigned right shift operator ">>>" shifts a zero into the leftmost position, while the leftmost position after ">>" depends on sign extension.
You need to either do either:
int val = (ialpha.getRGB(0, 0) & 0xff000000) >>> 24; // shift zero into left
Or:
int val = ialpha.getRGB(0, 0) >> 24) & 0xff; // mask out the sign part
PS: I tend to prefer the latter, because most people (myself included) never remember what the >>> operator actually does.. ;-)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.