Diagnosing a performance issue

Diagnosing a performance issue - java

I'm not very experienced with Rust and I'm trying to diagnose a performance problem. Below there is a pretty fast Java code (runs in 7 seconds) and what I think should the the equivalent Rust code. However, the Rust code runs very slowly (yes, I compiled it with --release as well), and it also appears to overflow. Changing i32 to i64 just pushes the overflow later, but it still happens. I suspect there is some bug in what I wrote, but after staring at the problem for a long time, I decided to ask for help.
public class Blah {
static final int N = 100;
static final int K = 50;
public static void main(String[] args) {
//initialize S
int[] S = new int[N];
for (int n = 1; n <= N; n++) S[n-1] = n*n;
// compute maxsum and minsum
int maxsum = 0;
int minsum = 0;
for (int n = 0; n < K; n++) {
minsum += S[n];
maxsum += S[N-n-1];
}
// initialize x and y
int[][] x = new int[K+1][maxsum+1];
int[][] y = new int[K+1][maxsum+1];
y[0][0] = 1;
// bottom-up DP over n
for (int n = 1; n <= N; n++) {
x[0][0] = 1;
for (int k = 1; k <= K; k++) {
int e = S[n-1];
for (int s = 0; s < e; s++) x[k][s] = y[k][s];
for (int s = 0; s <= maxsum-e; s++) {
x[k][s+e] = y[k-1][s] + y[k][s+e];
}
}
int[][] t = x;
x = y;
y = t;
}
// sum of unique K-subset sums
int sum = 0;
for (int s = minsum; s <= maxsum; s++) {
if (y[K][s] == 1) sum += s;
}
System.out.println(sum);
}
}
extern crate ndarray;
use ndarray::prelude::*;
use std::mem;
fn main() {
let numbers: Vec<i32> = (1..101).map(|x| x * x).collect();
let deg: usize = 50;
let mut min_sum: usize = 0;
for i in 0..deg {
min_sum += numbers[i] as usize;
}
let mut max_sum: usize = 0;
for i in deg..numbers.len() {
max_sum += numbers[i] as usize;
}
// Make an array
let mut x = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
let mut y = OwnedArray::from_elem((deg + 1, max_sum + 1), 0i32);
y[(0, 0)] = 1;
for n in 1..numbers.len() + 1 {
x[(0, 0)] = 1;
println!("Completed step {} out of {}", n, numbers.len());
for k in 1..deg + 1 {
let e = numbers[n - 1] as usize;
for s in 0..e {
x[(k, s)] = y[(k, s)];
}
for s in 0..max_sum - e + 1 {
x[(k, s + e)] = y[(k - 1, s)] + y[(k, s + e)];
}
}
mem::swap(&mut x, &mut y);
}
let mut ans = 0;
for s in min_sum..max_sum + 1 {
if y[(deg, s)] == 1 {
ans += s;
}
}
println!("{}", ans);
}

To diagnose a performance issue in general, I:
Get a baseline time or rate. Preferably create a testcase that only takes a few seconds, as profilers tend to slow down the system a bit. You will also want to iterate frequently.
Compile in release mode with debugging symbols.
Run the code in a profiler. I'm on OS X so my main choice is Instruments, but I also use valgrind.
Find the hottest code path, think about why it's slow, try something, measure.
The last step is the hard part.
In your case, you have a separate implementation that you can use as your baseline. Comparing the two implementations, we can see that your data structures differ. In Java, you are building nested arrays, but in Rust you are using the ndarray crate. I know that crate has a good maintainer, but I personally don't know anything about the internals of it, or what use cases it best fits.
So I rewrote it with using the standard-library Vec.
The other thing I know is that direct array access isn't as fast as using an iterator. This is because array access needs to perform a bounds check, while iterators bake the bounds check into themselves. Many times this means using methods on Iterator.
The other change is to perform bulk data transfer when you can. Instead of copying element-by-element, move whole slices around, using methods like copy_from_slice.
With those changes the code looks like this (apologies for poor variable names, I'm sure you can come up with semantic names for them):
use std::mem;
const N: usize = 100;
const DEGREE: usize = 50;
fn main() {
let numbers: Vec<_> = (1..N+1).map(|v| v*v).collect();
let min_sum = numbers[..DEGREE].iter().fold(0, |a, &v| a + v as usize);
let max_sum = numbers[DEGREE..].iter().fold(0, |a, &v| a + v as usize);
// different data types for x and y!
let mut x = vec![vec![0; max_sum+1]; DEGREE+1];
let mut y = vec![vec![0; max_sum+1]; DEGREE+1];
y[0][0] = 1;
for &e in &numbers {
let e2 = max_sum - e + 1;
let e3 = e + e2;
x[0][0] = 1;
for k in 0..DEGREE {
let current_x = &mut x[k+1];
let prev_y = &y[k];
let current_y = &y[k+1];
// bulk copy
current_x[0..e].copy_from_slice(&current_y[0..e]);
// more bulk copy
current_x[e..e3].copy_from_slice(&prev_y[0..e2]);
// avoid array index
for (x, y) in current_x[e..e3].iter_mut().zip(&current_y[e..e3]) {
*x += *y;
}
}
mem::swap(&mut x, &mut y);
}
let sum = y[DEGREE][min_sum..max_sum+1].iter().enumerate().filter(|&(_, &v)| v == 1).fold(0, |a, (i, _)| a + i + min_sum);
println!("{}", sum);
println!("{}", sum == 115039000);
}
2.060s - Rust 1.9.0
2.225s - Java 1.7.0_45-b18
On OS X 10.11.5 with a 2.3 GHz Intel Core i7.
I'm not experienced enough with Java to know what kinds of optimizations it can do automatically.
The biggest potential next step I see is to leverage SIMD instructions when performing the addition; it's pretty much exactly what SIMD is made for.
As pointed out by Eli Friedman, avoiding array indexing by zipping isn't currently the most performant way of doing this.
With the changes below, the time is now 1.267s.
let xx = &mut current_x[e..e3];
xx.copy_from_slice(&prev_y[0..e2]);
let yy = &current_y[e..e3];
for i in 0..(e3-e) {
xx[i] += yy[i];
}
This generates assembly that appears to unroll the loop as well as using SIMD instructions:
+0x9b0 movdqu -48(%rsi), %xmm0
+0x9b5 movdqu -48(%rcx), %xmm1
+0x9ba paddd %xmm0, %xmm1
+0x9be movdqu %xmm1, -48(%rsi)
+0x9c3 movdqu -32(%rsi), %xmm0
+0x9c8 movdqu -32(%rcx), %xmm1
+0x9cd paddd %xmm0, %xmm1
+0x9d1 movdqu %xmm1, -32(%rsi)
+0x9d6 movdqu -16(%rsi), %xmm0
+0x9db movdqu -16(%rcx), %xmm1
+0x9e0 paddd %xmm0, %xmm1
+0x9e4 movdqu %xmm1, -16(%rsi)
+0x9e9 movdqu (%rsi), %xmm0
+0x9ed movdqu (%rcx), %xmm1
+0x9f1 paddd %xmm0, %xmm1
+0x9f5 movdqu %xmm1, (%rsi)
+0x9f9 addq $64, %rcx
+0x9fd addq $64, %rsi
+0xa01 addq $-16, %rdx
+0xa05 jne "slow::main+0x9b0"

Related

Best way to parallelize this Java code

How would I go about parallelizing this piece of code with the use of Threads in Java? It extracts all the contour from an image and creates a new image with only the image contour.
import java.io.*;
import java.awt.image.*;
import javax.imageio.ImageIO;
import java.awt.Color;
public class Contornos {
static int h, w;
static float debugTime;
public static void main(String[] args) {
try {
File fichImagen = new File("test.jpg");
BufferedImage image = ImageIO.read(fichImagen);
w = image.getWidth();
h = image.getHeight();
int[] inicial = new int[w * h];
int[] resultadoR = new int[w * h];
int[] resultadoG = new int[w * h];
int[] resultadoB = new int[w * h];
int[][] procesarR = new int[h][w];
int[][] procesarG = new int[h][w];
int[][] procesarB = new int[h][w];
int[][] procesarBN = new int[h][w];
int[][] binaria = new int[h][w];
int[] resultado = new int[w * h];
image.getRGB(0, 0, w, h, inicial, 0, w);
for (int i = 0; i < w * h; i++) {
Color c = new Color(inicial[i]);
resultadoR[i] = c.getRed();
resultadoG[i] = c.getGreen();
resultadoB[i] = c.getBlue();
}
int k = 0;
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
procesarR[i][j] = resultadoR[k];
procesarG[i][j] = resultadoG[k];
procesarB[i][j] = resultadoB[k];
k++;
}
}
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
procesarBN[i][j] = (int) (0.2989 * procesarR[i][j] + 0.5870 * procesarG[i][j] + 0.1140 * procesarB[i][j]);
}
}
binaria = extraerContornos(procesarBN);
k = 0;
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
resultado[k++] = binaria[i][j];
}
}
image.setRGB(0, 0, w, h, resultado, 0, w);
ImageIO.write(image, "JPG", new File("allJPG.jpg"));
} catch (IOException e) {
}
}
static void debugStart() {
debugTime = System.nanoTime();
}
static void debugEnd() {
float elapsedTime = System.nanoTime()-debugTime;
System.out.println( (elapsedTime/1000000) + " ms ");
}
private static int[][] extraerContornos(int[][] matriz) {
int modx, mody;
int[][] sobelx = {{-1, 0, 1}, {-2, 0, 2}, {-1, 0, 1}};
int[][] sobely = {{-1, -2, -1}, {0, 0, 0}, {1, 2, 1}};
int[][] modg = new int[h][w];
double[][] theta = new double[h][w];
int[][] thetanor = new int[h][w];
int[][] contorno = new int[h][w];
int umbral = 10;
int superan = 0, ncontorno = 0;
double t;
int signo;
int uno, dos;
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
if (i == 0 || i == h - 1 || j == 0 || j == w - 1) {
modg[i][j] = 0;
theta[i][j] = 0.0;
thetanor[i][j] = 0;
} else {
modx = 0;
mody = 0;
for (int k = -1; k <= 1; k++) {
for (int l = -1; l <= 1; l++) {
modx += matriz[i + k][j + l] * sobelx[k + 1][l + 1];
mody += matriz[i + k][j + l] * sobely[k + 1][l + 1];
}
}
modx = modx / 4;
mody = mody / 4;
modg[i][j] = (int) Math.sqrt(modx * modx + mody * mody);
theta[i][j] = Math.atan2(mody, modx);
thetanor[i][j] = (int) (theta[i][j] * 256.0 / (2.0 * Math.PI));
}
}
}
for (int i = 1; i < h - 1; i++) {
for (int j = 1; j < w - 1; j++) {
contorno[i][j] = 0;
if (modg[i][j] >= umbral) {
superan++;
t = Math.tan(theta[i][j]);
if (t >= 0.0) {
signo = 1;
} else {
signo = -1;
}
if (Math.abs(t) < 1.0) {
uno = interpolar(modg[i][j + 1], modg[i - signo][j + 1], t);
dos = interpolar(modg[i][j - 1], modg[i + signo][j - 1], t);
} else {
t = 1 / t;
uno = interpolar(modg[i - 1][j], modg[i - 1][j + signo], t);
dos = interpolar(modg[i + 1][j], modg[i + 1][j - signo], t);
}
if (modg[i][j] > uno && modg[i][j] >= dos) {
ncontorno++;
contorno[i][j] = 255;
}
}
}
}
debugEnd();
return contorno;
}
private static int interpolar(int valor1, int valor2, double tangente) {
return (int) (valor1 + (valor2 - valor1) * Math.abs(tangente));
}
}
I believe I can use Threads in the extraerContornos method (for the for loops), and join() them at the end to get the results, but that's just my guess.
Would that be a correct way to parallelize this? Any tips in general on how to know when and where you should start parallelizing any code?

Tips in general on how to knowwhen and where you should start parallelizing any code?
Well,never ever start parallelizing any code, without having a quantitatively supported evidence, that it will improve system performance.
NEVER EVER,
even if any academicians or wannabe gurus tell you to do so.
First collect a fair amount of evidence, that it has any sense at all and how big will be a positive edge such code re-engineering will bring, over an original, pure-[SERIAL], code-execution flow.
It is like in nature or like in business -- who will ever pay a single cent more for getting a same result?
Who will pay X-[man*hours] work at current salary rates for getting just the first 1.01x improvement in performance ( not speaking about wannabe-parallel-gangstas, who manage to deliver even worse than original performance ... because of un-seen before hidden costs of add-on overheads ) -- who will ever pay for this?
How to start to analyse possible benefits v/s negative impacts?
First of all, try to understand the "mechanics", how can the layered, composite system -- consisting of [ O/S-kernel, programming language, user program ] -- orchestrate going forward using either a "just"-[CONCURRENT] or true-[PARALLEL] process-scheduling.
Without knowing this, one can never quantify the actual costs of the entry, and sometimes people even pay all such costs without ever realising, that the resulting processing-flow is yet never even at least a "just"-[CONCURRENT] processing ( if one forgets to understand a central "concurrency-preventing-by-exclusive-LOCK-ing" blocking of a python GIL-locking, which could well help mask some sorts of I/O-latencies, but never indeed any kind of improving of a CPU-bound processing-performance, yet all have to pay all those immense costs of spawning full-copies of the process execution-environment + python-internal-state -- all that for receiving nothing at the end. NOTHING. Yes, that bad may things go, if poor or missing knowledge preceded a naive attempt to "go parallelize" activism ).
Ok, once you feel comfortable in operating-system "mechanics" available for spawning threads and processes, you can guesstimate or better benchmark the costs of doing that -- to start working quantitatively -- knowing how many [ns] one will have to pay to spawn a first, second, ... thirtyninth child thread or separate O/S process, or what will be the add-on costs for using some higher-level language constructor, that fans-out a herd of threads/processes, distributes some amount of work and finally collects the heaps of results back to the original requestor ( using just the high-level syntax of .map(...){...}, .foreach(...){...} et al, which on their lower ends do all the dirty job just hidden from the sight of the user-programme designer ( not speaking about "just"-coders, who even do not try to spend any but zero efforts on a fully responsible understanding of the "mechanics" + "economy" of costs of their "just"-coded work ) ).
Without knowing the actual costs in [ns] ( technically not depicted for clarity and brevity in Fig.1, that are principally always present, being detailed and discussed in the trailer sections ), it makes almost no sense for anyone to try to read and to try to understand in its full depth and its code-design context the criticism of the Amdahl's Law
It is so easy to pay more than one will receive at the end ...
For more details on this risk, check this and follow the link from the first paragraph, leading to a fully interactive GUI-simulator of the actual costs of overheads, once introduced into the costs/benefits formula.
Back to your code:
Sobel-filter kernel introduces ( naive-)-thread-mapping non-local dependencies, better to start with a way simple section, where an absolute independence is straight visible:
May save all the repetitive for(){...}-constructor overheads and increase performance:
for ( int i = 0; i < h; i++ ) {
for ( int j = 0; j < w; j++ ) {
Color c = new Color( inicial[i * w + j] );
procesarBN[i][j] = (int) ( 0.2989 * c.getRed()
+ 0.5870 * c.getGreen()
+ 0.1140 * c.getBlue()
);
}
}
Instead of these triple-for(){...}-s:
for (int i = 0; i < w * h; i++) {
Color c = new Color(inicial[i]);
resultadoR[i] = c.getRed();
resultadoG[i] = c.getGreen();
resultadoB[i] = c.getBlue();
}
int k = 0;
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
procesarR[i][j] = resultadoR[k];
procesarG[i][j] = resultadoG[k];
procesarB[i][j] = resultadoB[k];
k++;
}
}
for (int i = 0; i < h; i++) {
for (int j = 0; j < w; j++) {
procesarBN[i][j] = (int) (0.2989 * procesarR[i][j] + 0.5870 * procesarG[i][j] + 0.1140 * procesarB[i][j]);
}
}
Effects?
In the [SERIAL]-part of the Amdahl's Law:
at net zero add-on costs: improved / eliminated 2/3 of the for(){...}-constructor looping overhead costs
at net zero add-on costs: improved / eliminated the ( 4 * h * w * 3 )-memIO ( i.e. not paying ~ h * w * 1.320+ [us] each !!! )
at net zero add-on costs: improved / eliminated the ( 4 * h * w * 3 * 4 )-memALLOCs, again saving remarkable amount of resources both in [TIME] and [SPACE], polynomially scaled domains of the complexity ZOO taxonomy.
and also may feel safe to run these in a [CONCURRENT] processing, as this pixel-value processing is principally independent here ( but not in the Sobel, not in the contour-detector algorithm ).
So, here,
any [CONCURRENT] or [PARALLEL] process-scheduling may help, if
at some non-zero add-on cost, the processing gets harnessing multiple computing resources ( more than the 1 CPU-core, that was operated in the original, pure-[SERIAL], code-execution ), will have been safely pixel-grid mapped onto such ( available resources-supported ) thread-pool or other code-processing facility.
Yet,
any attempt to go non-[SERIAL] makes sense if and only if the lumpsum of all the process-allocation / deallocation et al add-on costs get at least justified by an increased amount of [CONCURRENT]-ly processed calculations.
Paying more than receiving is definitely not a smart move...
So, benchmark, benchmark and benchmark, before deciding what may get positive effect on production code.
Always try to get improvements in the pure-[SERIAL] sections, as these have zero-add-on costs and yet may reduce the overall processing time.
Q.E.D. above.

Does the JIT Optimizer Optimize Multiplication?

In my computer architecture class I just learned that running an algebraic expression involving multiplication through a multiplication circuit can be more costly than running it though an addition circuit, if the number of required multiplications are less than 3. Ex: 3x. If I'm doing this type of computation a few billion times, does it pay off to write it as: x + x + x or does the JIT optimizer optimize for this?

I wouldn't expect to be a huge difference on writing it one way or the other.
The compiler will probably take care of making all of those equivalent.
You can try each method and measure how long it takes, that could give you a good hint to answer your own question.
Here's some code that does the same calculations 10 million times using different approaches (x + x + x, 3*x, and a bit shift followed by a subtraction).
They seem to all take approx the same amount of time as measured by System.nanoTime.
Sample output for one run:
sum : 594599531
mult : 568783654
shift : 564081012
You can also take a look at this question that talks about how compiler's optimization can probably handle those and more complex cases: Is shifting bits faster than multiplying and dividing in Java? .NET?
Code:
import java.util.Random;
public class TestOptimization {
public static void main(String args[]) {
Random rn = new Random();
long l1 = 0, l2 = 0, l3 = 0;
long nano1 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l1 += sum(num);
}
long nano2 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l2 += mult(num);
}
long nano3 = System.nanoTime();
for (int i = 1; i < 10000000; i++) {
int num = rn.nextInt(100);
l3 += shift(num);
}
long nano4 = System.nanoTime();
System.out.println(l1);
System.out.println(l2);
System.out.println(l3);
System.out.println("sum : " + (nano2 - nano1));
System.out.println("mult : " + (nano3 - nano2));
System.out.println("shift : " + (nano4 - nano3));
}
private static long sum(long x) {
return x + x + x;
}
private static long mult(long x) {
return 3 * x;
}
private static long shift(long x) {
return (x << 2) - x;
}
}

How to write Extended Euclidean Algorithm code wise in Java?

I have a question which is actually requires a bit of understanding Euclidian Algorithm. Problem is simple. An int "First" and int "Second" numbers are given by the user via Scanner.
Than we need to find greatest common divisor of them. Than the process goes like explained below:
Now Assume that the First number is: 42 and the Second is: 30 - they've given by the user. -
int x, y;
(x * First) + (y * Second) = gcd(First, Second); // x ? y ?
To Find GCD you may use: gcd(First, Second); Code is below:
public static int gcd(int a, int b)
{
if(a == 0 || b == 0) return a+b; // base case
return gcd(b,a%b);
}
Sample Input: First: 24 Second: 48 and Output should be x: (-3) and y: 2
Sample Input: First: 42 Second: 30 and Output should be x: (-2) and y: 3
Sample Input: First: 35 Second: 05 and Output should be x: (0) and y: 1
(x * First) + (y * Second) = gcd(First, Second); // How can we find x and y ?
I would very appreciate it if you could show a solution code wise in java thanks for checking!

The Extended Euclidean Algorithm is described in this Wikipedia article. The basic algorithm is stated like this (it looks better in the Wikipedia article):
More precisely, the standard Euclidean algorithm with a and b as
input, consists of computing a sequence q1,...,
qk of quotients and a sequence r0,...,
rk+1 of remainders such that
r0=a r1=b ...
ri+1=ri-1-qi ri and 0 <
ri+1 < |ri| ...
It is the main property of Euclidean division that the inequalities on
the right define uniquely ri+1 from ri-1 and
ri.
The computation stops when one reaches a remainder rk+1
which is zero; the greatest common divisor is then the last non zero
remainder rk.
The extended Euclidean algorithm proceeds similarly, but adds two
other sequences defined by
s0=1, s1=0 t0=0,
t1=1 ...
si+1=si-1-qi si
ti+1=ti-1-qi ti
This should be easy to implement in Java, but the mathematical way it's expressed may make it hard to understand. I'll try to break it down.
Note that this is probably going to be easier to implement in a loop than recursively.
In the standard Euclidean algorithm, you compute ri+1 in terms of ri-1 and ri. This means that you have to save the two previous versions of r. This part of the formula:
ri+1=ri-1-qi ri and 0 <
ri+1 < |ri| ...
just means that ri+1 will be the remainder when ri-1 is divided by ri. qi is the quotient, which you don't use in the standard Euclidean algorithm, but you do use in the extended one. So Java code to perform the standard Euclidean algorithm (i.e. compute the GCD) might look like:
prevPrevR = a;
prevR = b;
while ([something]) {
nextR = prevPrevR % prevR;
quotient = prevPrevR / prevR; // not used in the standard algorithm
prevPrevR = prevR;
prevR = nextR;
}
Thus, at any point, prevPrevR will be essentially ri-1, and prevR will be ri. The algorithm computes the next r, ri+1, then shifts everything which in essence increments i by 1.
The extended Euclidean algorithm will be done the same way, saving two s values prevPrevS and prevS, and two t values prevPrevT and prevT. I'll let you work out the details.

Thank's for helping me out ajb I solved it after digging your answer. So for the people who would like to see code wise:
public class Main
{
public static void main (String args[])
{
#SuppressWarnings("resource")
System.out.println("How many times you would like to try ?")
Scanner read = new Scanner(System.in);
int len = read.nextInt();
for(int w = 0; w < len; w++)
{
System.out.print("Please give the numbers seperated by space: ")
read.nextLine();
long tmp = read.nextLong();
long m = read.nextLong();
long n;
if (m < tmp) {
n = m;
m = tmp;
}
else {
n = tmp;
}
long[] l1 = {m, 1, 0};
long[] l2 = {n, 0, 1};
long[] l3 = new long[3];
while (l1[0]-l2[0]*(l1[0]/l2[0]) > 0) {
for (int j=0;j<3;j++) l3[j] = l2[j];
long q = l1[0]/l2[0];
for (int i = 0; i < 3; i++) {
l2[i] = (l1[i]-l2[i]*q);
}
for (int k=0;k<3;k++) l1[k] = l3[k];
}
System.out.printf("%d %d %d",l2[1],l2[2],l2[0]); // first two Bezouts identity Last One gcd
}
}
}

Here is the code that I came up with if anyone is still looking. It is in C# but I am sure it similar to java. Enjoy
static void Main(string[] args)
{
List<long> U = new List<long>();
List<long> V = new List<long>();
List<long> W = new List<long>();
long a, b, d, x, y;
Console.Write("Enter value for a: ");
string firstInput = Console.ReadLine();
long.TryParse(firstInput, out a);
Console.Write("Enter value for b: ");
string secondInput = Console.ReadLine();
long.TryParse(secondInput, out b);
long temp;
//Make sure that a > b
if(a < b)
{
temp = a;
a = b;
b = temp;
}
//Initialise List U
U.Add(a);
U.Add(1);
U.Add(0);
//Initialise List V
V.Add(b);
V.Add(0);
V.Add(1);
while(V[0] > 0)
{
decimal difference = U[0] / V[0];
var roundedDown = Math.Floor(difference);
long rounded = Convert.ToInt64(roundedDown);
for (int i = 0; i < 3; i++)
W.Add(U[i] - rounded * V[i]);
U.Clear();
for (int i = 0; i < 3; i++)
U.Add(V[i]);
V.Clear();
for (int i = 0; i < 3; i++)
V.Add(W[i]);
W.Clear();
}
d = U[0];
x = U[1];
y = U[2];
Console.WriteLine("\nd = {0}, x = {1}, y = {2}", d, x, y);
//Check Equation
Console.WriteLine("\nEquation check: d = ax + by\n");
Console.WriteLine("\t{0} = {1}({2}) + {3}({4})", d, a, x, b, y);
Console.WriteLine("\t{0} = {1} + {2}", d, a*x, b*y);
Console.WriteLine("\t{0} = {1}", d, (a * x) + (b * y));
if (d == (a * x) + (b * y))
Console.WriteLine("\t***Equation is satisfied!***");
else
Console.WriteLine("\tEquation is NOT satisfied!");
}
}
}

How can I shave .1 seconds from the runtime of this program?

I'm working on a practice program at InterviewStreet and I have a solution that runs with a time of 5.15xx seconds, while the maximum time allowed for a java solution is 5 seconds. Is there anything I can do with what I've got here to get it under 5 seconds? There's also a limit of 256 MB so as near as I can tell this is both the most time and memory efficient solution to the problem...
edit:
The possible values for N and K are N <= 10^9 and K <= N, which is why I chose to do everything using BigInteger. The maximum number of trials is 10000. So basically, you input the number of trials, then a pair of integer values for each number of trials, and the program computes the three versions of the binomial coefficient for the equation in the second loop. I figured it would be faster to read everything into the arrays, then process the arrays and put the results into a third array to be processed by the third loop because I figured it might be faster that way. I tried doing everything in the same loop and it ran slower.
I've tried three or four different algorithms for calculating the binomial coefficient (nCr - or n choose r, all are different ways of saying the same thing). Some of the algorithms involve a two dimensional array like c[n][k]. This is the only solution I've submitted that didn't come back with some sort of memory error. The answer needs to be output mod (10 ^ 6) + 3 because the answers of nCr * nCr get pretty huge. A sample run of the program is:
3
4 1
5 2
90 13
2
5
815483
Can't run it on a faster machine because it needs to pass on their machine to count, basically I submit the code and they run it against their test cases, and I have no idea what their test case is, just that the inputs are within the bounds given above.
And the program itself:
import java.math.BigInteger;
import java.util.Scanner;
public class Solution {
public BigInteger nCr(int n, int r) {
if (r > n ) {
return BigInteger.ZERO;
}
if (r > n / 2) {
r = n - r;
}
BigInteger result = BigInteger.ONE;
for (int i = 0; i < r; i++) {
result = result.multiply(BigInteger.valueOf(n - i));
result = result.divide(BigInteger.valueOf(i + 1));
}
return result;
}
public static void main(String[] args) {
Scanner input = new Scanner( System.in );
BigInteger m = BigInteger.valueOf(1000003);
Solution p = new Solution();
short T = input.nextShort(); // Number of trials
BigInteger intermediate = BigInteger.ONE;
int[] r = new int[T];
int[] N = new int[T];
int[] K = new int[T];
short x = 0;
while (x < T) {
N[x] = input.nextInt();
K[x] = input.nextInt();
x++;
}
x = 0;
while (x < T) {
if (N[x] >= 3) {
r[x] = ((p.nCr(N[x] - 3, K[x]).multiply(p.nCr(N[x] + K[x], N[x] - 1))).divide(BigInteger.valueOf((N[x] + K[x]))).mod(m)).intValue();
} else {
r[x] = 0;
}
x++;
}
x = 0;
while (x < T) {
System.out.println(r[x]);
x++;
}
}
}

Not entirely sure what the algorithm is trying to accomplish but I'm guessing something with binomial coefficients based on your tagging of the post.
You'll need to check if my suggestion modifies the result but it looks like you could merge two of your while loops:
Original:
while (x < T) {
N[x] = input.nextInt();
K[x] = input.nextInt();
x++;
}
x = 0;
while (x < T) {
if (N[x] >= 3) {
r[x] = ((p.nCr(N[x] - 3, K[x]).multiply(p.nCr(N[x] + K[x], N[x] - 1))).divide(BigInteger.valueOf((N[x] + K[x]))).mod(m)).intValue();
} else {
r[x] = 0;
}
x++;
}
New:
x = 0;
while (x < T) {
//this has been moved from your first while loop
N[x] = input.nextInt();
K[x] = input.nextInt();
if (N[x] >= 3) {
r[x] = ((p.nCr(N[x] - 3, K[x]).multiply(p.nCr(N[x] + K[x], N[x] - 1))).divide(BigInteger.valueOf((N[x] + K[x]))).mod(m)).intValue();
} else {
r[x] = 0;
}
x++;
}

try to run using a profilerm for example the jvisualvm and run this class with
-Dcom.sun.management.jmxremote
attach to the process and start a profile.

Trying to port python code to Java but getting different results

I'm sure I'm making a rookie mistake with java(this is actually my first program). I am trying to port some working python code I made into java(as a learning/testing exercise to learn a bit of the differences) but I'm getting different results between the two.
My program takes a list of data and generates another list based on it(basically sees if a value can be broken down by a sum). Python correctly gives 2,578 results while Java only gives 12. I tried to find the same commands in java and thought I did but can't seem to figure out why the results differ(the difference between the two I have experienced problems with multi threading and syncing of variables, wasn't sure if Java was doing anything behind the scenes so I had a while loop to keep running until the results stabilize, but it didn't help). Any suggestions would be helpful.
Here's the offending code(java at the top, python and pseudo code commented out in the bottom as reference):
for (int c = 0; c <= max_value; c++){
String temp_result = (s - c * data.get(i) + "," + i);
if( results.contains( temp_result ) ){
String result_to_add = (s + "," + i+1);
if( results.contains( result_to_add ) ){
System.out.println("contains result already");
} else {
results.add(result_to_add);
} print len(T)
#Here's the basic pseudo code(I added a few control variables but here's a high level view):
for i = 1 to k
for z = 0 to sum:
for c = 1 to z / x_i:
if T[z - c * x_i][i - 1] is true:
set T[z][i] to true
*/

In java s + "," + i+1 is a String concatenation : "10" + "," + 4 + 1 will return 10,41.
Use String result_to_add = s + "," + (i+1); instead.

I see you've solved it just now, but since I've written it already, here's my version:
This uses the trick of using a Point as a substitute for a 2-element Python list/tuple of int, which (coincidentally) bypasses your String concatenation issue.
public class Sums
{
public static void main(String[] args)
{
List T = new ArrayList();
T.add(new Point(0, 0));
int target_sum = 100;
int[] data = new int[] { 10, -2, 5, 50, 20, 25, 40 };
float max_percent = 1;
int R = (int) (target_sum * max_percent * data.length);
for (int i = 0; i < data.length; i++)
{
for (int s = -R; s < R + 1; s++)
{
int max_value = (int) Math.abs((target_sum * max_percent)
/ data[i]);
for (int c = 0; c < max_value + 1; c++)
{
if (T.contains(new Point(s - c * data[i], i)))
{
Point p = new Point(s, i + 1);
if (!T.contains(p))
{
T.add(p);
}
}
}
}
}
System.out.println(T.size());
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Diagnosing a performance issue - java

Related

Best way to parallelize this Java code

Does the JIT Optimizer Optimize Multiplication?

How to write Extended Euclidean Algorithm code wise in Java?

How can I shave .1 seconds from the runtime of this program?

Trying to port python code to Java but getting different results

Categories

Resources