How to calculate complexity of internal iterations

How to calculate complexity of internal iterations - java

This is regarding identifying time complexity of a java program. If i've iterations like for or while etc, we can identify the complexity. But if i use java API to do some task, if it is internally iterating, i think we should include that as well. If so, how to do that.
Example :
String someString = null;
for(int i=0;i<someLength;i++){
someString.contains("something");// Here i think internal iteration will happen, likewise how to identify time complexity
}
Thanks,
Aditya

Internal operations in the Java APIs have their own time complexity based on their implementation. For example the contains method of the String variable runs with linear complexity, where the dependency is based on the length of your someString variable.
In short - you should check how inner operations work and take them into consideration when calculating complexity.
Particularly for your code the time complexity is something like O(N*K), where N is the number of iterations of your loop (someLength) and K is the length of your someString variable.

You are correct in that the internal iterations will add to your complexity. However, except in a fairly small number of cases, the complexity of API methods is not well documented. Many collection operations come with an upper bound requirement for all implementations, but even in such cases there is no guarantee that the actual code doesn't have lower complexity than required. For cases like String.contains() an educated guess is almost certain to be correct, but again there is no guarantee.
Your best bet for a consistent metric is to look at the source code for the particular API implementation you are using and attempt to figure out the complexity from that. Another good approach would be to run benchmarks on the methods you care about with a wide range of input sizes and types and simply estimate the complexity from the shape of the resulting graph. The latter approach will probably yield better results for cases where the code is too complex to analyze directly.

Related

How to optimize a function in Java to make it faster?

public static ArrayList<Integer> duplicates(int[] arr) {
ArrayList<Integer> doubles = new ArrayList<Integer>();
boolean isEmpty = true;
for(int i = 0; i<arr.length; i++) {
for (int j = i+1; j< arr.length; j++) {
if( arr[i] == arr[j] && !doubles.contains(arr[i]) ){
doubles.add(arr[i]);
isEmpty = false;
break;
}
}
}
if(isEmpty) doubles.add(-1);
Collections.sort(doubles);
return doubles;
}
public static void main(String[] args) {
System.out.println( ( duplicates( new int[]{1,2,3,4,4,4} ) ) ); // Return: [4]
}
I made this function in Java which returns multiples of an input int array or returns a -1 if the input array is empty or when there are no multiples.
It works, but there is probably a way to make it faster.
Are there any good practices to make functions more efficient and faster in general?

There are, in broad strokes, 2 completely unrelated performance improvements you can make:
Reduce algorithmic complexity. This is a highly mathematical concept.
Reduce actual performance characteristics - literally, just make it run faster and/or use less memory (often, 'use less memory' and 'goes faster' go hand in hand).
The first is easy enough, but can be misleading: You can write an algorithm that does the same job in an algorithmically less complex way which nevertheless actually runs slower.
The second is also tricky: Your eyeballs and brain cannot do the job. The engineers that write the JVM itself are on record as stating that in general they have no idea how fast any given code actually runs. That's because the JVM is way too complicated: It has so many complicated avenues for optimizing how fast stuff runs (not just complicated in the code that powers such things, also complicated in how they work. For example, hotspot kicks in eventually, and uses the characteristics of previous runs to determine how best to rewrite a given method into finely tuned machine code, and the hardware you run it on also matters rather a lot).
This leads to the following easy conclusions:
Don't do anything unless there is an actual performance issue.
You really want a profiler report that actually indicates which code is 'relevant'. Generally, for any given java app, literally 1% of all of your lines of code is responsible for 99% of the load. There is just no point at all optimizing anything, except that 1%. A profiler report is useful in finding the 1% that requires the attention. Java ships with a profiler and there are commercial offerings as well.
If you want to micro-benchmark (time a specific slice of code against specific inputs), that's really difficult too, with many pitfalls. There's really only one way to do it right: Use the Java Microbenchmark Harness.
Whilst you can decide to focus on algorithmic complexity, you may still want a profiler report or JMH run because algorithmic complexity is all about 'Eventually, i.e. with large enough inputs, the algorithmic complexity overcomes any other performance aspect'. The trick is: Are your inputs large enough to hit that 'eventually' space?
For this specific algorithm, given that I have no idea what reasonable inputs might be, you're going to have to do the work on setting up JMH and or profiler runs. However, as far as algorithmic complexity goes:
That doubles.contains call has O(N) algorithmic complexity: The amount of time that call takes is linear relative to how large your inputs are.
You can get O(1) algorithmic complexity if you use a HashSet instead.
From the point of view of just plain performance, generally an ArrayList's performance and memory load vs. an int[] is quite large.
This gives 2 alternate obvious strategies to optimize this code:
Replace the ArrayList<Integer> with an int[].
Replace the ArrayList<integer> with a HashSet<Integer> instead.
You can't really combine these two, not without spending a heck of a long time handrolling a primitive int array backed hashbucket implementation. Fortunately, someone did the work for you: Eclipse Collections has a primitive int hashset implementation.
Theoretically it's hard to imagine how replacing this with IntHashSet can be slower. However, I can't go on record and promise you that it'll be any faster: I can imagine if your input is an int array with a few million ints in there, IntHashSet is probably going to be many orders of magnitude faster. But you really need test data and a profiler report and/or a JMH run or we're all just guessing, which is a bad idea, given that the JVM is such a complex beast.
So, if you're serious about optimizing this:
Write a bunch of test cases.
Write a wrapper around this code so you can run those tests in a JMH setup.
Replace the code with IntHashSet and compare that vs. the above in your JMH harness.
If that really improves things and the performance now fits your needs, great. You're done.
If not, you may have to re-evaluate where and how you use this code, or if there's anything else you can do to optimize things.

It works, but there is probably a way to make it faster.
I think you will find this approach significantly faster. I omitted the sort from both methods just to check. This does not discuss general optimizations as rzwitserloot's excellent answer already does that.
The two main problems with your method are:
you are using a nested loop which is essentially is an O(N*N) problem.
and you use contains on a list which must do a linear search each time to find the value.
A better way is to use a HashSet which works very close to O(1) lookup time (relatively speaking and depending on the set threshold values).
The idea is as follows.
Create two sets, one for the result and one for what's been seen.
iterate over the array
try to add the value to the seen set, if it returns true, that means a duplicate is not in the seen set so it is ignored.
if it returns false, a duplicate does exist in the seen set so it is added to the duplicate set.
Note the use of the bang ! to invert the above conditions.
once the loop is finished, return the duplicates in a list as required.
public static List<Integer> duplicatesSet(int[] arr) {
Set<Integer> seen = new HashSet<>();
Set<Integer> duplicates = new HashSet<>();
for (int v : arr) {
if (!seen.add(v)) {
duplicates.add(v);
}
}
return duplicates.isEmpty()
? new ArrayList<>(List.of(-1))
: new ArrayList<>(duplicates);
}
The sort is easily added back in. That will take additional computing time but that was not the real problem.
To test this I generated a list of random values and put them in an array. The following generates an array of 1,000,000 ints between 1 and 1000 inclusive.
Random r = new Random();
int[] val = r.ints(1_000_000, 1, 1001).toArray();

Abstract Algorithm: String / Byte Comparison / Diff

This is a rather abstract question as I yet have no idea how to solve it and haven't found any suitable solutions.
Let's start with the current situation. You'll have an array of byte[] (e.g. ArrayList<byte[]>) which behind the scene are actually Strings, but at the current state the byte[] is prefered. They can be very long (1024+ bytes for each byte[] array whereas the ArrayList may contain up to 1024 byte[] arrays) and might have a different length. Furthermore, they share a lot of the same bytes at the "same" locations (this is relativ, a = {0x41, 0x41, 0x61}, b = {0x41, 0x41, 0x42, 0x61 } => where the first 0x41 and the last 0x61 are the same).
I'm looking now for an algorithm that compares all those arrays with each other. The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
If possible without using any third party libraries (but i doubt it is feasible in a reasonable time without one).
Any suggestions are very welcome.
Edit:
Made some adjustments.
EDIT / SOLUTION:
I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.

The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
You will not be able to find a solution where your metric and the time is independent, they go hand in hand.
For example: if your metric is like the example from your post, that is d(str1,str2) = d(str1.first,str2.first) + d(str1.last,str2.last), then the solution is very easy: sort your array by first and last character (maybe separately), and then take the first and last element of the sorted array. This will give you O(n logn) for the sort.
But if your metric is something like "two sentences are close if they contain many equal words", then this does not work at all, and you end up with O(n²). Or you may be able to come up with a nifty way to re-order your words within the sentences before sorting the sentences etc. etc.
So unless you have a known metric, it's O(n²) with the trivial (naive) implementation of comparing everything while keeping track of the maximum delta.

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.

Big O notation of just a return statement?

Am I right in saying that the time complexity in big O notation would just be O(1)?
public boolean size() {
return (size == 0);
}

Am I right in saying that the time complexity in big O notation would just be O(1)?
No.
This is so common a misconception among students/pupils that I can only constantly repeat this:
Big-O notation is meant to give the complexity of something, with respect to a certain measure, over another number:
For example, saying:
"The algorithm for in-place FFT has a space requirement of O(n), with n being the number of FFT bins"
says something about how much the FFT will need in memory, observed for different lengths of the FFT.
So, you don't specify
What is the thing you're actually observing? Is it the time between calling and returning from your method? Is it the comparison alone? Is "time" measured in Java bytecode instructions, or real machine cycles?
What do you vary? The number of calls to your method? The variable size?
What is it that you actually want to know?
I'd like to stress 3.: Computer science students often think that they know how something will behave if they just know the theoretical time complexity of an algorithm. In reality, these numbers tend to mean nothing. And I mean that. A single fetching of a variable that is not in the CPU cache can take the time of 100-10000 additions in the CPU. Calling a method just to see whether something is 0 will take a few dozen instructions if directly compiled, and might take a lot more if you're using something that is (semi-)interpreted like Java; however, in Java, the next time you call that same method, it might already be there as precompiled machine code...
Then, if your compiler is very smart, it might not only inline the function, eliminating the stack save/restore and call/return instructions, but possibly even merging the result into whatever instructions you were conditioning on that return value, which in essence means that this function, in an extreme case, might not take a single cycle to execute.
So, no matter how you put this, you can not say "time complexity in big O of something that is a language specific feature" without saying what you vary, and exactly what your platform is.

What is the principle behind calculating the complexity of methods?

From Sonar Metrics complexity page the following method has a complexity of 5.
public void process(Car myCar){ <- +1
if(myCar.isNotMine()){ <- +1
return; <- +1
}
car.paint("red");
car.changeWheel();
while(car.hasGazol() && car.getDriver().isNotStressed()){ <- +2
car.drive();
}
return;
}
This is how the tool calculate complexity:
Keywords incrementing the complexity: if, for, while, case, catch,
throw, return (that is not the last statement of a method), &&, ||, ?
Why do case statements,if blocks and while blocks increase the complexity of the method? What is the intuition behind this metric calculation of complexity of methods?

It's because they have conditions in them which increase the number of tests needed to ensure that the code is correct.
Also probably ifs have less complexity than loops (while, for). Also read up on cyclomatic complexity related to this.
Read this blog post, it describes the actual reality of not being able to test everything and the sheer number of tests you require to test everything.

Maybe it is based on the Cyclomatic Complexity by McCabe (at least looks like it).
This metric is widely used in the Software Engineering field.
Take a look at this: http://en.wikipedia.org/wiki/Cyclomatic_complexity

Somar measures cyclomatic complexity, which represents the number of linearly independent paths through the source code.
The key to answering your question comes from a research paper of Thomas McCabe, published in December of 1976:
It can be shown that the cyclomatic complexity of any structured program with only one entrance point and one exit point is equal to the number of decision points (i.e., 'if' statements or conditional loops) contained in that program plus one.
This is precisely what Sonar does: it finds the decision points, which come from loops, conditional statements, and multipart boolean expressions, and counts their number.

Algorithm to reduce satisfiability java

Is there any algorithm to reduce sat problem.
Satisfiability is the problem of determining if the variables of a given Boolean formula can be assigned in such a way as to make the formula evaluate to TRUE. Equally important is to determine whether no such assignments exist, which would imply that the function expressed by the formula is identically FALSE for all possible variable assignments. In this latter case, we would say that the function is unsatisfiable; otherwise it is satisfiable. To emphasize the binary nature of this problem, it is frequently referred to as Boolean or propositional satisfiability. The shorthand "SAT" is also commonly used to denote it, with the implicit understanding that the function and its variables are all binary-valued.
I have used genetic algorithms to solve this, but it would be easier if is reduced first?.

Take a look at Reduced Order Binary Decision Diagrams (ROBDD). It provides a way of compressing boolean expressions to a reduced canonical form. There's plenty of software around for performing the BDD reduction, the wikipedia link above for ROBDD contains a nice list of external links to other relevant packages at the bottom of the article.

You could probably do a depth-first path-tree search on the formula to identify "paths" - Ie, for (ICanEat && (IHaveSandwich || IHaveBanana)), if "ICanEat" is false, the values in brackets don't matter and can be ignored. So, right there you can discard some edges and nodes.
And, if while you're generating this depth-first search, the current Node resolves to True, you've found your solution.

What do you mean by "reduced", exactly? I'm going to assume you mean some sort of preprocessing beforehand, to maybe eliminate or simplify some variables or clauses first.
It all depends on how much work you want to do. Certainly you should do unit propagation until it completes. There are other, more expensive things you can do. See the pre-processing section of the march_dl page for some examples.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.