Related
I am trying to find a way to calculate and print the Ascii distance between a string from user input
Scanner scan = new Scanner(System.in);
System.out.print("Please enter a string of 5 uppercase characters:");
String userString = scan.nextLine();
and a randomly generated string
int leftLimit = 65; // Upper-case 'A'
int rightLimit = 90; // Upper-case 'Z'
int stringLength = 5;
Random random = new Random();
String randString = random.ints(leftLimit, rightLimit + 1)
.filter(i -> (i <= 57 || i >= 65) && (i <= 90 || i >= 97))
.limit(stringLength)
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Is there a way to calculate the distance without having to separate each individual character from the two strings, comparing them and adding them back together?
Use Edit distance (Levenshtein distance)
You can
Implement your own edit distance based on the algorithm on wikipedia,
you can use an existing source code, for that look at rosetta code.
use an existing library like apache LevenshteinDistance
you can also check
Levenshtein Distance on stackoverflow
Streams are, well, as the name says, streams. They don't work very well unless you can define an operation strictly on the basis of one input: One element from a stream, without knowing its index or referring to the entire collection.
Here, that is a problem; after all, to operate on, say, the 'H' in your input, you need the matching character from your random code.
I'm not sure why you find 'separate each individual character, compare them, and add them back together' is so distasteful to you. Isn't that a pretty clean mapping from the problem description to instructions for your computer to run?
The alternative is more convoluted: You could attempt to create a mixed object that contains both the letter as well as its index, stream over this, and use the index to look up the character in the second string. Alternatively, you could attempt to create a mix object containing both characters (so, for inputs ABCDE and HELLO, an object containing both A and H), but you'd be writing far more code to get that set up, then the simple, no-streams way.
So, let's start with the simple way:
int difference = 0;
for (int i = 0; i < stringLength; i++) {
char a = inString.charAt(i);
char b = randomString.charAt(i);
difference += difference(a, b);
}
You'd have to write the difference method yourself - but it'd be a very very simple one-liner.
Trying to take two collections of some sort, and from them create a single stream where each element in the stream is matching elements from each collection (so, a stream of ["HA", "EB", "LC", "LD", "OE"]) is generally called 'zipping' (no relation to the popular file compression algorithm and product), and java doesn't really support it (yet?). There are some third party libraries that can do it, but given that the above is so simple I don't think zipping is what you're looking for here.
If you absolutely must, I guess i'd look something like:
// a stream of 0,1,2,3,4
IntStream.range(0, stringLength)
// map 0 to "HA", 1 to "EB", etcetera
.mapToObj(idx -> "" + inString.charAt(idx) + randomString.charAt(idx))
// map "HA" to the difference score
.mapToInt(x -> difference(x))
// and sum it.
.sum();
public int difference(String a) {
// exercise for the reader
}
Create an 2D array fill the array with distances - you can index directly into the 2D array to pull out the distance between the characters.
So one expression that sums up a set of array accesses.
Here is my code for this (ASCII distance) in MATLAB
function z = asciidistance(input0)
if nargin ~= 1
error('please enter a string');
end
size0 = size(input0);
if size0(1) ~= 1
error ('please enter a string');
end
length0 = size0(2);
rng('shuffle');
a = 32;
b = 127;
string0 = (b-a).*rand(length0,1) + a;
x = char(floor(string0));
z = (input0 - x);
ascii0 = sum(abs(z),'all');
ascii1 = abs(sum(z,'all'));
disp(ascii0);
disp(ascii1);
disp(ascii0/ascii1/length0);
end
This script also differentiates between the absolute ASCII distance on a per-character basis vs that on a per-string basis, thus resulting in two integers returned for the ASCII distance.
I have also included the limit of these two values, the value of which approaches the inverse of the length of strings being compared. This actually approximates the entropy, E, of every random string generation event when run.
After standard error checking, the script first finds the length of the input string. The rnd function seeds the random number generator. the a and b variables define the ASCII table minus non-printable characters, which ends at 126, inclusively. 127 is actually used as an upper bound so that the next line of code can generate a random string of variables of input length. The following line of code turns the string into the alphanumeric characters provided by the ASCII table. The following line of code subtracts the two strings element-wise and stores the result. The next two lines of code sum up the ASCII distances in the two ways mentioned in the first paragraph. Finally, the values are printed out, as well as providing the entropy, E, of the random string generation event.
Input:
There is a long String S and we have a array of integers A which denotes the prefixes of the String S like A[i] denotes the prefix S[0..A[i]]
Output:
Return an array Output[] of the same size as A where Output[i] is the length of the longest matching suffix of S[0..A[i]] and S
Sample Input:
S = "ababa"
A[]=[0, 1, 2, 3, 4]
Sample Output:
Output[]=[1,0,3,0,5]
The most naive algorithm which I have is for every A[i] just match the number of characters between S[0..A[i]] and S from the end of both strings. But this algorithm is O(n^2) where n is the length of the original String S.
Question:
Is there is a better algorithm which pre processes the string S and then can quickly return the longest length suffix for the entire input Array?
This is just a Z-function of the reversed string. The slight difference is that the first element of the Z-function is chosen to be equal to the length of S. There is an algorithm to calculate the Z-function of a string in O(n)
And the algorithm for this problem is as follows:
S' := reversed S
Z := Z-function of S'
for each i, Output[i] := Z[Len(S) - A[i] - 1]
For example:
S = "baabaaa"
A[] = [0,1,2,3,4,5,6]
Output[] should be [0,1,2,0,1,2,7]
S' = "aaabaab"
Z = Z-function(S') = [7,2,1,0,2,1,0] (with the first element chosen to be Len(S))
The algorithm / datastructure you are looking for is called Suffix Tree it has a worst case complexity of O(n log n)
In computer science, a suffix tree (also called PAT tree or, in an
earlier form, position tree) is a compressed trie containing all the
suffixes of the given text as their keys and positions in the text as
their values. Suffix trees allow particularly fast implementations of
many important string operations. (wiki)
Here you can find some slides which explain the functionality and implemenation in detail
This is a question regarding a piece of coursework so would rather you didn't fully answer the question but rather give tips to improve the run time complexity of my current algorithm.
I have been given the following information:
A function g(n) is given by g(n) = f(n,n) where f may be defined recursively by
I have implemented this algorithm recursively with the following code:
public static double f(int i, int j)
{
if (i == 0 && j == 0) {
return 0;
}
if (i ==0 || j == 0) {
return 1;
}
return ((f(i-1, j)) + (f(i-1, j-1)) + (f(i, j-1)))/3;
}
This algorithm gives the results I am looking for, but it is extremely inefficient and I am now tasked to improve the run time complexity.
I wrote an algorithm to create an n*n matrix and it then computes every element up to the [n][n] element in which it then returns the [n][n] element, for example f(1,1) would return 0.6 recurring. The [n][n] element is 0.6 recurring because it is the result of (1+0+1)/3.
I have also created a spreadsheet of the result from f(0,0) to f(7,7) which can be seen below:
Now although this is much faster than my recursive algorithm, it has a huge overhead of creating a n*n matrix.
Any suggestions to how I can improve this algorithm will be greatly appreciated!
I can now see that is it possible to make the algorithm O(n) complexity, but is it possible to work out the result without creating a [n][n] 2D array?
I have created a solution in Java that runs in O(n) time and O(n) space and will post the solution after I have handed in my coursework to stop any plagiarism.
This is another one of those questions where it's better to examine it, before diving in and writing code.
The first thing i'd say you should do is look at a grid of the numbers, and to not represent them as decimals, but fractions instead.
The first thing that should be obvious is that the total number of you have is just a measure of the distance from the origin, .
If you look at a grid in this way, you can get all of the denominators:
Note that the first row and column are not all 1s - they've been chosen to follow the pattern, and the general formula which works for all of the other squares.
The numerators are a little bit more tricky, but still doable. As with most problems like this, the answer is related to combinations, factorials, and then some more complicated things. Typical entries here include Catalan numbers, Stirling's numbers, Pascal's triangle, and you will nearly always see Hypergeometric functions used.
Unless you do a lot of maths, it's unlikely you're familiar with all of these, and there is a hell of a lot of literature. So I have an easier way to find out the relations you need, which nearly always works. It goes like this:
Write a naive, inefficient algorithm to get the sequence you want.
Copy a reasonably large amount of the numbers into google.
Hope that a result from the Online Encyclopedia of Integer Sequences pops up.
3.b. If one doesn't, then look at some differences in your sequence, or some other sequence related to your data.
Use the information you find to implement said sequence.
So, following this logic, here are the numerators:
Now, unfortunately, googling those yielded nothing. However, there are a few things you can notice about them, the main being that the first row/column are just powers of 3, and that the second row/column are one less than powers of three. This kind boundary is exactly the same as Pascal's triangle, and a lot of related sequences.
Here is the matrix of differences between the numerators and denominators:
Where we've decided that the f(0,0) element shall just follow the same pattern. These numbers already look much simpler. Also note though - rather interestingly, that these numbers follow the same rules as the initial numbers - except the that the first number is one (and they are offset by a column and a row). T(i,j) = T(i-1,j) + T(i,j-1) + 3*T(i-1,j-1):
1
1 1
1 5 1
1 9 9 1
1 13 33 13 1
1 17 73 73 17 1
1 21 129 245 192 21 1
1 25 201 593 593 201 25 1
This looks more like the sequences you see a lot in combinatorics.
If you google numbers from this matrix, you do get a hit.
And then if you cut off the link to the raw data, you get sequence A081578, which is described as a "Pascal-(1,3,1) array", which exactly makes sense - if you rotate the matrix, so that the 0,0 element is at the top, and the elements form a triangle, then you take 1* the left element, 3* the above element, and 1* the right element.
The question now is implementing the formulae used to generate the numbers.
Unfortunately, this is often easier said than done. For example, the formula given on the page:
T(n,k)=sum{j=0..n, C(k,j-k)*C(n+k-j,k)*3^(j-k)}
is wrong, and it takes a fair bit of reading the paper (linked on the page) to work out the correct formula. The sections you want are proposition 26, corollary 28. The sequence is mentioned in Table 2 after proposition 13. Note that r=4
The correct formula is given in proposition 26, but there is also a typo there :/. The k=0 in the sum should be a j=0:
Where T is the triangular matrix containing the coefficients.
The OEIS page does give a couple of implementations to calculate the numbers, but neither of them are in java, and neither of them can be easily transcribed to java:
There is a mathematica example:
Table[ Hypergeometric2F1[-k, k-n, 1, 4], {n, 0, 10}, {k, 0, n}] // Flatten
which, as always, is ridiculously succinct. And there is also a Haskell version, which is equally terse:
a081578 n k = a081578_tabl !! n !! k
a081578_row n = a081578_tabl !! n
a081578_tabl = map fst $ iterate
(\(us, vs) -> (vs, zipWith (+) (map (* 3) ([0] ++ us ++ [0])) $
zipWith (+) ([0] ++ vs) (vs ++ [0]))) ([1], [1, 1])
I know you're doing this in java, but i could not be bothered to transcribe my answer to java (sorry). Here's a python implementation:
from __future__ import division
import math
#
# Helper functions
#
def cache(function):
cachedResults = {}
def wrapper(*args):
if args in cachedResults:
return cachedResults[args]
else:
result = function(*args)
cachedResults[args] = result
return result
return wrapper
#cache
def fact(n):
return math.factorial(n)
#cache
def binomial(n,k):
if n < k: return 0
return fact(n) / ( fact(k) * fact(n-k) )
def numerator(i,j):
"""
Naive way to calculate numerator
"""
if i == j == 0:
return 0
elif i == 0 or j == 0:
return 3**(max(i,j)-1)
else:
return numerator(i-1,j) + numerator(i,j-1) + 3*numerator(i-1,j-1)
def denominator(i,j):
return 3**(i+j-1)
def A081578(n,k):
"""
http://oeis.org/A081578
"""
total = 0
for j in range(n-k+1):
total += binomial(k, j) * binomial(n-k, j) * 4**(j)
return int(total)
def diff(i,j):
"""
Difference between the numerator, and the denominator.
Answer will then be 1-diff/denom.
"""
if i == j == 0:
return 1/3
elif i==0 or j==0:
return 0
else:
return A081578(j+i-2,i-1)
def answer(i,j):
return 1 - diff(i,j) / denominator(i,j)
# And a little bit at the end to demonstrate it works.
N, M = 10,10
for i in range(N):
row = "%10.5f"*M % tuple([numerator(i,j)/denominator(i,j) for j in range(M)])
print row
print ""
for i in range(N):
row = "%10.5f"*M % tuple([answer(i,j) for j in range(M)])
print row
So, for a closed form:
Where the are just binomial coefficients.
Here's the result:
One final addition, if you are looking to do this for large numbers, then you're going to need to compute the binomial coefficients a different way, as you'll overflow the integers. Your answers are lal floating point though, and since you're apparently interested in large f(n) = T(n,n) then I guess you could use Stirling's approximation or something.
Well for starters here are some things to keep in mind:
This condition can only occur once, yet you test it every time through every loop.
if (x == 0 && y == 0) {
matrix[x][y] = 0;
}
You should instead: matrix[0][0] = 0; right before you enter your first loop and set x to 1. Since you know x will never be 0 you can remove the first part of your second condition x == 0 :
for(int x = 1; x <= i; x++)
{
for(int y = 0; y <= j; y++)
{
if (y == 0) {
matrix[x][y] = 1;
}
else
matrix[x][y] = (matrix[x-1][y] + matrix[x-1][y-1] + matrix[x][y-1])/3;
}
}
No point in declaring row and column since you only use it once. double[][] matrix = new double[i+1][j+1];
This algorithm has a minimum complexity of Ω(n) because you just need to multiply the values in the first column and row of the matrix with some factors and then add them up. The factors stem from unwinding the recursion n times.
However you therefore need to do the unwinding of the recursion. That itself has a complexity of O(n^2). But by balancing unwinding and evaluation of recursion, you should be able to reduce complexity to O(n^x) where 1 <= x <= 2. This is some kind of similiar to algorithms for matrix-matrix multiplication, where the naive case has a complexity of O(n^3) but Strassens's algorithm is for example O(n^2.807).
Another point is the fact that the original formula uses a factor of 1/3. Since this is not accurately representable by fixed point numbers or ieee 754 floating points, the error increases when evaluating the recursion successively. Therefore unwinding the recursion could give you higher accuracy as a nice side effect.
For example when you unwind the recursion sqr(n) times then you have complexity O((sqr(n))^2+(n/sqr(n))^2). The first part is for unwinding and the second part is for evaluating a new matrix of size n/sqr(n). That new complexity actually can be simplified to O(n).
To describe time complexity we usually use a big O notation. It is important to remember that it only describes the growth given the input. O(n) is linear time complexity, but it doesn't say how quickly (or slowly) the time grows when we increase input. For example:
n=3 -> 30 seconds
n=4 -> 40 seconds
n=5 -> 50 seconds
This is O(n), we can clearly see that every increase of n increases the time by 10 seconds.
n=3 -> 60 seconds
n=4 -> 80 seconds
n=5 -> 100 seconds
This is also O(n), even though for every n we need twice that much time, and the raise is 20 seconds for every increase of n, the time complexity grows linearly.
So if you have O(n*n) time complexity and you will half the number of operations you perform, you will get O(0.5*n*n) which is equal to O(n*n) - i.e. your time complexity won't change.
This is theory, in practice the number of operations sometimes makes a difference. Because you have a grid n by n, you need to fill n*n cells, so the best time complexity you can achieve is O(n*n), but there are a few optimizations you can do:
Cells on the edges of the grid could be filled in separate loops. Currently in majority of the cases you have two unnecessary conditions for i and j equal to 0.
You grid has a line of symmetry, you could utilize it to calculate only half of it and then copy the results onto the other half. For every i and j grid[i][j] = grid[j][i]
On final note, the clarity and readability of the code is much more important than performance - if you can read and understand the code, you can change it, but if the code is so ugly that you cannot understand it, you cannot optimize it. That's why I would do only first optimization (it also increases readability), but wouldn't do the second one - it would make the code much more difficult to understand.
As a rule of thumb, don't optimize the code, unless the performance is really causing problems. As William Wulf said:
More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity.
EDIT:
I think it may be possible to implement this function with O(1) complexity. Although it gives no benefits when you need to fill entire grid, with O(1) time complexity you can instantly get any value without having a grid at all.
A few observations:
denominator is equal to 3 ^ (i + j - 1)
if i = 2 or j = 2, numerator is one less than denominator
EDIT 2:
The numerator can be expressed with the following function:
public static int n(int i, int j) {
if (i == 1 || j == 1) {
return 1;
} else {
return 3 * n(i - 1, j - 1) + n(i - 1, j) + n(i, j - 1);
}
}
Very similar to original problem, but no division and all numbers are integers.
If the question is about how to output all values of the function for 0<=i<N, 0<=j<N, here is a solution in time O(N²) and space O(N). The time behavior is optimal.
Use a temporary array T of N numbers and set it to all ones, except for the first element.
Then row by row,
use a temporary element TT and set it to 1,
then column by column, assign simultaneously T[I-1], TT = TT, (TT + T[I-1] + T[I])/3.
Thanks to will's (first) answer, I had this idea:
Consider that any positive solution comes only from the 1's along the x and y axes. Each of the recursive calls to f divides each component of the solution by 3, which means we can sum, combinatorially, how many ways each 1 features as a component of the solution, and consider it's "distance" (measured as how many calls of f it is from the target) as a negative power of 3.
JavaScript code:
function f(n){
var result = 0;
for (var d=n; d<2*n; d++){
var temp = 0;
for (var NE=0; NE<2*n-d; NE++){
temp += choose(n,NE);
}
result += choose(d - 1,d - n) * temp / Math.pow(3,d);
}
return 2 * result;
}
function choose(n,k){
if (k == 0 || n == k){
return 1;
}
var product = n;
for (var i=2; i<=k; i++){
product *= (n + 1 - i) / i
}
return product;
}
Output:
for (var i=1; i<8; i++){
console.log("F(" + i + "," + i + ") = " + f(i));
}
F(1,1) = 0.6666666666666666
F(2,2) = 0.8148148148148148
F(3,3) = 0.8641975308641975
F(4,4) = 0.8879743941472337
F(5,5) = 0.9024030889600163
F(6,6) = 0.9123609205913732
F(7,7) = 0.9197747256986194
Googling around for a while to find subsets of a String, i read wikipedia and it mentions that
.....For the whole power set of S we get:
{ } = 000 (Binary) = 0 (Decimal)
{x} = 100 = 4
{y} = 010 = 2
{z} = 001 = 1
{x, y} = 110 = 6
{x, z} = 101 = 5
{y, z} = 011 = 3
{x, y, z} = 111 = 7
Is there a possible way to implement this through program and avoid recursive algorithm which uses string length?
What i understood so far is that, for a String of length n, we can run from 0 to 2^n - 1 and print characters for on bits.
What i couldn't get is how to map those on bits with the corresponding characters in the most optimized manner
PS : checked thread but couldnt understood this and c++ : Power set generated by bits
The idea is that a power set of a set of size n has exactly 2^n elements, exactly the same number as there are different binary numbers of length at most n.
Now all you have to do is create a mapping between the two and you don't need a recursive algorithm. Fortunately with binary numbers you have a real intuitive and natural mapping in that you just add a character at position j in the string to a subset if your loop variable has bit j set which you can easily do with getBit() I wrote there (you can inline it but for you I made a separate function for better readability).
P.S. As requested, more detailed explanation on the mapping:
If you have a recursive algorithm, your flow is given by how you traverse your data structure in the recursive calls. It is as such a very intuitive and natural way of solving many problems.
If you want to solve such a problem without recursion for whatever reason, for instance to use less time and memory, you have the difficult task of making this traversal explicit.
As we use a loop with a loop variable which assumes a certain set of values, we need to make sure to map each value of the loop variable, e.g. 42, to one element, in our case a subset of s, in a way that we have a bijective mapping, that is, we map to each subset exactly once. Because we have a set the order does not matter, so we just need whatever mapping that satisfies these requirements.
Now we look at a binary number, e.g. 42 = 32+8+2 and as such in binary with the position above:
543210
101010
We can thus map 42 to a subset as follows using the positions:
order the elements of the set s in any way you like but consistently (always the same in one program execution), we can in our case use the order in the string
add an element e_j if and only if the bit at position j is set (equal to 1).
As each number has at least one digit different from any other, we always get different subsets, and thus our mapping is injective (different input -> different output).
Our mapping is also valid, as the binary numbers we chose have at most the length equal to the size of our set so the bit positions can always be assigned to an element in the set. Combined with the fact that our set of inputs is chosen to have the same size (2^n) as the size of a power set, we can follow that it is in fact bijective.
import java.util.HashSet;
import java.util.Set;
public class PowerSet
{
static boolean getBit(int i, int pos) {return (i&1<<pos)>0;}
static Set<Set<Character>> powerSet(String s)
{
Set<Set<Character>> pow = new HashSet<>();
for(int i=0;i<(2<<s.length());i++)
{
Set<Character> subSet = new HashSet<>();
for(int j=0;j<s.length();j++)
{
if(getBit(i,j)) {subSet.add(s.charAt(j));}
}
pow.add(subSet);
}
return pow;
}
public static void main(String[] args)
{System.out.println(powerSet("xyz"));}
}
Here is easy way to do it (pseudo code) :-
for(int i=0;i<2^n;i++) {
char subset[];
int k = i;
int c = 0;
while(k>0) {
if(k%2==1) {
subset.add(string[c]);
}
k = k/2;
c++;
}
print subset;
}
Explanation :- The code divides number by 2 and calculates remainder which is used to convert number to binary form. Then as you know only selects index in string which has 1 at that bit number.
I realize permutations in programming language is a very frequently asked question, however I feel like my question is sort of unique.
I have received input of a certain length integer N and stored each digit in an array where the index of the array stores the number of times that digit occurs in N.
now I want to test if some function holds true with all permutations of N's original length with no leading zeroes. Ex:
int[] digits = new int[10];
String n = "12345675533789025";
for (char c : n.toCharArray())
digits[c-'0']++;
for (Long f : allPermutationsOf(digits))
if (someCondition(f))
System.out.println(f);
a precondition to the following code is that N must be less than 2^64-1, (long's maximum value.)
The question is, how would I take all permutations of the digits array and return a Long[] or long[] without using some kind of String concatenation? Is there a way to return a long[] with all permutations of digits[] in the "Integer scope of things" or rather using only integer arithmetic?
To elaborate on one of the above comments, putting a digit d in a given place in the resulting long is easy: d*1 puts it in the 1s place, d*1000 puts it in the thousands place, and in general d * (10^k) puts d into the k+1th digit. You have N total digits to fill, so you need to do permutations on the powers of 10 from 1 to 10^(N-1).
If you are expecting the permutations to be Longs anyway, instead of representing n as an array of counts, it might be easier to represent it as a Long too.
Here are a couple of ways you can generate the permutations.
Think of generating permutations as finding the next largest number with the same set of digits, starting from the number consisting of the sorted digits of n. In this case, the answers to this StackOverflow question is helpful. You can use arithmetic operations and modding instead of string concatenation to implement the algorithm there (I can provide more details if you like). A benefit of this is that the permutations you generate will automatically be in order.
If you don't care about the order of the permutations and you expect the number of digit duplicates to be small, you can use the Steinhaus-Johnson-Trotter algorithm, which (according to Robert Sedgewick) is the fastest algorithm for generating permutations of unique elements. To make sure duplicate permutations are not generated, you would have to distinguish every duplicate digit and only emit the permutations where they appear in order (i.e., if 2 appears three times, then create the elements 2_1, 2_2, 2_3 and make sure those three elements always appear in that order in an emitted permutation).
For the requirement, assuming that the length of N is n, we can generate all permutations by going from digit to digit, starting from 0 and end at n - 1. With 0 is the leading digit.
For each digit, we only go through each possibility (0 to 9) once , which will avoid duplicate permutation.
From digit x to digit x + 1, we can easily generate the current value by passing a number called current
For example: at digit 3, we have current = 1234, so at digit 4, if we choose 5 to be at digit 4, the current will be 1234*10 + 5 = 12345
Sample code in Java:
public void generate(int index, int length, int[] digits, long current, ArrayList<Long> result) {
//All the permutation will be stored in result ArrayList
for (int i = 0; i < 10; i++) {
if (digits[i] > 0 && (i != 0 || index != 0)) {
digits[i]--;
if (index + 1 == length) {//If this is the last digit, add its value into result
result.add(current * 10 + i);
} else {//else, go to next digit
generate(index + 1, length, digits, current * 10 + i, result);
}
digits[i]++;
}
}
}