Dynamic Programming (how to send a message in the most efficient way)

Dynamic Programming (how to send a message in the most efficient way) - java

I'm having a hard time with dynamic programming, I'm new in this, so I hope you could help me with anything you can, the problem is this:
As the Communications Officer of the IKS B'Moth Klingon battle cruiser, your duty is to manage communications in the most efficient way. Assume you need to transmit a message S= s1...sm given as a string of m symbols. For this purpose, you have r different codes. Let b(ij) be the number of bits needed to enconde the i-th symbol of your message in the j-th code. Initially, the bridge transmitter is set to code #1, but you can freely change the code at any point within the message and as many times as you want. To do so, you need to send a control code which is composed of C(ij) bits if you want to switch from current code i to any other code j. Your goal is to determine how to send thee message in the most efficient way (using the least number of bits).
A) Prove the problem exhibits optimal substructure.
B) find a recurrence for the optimal number of bits required.
C) Build a bottom up dynamic programming algorithm to solve the problem and indicate its complexity.

You can make a 3 dimensional array, and use previousCode, newCode, and ithSymbol as indices. The array will store the least number of bits while scanned upto ithSymbol and when it switches code from previousCode to newCode.
The recursive formula will be:
dp(ithSymbol, previousCode, newCode)=min_(i=1 to r)(dp(ithSymbol-1,i,previousCode))+C(previousCode, newCode)+b(ithSymbol,newCode);
(Assumed, C(i,i)=0 for all i)
Now you can write the code for yourself.
N.B. This is a naive approach. You can further make it efficient by making the array 2-D as only ithSymbol-1 is used in any step.

Related

Calculating the principal axis/eigenvalues and -vectors of large dataset in Java

I have a large dataset (>500.000 elements) that contains the stress values (σ_xx, σ_yy, σ_zz, τ_xy, τ_yz, τ_xz) of FEM-Elements. These stress values are given in the global xyz-coordinate space of the model. I want to calculate the main axis stress values and directions from those. If you're not that familiar with the physics behind it, this means taking the symmetric matrix
| σ_xx τ_xy τ_xz |
| τ_xy σ_yy τ_yz |
| τ_xz τ_yz σ_zz |
and calculating its eigenvalues and eigenvectors. Calculating each set of eigenvalues and -vectors on its own is too slow. I'm looking for a library, an algorithm or something in Java that would allow me to do this as array calculations. As an example, in python/numpy I could just take all my 3x3-matrices, stack them along a third dimension to get a nx3x3-array, and pass that to np.linalg.eig(arr), and it automatically gives me an nx3-array for the three eigenvalues and an nx3x3-array for the three eigenvectors.
Things I tried:
nd4j has an Eigen-module for calculating eigenvalues and -vectors, but only supports a single square array at a time.
Calculate the characteristic polynomial and use cardanos formula to get the roots/eigenvalues - possible to do for the whole array at once, but I'm stuck now on how to get the corresponding eigenvectors. Is there maybe a general simple algorithm to get from those to the eigenvectors?
Looking for an analytical form of the eigenvalues and -vectors that can be calculated directly: It does exist, but just no.

You'll need to write a little code.
I'd create or use a Matrix class as a dependency and find methods to give you eigenvalues and eigenvectors. The ones you found in nd4j sound like great candidates. You might also consider the Linear Algebra For Java (LA4J) dependency.
Load the dataset into a List<Matrix>.
Use functional Java methods to apply a map to give you a List of eigenvalues as a vector per stress matrix and a List of eigenvectors as a matrix per stress matrix.
You can optimize this calculation to the greatest extent possible by applying the map function to a stream. Java will parallelize the calculation under the covers to leverage available cores to the greatest extent possible.

Follow-up: This is the way that worked best for me, as I can do all operations without iterating over every element. As stated above, I'm using Nd4j, which seems to be limited in its possibilities compared to numpy (or maybe I just didn't read the documentation thoroughly enough). The following method uses only basic array operations:
From the given stress values, calculate the eigenvalues using Cardano's formula. Only element wise instructions are needed to do that (add, sub, mul, div, pow). The result should be three vectors of size n, each containing one eigenvalue for all elements.
Use the formula given here to calculate the matrix S for each eigenvalue. Like step 1, this can obviously also be done using only element-wise operations with the stress value- and eigenvalue-vectors, in order to avoid specifiying some complicated instructions on which array to multiply according to which axis while keeping whatever other axis.
Take one column from S and normalize it to get a normalized eigenvector for the given eigenvalue.
Note that this method only works if you have a real symmetric matrix. You also should make sure to properly deal with cases where the same eigenvalue appears multiple times.

Get all extreme points of Linear Program in CPLEX

I need to enumerate all basis corresponding to all extreme points of a LP with the CPLEX API in Java. Unfortunately I did not find any way to do this with CPLEX. Is there a solution ?
If not, I will do this myself but I will need to play with basis. Is any simple way with CPLEX to enumerate all basis and check if a basis is a feasible solution ?

The short answer: no.
There is no easy way to do this. One possible approach, but somewhat cumbersome, is to encode the basis using binary variables. E.g.:
xb[i] = 1 for basic variables
0 for non-basic variables
We need to add constraints on non-basic variables: they will be at bound. I.e. for a non-negative variable x[i] we have
xb[i]=0 => x[i]=0
(this is an indicator constraint). Furthermore we know that
sum(i,xb[i]) = m
(the number of basic variables is equal to the number of rows in the model).
Then use Cplex's solution pool to enumerate all possible feasible bases. An illustration for this approach is shown in this link. (This particular example enumerates all optimal bases, but it is not difficult to tell Cplex to enumerate all feasible bases).

Abstract Algorithm: String / Byte Comparison / Diff

This is a rather abstract question as I yet have no idea how to solve it and haven't found any suitable solutions.
Let's start with the current situation. You'll have an array of byte[] (e.g. ArrayList<byte[]>) which behind the scene are actually Strings, but at the current state the byte[] is prefered. They can be very long (1024+ bytes for each byte[] array whereas the ArrayList may contain up to 1024 byte[] arrays) and might have a different length. Furthermore, they share a lot of the same bytes at the "same" locations (this is relativ, a = {0x41, 0x41, 0x61}, b = {0x41, 0x41, 0x42, 0x61 } => where the first 0x41 and the last 0x61 are the same).
I'm looking now for an algorithm that compares all those arrays with each other. The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
If possible without using any third party libraries (but i doubt it is feasible in a reasonable time without one).
Any suggestions are very welcome.
Edit:
Made some adjustments.
EDIT / SOLUTION:
I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.

The result should be the array that differs the most and how much they differ from each other (some kind of metric). Furthermore, the task should complete within a short time.
You will not be able to find a solution where your metric and the time is independent, they go hand in hand.
For example: if your metric is like the example from your post, that is d(str1,str2) = d(str1.first,str2.first) + d(str1.last,str2.last), then the solution is very easy: sort your array by first and last character (maybe separately), and then take the first and last element of the sorted array. This will give you O(n logn) for the sort.
But if your metric is something like "two sentences are close if they contain many equal words", then this does not work at all, and you end up with O(n²). Or you may be able to come up with a nifty way to re-order your words within the sentences before sorting the sentences etc. etc.
So unless you have a known metric, it's O(n²) with the trivial (naive) implementation of comparing everything while keeping track of the maximum delta.

I'm using the Levenshtein distance now. Furthermore, I've made some slight adjustments to improve the runtime / speed. This is very specific to the data I'm handling as I know that all Strings have a lot in common (and I know approximatly where). So filtering that content improves the speed by a factor of 400 in comparison to two unfiltered Strings (test data) used directly by the Levenshtein distance algorithm.
Thanks for your input / answers, they were a great assistance.

Encog Neural Net - How to structure training data?

Every example I've seen for Encog neural nets has involved XOR or something very simple. I have around 10,000 sentences and each word in the sentence has some type of tag. The input layer needs to take 2 inputs, the previous word and the current word. If there is no previous word, then the 1st input is not activated at all. I need to go through each sentence like this. Each word is contingent on the previous word, so I can't just have an array that looks similar to the XOR example. Furthermore, I don't really want to load all the words from 10,000+ sentences into an array, I'd rather scan one sentence at a time and once I reach EOF, start back at the beginning.
How should I go about doing this? I'm not super comfortable with Encog because all the examples I've seen have either been XOR or extremely complicated.
There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen.
I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.)
It seems that the networks are trained with a single array holding all training data, and that is looped through until the network is trained. I would like to train the network with many different arrays (an array per sentence) and then look through them all again.
This format is clearly not going to work for what I'm doing:
do {
train.iteration();
System.out.println(
"Epoch #" + epoch + " Error:" + train.getError());
epoch++;
} while(train.getError() > 0.01);

So, I'm not sure how to tell you this, but that's not how a neural net works. You can't just use a word as an input, and you can't just "not activate" an input either. At a very basic level, this is what you need to run a neural network on a problem:
A fixed-length input vector (whatever you are feeding in, it must be represented numerically with a fixed length. Each entry in the vector is a single number)
A set of labels (each input vector must correspond to a single, fixed-length output vector)
Once you have those two, the neural net classifies an example, then edits itself to get as close as possible to the labels.
If you're looking to work with words and a deep learning framework, you should map your words to an existing vector representation (I would highly recommend glove, but word2vec is decent as well) and then learn on top of that representation.
After having a deeper understanding of what you're attempting here I think the issue is that you're dealing with 60 inputs, not one. These inputs are the concatenation of the existing predictions for both words (in the case with no first word the first 30 entries are 0). You should take care of the mapping yourself (should be very straightforward), and then just treat it as trying to predict 30 numbers with 60 numbers.
I feel obliged to tell you that the way you've framed the problem you will see awful performance. When dealing with a sparse (mostly zeros) vector and such a small dataset deep learning techniques will show VERY poor performance compared to other methods. You are better off using glove + svm or a random forest model on your existing data.

You can use other implementations of MLDataSet besides BasicMLDataSet.
I ran into a similar problem with windows of DNA sequences. Building an array of all the windows would not have been scalable.
Instead, I implemented my own VersatileDataSource, and wrapped it in a VersatileMLDataSet.
VersatileDataSource has just a few methods to implement:
public interface VersatileDataSource {
String[] readLine();
void rewind();
int columnIndex(String name);
}
For each readLine(), you could return the inputs for the previous/current word, and advance the position to the next word.

Fast string retrieval based on metric distance in Java

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.

BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.

Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.