Record Matching - Efficient Iteration

Record Matching - Efficient Iteration - java

I have to preform record matching of 70K records in Java. One record size would be 200 bytes As record matching process all records compared against all records. My query is, how efficiently I can iterate and perform comparison.

First of all, you don't need compare all to each other. Once A - B is equal to B - A, you just need compare one with its successors. For example, you have { A, B, C, D }, then you compare A with B, C and D. Compare B with C and D, and compare C with D. This cut the amount of comparisons from n ^ 2 to n!.
You can optimize the algorithm by making search blocks. Put everyone with the same name and last name on the same block. Everyone with the same email on other block and so on. After all, you process each block comparing their records as described above. Depending on the amount of records you have, you will reduce dramatically the time of processing.

Use Duke [https://github.com/larsga/Duke].
Not perfect, but it's free and Java.
We have .NET version that is better and faster, but it's in-house thing, not OSS yet.

Related

Finding lowest cost for permutations of objects; how to overcome memory issues?

I'm not sure "permutation" is exactly the right word, but the scenario is that I have a List of ~40 Objects. Each different Object has a different value and cost.
Say my objects contain a value between 1 and 5. I am trying to combine a list of objects which exceed some given targetValue, find the combination with the lowest total cost, and return that combination. This combination could potentially contain many duplicates of one of the Objects in the List.
For an example, if my list of objects were { a, b, c, d }; the output could be { a, a, a, a, a, a, a, a, a, a }. However, note that order also matters. { a, a, b } may have a different total value than { a, b, a }
Currently, I've been trying to brute force the solution. However, with 40! combinations, I am running out of memory while keeping track of all the different "permutations".
I still prefer to run through every combination for accuracy, increased time to perform the calculation is not a problem, but as I said before, the biggest problem is memory.
Current code: (incompleteList starts with a beginning Object)
while (incompleteList.size() > 0)
{
Container container = incompleteList.get(0);
for (myObject o : objectList)
{
Container newAdditionContainer = new Container(container);//copies the list of objects into a new container
newAdditionContainer.addMyObject(o);
if (newAdditionContainer.getTotalValue()) < targetValue)
{
incompleteList.add(newAdditionContainer);
} else {
completeList.add(newAdditionContainer);
}
}
incompleteList.remove(container);
} //code then loops through completeList and grabs the container with the cheapest cost,
//but in actuality that code hasn't been able to run yet.
I'm pretty sure the above could work, if it were able to complete (but it cant due to memory); How can I change the algorithm to try and get the lowest cost and stay within memory limits?

Build a PermIterator which is intialized with your List of Objects and a desired permutation length. Iterate this Iterator beginning with length 1 until for a complete iteration with the same length all permutations exceed the desired value. Always store only the actual permutation and the current best permutation, which exceeds the desired value and has the lowest cost, independent of permutation length.
This way you avoid storing all the permutations in Lists and going out of memory. Obviously with 40 Objects this can still take quite long.

It is pretty common to run out of memory if you attempt to load all the possible data before doing anything with it. This is also a problem when reading large files.
A simple solution is to not store all the values if there is many of them but rather process them as you crate them.
I suggest having a callback or lambda you call each time you create a new permutation. This way you don't need to store them.
Note with 40! Permutations you are likely to run out of time. At one per micro second it will take 2.5e34 years. Longer than the planet has left.

Genetic Algorithms: Genes values should sum up to one

I want to implement a genetic algorithm (I'm not sure about the language/framework yet, maybe Watchmaker) to optimize the mixing ratio of some fluids.
Each mix consists of up to 5 ingredients a, b, c, d, e, which I would model as genes with changing values. As the chromosome represents a mixing ratio, there are (at least) two additional conditions:
(1) a + b + c + d + e = 1
(2) a, b, c, d, e >= 0
I'm still in the stage of planning my project, therefore I can give no sample code, however I want to know if and how these conditions can be implemented in a genetic algorithm with a framework like Watchmaker.
[edit]
As this doesn't seem to be straight forward some clarification:
The problem is condition (1) - if each gene a, b, c, d, e is randomly and independently chosen, the probability of this to happen is approximately 0. I would therefore need to implement the mutation in a way where a, b, c, d, e are chosen depending on each other (see Random numbers that add to 100: Matlab as an example).
However, I don't know if this is possible and if it this would be in accordance with evolutionary algorithms in general.

The first condition (a+b+c+d+e=1) can be satisfied by having shorter chromosomes, with only a,b,c,d. The e value can then be represented (in the fitness function or for later use) by e:=1-a-b-c-d.
EDIT:
Another way to satisfy the first condition would be to normalize the values:
sum:= a+b+c+d+e
a:= a/sum;
b:= b/sum;
c:= c/sum;
d:= d/sum;
e:= e/sum;
The new sum will then be 1.
For the second condition (a,b,c,d,e>=0), you can add an approval phase for the new offspring chromosomes (generated by mutation and/or crossover) before throwing them into the gene pool (and allowing them to breed), and reject those who dont satisfy the condition.

is there any DSL for streams/iterators?

I wonder (and nearly become desperate) if there is any worked out DSL for streams/iterators on ordered series of objects?
The sources are ordered streams of id,time,key,value instances and the requirement is to join and analyse those streams. This has to be done by collecting combinations of keys and applying metrics to values within certain (defineable) time-constraints (count distinct keys or sum values within a day, within same second ..). There are some DSL, that work on timeseries (ESP), but mostly using relatively simple time-windows and they do not seem to be able to handle the order/join by id,time etc (and in consequence the computation of combinations by id).
What I have to do is something like "compute the combinations of A and (B or C), count distinct D within same second, sum E with same id"
The results should contain all available combinations of A, (B or C) with the count of distinct values for key D that are in the same second with A, (B or C) for each distinct id and the sum of the values for key E for each id (which is the sum over all values of E for ids havin A, (B or C).
not an easy question. I'm just looking for maybe helpful, already thought out DSL for such problems. I do not think SQL will make it.
Thanks a lot!

I think you can't find such methods because streams and iterators are not intended to contain ordered data (however they can). As result if you can't rely on sorted data inside there is no need in such methods, because you will need to read all data from stream/iterator thus they will loose their main purpose as a data structure. So why not to use list?

Algorithm for Graph/Data Structure on Java

I have been working on the following problem where, I have a CSV file with two columns, we can say the filed names are "Friends". Both the columns contain letters from A to Z.
e.g.
A B
B C
A E
D F
E F
Each row has two different letters(no duplication in the row). A is a friend of B, C is a friend of D etc...If person A talks to person B and Person B talks to person C, then B and C will become aquitances. Aquintaces are who share a common friend. I need to fin out who has more aquintances?
I have been trying with two different methods one using differnt data structures like hashmap, arraylist, stack etc, and another using graph theory (JGraphT library).
But, i am stuck with the logic if I use data strcutres and I am stuck with traversal in the graph if I use graph theory.
I have following questions:-
What is a better approach to go with data structures or graph? Or
any other better approach/logic/algorithm than this?
Does anyone know how to traverse a graph in JgraphT Library. I am
not able to do this, they have very limited documentation about
the library.
Please, any help would really be appreciated.

Generally HashMaps are among the most rapid and easy to use. I would recommend you use them rather any custom libraries, except if you are sure that some library will do easily something you need and something that will take longtime for you to code.
In your case, just you can just use each person as a key and the list of his friends as the object pointed to by. Parsing your .csv file and filling the HashMap accordingly will solve your issue, as a previous comment pointed out.

You can have a hash table first that maps every letter to the set of its friends, e.g. A maps to { B }, B maps to { C }, and if Z has two friends Y and W then Z maps to { Y, W }. This is a hash map from letters to sets-of-letters.
To calculate the acquaintances, iterate through the hash map. When you are at entry A -> { B, C, F } then iterate in an inner loop through the set B, C, F. Collect their friends (three sets) into a single set (just insert the elements into a temp set) and remove A from that set if A was found. The size of that set is then the number of acquaintances for A. Rinse and repeat.

Java library method or algorithm to estimate aggregate string similarity?

I have responses from users to multiple choice questions, e.g. (roughly):
Married/Single
Male/Female
American/Latin American/European/Asian/African
What I want is to estimate similarity by aggregating all responses into a single field which can be compared across users in the database - rather than running queries against each column.
So, for example, some responses might look like:
Married-Female-American
Single-Female-European
But I don't want to store a massive text object to represent all of the possible concatenated responses since there are maybe 50 of them.
So, is there some way to represent a set of responses more concisely using a Java library method of some kind.
In other words, this method would take Married-Female-American and generate a code, say of abc while Single-Female-European would generate a code of, say, def?
This way if I want to find out if two users are Married-Female-Americans I can simply query a single column for the code abc.

Well, if it was a multiple choice question, you have choices enumerated. That is, numbered. Why not use 1-1-2 and 23-1-75 then? Even if you have 50 answers, it's still manageable.
Now if you happen to need the similarity, aggregating is the last thing you want. What you want is a simple array of ids of the answers given and a function defining a distance between two answer arrays. Do not use Strings, do not aggregate. Leave clean nice vectors, and all the ML libraries will be at your service.
To quote a Java ML library, try http://www.cs.waikato.ac.nz/~ml/weka/
Update: One more thing you may want to try is locality sensitive hashing. I don't think it's a good idea in your case, but your question looks like a request for it. Give it a try.

Do you have a finite number of options (multiple-choice seems to imply this)?
It is a common technique for performance to go from strings to a numerical data set, by essentially indexing the available strings. As long as you only need identity, this is perfect. Comparing an integer is much faster than comparing a string, and they usually take less memory, too.
A character is essentially an integer in 0-255, so you can of course use this.
So just define an alphabet:
a Married
b Single
c Male
d Female
e American
f Latin American
g European
h Asian
i African
You can in fact use this even when you have more than 256 words, if they are positional (and no single question has more than 256 choices). You would then use
a Q1: Married
b Q1: Single
a Q2: Male
b Q2: Female
a Q3: American
b Q3: Latin American
c Q3: European
d Q3: Asian
e Q3: African
Your examples would then be encoded as either (variant 1) ade and bdg or (variant 2) aba and bbc. The string should then have a fixed length of 50 (if you have 50 questions) and can be stored very effectively.
For comparing answers, just access the nth character of the string. Maybe your database allows for indexed substring queries, too. As you can see in above example, both strings agree only on the second character, just like the answers agreed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Record Matching - Efficient Iteration - java

I have to preform record matching of 70K records in Java. One record size would be 200 bytes As record matching process all records compared against all records. My query is, how efficiently I can iterate and perform comparison.

Use Duke [https://github.com/larsga/Duke]. Not perfect, but it's free and Java. We have .NET version that is better and faster, but it's in-house thing, not OSS yet.

Related

Finding lowest cost for permutations of objects; how to overcome memory issues?

Genetic Algorithms: Genes values should sum up to one

is there any DSL for streams/iterators?

Algorithm for Graph/Data Structure on Java

Java library method or algorithm to estimate aggregate string similarity?

Categories

Resources