Comparing sets of randomly assigned codes in Java to assign a name - java

Good day,
I honestly do not know how to phrase the problem in the title, thus the generic description. Actually I have a set of ~150 codes, which are combined to produce a single string, like this "a_b_c_d". Valid combinations contain 1-4 code combinations plus the '-' character if no value is assigned, and each code is only used once( "a_a..." is not considered valid). These sets of codes are then assigned to a unique name. Not all combinations are logical, but if a combination is valid then the order of the codes does not matter (if "f_g_e_-" is valid, then "e_g_f_-","e_f_-_ g_" is valid, and they all have the same name). I have taken the time and assigned each valid combination to its unique name and tried to create a single parser to read these values and produce the name.
I think the problem is apparent. Since the order does not matter, I have to check for every possible combination. The codes cannot be strictly orderd, since there are some codes who have meaning in any position.So, this is impossible to accomplish with a simple parser. Is there an optimal way to do this, or will I have to force the user to use some kind of order against the standard?

Try using TreehMap to store the code (string) and and its count (int). increment the count for the code every time it is encountered in the string.
After processing the whole string if you find the count for any code > 1 then string has repeated codes and is invalid, else valid.
Traversing TreeMap will be sorted based on key value. Traverse the TreeMap to generate code sequence that will be sorted.

Related

Create mapping of different unique string values into unique integer values with specified range

I am facing one issue in which I want to map list of string from one application to unique integer but with specified range (like 0 to 99999).
Example:
"Input_str_1" should (for example) mapped to 5423 each time
"Input_str_2" should (for example) mapped to 4829 each time
Here important consideration is that for same input string I should get same number from given range each time. My input string will not be more than 1,00,000. So I have specified this range.
I am unable to get starting pointer on how to approach this problem. If any of you can help me in this direction that will be grateful.
My both application are in java.
Is your goal to produce a unique number, or simply a random-looking number? If the latter, any hash function will suffice. Otherwise, if there are N possible inputs, and all other inputs are invalid, look at perfect hash functions.

Hash items in a 2d array, but only on one index

So, I have a 2d array (really, a List of Lists) that I need to squish down and remove any duplicates, but only for a specific field.
The basic layout is a list of Matches, with each Match having an ID number and a date. I need to remove all duplicates such that each ID only appears once. If an ID appears multiple times in the List of Matches, then I want to take the Match with the most recent date.
My current solution has me taking the List of Matches, adding it to a HashSet, and then converting that back to an ArrayList. However all that does is remove any exact Match duplicates, which still leaves me with the same ID appearing multiple times if they have different dates.
Set<Match> deDupedMatches = new HashSet<Match>();
deDupedMatches.addAll(originalListOfMatches);
List<Match> finalList = new ArrayList<Match>(deDupedMatches)
If my original data coming in is
{(1, 1-1-1999),(1, 2-2-1999),(1, 1-1-1999),(2, 3-3-2000)}
then what I get back is
{(1, 1-1-1999),(1, 2-2-1999),(2, 3-3-2000)}
But what I am really looking for is a solution that would give me
{(1, 2-2-1999),(2, 3-3-2000)}
I had some vague idea of hashing the original list in the same basic way, but only using the IDs. Basically I would end up with "buckets" based on the ID that I could iterate over, and any bucket that had more than one Match in it I could choose the correct one for. The thing that is hanging me up is the actual hashing. I am just not sure how or if I can get the Matches broken up in the way that I am thinking of.
If I understand your question correctly you want to take distinct IDs from a list with the latest date by which it occurs.
Because your Match is a class it is not as easy to compare with each other because of the fields not being looked at by Set.
What I would do to get around this problem is use a HashMap which allows distinct keys and values to be linked.
Keys cannot be repeated, values can.
I would do something like this while looping through:
if(map.putIfAbsent(match.getID(), match) != null &&
map.get(match.getID()).getDate() < match.getDate()){
map.replace(match.getID(),match);
}
So what that does is it loops through your matches.
Put the current Match with its ID in if that ID doesn't exist yet.
.putIfAbsent returns the old value which is null if it did not exist.
You then check if there was an item in the map at that ID using the putIfAbsent (2 birds with one stone).
after that it is safe to compare the two dates (one in map and one from iteration - the < is an exams for your comparison method)
if the new one is later then replace the current Match.
And finally in order to get your list you use .getValues()
This will remove duplicate IDs and leave only the latest ones.
Apologies for typos or code errors, this was done on a phone. Please notify me of any errors in the comments.
Java 7 does not have the .putIfAbsent and .replace functionality, but they can be substitued for .contains and .put

Fuzzy Matching Duplicates in Java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

Unique alphanumeric String with a fixed length

How can generate an unique alphanumeric String with a fixed length of 8 characters. I want base it in an Id + current time.
I tried with MD5 but it make a string too long
Thanks!
The problem is that 8 alphanumeric characters is most likely too few to guarantee uniqueness ... using that approach.
You just need to do some arithmetic. Multiply the number of ids that your application could generate per second by the expected number of seconds that your application is expected to "live". Now figure out how many alphanumeric characters you need to encode that number ... and that gives you how large the "timestamp" part of your id would need to be. Then add the characters for the "id" part of your string.
IMO, the best approach (if you have to use short strings) is to generate partially or fully random strings, and then check them against a (big) table of all previously issued id strings. If you get a collision, generate another string, and repeat.
If you also want your ids to be hard to predict (per your comment), then the "random number" approach is best. Make sure that you use a cryptographic-quality RNG or PRNG. The problem with a timestamp-based approach is that the resulting ids will be much easier to predict ... or guess.
Use java.util.UUID.
UUID uuid = UUID.randomUUID();
String id = uuid.toString().substring(0, 8);
Strings can't be unique: uniqueness refers to an item in the context of a collection without duplicates, called a set. Given a set of symbols (you said alphanumeric in you question) and a string length (in your example 8) there's a known number of possible combinations which may or may not be enough for your needs.
Your requirements can't be satisfied (at least, not with the information you provided). If you really want the token to be unique and the given input (id, timestamp) is guaranteed to be the key (ie for each given ID you'll never have two or more identical timestamps), just put the ID and the timestamp side by side.
The size of the ID columns will be the maximum size for the username + the fixed size for the timestamp.

Good collection for keeping track of characters

Just a general question (and I'm sort of new to java) but what would be a good collection that I could add objects to, and keep track of how many of each I've added? For example, if I added the alphabet a character at a time, it would have 26 different characters, and an associated value of 1 for each. Likewise, adding 'z' 10 times would have z with an associated 10. Suggestions? The name "hashtable" had sounded promising, but I don't think I want to use that...
First thing that comes to mind is a Dictionary. The key would be the ASCII value of the character, and the value would be the number of times it is used. Not necessarily the most efficient way to do it, but it is one of the easiest.
You could also do it with a single array, and offset the value 0 to be the first ASCII character.
If you want an extremely fast implementation, a HashMap is actually a very good idea.
For concurrency, you can use a ConcurrentHashMap.
There's no need to use a special data structure as simply using a HashMap should work well. When adding a char, myChar, you call get(myChar), and if null, create a new item for the map for that Character with an Integer value of 1. If the Map returns an Integer, simply add one to it, and then put it back into the Map.
Multiset is the data structure for this purpose. Guava has a implementation of it.
Multiset<Character> charFrequency=HashMultiset.create();
charFrequency.add(char1);
charFrequency.add(char1);
charFrequency.count(char1)

Categories

Resources