I have a ResultSet with list of Stock exchanges and Countries, in which they reside. Nonetheless, in my database, not every Cxchange has an country_id, therefore when creating Exchange objects, bunch of them has country_id and country_title null values. Due to the memory optimization, I planned to intern all duplicate Strings (countries, currencies, etc.), but noticed, that I get a NullPointerException, which is loggical. Is there some workaround, how to avoid duplicate strings with intern and also don't get a NPE? Thank you.
Some options are:
Given there are less than 200 countries, and less than that many exchanges (there are only 60 major exchanges globally), it would be trivial to provide the missing data to your exchanges.
Provide a default value programatically, either in java or via your query, eg assign 0 to the country_id and "" to country_title when they are null in the database.
Don't bother interning - with so few Strings, such a micro optimisation would have no measurable effect.
Thank you guys, there are much more strings used in the app, countries and exchanges were just an example. Totally there are around 500k Strings, out of which 50k are unique, i.e. around 30mb wasted. Not a big deal indeed.
After some research, I will not intern strings, given that the app should run on a well equiped PCs :)
Related
I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.
I'm running some experiments over a large dataset and would like to optimize a particular part. Currently, I have 5-6 Models each of which stores a mapping from Topics to List of Strings. The set of Topics is large and the same between each Model, so there must be a better way. Ultimately the query I need to perform is: what is the String in position x of the List for some Model-Topic combination.
One of the problems with using the mapping method is that if there are say 500k-5M topics, each has a list of 20 strings. Then my Map<Model, Map<Topic, List<String>>> is going to be massive.
Have you tried SortedSet / Maps? Sounds like you need to optimize your search, sorted collections (like TreeMap) should be log(n) while regular list is O(1). Of course, this kind of thing is something at which databases excel...
Not clear where/how you want to achieve "memory efficiency". First one needs to look at the particulars of your detailed data to see how much storage that consumes, then examine various ways of organizing it and analyze their efficiency in terms of % overhead vs your "real" data.
A brief glance shows that a HashMap, when you consider the associated tables, has about 80 bytes of overhead per entry. An ArrayList looks to average out around 10-12. Without looking, I would guess that a TreeMap would be more than a HashMap -- maybe 100.
Generally speaking, links within your own objects will be "cheaper", both in storage and speed to access, than links using these aggregating objects. But the aggregating objects are convenient to use, and have been "optimized" to a degree.
(But looking at your update, you probably should be looking at a DB application, rather than holding everything in heap.)
You could use Topic and Model to construct a composite key in a single Map, e.g.
map.put(topic1_id + model1_id, list1_1);
map.put(topic1_id + model2_id, list1_2);
...
map.get(topic_id + model_id)
where the IDs are Strings (or a similar scheme could be used with numeric identifiers).
A similar approach is to assign each topic and model a unique number, then store the lists of strings in arrays, so looking up the list for a given combination is a matter of looking up two indexes, then accessing a given location in a 2D array. (however, this is easier when you know the number of topics and models in advance of constructing the data structure)
For memory efficiency, also consider the small details. In general, you want to minimise the number of Objects - each Object carries an overhead. ArrayLists can have a lot of wasted space as they grow dynamically, doubling in size when they exceed their current capacity. If you can pre-size them to the required capacity (or use an array instead) then you can save a lot of memory. The same applies when using large numbers of small HashMaps.
One possible data structure is a hierarchy of maps, leading to an array of Strings. E.g.:
HashMap<Model, HashMap<Topic, String[]>> map;
A query function would then look like:
public String query(Model model, Topic topic, int x) {
HashMap<Topic, String[]> childMap = map.get(model);
if (childMap == null) {
return null;
}
String[] list = childMap.get(topic);
if (list == null) {
return null;
}
return list[x];
}
Presuming your Model and Topic structures implement hashCode() and equals() reasonably, the query performance should be quite good.
One potential weakness: I'm assuming you need to index a large number of Model/Topic combinations, and related lists of Strings (if not, you presumably wouldn't be asking about optimization). My guess is that the child String[] arrays will consume a large amount of memory. Each array is a Java object (about 20 bytes) + a pointer at each array location.
2 suggestions there:
1) If many Model/Topic combinations share the same set of Strings, you could gain quite a lot by sharing those String[] instances.
2) If you're using a 64-bit VM, be sure to use compressed ordinary object pointers (-XX:+UseCompressedOops). That will at least keep most of the pointers to 4 bytes instead of 8. Compressed OOPs is the default since 1.6.0_23, so a relatively recent VM will save you some memory here.
One other possibility not mentioned is store the strings using String[][][] and models and topics in a List such as ArrayList and then at query time:
public String query(Model model, Topic topic, int x) {
return strings[models.indexOf(model)][topics.indexOf(topic)][x];
}
It could be further improved for speed if the topics and models were sorted, then binary search rather than indexOf could be used.
I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.
You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".
Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.
If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?
«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).
I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.
When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.
With my much larger database this is not possible anymore. What would you guys do?
ps: I'm also curious about any multithreaded solutions to this because the process is taking ages now.
pps: I'm coding in Java
What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there's a MySQL SOUNDEX function.
After computing the soundex of each name, all you have to do is group the rows by identical soundex.
EDIT:
If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.
There is no way around pairwise matching: the way as efficient as it gets.
If you need to do your record linkage faster, try using a string distance metrics that requires less computations than the edit distance (Bonacci distance, Jaro–Winkler distance, etc.)
You could also use another metric as a preprocessing step, and then compute edit distance to confirm or deny the match.
I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.
Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.
For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).
So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.
We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.
Examples where interning will be beneficial involve a large numbers strings where:
the strings are likely to survive multiple GC cycles, and
there are likely to be multiple copies of a large percentage of the Strings.
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
the pool data structure used to hold the interned strings takes extra space,
interning takes time, and
interning doesn't prevent the creation of the duplicate string in the first place.
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.
Not a complete answer but additional food for thought (found here):
Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.
Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().