JPQL number to string conversion disagreeing with Java - java

Due to some of the restrictions of JPQL (no ordering in subqueries, can't check array equality), I'm having to do some workarounds. Namely, I'm concatenating some numbers (and commas) into a string and checking if they're in an array parameter. (I wish to know which item has exactly a particular set of associated entries in a crossreference table. "For each item, is the set of other_id/value pairs referencing item equal to this set I'm passing in as a parameter?") However, the number-to-string conversion tends to go a little weird at high decimal places - I think JPQL (or possibly postgres) is converting the numbers into strings slightly differently than Java is. For instance, JPQL vs Java might give me
1.20991849899292 vs
1.2099184989929199
or maybe the other way around, and new BigDecimal(value).toString() gives me something again slightly different.
How can I get them to agree? Consider that there's string A, returned from a query, and I wish it to match string B, calculated in Java from the same number. One way I thought of was to reduce the precision of the number before storing it, so it just wouldn't run into the issue. This worked, but only once there were only, say, 10 bits of mantissa left, which was more than I wanted to remove. Another way I thought of was to obtain string B by running, like,
entityManager.createQuery("SELECT CAST(:val as text)", String.class)
.setParameter("val", 0.14520927387582)
.getSingleResult();
...but JPQL refuses to do queries that don't have a FROM clause. I could make a junk table, always containing exactly one row, but that's really hacky.
Is there a way to tell JPQL how to format a number when it turns it into a string (without knowing which underlying database you're using)? Is there a way to get the SELECT CAST method to work without being more than a little hacky? Any other, better ideas?

Related

Comparing sets of randomly assigned codes in Java to assign a name

Good day,
I honestly do not know how to phrase the problem in the title, thus the generic description. Actually I have a set of ~150 codes, which are combined to produce a single string, like this "a_b_c_d". Valid combinations contain 1-4 code combinations plus the '-' character if no value is assigned, and each code is only used once( "a_a..." is not considered valid). These sets of codes are then assigned to a unique name. Not all combinations are logical, but if a combination is valid then the order of the codes does not matter (if "f_g_e_-" is valid, then "e_g_f_-","e_f_-_ g_" is valid, and they all have the same name). I have taken the time and assigned each valid combination to its unique name and tried to create a single parser to read these values and produce the name.
I think the problem is apparent. Since the order does not matter, I have to check for every possible combination. The codes cannot be strictly orderd, since there are some codes who have meaning in any position.So, this is impossible to accomplish with a simple parser. Is there an optimal way to do this, or will I have to force the user to use some kind of order against the standard?
Try using TreehMap to store the code (string) and and its count (int). increment the count for the code every time it is encountered in the string.
After processing the whole string if you find the count for any code > 1 then string has repeated codes and is invalid, else valid.
Traversing TreeMap will be sorted based on key value. Traverse the TreeMap to generate code sequence that will be sorted.

Fuzzy Matching Duplicates in Java

I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.

Hashcode for strings that can be converted to integer

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.
You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".
Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.
If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?
«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

most efficient Java data structure for searching triples of strings

Suppose I have a large list (around 10,000 entries) of string triples as such:
car noun yes
dog noun no
effect noun yes
effect verb no
Suppose I am presented with a string double - for example, (effect, verb) - and I need to quickly look in the list to see if the pair appears and, if it does, whether its value is yes or no. (For this example the double does appear and the value is "no".)
What is the best data structure in Java to store the list and the most efficient way to perform the search? I am running hundreds of thousands of these searches so speed is of the essence.
Thanks!
You might consider using a HashMap<YourDouble, String>. Searches will be O(1).
You could either create an object, YourDouble which holds the first two values, or else append one to the other -- if values will still be unique -- and use HashMap<String, String>.
I would create a HashMultimap for each type of search you want, e.g. "all three", "each pair", and "each single field". When you build the list, populate all the different maps, then you can fetch from whichever map is appropriate for your query.
(The downside is that you'll need a type for at least each arity, e.g. use just String for the "single field" maps, but a Pair for the two-field maps, and a Triple for the three-field map.)
You could use a HashMap where the key is the concatenation of the first two strings, the ones which you'll use for lookups, and the value is a Boolean, representing the yes and no strings.
Alternatively, it seems the words in the second column would be fewer, since they represent categories. You could have a HashMap<String, HashMap<String, Boolean>> where you first index by e.g. "noun", "verb" etc. and then you index by e.g. "car", "dog", "effect", to get to your boolean. This would probably be more space-efficient.
10k doesn't seem that large to me. Have you tried a DB?
The place to look for information like this is the Semantic Web. A number of projects work on Triple Stores of just this type. There's a list at the bottom of the Triple Store page of implementations.
As far as java is concerned your algorithms are almost certainly going to be language dependent and if you find a good algorithm implemented in C its java port will also be fast.
Also, what's your data set look like? Are there a lot of 2 matches such that subject and verb are often the same? How many matches are you expecting to get? MapReduce will work work well for finding one match in 10k but won't work as well doing a query that returns a 8k of 10k where the query can't be easily partitioned.
There's a query language made just for this problem too: SPARQL. The bigdata blog has some good insights, though again 10k doesn't seem that large.

Find a large collection of strings within a larger collection of strings

I have a collection of strings that I want to filter. They'll be in this pattern:
xxx_xxx_xxx_xxx
so always a sequence of letters or numbers separated by three underscores. The max length of each string will be 60 characters. I might have a few million of these in my collection.
What data structure could I use to efficiently do something like this:
Get all strings starts with: "abc_123_456"
Get all strings starts with: "def_999_888"
etc..
for example, I could do this:
List<String> matched = new ArrayList<String>();
for (String it : strings) {
if (it.startsWith(match)) {
matched.add(it);
}
}
but that would take a long time if my collection is on the order of millions of strings, and worse yet if the number of matched strings is also high.
The high-level problem is that I want to answer the following question for an app I'm writing: "which of my friends have recommended product A for product B?". I could store this information in a sql table and run the following statement:
select recommender from recs where username='me' and prodIdA='a' and prodIdB='b';
I'm curious if something custom in java/C/C++ could run faster, using encoded flat strings like I have above:
myusername_prodIdA_prodIdB_recommenderusername
The idea being that you could do a starts-with operation on the whole collection of encoded strings to get your answer.
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though,
Thanks
To do that in Java, you can use a Trie structure.
That being said, I don't think it's a good idea. Dumping "a few million" records in the memory won't always work.
That's what databases are for; with the right design and proper indexing you can have very good performance with the DB alone.
I think you are looking for a SortedMap.
"headMap(K toKey)
Returns a view of the portion of this map whose keys are strictly less than toKey."
I know trying to implement a custom solution like this is most likely not usable in a production environment, so some sql db would be better, just curious though
If only for the sake of curiosity, you can put all existing different "myusername_prodIdA_prodIdB" combinations in hashtable. And for each combination store a list of relevant results.
So, the structure would look like Map<String, List<String>> and used like hash.get("def_999_888"). Constant time (O(1))
You can get rid of inner list and optimize it in many ways, but this is the idea.
The first thing that comes to mind for me is pre-processing the strings into some sort of data structure so that they could be searched for efficiently. If you're going to be calling the search function many times, I think it'd be good for you to put all of the strings into a hash table for a constant-time look up. It'd take more processing power to construct your array of strings, but it'd trivialize the task of searching for them.

Categories

Resources