Hash join in java - java

I have two relations A and B, both with all integer attributes (A {a1,a2,a3,...} B{b1,b2,b3,..}. How would I hash-join these two in java? The user will pick the two joining attributes. Do I make two hashtables and then proceed to join them?

Well, what form do your relations have? Are they in a relational database? If so, just use an SQL JOIN - the DBMS will probably do a hash join, but you don't have to care about that.
If they're not in a relational database, why not?
If some weird constraint prevents you form using the best tool for the job then yes, doing a hash join is as simple as putting each tuple into a hashtable keyed on the join attribute and then iterating over the entries of one and looking up matches in the other. If all your data fits into main memory, that is.

Here his a way to do an hash-join in Java
The best is to hash one table A with an hashmap.
HashMap<Sting, Object[]> hash = new HashMap<String, Object[]>();
for (Object[] a : as) {
hash.put(a.a1, a);
}
Then loop in B using the hash and regroup the matched.
ArrayList joined = new ArrayList();
for(Objec[] b : bs){
A a = hash.get(b.b1);
joined.add(new Object[]{a, b});
}
This will work only if each element of A table has an unique a1.

Related

Joining postgres tables from different servers

I want to extract data by joining tables from two different postgres hosted on different servers using java.
ResultSet resA = statement_A.executeQuery("select issue_id from Server_A.table_name");
ResultSet resB = statement_B.executeQuery("select issue_id from Server_B.table_name");
How can I get join query executed to get result set in this case ? Any pointers would be highly appreciated..
You can't do it in any automatic/magical way. What you can do is define a class that will have the union of properties of the two tables like:
public class JoinedResult{
private int id;
private int name;
// all other common properties to both
...
// properties exclusive to first table
...
// properteis exclusive to second table
...
}
and construct a list of these object that will contain the joined result of both tables.
To make the actual construction you have a few options:
The first one and the easiest one (but not efficient) is to iterate both results with nested loops, and once the ids (or whatever key is used) match you should construct a JoinedResult.
The second one is a bit more complex but also more efficient:
Iterate first result set and construct a map that will map the id to the object.
Iterate second result set and construct a map that will map the id to the object.
Run a loop over the keys of one of the maps you constructed and use that key to access the matching values in both maps, finally construct the joined object.

Fetching a record from an arraylist that is made up of arrays of objects

I have a java arraylist that is made like this:
{[{},{}], [{},{}], [{},{}], [{},{}]} of around four thousand records.
I have a particular key through which I want to search in one of the objects in this list and fetch that particular array where that
record matches. The search key is a string.
Is there a solution to this without traversing through the entire list.
It is basically a list that is constructed like this:
List<Object[]> list = new ArrayList<>();
I am using this to fetch the the data from two tables using a join. Individual records of each tables map to these objects.
Say table1: {a:1,b:2,c:3} and table2: {x:1,y:2,z:3}
the data returned would be
{[{a:1,b:2,c:3}, {x:1,y:2,z:3}],[{a:2,b:3,c:4}, {x:2,y:3,z:4}]}
How will I search for say in which array in the list is a=2.
Thanks
If you do not want to be a victim of the linear search, you should consider using another type of data structure than List.
The use case you described seems like a good match for a Map in general. If you want constant time key lookup, consider using HashMap instead.

Non-serializing in-memory database

I have the following problem:
There is a Set<C> s of objects of class C. C is defined as follows:
class C {
A a;
B b;
...
}
Given A e, B f, ..., I want to find from s all objects o such that o.a = e, o.b = f, ....
Simplest solution: stream over s, filter, collect, return. But that takes a long time.
Half-assed solution: create a Map<A, Set<C>> indexA, which splits the set by a's value. Stream over indexA.get(e), filter for the other conditions, collect, return.
More-assed solution: create index maps for all fields, select for all criteria from the maps, stream over the shortest list, filter for other criteria, collect, return.
You see where this is going: we're accidentally building a database. The thing is that I don't want to serialize my objects. Sure I could grab H2 or HSQLDB and stick my objects in there, but I don't want to persist them. I basically just want indices on my regular old on-the-heap Java objects.
Surely there must be something out there that I can reuse.
Eventually, I found a couple of projects which tackle this problem including CQEngine, which seems like the most complete and mature library for this purpose.
HSQLDB provides the option of storing Java objects directly in an in-memory database without serializing them.
The property sql.live_object=true is used as a property on the connection URL to a mem: database, for example jdbc:hsqldb:mem:test;sql.live_object=true. A table is created with a column of type OTHER to store the object. Extra columns in this table duplicate any fields in the object that need indexing.
For example:
CREATE TABLE OBJECTLIST (ID INTEGER IDENTITY, OBJ OTHER, TS_FIELD TIMESTAMP, INT_FIELD INTEGER)
CREATE INDEX IDX1 ON OBJECTLIST(TS_FIELD)
CREATE INDEX IDX2 ON OBJECTLIST(INT_FIELD)
The object is stored in the OBJ column, and the timestamp and integer values for the fields that are indexed are stored the the extra columns. SQL queries such as SELECT * FROM OBJECTLIST WHERE INT_FILED = 1234 return the rows containing the relevant objects.
http://hsqldb.org/doc/2.0/guide/dbproperties-chapt.html#dpc_sql_conformance

Storing and Searching an Array of Strings in Google App Engine DataStore (Java)

I am trying to implement a many to one relationship. I plan to store an array of keys (datastore entity key) of one model in the other model's entity as List<String>.
e.g. Say 4 entities of Model A (a1,a2,a3,a4) have datastore keys : key1, key2, key3 and key4 respectively. Now I store an entity of Model B which has a property called "ids" as List<String>. "ids" has these String as the elements: key1, key2, key3 and key4.
Its all fine till now.
But how do I query the model B for each of these ids now?
What I want to do is something like this:
query.setFilter(FilterOperator.EQUAL.of(ids,"key1")).
Clearly this can not be done right now.
Now what I am doing is fetching the ids property of each B entity and then manually deserializing into a list of string and then checking if the key is present or not.
As you can see this is highly inefficient. How should I approach here? Should I store these mapping in a separate Model. I don't want to handle Joins, but I will have to if I can't get anything else than the present solution.
I am not using JPA or JDO and I plan not to use them.
Any help would be appreciated.
The query with EQUAL filter works fine for lists of values. Make sure you pass correct value when executing this query.
For example, you can store List, if you only use this entity on the server side.
If you need this list on the client side, and you always store keys for entities of the same kind, you can store a list of ids (List) or 'names` (List) used to create these keys. This will take much less space.

storing huge HashMap in database

I have something like:
import java.util.HashMap;
import java.util.List;
public class A {
HashMap<Long, List<B>> hashMap = new HashMap<Long, List<B>>();
}
class B{
int a;
int b;
int c;
}
And I want to store this in database, because it will be huge huge.
I will have more 250000000 keys in HashMap and each key representing huge list of data (say list size may be around 1000).
How I can do this for best performance on retrieving list of B's objects with Long hashKey from database?
Any other suggestions?
Thanks in advance.
To me, this looks like a classical One-To-Many or Many-To-Many association between two tables.
If each B belongs to only one A, then you would have a table A and a table B containing a foreign key to A.
If a given B can belong to multiple As, then you would have a table A, a table B, and a join table between the two tables.
Make sure to index the foreign keys.
As you have a very large data set of up to 1/4 bn * 20 * 1 k or about 5 TB, the main problem you have is that it can't be stored in memory and is too large to store on SSD, so you will have to access disk efficiently otherwise you are looking at a latency of about 8 ms per key This should be your primary concern otherwise it will take days just to access every key randomly once.
Unless you have a good understand of how to implement this with memory mapped files you will need to use a database, preferable one design to handle large numbers of records. You with also need a disk sub-system not only for capacity but to give you more spindles so you can increase the number of requests you can perform concurrently.
using infinispan you could just work with your huge map and have parts of it (the ones not recently accessed) stored to disk to save RAM. easier to do than writing a whole D layer and (i think) faster and uses less memory #runtime (the entire map isnt in memory ever)
You can map this directly as a one to many relationship. You need two tables. One to hold the key (let's call it KeyTable), and the other one to keep the B objects (BTable). On BTable with the B objects, you need a foreign key to the KeyTable. Then you can query something like this to get the objects with key 1234:
SELECT * FROM BTABLE WHERE key=1234;
For performance, you probably should code this using JDBC instead of something like Hibernate, to have better control of memory usage.

Categories

Resources