Spring data / Neo4j path length with large data sets

Spring data / Neo4j path length with large data sets - java

I have been running the following query to find relatives within a certain "distance" of a given person:
#Query("start person=node({0}), relatives=node:__types__(className='Person') match p=person-[:PARTNER|CHILD*]-relatives where LENGTH(p) <= 2*{1} return distinct relatives")
Set<Person> getRelatives(Person person, int distance);
The 2*{1} comes from one conceptual "hop" between people being represented as two nodes - one Person and one Partnership.
This has been fine so far, on test populations. Now I'm moving on to actual data, which consists of sizes from 1-10 million, and this is taking for ever (also from the data browser in the web interface).
Assuming the cost was from loading everything into ancestors, I rewrote the query as a test in the data browser:
start person=node(385716) match p=person-[:PARTNER|CHILD*1..10]-relatives where relatives.__type__! = 'Person' return distinct relatives
And that works fine, in fractions of a second on the same data store. But when I want to put it back into Java:
#Query("start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Person' return relatives")
Set<Person> getRelatives(Person person, int distance);
That won't work:
[...]
Nested exception is Properties on pattern elements are not allowed in MATCH.
"start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Neo4jPerson' return relatives"
^
Is there a better way of putting a path length restriction in there? I would prefer not to use a where as that would involve loading ALL the paths, potentially loading millions of nodes where I need only go to a depth of 10. This would presumably leave me no better off.
Any ideas would be greatly appreciated!
Michael to the rescue!
My solution:
public Set<Person> getRelatives(final Person person, final int distance) {
final String query = "start person=node(" + person.getId() + ") "
+ "match p=person-[:PARTNER|CHILD*1.." + 2 * distance + "]-relatives "
+ "where relatives.__type__! = '" + Person.class.getSimpleName() + "' "
+ "return distinct relatives";
return this.query(query);
// Where I would previously instead have called
// return personRepository.getRelatives(person, distance);
}
public Set<Person> query(final String q) {
final EndResult<Person> result = this.template.query(q, MapUtil.map()).to(Neo4jPerson.class);
final Set<Person> people = new HashSet<Person>();
for (final Person p : result) {
people.add(p);
}
return people;
}
Which runs very quickly!

You're almost there :)
Your first query is a full graph scan, which effectively loads the whole database into memory and pulls all nodes through this pattern match multiple times.
So it won't be fast, also it would return huge datasets, don't know if that's what you want.
The second query looks good, the only thing is that you cannot parametrize the min-max values of variable length relationships. Due to effects to query optimization / caching.
So for right now you'd have to go with template.query() or different query methods in your repo for different max-values.

Related

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}

Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

Java: showing the output of a separate for loop as an extra column shown in a console

Bit of context...
In my project I have one embedded for loop that outputs data whereby for each category show the item and within each item show its property so in reality the output I generated is 3 columns of data in the console (headings: Category/Item/Property) The for loop to show this data looks like this (Variables are set earlier on in the method):
for... (picks up each category)
for...(picks up items in category)
for (String propertyName : item.getPropertyNames()) {
out.println(category.getName() + "\t"
+ itemDesc.getName() + "\tProperty:"
+ propertyName);
}
}
}
The purpose of the project is to provide a more dynamic documentation of the properties of set components in the system. (The /t making it possible to separate them in to individual columns on a console and even in a file in say an excel spreadsheet should I choose to set the file on the printstream (Also at the start of this method.))
The Problem
Now for the problem, after the for loops specified above I have generated another for loop separate from the data but shows the list of all the functions and operators involved in the components:
//Outside the previous for loops
for (Function function : Functions.allFunctions) {
out.println(function.getSignature());
}
What I want is to set this list as the 4th column but the positioning of the for loop and the way it is set leaves it fixed on the first column with the categories. I cant add it after property names as the functions are more generic to everything in the lists and there maybe repetitions of the functions which I am trying to avoid. Is there a way to set it as the forth column? Having trouble finding the sufficient research that specifies what I am looking for here. Hope this makes sense.

One solution, if the total amount of output is small enough to fit in memory, is to simply save all the data into an ArrayList of String, and output it all at the very end.
List<String> myList = new ArrayList<String>();
for... (picks up each category)
for...(picks up items in category)
for (String propertyName : item.getPropertyNames()) {
myList.add(category.getName() + "\t"
+ itemDesc.getName() + "\tProperty:"
+ propertyName);
}
}
}
int i = 0;
// Here we assume that the total lines output by the previous set of loops is
// equal to the total output by this loop.
for (Function function : Functions.allFunctions) {
out.println(myList.get(i) + "\t" + function.getSignature());
i++;
}

How to get the size of a java.sql.ResultSet?

I want to get the size of the ResultSet inside the while loop.
Tried the code below, and I got the results that I want. But it seems to be messing up with result.next() and the while loop only loops once if I do this.
What's the proper way of doing this?
result.first();
while (result.next()){
System.out.println(result.getString(2));
System.out.println("A. " + result.getString(5) + "\n" + "B. " + result.getString(6) + "\n" + "C. " + result.getString(7) + "\n" + "D. " + result.getString(8));
System.out.println("Answer: ");
answer = inputquiz.next();
result.last();
if (answer.equals(result.getString(10))) {
score++;
System.out.println(score + "/" + result.getRow());
} else {
System.out.println(score + "/" + result.getRow());
}
}

What's the proper way of doing this?
Map it to a List<Entity>. Since your code is far from self-documenting (you're using indexes instead of column names), I can't give a well suited example. So I'll take a Person as example.
First create a javabean class representing whatever a single row contains.
public class Person {
private Long id;
private String firstName;
private String lastName;
private Date dateOfBirth;
// Add/generate c'tors/getters/setters/equals/hashcode and other boilerplate.
}
(a bit decent IDE like Eclipse can autogenerate them)
Then let JDBC do the following job.
List<Person> persons = new ArrayList<Person>();
while (resultSet.next()) {
Person person = new Person();
person.setId(resultSet.getLong("id"));
person.setFirstName(resultSet.getString("fistName"));
person.setLastName(resultSet.getString("lastName"));
person.setDataOfBirth(resultSet.getDate("dateOfBirth"));
persons.add(person);
}
// Close resultSet/statement/connection in finally block.
return persons;
Then you can just do
int size = persons.size();
And then to substitute your code example
for (int i = 0; i < persons.size(); i++) {
Person person = persons.get(i);
System.out.println(person.getFirstName());
int size = persons.size(); // Do with it whatever you want.
}
See also:
How to check if there is zero-or-one result or one-or-more results and their size

you could do result.last(); and call result.getRow(); (which retrieves the current row number) to get count. but it'll have load the all the rows and if it's a big result set, it might not be very efficient. The best way to go about is to do a SELECT COUNT(*) on you query and get the count like it's demonstrated in this post, beforehand.

This is a tricky question.
Normally, result.last() scrolls to the end of the ResultSet, and you can't go back.
If you created the statement using one of the createStatement or prepareStatement methods with a "resultSetType" parameter, and you've set the parameter to ResultSet.TYPE_SCROLL_INSENSITIVE or ResultSet.TYPE_SCROLL_SENSITIVE, then you can scroll the ResultSet using first() or relative() or some other methods.
However, I'm not sure if all databases / JDBC drivers support scrollable result sets, and there are likely to be performance implications in doing this. (A scrollable result set implies that either the database or the JVM needs to buffer the entire resultset somewhere ... or recalculate it ... and that's expensive for a large resultset.)

The way of getting size of ResultSet, No need of using ArrayList etc
int size =0;
if (rs != null)
{
rs.beforeFirst();
rs.last();
size = rs.getRow();
}
Now You will get size, And if you want print the ResultSet, before printing use following line of code too,
rs.beforeFirst();

There are also another way to get the count from DB.
Note :
This column gets updated when DBA'S do realtime statistics
select num_rows from all_Tables where table_name ='<TABLE_NAME>';

mapping ipaddress range to country codes (data-structure hashmaps or trees?)

trying to solve a puzzle which i found here:
http://zcasper.blogspot.com/2005/10/google-phone-interview.html
the goal is to re-present a IP-Address range to country code look-up table in memory and use this data-structure to process a zilloin rows of ipaddress to identify the country code..
so i started with a shoot from the hip thought of using HashTable
a hash-table works great; if we have a country-code to range look-up, as we have less country names that map to ip address ranges?
but not sure; how do i go with ipaddress to country code. any thoughts?
or can i use a tree data-structure?

The input file provides a range of IP Addresses (not 1:1 mapping) so you need some sort of ordered map structure.
// Assuming IPv4, and the inputs are valid (start before end)
// and no overlapping ranges.
public class CountyCodeToIPMap {
private final TreeMap<Long, CountryCodeEntry> ipMap =
new TreeMap<Long, CountryCodeEntry>();
public void addIpRange(long startIp, long endIp, String countryCode) {
ipMap.put(startIp, new CountryCodeEntry(endIp, countryCode);
}
public String getCountryCode(long ip) {
Map.Entry<Long, CountryCodeEntry> entry = ipMap.floorEntry(ip);
if (entry != null && ip <= entry.getValue().endIpAddress) {
return entry.getValue().countryCode;
} else {
return null;
}
}
}
public class CountryCodeEntry {
public final long endIpAddress;
public final String countryCode;
public CountryCodeEntry (long endIpAddress, String countryCode) {
this.endIpAddress = endIpAdddress;
this.countryCode = countryCode;
}
}

you have no chance of storing All ip adresses.
what you can do is store the intervals start-end where ip adress ranges are.
there is a specialized data structure, called Interval Tree which allows you to query this.

this is if you are considering an sql solution:
if you can add some constraints to your data set, you might get away using a very simple sql. where you can even use simple indexes. - that is the case when you use the GeoCityLite dataset
if your ip blocks are non-overlapping, you can can just insert them in a database as unsigned 32bit numbers in a "blocks" table and query them like that with hibernate:
(GeoipBlocks) getSession()
.createQuery("select gb" +
" from GeoipBlocks gb" +
" where gb.startIpNum <= :ipnumeric " +
" order by gb.startIpNum desc").
setMaxResults(1)
.setParameter("ipnumeric", ipInLongValue)
.uniqueResult()
i wrote it down in hql syntax, because not all databases use the same syntax for offset + limit
that issues a query for the best match, assuming all blocks are non-overlapping. - you don't even need the end ip for that, this is automatically determined by the successor.
avoid to query it this way!:
select * from blocks where ipstart <= ip and ipend >= ip
my database was not able to fully utilize their indexes, and did a lot of table scanning.

Due to the way the internet routing works, your algorithm needs to handle Longest Prefix Matching and you want store CIDR blocks, instead of addresses.
I developed an algorithm to handle this but can't post it here. The closest thing in Open Source is the routing table handling code in Linux.
You can also check out Patricia Trie or Radix Tree algorithms. Those can be used to solve this problem.

I can't understand this programming code for psedorandom number generator for hashing

First of all I just begun learning Java and i can say it more challenging then C or python. I'm not very keen on programming to so I have hard time understanding how some codes works. This one in particular
public class Pseudo
{
final int a = 2;
final int c = 3;
int address;
String list[][] = new String [100][6];
public void AddRecord(String ID, String Name, String Course, String Address, String Email, String Contact)
{
address = (a * Integer.parseInt(ID) + c) % list.length;
if((Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0)
{
showMessageDialog(null,"The ID number should be in six digit and the particular field should not be empty","",ERROR_MESSAGE);
}
else{
if(list[address][0]!=null){
showMessageDialog(null,"Collison is occur, the same address is get. Recalculating...............","",WARNING_MESSAGE);
while(list[address][0]!=null)
{
address = (a * address + c) % list.length;
}
}
list[address][0] = ID;
list[address][1] = Name;
list[address][2] = Course;
list[address][3] = Address;
list[address][4] = Email;
list[address][5] = Contact;
showMessageDialog(null,"Student Information " + ID + " will be saved in address: " + address,"",INFORMATION_MESSAGE);
}
}
The confusion come when
address = (a * Integer.parseInt(ID) + c) % list.length;
if((Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0)
What does it mean. From what I understand from this code is that inside an IF statement you can have more then 1 condition. I'm no very sure since this is my first time seeing such a code.
The second is this
if(list[address][0]!=null){
showMessageDialog(null,"Collison is occur, the same address is get. Recalculating...............","",WARNING_MESSAGE);
while(list[address][0]!=null)
{
address = (a * address + c) % list.length;
}
}
list[address][0] = ID;
list[address][1] = Name;
list[address][2] = Course;
list[address][3] = Address;
list[address][4] = Email;
list[address][5] = Contact;
showMessageDialog(null,"Student Information " + ID + " will be saved in address: " + address,"",INFORMATION_MESSAGE);
If collision occurs the address of which it is stored should be altered using a psedorandom number generator again but what I can't grasped is
list[address][0]!=null.I am just baffle with this line. I know its job is change the address if collision happens but i don't know the exact mechanics of how this part is executed.

From what I understand from this code is that inside an IF statement you can have more then 1 condition.
Well, yes and no. You can construct complex conditions based on many smaller conditions, but ultimately the whole thing has to resolve to a single boolean true/false result.
Consider the condition in this case:
(Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0
Let's break that down into its components:
(
Integer.parseInt(ID)<100000 ||
Integer.parseInt(ID)>999999
) ||
ID.length()==0 ||
Name.length()==0 ||
Course.length()==0 ||
Address.length()==0
It's really just chaining together a bunch of comparisons into one big true/false statement. You can essentially read something like this as:
If (something) or (something else) or (another thing) then...
And each something can itself contain small somethings, etc. You can build as complex a logical condition as you want, grouping sub-conditions with parentheses, as long as the whole thing resolves to a single true/false result.
what I can't grasped is list[address][0]!=null
That is just checking if a particular value is null. That value is part of a nested (jagged) array. So you have a variable called list. That variable is an array. Each element in that array is, itself, also an array. So you end up with a kind of 2-dimensional array (but a jagged one, where any given sub-array doesn't have to be the same length as any other).
That specific piece of code looks into the list array, at the address index, and looks at the 0 index of that sub-array, and checks if that value is null.

First of all, understanding any code is much easier if it's properly formatted. All good IDEs have such a function, e.g. for Eclipse the shortcut is Ctrl+Shift+F, for IntelliJ IDEA Ctrl+Alt+L.
The most important part, which might resolve your first confusion: || is the logical OR in Java, meaning the ID must be a number between 100000 and 999999 and the attributes must not be empty. Or literally, if the ID is smaller than 100000 or larger than 999999 or any of the values are empty, there will be an error message and nothing will be done.
For the second part: null means that a variable is not set, so to prevent overwriting an entry you can check if it's already set, i.e. not equal to null. So the code changes the address variable until an address is found for which no data is set yet and then uses it to store the given data.
There are several potential problems in this code, among which:
several calls to the relatively slow Integer.parseInt(String) where it could be called once and stored into a variable
potential NumberFormatException if ID isn't a number (or is empty, or has some excess white spaces)
potential infinite loop if the array is full
But as it looks like some CS homework it shouldn't matter.

Thank You so much Mr David. I understand the first part where if u have a condition u can stack it on each other and from what i can understand it only works with the ||(OR) statement since using this will guarantee either a true or false ending.
while(list[address][0]!=null)
But I'm still a little confuse for part 2 of my problem. Since that line is to check the array is null meaning no value right.This is my understanding of the situation.That particular part of the code is suppose to resolve any collision if the user enters the same ID number right so shouldn't it be checking the value that's causing the collision. But the line seems to be doing is as long as a null value is detected the corresponding procedure would be implemented.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spring data / Neo4j path length with large data sets - java

Related

Ektorp CouchDb: Query for pattern with multiple contains

Java: showing the output of a separate for loop as an extra column shown in a console

How to get the size of a java.sql.ResultSet?

mapping ipaddress range to country codes (data-structure hashmaps or trees?)

I can't understand this programming code for psedorandom number generator for hashing

Categories

Resources