mapping ipaddress range to country codes (data-structure hashmaps or trees?) - java

trying to solve a puzzle which i found here:
http://zcasper.blogspot.com/2005/10/google-phone-interview.html
the goal is to re-present a IP-Address range to country code look-up table in memory and use this data-structure to process a zilloin rows of ipaddress to identify the country code..
so i started with a shoot from the hip thought of using HashTable
a hash-table works great; if we have a country-code to range look-up, as we have less country names that map to ip address ranges?
but not sure; how do i go with ipaddress to country code. any thoughts?
or can i use a tree data-structure?

The input file provides a range of IP Addresses (not 1:1 mapping) so you need some sort of ordered map structure.
// Assuming IPv4, and the inputs are valid (start before end)
// and no overlapping ranges.
public class CountyCodeToIPMap {
private final TreeMap<Long, CountryCodeEntry> ipMap =
new TreeMap<Long, CountryCodeEntry>();
public void addIpRange(long startIp, long endIp, String countryCode) {
ipMap.put(startIp, new CountryCodeEntry(endIp, countryCode);
}
public String getCountryCode(long ip) {
Map.Entry<Long, CountryCodeEntry> entry = ipMap.floorEntry(ip);
if (entry != null && ip <= entry.getValue().endIpAddress) {
return entry.getValue().countryCode;
} else {
return null;
}
}
}
public class CountryCodeEntry {
public final long endIpAddress;
public final String countryCode;
public CountryCodeEntry (long endIpAddress, String countryCode) {
this.endIpAddress = endIpAdddress;
this.countryCode = countryCode;
}
}

you have no chance of storing All ip adresses.
what you can do is store the intervals start-end where ip adress ranges are.
there is a specialized data structure, called Interval Tree which allows you to query this.

this is if you are considering an sql solution:
if you can add some constraints to your data set, you might get away using a very simple sql. where you can even use simple indexes. - that is the case when you use the GeoCityLite dataset
if your ip blocks are non-overlapping, you can can just insert them in a database as unsigned 32bit numbers in a "blocks" table and query them like that with hibernate:
(GeoipBlocks) getSession()
.createQuery("select gb" +
" from GeoipBlocks gb" +
" where gb.startIpNum <= :ipnumeric " +
" order by gb.startIpNum desc").
setMaxResults(1)
.setParameter("ipnumeric", ipInLongValue)
.uniqueResult()
i wrote it down in hql syntax, because not all databases use the same syntax for offset + limit
that issues a query for the best match, assuming all blocks are non-overlapping. - you don't even need the end ip for that, this is automatically determined by the successor.
avoid to query it this way!:
select * from blocks where ipstart <= ip and ipend >= ip
my database was not able to fully utilize their indexes, and did a lot of table scanning.

Due to the way the internet routing works, your algorithm needs to handle Longest Prefix Matching and you want store CIDR blocks, instead of addresses.
I developed an algorithm to handle this but can't post it here. The closest thing in Open Source is the routing table handling code in Linux.
You can also check out Patricia Trie or Radix Tree algorithms. Those can be used to solve this problem.

Related

Java, how to hash a string with low collision probability, specify characters allowed in output to decrease this

Is there any way to hash a string and specify the characters allowed in the output, or a better approach to avoid collisions when producing a hash of 8 characters in length.
I am running into a situation where I am seeing a collision with my current hashing method (see example implementation below).
currently using crc32 from https://guava.dev/releases/20.0/api/docs/com/google/common/hash/Hashing.html
the hashes produced are alphaNumeric, 8 characters in length.
I need to keep the 8 digit length (not storing passwords), Is there a way to specify an "Alphabet" of allowed output characters of a hashing function?
e.g. to allow (a-z, 0-9,) and a set of characters e.g. (_,$,-),
the characters added will need to be URI friendly
This would allow me to decrease the possibility of collisions occurring.
The hash output will be stored in a cache for a maximum of 60 days, so collisions occurring after that period will have no affect
current approach example code:
import com.google.common.hash.HashFunction;
import com.google.common.hash.Hasher;
import com.google.common.hash.Hashing;
public class Test {
private static final String SALT = "4767c3a6-73bc-11ec-90d6-0242ac120003";
public static void main( String[] args )
{
// actual strings causing collisions removed as have to redact some data
String string1 = "myStringOne";
String string2 = "myStringTwo";
System.out.println( "string1:" + string1);
System.out.println( "string1 hashed:" + doHash(string1, SALT));
System.out.println( "string2:" + string2);
System.out.println( "string2 hash:" + doHash(string2, SALT));
}
private static String doHash(String keyValue, String salt){
HashFunction func = Hashing.crc32();
Hasher hasher = func.newHasher();
hasher.putUnencodedChars(keyValue);
hasher.putUnencodedChars(salt);
return hasher.hash().toString();
}
}
functionality of the code/problem statement
using key store db.
A user requests a resource,
hash is made of (user details & requested resource).
if resulting id already present -> return that item from DB
else, perform processing on resource and store in db, with result from hash as ID
cache is purged periodically.
Questions.
Is there a way to specify the alphabet the hash is allowed to use in its output?
I checked the docs but do not see an approach https://guava.dev/releases/20.0/api/docs/com/google/common/hash/Hashing.html
Or is there an alternative approach that would be recommended?
e.g. generating a longer hash and taking a subset.

Ektorp CouchDb: Query for pattern with multiple contains

I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}
Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.

How to generate random string with no duplicates in java

I read some answers , usually they use a set or some other data structure to ensure there is no duplicates. but for my situation , I already stored a lot random string in database , I have to make sure that the generated random string should not existed in database .
and I don't think retrieve all random string from database into a set and then generated the random string is a good idea...
I found that System.currentTimeMillis() will generate a "random" number , but how to translate that number to a random string is a question...I need a string with length 8.
any suggestion will be appreciated
You can use Apache library for this: RandomStringUtils
RandomStringUtils.randomAlphanumeric(8).toUpperCase() // for alphanumeric
RandomStringUtils.randomAlphabetic(8).toUpperCase() // for pure alphabets
randomAlphabetic(int count)
Creates a random string whose length is the number of characters specified.
randomAlphanumeric(int count)
Creates a random string whose length is the number of characters specified.
So there are two issues here - creating the random string, and making sure there's no duplicate already in the db.
If you are not bound to 8 characters, you can use a UUID as the commenter above suggested. The UUID class returns a strong that is highly statistically unlikely to be a duplicate of a previously generated UUID so you can use it for this precise purpose without checking if its already in your database.
UUID.randomUUID().toString();
Or if you don't care whether what the unique id is as long as its unique you could use an identity or autoincrement field which pretty much all DB's support. If you do that, though you have the read the record after you commit it to get the identity assigned by the db.
which produces a string which looks something that looks like this:
5e0013fd-3ed4-41b4-b05d-0cdf4324bb19
If you are have to have an 8 character string as your unique id and you don't want to import the apache library, \you can generate random 8 character string like this:
final String alpha="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
final Random rand= new Random();
public String myUID() {
int i = 8;
String uid="";
while (i-- > 0) {
uid+=alpha.charAt(rand.nextInt(26));
}
return uid;
}
To make sure its not a duplicate, you should add a unique index to the column in the db which contains it.
You can either query the db first to make sure that no row has that id before you insert the row, or catch the exception and retry if you've generated a duplicate.
Method currentTimeMillis() returns the current time in milliseconds in long so convert long to string, and s.substring(5, s.length()) give you last 8 digit's of milliseconds those are always identical for each millisecond.
public static void main(String[] args) {
String s = String.valueOf(System.currentTimeMillis());
System.out.println(s.substring(5, s.length()));
}
You have to make sure that this string is available or not in your database each time.

Spring data / Neo4j path length with large data sets

I have been running the following query to find relatives within a certain "distance" of a given person:
#Query("start person=node({0}), relatives=node:__types__(className='Person') match p=person-[:PARTNER|CHILD*]-relatives where LENGTH(p) <= 2*{1} return distinct relatives")
Set<Person> getRelatives(Person person, int distance);
The 2*{1} comes from one conceptual "hop" between people being represented as two nodes - one Person and one Partnership.
This has been fine so far, on test populations. Now I'm moving on to actual data, which consists of sizes from 1-10 million, and this is taking for ever (also from the data browser in the web interface).
Assuming the cost was from loading everything into ancestors, I rewrote the query as a test in the data browser:
start person=node(385716) match p=person-[:PARTNER|CHILD*1..10]-relatives where relatives.__type__! = 'Person' return distinct relatives
And that works fine, in fractions of a second on the same data store. But when I want to put it back into Java:
#Query("start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Person' return relatives")
Set<Person> getRelatives(Person person, int distance);
That won't work:
[...]
Nested exception is Properties on pattern elements are not allowed in MATCH.
"start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Neo4jPerson' return relatives"
^
Is there a better way of putting a path length restriction in there? I would prefer not to use a where as that would involve loading ALL the paths, potentially loading millions of nodes where I need only go to a depth of 10. This would presumably leave me no better off.
Any ideas would be greatly appreciated!
Michael to the rescue!
My solution:
public Set<Person> getRelatives(final Person person, final int distance) {
final String query = "start person=node(" + person.getId() + ") "
+ "match p=person-[:PARTNER|CHILD*1.." + 2 * distance + "]-relatives "
+ "where relatives.__type__! = '" + Person.class.getSimpleName() + "' "
+ "return distinct relatives";
return this.query(query);
// Where I would previously instead have called
// return personRepository.getRelatives(person, distance);
}
public Set<Person> query(final String q) {
final EndResult<Person> result = this.template.query(q, MapUtil.map()).to(Neo4jPerson.class);
final Set<Person> people = new HashSet<Person>();
for (final Person p : result) {
people.add(p);
}
return people;
}
Which runs very quickly!
You're almost there :)
Your first query is a full graph scan, which effectively loads the whole database into memory and pulls all nodes through this pattern match multiple times.
So it won't be fast, also it would return huge datasets, don't know if that's what you want.
The second query looks good, the only thing is that you cannot parametrize the min-max values of variable length relationships. Due to effects to query optimization / caching.
So for right now you'd have to go with template.query() or different query methods in your repo for different max-values.

I can't understand this programming code for psedorandom number generator for hashing

First of all I just begun learning Java and i can say it more challenging then C or python. I'm not very keen on programming to so I have hard time understanding how some codes works. This one in particular
public class Pseudo
{
final int a = 2;
final int c = 3;
int address;
String list[][] = new String [100][6];
public void AddRecord(String ID, String Name, String Course, String Address, String Email, String Contact)
{
address = (a * Integer.parseInt(ID) + c) % list.length;
if((Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0)
{
showMessageDialog(null,"The ID number should be in six digit and the particular field should not be empty","",ERROR_MESSAGE);
}
else{
if(list[address][0]!=null){
showMessageDialog(null,"Collison is occur, the same address is get. Recalculating...............","",WARNING_MESSAGE);
while(list[address][0]!=null)
{
address = (a * address + c) % list.length;
}
}
list[address][0] = ID;
list[address][1] = Name;
list[address][2] = Course;
list[address][3] = Address;
list[address][4] = Email;
list[address][5] = Contact;
showMessageDialog(null,"Student Information " + ID + " will be saved in address: " + address,"",INFORMATION_MESSAGE);
}
}
The confusion come when
address = (a * Integer.parseInt(ID) + c) % list.length;
if((Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0)
What does it mean. From what I understand from this code is that inside an IF statement you can have more then 1 condition. I'm no very sure since this is my first time seeing such a code.
The second is this
if(list[address][0]!=null){
showMessageDialog(null,"Collison is occur, the same address is get. Recalculating...............","",WARNING_MESSAGE);
while(list[address][0]!=null)
{
address = (a * address + c) % list.length;
}
}
list[address][0] = ID;
list[address][1] = Name;
list[address][2] = Course;
list[address][3] = Address;
list[address][4] = Email;
list[address][5] = Contact;
showMessageDialog(null,"Student Information " + ID + " will be saved in address: " + address,"",INFORMATION_MESSAGE);
If collision occurs the address of which it is stored should be altered using a psedorandom number generator again but what I can't grasped is
list[address][0]!=null.I am just baffle with this line. I know its job is change the address if collision happens but i don't know the exact mechanics of how this part is executed.
From what I understand from this code is that inside an IF statement you can have more then 1 condition.
Well, yes and no. You can construct complex conditions based on many smaller conditions, but ultimately the whole thing has to resolve to a single boolean true/false result.
Consider the condition in this case:
(Integer.parseInt(ID)<100000||Integer.parseInt(ID)>999999)||ID.length()==0 || Name.length()==0 || Course.length()==0 || Address.length()==0
Let's break that down into its components:
(
Integer.parseInt(ID)<100000 ||
Integer.parseInt(ID)>999999
) ||
ID.length()==0 ||
Name.length()==0 ||
Course.length()==0 ||
Address.length()==0
It's really just chaining together a bunch of comparisons into one big true/false statement. You can essentially read something like this as:
If (something) or (something else) or (another thing) then...
And each something can itself contain small somethings, etc. You can build as complex a logical condition as you want, grouping sub-conditions with parentheses, as long as the whole thing resolves to a single true/false result.
what I can't grasped is list[address][0]!=null
That is just checking if a particular value is null. That value is part of a nested (jagged) array. So you have a variable called list. That variable is an array. Each element in that array is, itself, also an array. So you end up with a kind of 2-dimensional array (but a jagged one, where any given sub-array doesn't have to be the same length as any other).
That specific piece of code looks into the list array, at the address index, and looks at the 0 index of that sub-array, and checks if that value is null.
First of all, understanding any code is much easier if it's properly formatted. All good IDEs have such a function, e.g. for Eclipse the shortcut is Ctrl+Shift+F, for IntelliJ IDEA Ctrl+Alt+L.
The most important part, which might resolve your first confusion: || is the logical OR in Java, meaning the ID must be a number between 100000 and 999999 and the attributes must not be empty. Or literally, if the ID is smaller than 100000 or larger than 999999 or any of the values are empty, there will be an error message and nothing will be done.
For the second part: null means that a variable is not set, so to prevent overwriting an entry you can check if it's already set, i.e. not equal to null. So the code changes the address variable until an address is found for which no data is set yet and then uses it to store the given data.
There are several potential problems in this code, among which:
several calls to the relatively slow Integer.parseInt(String) where it could be called once and stored into a variable
potential NumberFormatException if ID isn't a number (or is empty, or has some excess white spaces)
potential infinite loop if the array is full
But as it looks like some CS homework it shouldn't matter.
Thank You so much Mr David. I understand the first part where if u have a condition u can stack it on each other and from what i can understand it only works with the ||(OR) statement since using this will guarantee either a true or false ending.
while(list[address][0]!=null)
But I'm still a little confuse for part 2 of my problem. Since that line is to check the array is null meaning no value right.This is my understanding of the situation.That particular part of the code is suppose to resolve any collision if the user enters the same ID number right so shouldn't it be checking the value that's causing the collision. But the line seems to be doing is as long as a null value is detected the corresponding procedure would be implemented.

Categories

Resources