Assign same numbers in different formats to contact

Assign same numbers in different formats to contact - java

How do I assign same numbers in different formats to one contact?
In the stock Samsung phone app,
+1 8542569, 8542569 and 18542569 are all assigned to one contact when called: "Example USA" :
Even though "Example USA" only has +1 854-256-9 listed in the phone book:
This goes on for every country, not just the USA. Here's a New Zealand example:
^ Here, "New Zealand Example" has 91234567 listed.
And, I can call 91234567, 00 64 9-123 4567 or 6491234567 and they will all get assigned to "New Zealand Example" contact:
My question is: how can I do the same thing in Java, for every country just like the Samsung stock app?
Say I have 3 strings: 91234567, 00 64 9-123 4567 and 6491234567.
How will my app recognize that they belong to the same contact and that they're basically the same number?
I'm sure it can be done because Samsung did it :)
Again, I'd like the code to work for every country.

Part 1: Remove starting zeros. They are useless. They might be useful, but it would be easier to do it by removing them.
This quora link with be a bit useful
Now countries have different number of digits in their phone numbers. Like, India has 10 which doesn't match New Zealand and USA.
Assuming that each country has a specific number of digits in numbers after removing region specific codes.
Part 2: You can create a multi-dimensional matrix, with country code and number of digits after that. Match for starting country codes, and then verify number of digits.
This method may have conflicts with different countries, but I lack knowledge about this thing.
Eg. of conflict:
Country A: country_code = +1, digits = 6
Country B: country_code = +12, digits =5
+1 234567
+12 34567

Related

Display frequency of n-letter digits

Hi I'm trying to display the number of all three and five letter words from a text file called Article.txt but the output I get is 4 for both. I am a beginner and I will appreciate any kind of help. Thank you!
import java.util.*;
import java.io.*;
class test
{
public static void main(String[] args) throws Exception
{
FileReader fr = new FileReader("E:\\test\\Article.txt");
Scanner s = new Scanner(fr);
String str = s.nextLine();
String[] words = str.split(" ");
int countThree = 0, countFive = 0;`
for(String word : words)
{
if(word.length() == 3)
{
countThree++;
}
else if(word.length() == 5)
{
countFive++;
}
}
System.out.println("Number of three letter words: " +countThree);
System.out.println("Number of five lettr words: " +countFive);
}
}
Here is the article:
There was a time when Pete Sampras tally of 14 Grand Slam singles titles the last of which came at the US Open in 2002 seemed like the acme of sporting
achievement in men tennis. Little did anybody expect that in the next 16 years across 64 Majors not one or two but three players would stand shoulder to
shoulder with the American great. On Sunday Novak Djokovic became that third man defeating Argentine Juan Martin del Potro for his third US Open title at
Flushing Meadows. The 31 year old Serb has never been considered a once in a generation talent as have Roger Federer and Rafael Nadal the ones above him in
the trophy count. But nobody represents the modern day game as well as Djokovic. He is the ultimate practitioner of the attrition-based baseline tennis and at
his best with his supremely efficient patrolling of the court is near invincible. Over two weeks in New York he hit this high many times over. In fact
the 95-minute second set in the final was a microcosm of Djokovic last two years. It was long and weary as fortunes swung back and forth. But adversity
energised him and he found a level which his opponent could not match. Coming after his triumphant return at Wimbledon in July the latest success is evidence
enough that technically, tactically and physically Djokovic is back to his best.
If it was about the restoration of the old order on the mens side it was the continuation of the new in the women section. There has been a first time
winner in four of the past six Grand Slam tournaments and 20 year old Naomi Osaka added to the eclectic mix by becoming the first Japanese to win a Major.
In Serena Williams the winner of 23 singles Slams, the most by any player in the Open Era Osaka faced the ultimate challenge. It was also an inter generational
battle like none other. The 16 year age gap between Williams and Osaka was the second biggest in the Open Era for a womens final next only to Monica Seles
versus Martina Navratilova at the 1991 US Open. To her immense credit Osaka was not awed by the stage. While growing up she had revered Williams. After all
this is someone who chose Williams as her subject for a school essay in third grade. On Saturday she played like she knew the 36 year olds game like the back of
her hand absorbing everything the American threw at her and redirecting them with much more panache. The magnitude of her achievement was nearly drowned out by
the chaos in the aftermath of Williams tirade against the chair umpire. Yet the manner in which Osaka at an impressionable young age closed out the match with
a cold relentlessness showed she is here to stay.

I assume you would like to process your file by line.
At present you are only evaluating the first line by executing
String str = s.nextLine();
For this line you are counting the number of words.
You have to count also for all other lines.

Not able to find pattern in data sample

I'm seeking help with the logic and not the technology to solve this problem. I was writing a program in Java to use categorized data(consisting temperature and blood pressure mapped to a state of Infected/NotInfected/unknown) and classify a given set of travelers as
“Infected”, “NotInfected” or “Unknown” accordingly.
Input:
The input comprises of a string containing two parts separated by ‘#’. The first part contains
categorized data for any number of individuals separated by comma. Data for each individual
contains space separated three values as follows:
Temperature bloodpressure category
The second part contains space separated temperature and blood pressure of multiple travelers
separated by comma.
Output:
Categorization of travelers separated by comma.
Sample Input & Output
90 120 Infected,90 150 NotInfected,100 140 Infected,80 130 NotInfected#95 125,95 145,75 160 | Output: Infected,Unknown,NotInfected
80 120 Infected,70 145 Infected,90 100 Infected,80 150 NotInfected,80 80 NotInfected,100 120 NotInfected#120 148,75 148,60 90 | Output: NotInfected,Unknown,Unknown
I went on to solve this by splitting the strings provided in to substring one containing the categorized data and the other containing the input data set.
public static void main(String[] args) {
String s="90 120 Infected, 90 150 NotInfected, 100 140 NotInfected, 80 130 NotInfected#95 125, 95 145, 75 160";
String categories = s.split("#")[0];
String inputs = s.split("#")[1];
System.out.println(categories+"\n"+inputs);
for (String input: inputs.split(",")){
//iterate through categories and match against input
}
}
But I realized that I was not able to find any pattern that could help me get the desired output as mentioned in the "Sample above". Which type of temperature-BP leads to Infected category?

So, your problem is to learn a classifier from your sample (training data) and then be able to classify new cases (described by explanatory variables, temperature and blood pressure) into one of three classes.
There are numerous ways to learn classifiers, but first you should find out, if your explanatory variables actually explain the classes (i.e. if there is a pattern). For this purpose, I would suggest a simple check: plot you training data in two dimensions (explanatory variables) and give each of the three classes a different symbol (e.g. letter N, I, U). You will see if all classes are randomly mixed or if the same symbols tend to aggregate together. Or are you able to draw lines that separate different classes sufficiently well? You don't need to be able to separate classes perfectly - some classification errors just belong to life - but your should be able to see some tendency.
If there is a clear class division, then you should just select a classifier to learn. Learning algorithms are widely available, so you don't need to code it yourself. You could try e.g. classification trees (classical c4.5 learning algorithm). Or if your training set is sufficiently large, you could use a K-nearest neighbour classifier that doesn't require any learning phase: you simply classify a new case according to its K nearest neighbours in the training data (you can just calculate Euclidean distances between points in the temperature and blood pressure space, select K points with shortest distances from you new query point and select the most common class among neighbours).

How to calculate similarity between Chamber of Commerce numbers?

I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLﬂ0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLﬂ0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)

Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.

Find country code of a mobile number in Java

Is there any way to find the country code of a phone number in Java?
Say if I give 9710334544, I will receive the country code as 91 (if its India).
Any suggestions please.

If the phone number does not include the country code number as prefix, there is no way to find out from which region this phone number originates.

The idea behind the country code is to distinguish the country first, and then parse the number. The reason for this is to forego issues with the same number.
If my U.S. number is 1234567890 then what is there to distinguish that from my U.K number which is 1234567890? The answer is the country prefix. Unfortunately, due to the very nature of this number(in that it distinguishes between numbers that are the same, you can't use the number to figure it out).
Now, if you already have the full 13-14 digit number, you can find the country code by simply dividing (integer division):
long inputPhoneNumber = 123 (XXX) - XXX - XXXX;
long countryCode = inputPhoneNumber / 10000000000l;
// Will give 123
After you have the answer you can match it up with the country, the internet provides several sites that list the codes with their countries:
Country Code

What is the best way for converting phone numbers into international format (E.164) using Java?

What is the best way for converting phone numbers into international format (E.164) using Java?
Given a 'phone number' and a country id (let's say an ISO country code), I would like to convert it into a standard E.164 international format phone number.
I am sure I can do it by hand quite easily - but I would not be sure it would work correctly in all situations.
Which Java framework/library/utility would you recommend to accomplish this?
P.S. The 'phone number' could be anything identifiable by the general public - such as
* (510) 786-0404
* 1-800-GOT-MILK
* +44-(0)800-7310658
that last one is my favourite - it is how some people write their number in the UK and means that you should either use the +44 or you should use the 0.
The E.164 format number should be all numeric, and use the full international country code (e.g.+44)

Google provides a library for working with phone numbers. The same one they use for Android
http://code.google.com/p/libphonenumber/
String swissNumberStr = "044 668 18 00"
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
try {
PhoneNumber swissNumberProto = phoneUtil.parse(swissNumberStr, "CH");
} catch (NumberParseException e) {
System.err.println("NumberParseException was thrown: " + e.toString());
}
// Produces "+41 44 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.INTERNATIONAL));
// Produces "044 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.NATIONAL));
// Produces "+41446681800"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.E164));

Speaking from experience at writing this kind of thing, it's really difficult to do with 100% reliability. I've written some Java code to do this that is reasonably good at processing the data we have but won't be applicable in every country. Questions you need to ask are:
Are the character to number mappings consistent between countries? The US uses a lot of this (eg 1800-GOT-MILK) but in Australia, as one example, its pretty rare. What you'd need to do is ensure that you were doing the correct mapping for the country in question if it varies (it might not). I don't know what countries that use different alphabets (eg Cyrilic in Russia and the former Eastern block countries) do;
You have to accept that your solution will not be 100% and you should not expect it to be. You need to take a "best guess" approach. For example, theres no real way of knowing that 132345 is a valid phone number in Australia, as is 1300 123 456 but that these are the only two patterns that are for 13xx numbers and they're not callable from overseas;
You also have to ask if you want to validate regions (area codes). I believe the US uses a system where the second digit of the area code is a 1 or a 0. This may have once been the case but I'm not sure if it still applies. Whatever the case, many other countries will have other rules. In Australia, the valid area codes for landlines and mobile (cell) phones are two digits (the first is 0). 08, 03 and 04 are all valid. 01 isn't. How do you cater for that? Do you want to?
Countries use different conventions no matter how many digits they're writing. You have to decide if you want to accept something other than the "norm". These are all common in Australia:
(02) 1234 5678
02 1234 5678
0411 123 123 (but I've never seen 04 1112 3456)
131 123
13 1123
131 123
1 300 123 123
1300 123 123
02-1234-5678
1300-234-234
+44 78 1234 1234
+44 (0)78 1234 1234
+44-78-1234-1234
+44-(0)78-1234-1234
0011 44 78 1234 1234 (0011 is the standard international dialling code)
(44) 078 1234 1234 (not common)
And thats just off the top of my head. For one country. In France, for example, its common the write the phone number in number pairs (12 34 56 78) and they pronounce it that way too: instead of:
un (one), deux (two), trois (three), ...
its
douze (twelve), trente-quatre (thirty four), ...
Do you want to cater for that level of cultural difference? I would assume not but the question is worth considering just in case you make your rules too strict.
Also some people may append extension numbers on phone numbers, possibly with "ext" or similar abbreviation. Do you want to cater for that?
Sorry, no code here. Just a list of questions to ask yourself and issues to consider. As others have said, a series of regular expressions can do much of the above but ultimately phone number fields are (mostly) free form text at the end of the day.

This was my solution:
public static String FixPhoneNumber(Context ctx, String rawNumber)
{
String fixedNumber = "";
// get current location iso code
TelephonyManager telMgr = (TelephonyManager) ctx.getSystemService(Context.TELEPHONY_SERVICE);
String curLocale = telMgr.getNetworkCountryIso().toUpperCase();
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Phonenumber.PhoneNumber phoneNumberProto;
// gets the international dialling code for our current location
String curDCode = String.format("%d", phoneUtil.getCountryCodeForRegion(curLocale));
String ourDCode = "";
if(rawNumber.indexOf("+") == 0)
{
int bIndex = rawNumber.indexOf("(");
int hIndex = rawNumber.indexOf("-");
int eIndex = rawNumber.indexOf(" ");
if(bIndex != -1)
{
ourDCode = rawNumber.substring(1, bIndex);
}
else if(hIndex != -1)
{
ourDCode = rawNumber.substring(1, hIndex);
}
else if(eIndex != -1)
{
ourDCode = rawNumber.substring(1, eIndex);
}
else
{
ourDCode = curDCode;
}
}
else
{
ourDCode = curDCode;
}
try
{
phoneNumberProto = phoneUtil.parse(rawNumber, curLocale);
}
catch (NumberParseException e)
{
return rawNumber;
}
if(curDCode.compareTo(ourDCode) == 0)
fixedNumber = phoneUtil.format(phoneNumberProto, PhoneNumberFormat.NATIONAL);
else
fixedNumber = phoneUtil.format(phoneNumberProto, PhoneNumberFormat.INTERNATIONAL);
return fixedNumber.replace(" ", "");
}
I hope this helps someone with the same problem.
Enjoy and use freely.

Thanks for the answers. As stated in the original question, I am much more interested in the formatting of the number into the standard format than I am in determining if it is a valid (as in genuine) phone number.
I have some hand crafted code currently that takes a phone number String (as entered by the user) and a source country context and target country context (the country from where the number is being dialed, and the country to where the number is being dialed - this is known to the system) and then does the following conversion in steps
Strip all whitespace from the number
Translate all alpha into digits - using a lookup table of letter to digit (e.g. A-->2, B-->2, C-->2, D-->3) etc. for the keypad (I was not aware that some keypads distribute these differently)
Strip all punctuation - keeping a preceding '+' intact if it exists (in case the number is already in some sort of international format).
Determine if the number has an international dialling prefix for the country context - e.g. if source context is the UK, I would see if it starts with a '00' - and replace it with a '+'. I do not currently check whether the digits following the '00' are followed by the international dialing code for the target country. I look up the international dialing prefix for the source country in a lookup table (e.g. GB-->'00', US-->'011' etc.)
Determine if the number has a local dialing prefix for the country context - e.g. if the source context is the UK, I would look to see if it starts with a '0' - and replace it with a '+' followed by the international dialing code for the target country. I look up the local dialing prefix for the source country in a lookup table (e.g. GB-->'0', US-->'1' etc.), and the international dialing code for the target country in another lookup table (e.g.'GB'='44', US='1')
It seems to work for everything I have thrown at it so far - except for the +44(0)1234-567-890 situation - I will add a special case check for that one.
Writing it was not hard - and I can add special cases for each strange exception I come across. But I would really like to know if there is a standard solution.
The phone companies seem to deal with this thing every day. I never get inconsistent results when dialing numbers using the PSTN. For example, in the US (where mobile phones have the same area codes as landlines, I could dial +1-123-456-7890, or 011-1-123-456-7890 (where 011 is the international dialing prefix in the US and 1 is the international dialing code for the US), 1-123-456-7890 (where 1 is the local dialing prefix in the US) or even 456-7890 (assuming I was in the 123 area code at the time) and get the same results each time. I assume that internally these dialed numbers get converted to the same E.164 standard format, and that the conversion is all done in software.

To be honest, it sounds like you've got most of the bases covered already.
The +44(0)800 format sometimes (incorrectly) used in the UK is annoying and isn't strictly valid according to E.123, which is the ITU-T recommendation for how numbers should be displayed. If you haven't got a copy of E.123 it's worth a look.
For what it's worth, the telephone network itself doesn't always use E.164. Often there'll be a flag in the ISDN signalling generated by the PBX (or in the network if you're on a steam phone) which tells the network whether the number being dialled is local, national or international.

In some countries you can validate 112 as a valid phone number, but if you stick a country code in front of it it won't be valid any more. In other countries you can't validate 112 but you can validate 911 as a valid phone number.
I've seen some phones that put Q on the 7 key and Z on the 9 key. I've seen some phones that put Q and Z on the 0 key, and some that put Q and Z on the 1 key.
An area code that existed yesterday might not exist today, and vice-versa.
In half of North America (country code 1), the second digit rule used to be 0 or 1 for area codes, but that rule went away 10 years ago.

I'm not aware of a standard library or framework available for formatting telephone numbers into E.164.
The solution used for our product, which requires formatting PBX provided caller-id into E.164, is to deploy a file (database table) containing the E.164 format information for all countries applicable.
This has the advantage that the application can be updated (to handle all the strange corner cases in various PSTN networks) w/out requiring changes to the production code base.
The table contains a row for each country code and information regarding area code length and subscriber length. There may be multiple entries for a country depending on what variations are possible with area code and subscriber number lengths.
Using New Zealand PSTN (partial) dial plan as an example of the table..
CC AREA_CODE AREA_CODE_LENGTH SUBSCRIBER SUBSCRIBER_LENGTH
64 1 7
64 21 2 7
64 275 3 6
We do something similar to what you have described, i.e. strip the provided telephone number of any non-digit characters and then format based on various rules regarding overall number plan length, outside access code, and long distance/international access codes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.