I have a flat file of e-mail header data that I'm trying to parse for analysis. The file will always have fields in order as follows: Record Number, 1 or 2 bytes, "From:" followed by the sender's name and "Sent:" followed by the date sent.
1 From: Person.Name Sent: April 12, 2010
2 From:<tab>Person.Name Sent: April 30, 2011
10 From: Person.Name Sent: June 29, 2012
11 From:<tab>Person.Name Sent: July 8, 2012
Using BufferedReader I am reading a the file line-by-line and defining a substring of the Name based on all characters between the indeces of "From:" and "Sent:".
String sender = inputLine.substring((inputLine.indexof("From:")+6),(inputLine.indexOf("Sent:")-1));
In this case, I'm grabbing everything AFTER "From: " (sixth byte excludes the word, the colon, and the space/single byte after the colon) through one LESS than the position of "Sent: " (the space before the S).
However, I'm getting unexpected output when I run the job. Some of my input data appears to have a tab after "From: " and some lines do not. When a tab is present, my output include the last two or three bytes of "From: " (when the record number is a single digit, I get m:<tab>, for double digit record numbers it's om:<tab>.
Person.Name
m:<tab>Person.Name <-- single digit record number
Person.Name
om:<tab>Person.Name <-- double digit record number
EDIT: When I amend my substring to
String sender = inputLine.substring((inputLine.indexof("From:\t")+6),(inputLine.indexOf("Sent:")-1));
ONLY the records with a space (and not a tab) prepent the end of the From: to the output.
Person.Name <-- records with From:<tab>
om: Person.Name <-- records with From:<space>
I'm now wondering if I understand substring correctly. My statement above is based on an understanding of substring(x,y) where x is the start and y is the end of the string. Is that correct?
Since indexOf("From:") is intended to represent an integer value of 2 or 3 (depending on a 1 or 2 byte record number, e.g., 1 From: or 10 From:) I would think that adding a value of 6 would give me an index value that falls AFTER the : in index 8 or 9 from the front of the line. So why does it appear to be viewing this as an index of 5--regardless?
111111111122222222222 |
0123456789012345678901234567890 + index values
1 From: Person.Name Sent: June
10 From: Person.Name Sent: July
The tab is the only difference in the records, and while I understand that a tab character may need to be counted differently than an ASCII space character, SUBTRACTING from the index seems a little strange.
Even more interesting, if I remove the "adjustments" from the statement,
String sender = inputLine.substring((inputLine.indexof("From:")),(inputLine.indexOf("Sent:")));
I get a -1 out of range exception.
Can someone please explain what's happening here? I am baffled, and can't find answers this specific in oracle's java documentation.
I ended up creating new input fields that replaced \t with a space. Then everything worked fine. What it was about the tab character that threw things off is still a mystery.
Related
Making use of an ASCII .DAT file that contains multiple records of a fixed length I would like to read each record and generate an output based on its certain portions of its contents.
So far my program does exactly this but I was alerted to the fact that the first field in each .DAT file starts with the record length and number of records, the only issue I am having is reading this first field and extracting the data as usable, the issue is that the data is in ASCII chars and not decimal numbers.
Below is a code snipet in BASIC that reads the same file and extracts the initial data required
CLS
INPUT "Survey System Data File? : ", survey$
survey$ = "f:\apps\survey\" + survey$
reclen = 3004
OPEN survey$ + ".dat" FOR RANDOM AS 1 LEN = reclen
FIELD #1, 3 AS RL$, 9 AS n$
GET #1, 1
RL = CVI(RL$): n = CVI(n$)
PRINT "Record Length = "; RL
reclen = RL
PRINT "Number of Records = "; n
CLOSE #1
Is there a way of doing something similar in Java?
The initial record and second record are seen below. The second record starts from 0001511
#Å Õ 000151115 2 351228 6 8131720 1121211 12111121121111111112112111 Treat people fairly. Motivated people who go the extra mile should be recognised. Trust employees to make decisions and find out what is best for the business. Examine the workload and the performance and timing of the work. 11 6 5 6 5 2003/10/007:12 21 111 2 1154 1 1 113 1 1 1 1 1 4000100 0 0 0 400 0 0 0 400 4100.0000.0000.0 0 0 10 24 12111none 9 1346
As you can see the initial characters are ASCII chars and not decimals that I'm looking for.
Many thanks in advance for the help.
I have found a way around this issue as the initial record of the file is basically a blank indicator record and so using this initial record length I was able to find the recurring record length of the others.
Hi all and thank you for the help in advance.
I have scoured the webs and have not really turned up with anything concrete as to my initial question.
I have a program I am developing in JAVA thats primary purpose is to read a .DAT file and extract certain values from it and then calculate an output based on the extracted values which it then writes back to the file.
The file is made up of records that are all the same length and format and thus it should be fairly straightforward to access, currently I am using a loop and and an if statement to find the first occurrence of a record and then through user input determine the length of each record to then loop through each record.
HOWEVER! The first record of this file is a blank (Or so I thought). As it turns out this first record is the key to the rest of the file in that the first few chars are ascii and reference the record length and the number of records contained within the file respectively.
below are a list of the ascii values themselves as found in the files (Disregard the " " the ascii is contained within them)
"#¼ ä "
"#g â "
"ÇG # "
"lj ‰ "
"Çò È "
"=¼ "
A friend of mine who many years ago use to code in Basic recons the first 3 chars refer to the record length and the following 9 refer to the number of records.
Basically what I am needing to do is convert this initial string of ascii chars to two decimals in order to work out the length of each record and the number of records.
Any assistance will be greatly appreciated.
Edit...
Please find below the Basic code used to access the file in the past, perhaps this will help?
CLS
INPUT "Survey System Data File? : ", survey$
survey$ = "f:\apps\survey\" + survey$
reclen = 3004
OPEN survey$ + ".dat" FOR RANDOM AS 1 LEN = reclen
FIELD #1, 3 AS RL$, 9 AS n$
GET #1, 1
RL = CVI(RL$): n = CVI(n$)
PRINT "Record Length = "; RL
reclen = RL
PRINT "Number of Records = "; n
CLOSE #1
Basically what I am looking for is something similar but in java.
ASCII is a special way to translate a bit pattern in a byte to a character, and that gives each character a numerical value; for the letter 'A' is this 65.
In Java, you can get that numerical value by converting the char to an int (ok, this gives you the Unicode value, but as for the ASCII characters the Unicode value is the same as for ASCII, this does not matter).
But now you need to know how the length is calculated: do you have to add the values? Or multiply them? Or append them? Or multiply them with 128^p where p is the position, and add the result? And, in the latter case, is the first byte on position 0 or position 3?
Same for the number of records, of course.
Another possible interpretation of the data is that the bytes are BCD encoded numbers. In that case, each nibble (4bit set) represents a number from 0 to 9. In that case, you have to do some bit manipulation to extract the numbers and concatenate them, from left (highest) to right (lowest). At least you do not have to struggle with the sequence and further interpretation here …
But as BCD would require 8-bit, this would be not the right interpretation if the file really contains ASCII, as ASCII is 7-bit.
Background to my problem
Hi, I am just attempting to complete an exercise on project Euler which states that I must read all names from a ".txt" file and add all the character codes for each character within that string etc. As I was doing the exercises I realized that the wrong character codes is being displayed.
This is the full details for my problem from project Euler
Using names.txt (right click and 'Save Link/Target As...'), a 46K text
file containing over five-thousand first names, begin by sorting it
into alphabetical order. Then working out the alphabetical value for
each name, multiply this value by its alphabetical position in the
list to obtain a name score.
For example, when the list is sorted into alphabetical order, COLIN,
which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the
list. So, COLIN would obtain a score of 938 × 53 = 49714.
What is the total of all the name scores in the file?
My Question
why is my code displaying the value "67" for the character "C" when the actual character code value for "C" is 3? . Thanks in advance.
private static int NameValue(string name)
{
string StrimName = name.Substring(1, name.Length-2); // name ---> COLIN
Console.WriteLine(StrimName[0] + 0); // should print 3 because character code for "C" Is 3 but result is 67...
return 0;
}
It prints a number from an ASCII table: http://www.asciitable.com/
You should replace it with:
Console.WriteLine((StrimName[0]-64) + 0);
to receive what you want. It turns out you want to count 'A' as one, and its number in ASCII table is 65, therefrom I subtract 64.
Every character has a number in the ascii code,
The ascii-code dor 'C' is 67, This is why you see 67.
You can see here a table for ascii code
I am using AWS SNS for pushing notification to apple devices. However, I am facing a lot of issue regarding the length of the message that can be passed to the SNS. Eg.
If i'm using the following message, it gets delivered:
{
"default":"This is the default message",
"APNS":"{
\"aps\":{
\"badge\":9,
\"alert\":\"The ninth season of supernatural, an American paranormal drama television series created by Eric Kripke, premiered on October 8, 2013, concluded on May 20, 2014, and contained 23 episodes. On February 14, 201\",
\"sound\":\"default\"
}
}"
}
with alert's value(which is the actual message) : 208 characters
Total characters : 319 characters
But if I add 1 more character in the message(alert's value), it doesn't work.
Again, If I use the following json with reduced message length(by 25 characters) and 1 extra parameter along aps the working lengths are as follows:
{
"default":"This is the default message",
"APNS":"{
\"aps\":{
\"badge\":9,
\"alert\":\"The ninth season of supernatural, an American paranormal drama television series created by Eric Kripke, premiered on October 8, 2013, concluded on May 20, 2014, and contained 23 epis\",
\"sound\":\"default\"
},
\"sound\":\"newMessage.aif\"
}"
}
with alert's value(which is the actual message) : 183 characters
Total characters : 324 characters
However, if I add 1 more character to message(alert's value), it doesn't work.
I can't seem to figure out the amount of trimming I need to do, before sending the messages, so that it doesn't fail. Any body got any idea?
The payload of your message is :
{
"aps":{
"badge":9,
"alert":"The ninth season of supernatural, an American paranormal drama television series created by Eric Kripke, premiered on October 8, 2013, concluded on May 20, 2014, and contained 23 epis",
"sound":"default"
},
"sound":"newMessage.aif"
}
The total length of all the characters you see above, including the quotes and the brackets, should be <= 256 bytes (not just the content of the alert property). You should avoid any spaces and new-lines that are not part of the alert message, because those are also counted toward the 256 bytes limit.
Note that your second example contains an additional parameter "sound":"newMessage.aif". That's why you have less remaining space for your alert.
BTW, I don't understand why you send the sound parameter twice. Is it a mistake? It should only appear inside the aps dictionary.
Relevant quotes from the APNS guide :
Each push notification includes a payload. The payload contains
information about how the system should alert the user as well as any
custom data you provide. The maximum size allowed for a notification
payload is 256 bytes; Apple Push Notification Service refuses any
notification that exceeds this limit.
The examples are formatted with whitespace and line breaks for
readability. In practice, omit whitespace and line breaks to reduce
the size of the payload, improving network performance.
What is the best way for converting phone numbers into international format (E.164) using Java?
Given a 'phone number' and a country id (let's say an ISO country code), I would like to convert it into a standard E.164 international format phone number.
I am sure I can do it by hand quite easily - but I would not be sure it would work correctly in all situations.
Which Java framework/library/utility would you recommend to accomplish this?
P.S. The 'phone number' could be anything identifiable by the general public - such as
* (510) 786-0404
* 1-800-GOT-MILK
* +44-(0)800-7310658
that last one is my favourite - it is how some people write their number in the UK and means that you should either use the +44 or you should use the 0.
The E.164 format number should be all numeric, and use the full international country code (e.g.+44)
Google provides a library for working with phone numbers. The same one they use for Android
http://code.google.com/p/libphonenumber/
String swissNumberStr = "044 668 18 00"
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
try {
PhoneNumber swissNumberProto = phoneUtil.parse(swissNumberStr, "CH");
} catch (NumberParseException e) {
System.err.println("NumberParseException was thrown: " + e.toString());
}
// Produces "+41 44 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.INTERNATIONAL));
// Produces "044 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.NATIONAL));
// Produces "+41446681800"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.E164));
Speaking from experience at writing this kind of thing, it's really difficult to do with 100% reliability. I've written some Java code to do this that is reasonably good at processing the data we have but won't be applicable in every country. Questions you need to ask are:
Are the character to number mappings consistent between countries? The US uses a lot of this (eg 1800-GOT-MILK) but in Australia, as one example, its pretty rare. What you'd need to do is ensure that you were doing the correct mapping for the country in question if it varies (it might not). I don't know what countries that use different alphabets (eg Cyrilic in Russia and the former Eastern block countries) do;
You have to accept that your solution will not be 100% and you should not expect it to be. You need to take a "best guess" approach. For example, theres no real way of knowing that 132345 is a valid phone number in Australia, as is 1300 123 456 but that these are the only two patterns that are for 13xx numbers and they're not callable from overseas;
You also have to ask if you want to validate regions (area codes). I believe the US uses a system where the second digit of the area code is a 1 or a 0. This may have once been the case but I'm not sure if it still applies. Whatever the case, many other countries will have other rules. In Australia, the valid area codes for landlines and mobile (cell) phones are two digits (the first is 0). 08, 03 and 04 are all valid. 01 isn't. How do you cater for that? Do you want to?
Countries use different conventions no matter how many digits they're writing. You have to decide if you want to accept something other than the "norm". These are all common in Australia:
(02) 1234 5678
02 1234 5678
0411 123 123 (but I've never seen 04 1112 3456)
131 123
13 1123
131 123
1 300 123 123
1300 123 123
02-1234-5678
1300-234-234
+44 78 1234 1234
+44 (0)78 1234 1234
+44-78-1234-1234
+44-(0)78-1234-1234
0011 44 78 1234 1234 (0011 is the standard international dialling code)
(44) 078 1234 1234 (not common)
And thats just off the top of my head. For one country. In France, for example, its common the write the phone number in number pairs (12 34 56 78) and they pronounce it that way too: instead of:
un (one), deux (two), trois (three), ...
its
douze (twelve), trente-quatre (thirty four), ...
Do you want to cater for that level of cultural difference? I would assume not but the question is worth considering just in case you make your rules too strict.
Also some people may append extension numbers on phone numbers, possibly with "ext" or similar abbreviation. Do you want to cater for that?
Sorry, no code here. Just a list of questions to ask yourself and issues to consider. As others have said, a series of regular expressions can do much of the above but ultimately phone number fields are (mostly) free form text at the end of the day.
This was my solution:
public static String FixPhoneNumber(Context ctx, String rawNumber)
{
String fixedNumber = "";
// get current location iso code
TelephonyManager telMgr = (TelephonyManager) ctx.getSystemService(Context.TELEPHONY_SERVICE);
String curLocale = telMgr.getNetworkCountryIso().toUpperCase();
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
Phonenumber.PhoneNumber phoneNumberProto;
// gets the international dialling code for our current location
String curDCode = String.format("%d", phoneUtil.getCountryCodeForRegion(curLocale));
String ourDCode = "";
if(rawNumber.indexOf("+") == 0)
{
int bIndex = rawNumber.indexOf("(");
int hIndex = rawNumber.indexOf("-");
int eIndex = rawNumber.indexOf(" ");
if(bIndex != -1)
{
ourDCode = rawNumber.substring(1, bIndex);
}
else if(hIndex != -1)
{
ourDCode = rawNumber.substring(1, hIndex);
}
else if(eIndex != -1)
{
ourDCode = rawNumber.substring(1, eIndex);
}
else
{
ourDCode = curDCode;
}
}
else
{
ourDCode = curDCode;
}
try
{
phoneNumberProto = phoneUtil.parse(rawNumber, curLocale);
}
catch (NumberParseException e)
{
return rawNumber;
}
if(curDCode.compareTo(ourDCode) == 0)
fixedNumber = phoneUtil.format(phoneNumberProto, PhoneNumberFormat.NATIONAL);
else
fixedNumber = phoneUtil.format(phoneNumberProto, PhoneNumberFormat.INTERNATIONAL);
return fixedNumber.replace(" ", "");
}
I hope this helps someone with the same problem.
Enjoy and use freely.
Thanks for the answers. As stated in the original question, I am much more interested in the formatting of the number into the standard format than I am in determining if it is a valid (as in genuine) phone number.
I have some hand crafted code currently that takes a phone number String (as entered by the user) and a source country context and target country context (the country from where the number is being dialed, and the country to where the number is being dialed - this is known to the system) and then does the following conversion in steps
Strip all whitespace from the number
Translate all alpha into digits - using a lookup table of letter to digit (e.g. A-->2, B-->2, C-->2, D-->3) etc. for the keypad (I was not aware that some keypads distribute these differently)
Strip all punctuation - keeping a preceding '+' intact if it exists (in case the number is already in some sort of international format).
Determine if the number has an international dialling prefix for the country context - e.g. if source context is the UK, I would see if it starts with a '00' - and replace it with a '+'. I do not currently check whether the digits following the '00' are followed by the international dialing code for the target country. I look up the international dialing prefix for the source country in a lookup table (e.g. GB-->'00', US-->'011' etc.)
Determine if the number has a local dialing prefix for the country context - e.g. if the source context is the UK, I would look to see if it starts with a '0' - and replace it with a '+' followed by the international dialing code for the target country. I look up the local dialing prefix for the source country in a lookup table (e.g. GB-->'0', US-->'1' etc.), and the international dialing code for the target country in another lookup table (e.g.'GB'='44', US='1')
It seems to work for everything I have thrown at it so far - except for the +44(0)1234-567-890 situation - I will add a special case check for that one.
Writing it was not hard - and I can add special cases for each strange exception I come across. But I would really like to know if there is a standard solution.
The phone companies seem to deal with this thing every day. I never get inconsistent results when dialing numbers using the PSTN. For example, in the US (where mobile phones have the same area codes as landlines, I could dial +1-123-456-7890, or 011-1-123-456-7890 (where 011 is the international dialing prefix in the US and 1 is the international dialing code for the US), 1-123-456-7890 (where 1 is the local dialing prefix in the US) or even 456-7890 (assuming I was in the 123 area code at the time) and get the same results each time. I assume that internally these dialed numbers get converted to the same E.164 standard format, and that the conversion is all done in software.
To be honest, it sounds like you've got most of the bases covered already.
The +44(0)800 format sometimes (incorrectly) used in the UK is annoying and isn't strictly valid according to E.123, which is the ITU-T recommendation for how numbers should be displayed. If you haven't got a copy of E.123 it's worth a look.
For what it's worth, the telephone network itself doesn't always use E.164. Often there'll be a flag in the ISDN signalling generated by the PBX (or in the network if you're on a steam phone) which tells the network whether the number being dialled is local, national or international.
In some countries you can validate 112 as a valid phone number, but if you stick a country code in front of it it won't be valid any more. In other countries you can't validate 112 but you can validate 911 as a valid phone number.
I've seen some phones that put Q on the 7 key and Z on the 9 key. I've seen some phones that put Q and Z on the 0 key, and some that put Q and Z on the 1 key.
An area code that existed yesterday might not exist today, and vice-versa.
In half of North America (country code 1), the second digit rule used to be 0 or 1 for area codes, but that rule went away 10 years ago.
I'm not aware of a standard library or framework available for formatting telephone numbers into E.164.
The solution used for our product, which requires formatting PBX provided caller-id into E.164, is to deploy a file (database table) containing the E.164 format information for all countries applicable.
This has the advantage that the application can be updated (to handle all the strange corner cases in various PSTN networks) w/out requiring changes to the production code base.
The table contains a row for each country code and information regarding area code length and subscriber length. There may be multiple entries for a country depending on what variations are possible with area code and subscriber number lengths.
Using New Zealand PSTN (partial) dial plan as an example of the table..
CC AREA_CODE AREA_CODE_LENGTH SUBSCRIBER SUBSCRIBER_LENGTH
64 1 7
64 21 2 7
64 275 3 6
We do something similar to what you have described, i.e. strip the provided telephone number of any non-digit characters and then format based on various rules regarding overall number plan length, outside access code, and long distance/international access codes.