I am trying to normalise UK telephone numbers to international format.
The following strings should resolve to: +447834012345
07834012345
+447834012345
+4407834012345
+44 (0) 7834 012345
+44 0 7834 012345
004407834012345
0044 (0) 7834012345
00 44 0 7834012345
So far, I have got this:
"+44" + mobile.replaceAll("[^0-9]0*(44)?0*", "")
This doesn't quite cut it, as I am having problems with leading 0's etc; see table below. I'd like to try and refrain from using the global flag if possible.
Mobile | Normalised |
--------------------+--------------------+------
07834012345 | +4407834012345 | FAIL
+447834012345 | +447834012345 | PASS
+4407834012345 | +447834012345 | PASS
+44 (0) 7834 012345 | +44783412345 | FAIL
+44 0 7834 012345 | +44783412345 | FAIL
004407834012345 | +44004407834012345 | FAIL
0044 (0) 7834012345 | +4400447834012345 | FAIL
00 44 0 7834012345 | +44007834012345 | FAIL
+4407834004445 | +447834004445 | PASS
Thanks
If you still want the regex I was able to get it working like this:
"+44" + System.out.println(replaceAll("[^0-9]", "")
.replaceAll("^0{0,2}(44){0,2}0{0,1}(\\d{10})", "$2"));
EDIT: Changed the code to reflect failed tests. Removed non-numeric characters before running the regex.
EDIT: Update code based on comments.
Like my answer here, I would also suggest looking at the Google libphonenumber library. I know it is not regex but it does exactly what you want.
An example of how to do it in Java (it is available in other languages) would be the following from the documentation:
Let's say you have a string representing a phone number from
Switzerland. This is how you parse/normalize it into a PhoneNumber
object:
String swissNumberStr = "044 668 18 00";
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
try {
PhoneNumber swissNumberProto = phoneUtil.parse(swissNumberStr, "CH");
} catch (NumberParseException e) {
System.err.println("NumberParseException was thrown: " + e.toString());
}
At this point, swissNumberProto contains:
{
"country_code": 41,
"national_number": 446681800
}
PhoneNumber is a class that is auto-generated from the
phonenumber.proto with necessary modifications for efficiency. For
details on the meaning of each field, refer to
https://github.com/googlei18n/libphonenumber/blob/master/resources/phonenumber.proto
Now let us validate whether the number is valid:
boolean isValid = phoneUtil.isValidNumber(swissNumberProto); // returns true
There are a few formats supported by the formatting method, as
illustrated below:
// Produces "+41 44 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.INTERNATIONAL));
// Produces "044 668 18 00"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.NATIONAL));
// Produces "+41446681800"
System.out.println(phoneUtil.format(swissNumberProto, PhoneNumberFormat.E164));
Related
I have a Dataset<Row> in java. I need to read value of 1 column which is a JSON string, parse it, and set the value of a few other columns based on the parsed JSON value.
My dataset looks like this:
|json | name| age |
========================================
| "{'a':'john', 'b': 23}" | null| null |
----------------------------------------
| "{'a':'joe', 'b': 25}" | null| null |
----------------------------------------
| "{'a':'zack'}" | null| null |
----------------------------------------
And I need to make it like this:
|json | name | age |
========================================
| "{'a':'john', 'b': 23}" | 'john'| 23 |
----------------------------------------
| "{'a':'joe', 'b': 25}" | 'joe' | 25 |
----------------------------------------
| "{'a':'zack'}" | 'zack'|null|
----------------------------------------
I am unable to figure out a way to do it. Please help with the code.
There is a function get_json_object exists in Spark.
Suggesting, you have a data frame named df, you may choose this way to solve your problem:
df.selectExpr("get_json_object(json, '$.a') as name", "get_json_object(json, '$.b') as age" )
But first and foremost, be sure that your json attribute has double quotes instead of single ones.
Note: there is a full list of Spark SQL functions. I am using it heavily. Consider to add it to bookmarks and reference time to time.
You could use UDFs
def parseName(json: String): String = ??? // parse json
val parseNameUDF = udf[String, String](parseName)
def parseAge(json: String): Int = ??? // parse json
val parseAgeUDF = udf[Int, String](parseAge)
dataFrame
.withColumn("name", parseNameUDF(dataFrame("json")))
.withColumn("age", parseAgeUDF(dataFrame("json")))
I have this string to parse and extract all elements between <>:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
I tried with this pattern, but it doesn't work (no result):
String pattern = "<#[C,U][0-9]+\\|[.]+>";
So in this example I want to extract:
<#C5712|user_name_toto>
<#U433|user_hola>
Then for each, I want to extract:
C or U element
ID (ie: 5712 or 433)
user name (ie: user_name_toto)
Thank you very much guys
The main problem I can see with your pattern is that it doesn't contain groups, hence retrieving parts of it will be impossible without further parsing.
You define numbered groups within parenthesis: (partOfThePattern).
From Java 7 onwards, you can also define named groups as follows: (?<theName>partOfThePattern).
The second problem is that [.] corresponds to a literal dot, not an "any character" wildcard.
The third problem is your last quantifier, which is greedy, therefore it would consume the whole rest of the string starting from the first username.
Here's a self-contained example fixing all that:
String text = "test user #myhashtag <#C5712|user_name_toto> <#U433|user_hola>";
// | starting <#
// | | group 1: any 1 char
// | | | group 2: 1+ digits
// | | | | escaped "|"
// | | | | | group 3: 1+ non-">" chars, greedy
// | | | | | | closing >
// | | | | | |
Pattern p = Pattern.compile("<#(.)(\\d+)\\|([^>]+))>");
Matcher m = p.matcher(text);
while (m.find()) {
System.out.printf(
"C or U? %s%nUser ID: %s%nUsername: %s%n",
m.group(1), m.group(2), m.group(3)
);
}
Output
C or U? C
User ID: 5712
Username: user_name_toto
C or U? U
User ID: 433
Username: user_hola
Note
I'm not validating C vs U here (gives you another . example).
You can easily replace the initial (.) with (C|U) if you only have either. You can also have the same with ([CU]).
<#([CU])(\d{4})\|(\w+)>
Where:
$1 --> C/U
$2 --> 5712/433
$3 --> user_name_toto/user_hola
Background: I'm using Talend to do something (I guess) that is pretty common: generating multiple rows from one. For example:
ID | Name | DateFrom | DateTo
01 | Marco| 01/01/2014 | 04/01/2014
...could be split into:
new_ID | ID | Name | DateFrom | DateTo
01 | 01 | Marco | 01/01/2014 | 02/01/2014
02 | 01 | Marco | 02/01/2014 | 03/01/2014
03 | 01 | Marco | 03/01/2014 | 04/01/2014
The number of outcoming rows is dynamic, depending on the date period in the original row.
Question: how can I do this? Maybe using tSplitRow? I am going to check those periods with tJavaRow. Any suggestions?
Expanding on the answer given by Balazs Gunics
Your first part is to calculate the number of rows one row will become, easy enough with a date diff function on the to and from dates
Part 2 is to pass that value to a tFlowToIterate, and pick it up with a tJavaFlex that will use it in its start code to control a for loop:
tJavaFlex start:
int currentId = (Integer)globalMap.get("out1.id");
String currentName = (String)globalMap.get("out1.name");
Long iterations = (Long)globalMap.get("out1.iterations");
Date dateFrom = (java.util.Date)globalMap.get("out1.dateFrom");
for(int i=0; i<((Long)globalMap.get("out1.iterations")); i++) {
Main
row2.id = currentId;
row2.name = currentName;
row2.dateFrom = TalendDate.addDate(dateFrom, i, "dd");
row2.dateTo = TalendDate.addDate(dateFrom, i+1, "dd");
End
}
and sample output:
1|Marco|01-01-2014|02-01-2014
1|Marco|02-01-2014|03-01-2014
1|Marco|03-01-2014|04-01-2014
2|Polo|01-01-2014|02-01-2014
2|Polo|02-01-2014|03-01-2014
2|Polo|03-01-2014|04-01-2014
2|Polo|04-01-2014|05-01-2014
2|Polo|05-01-2014|06-01-2014
2|Polo|06-01-2014|07-01-2014
2|Polo|07-01-2014|08-01-2014
2|Polo|08-01-2014|09-01-2014
2|Polo|09-01-2014|10-01-2014
2|Polo|10-01-2014|11-01-2014
2|Polo|11-01-2014|12-01-2014
2|Polo|12-01-2014|13-01-2014
2|Polo|13-01-2014|14-01-2014
2|Polo|14-01-2014|15-01-2014
2|Polo|15-01-2014|16-01-2014
2|Polo|16-01-2014|17-01-2014
2|Polo|17-01-2014|18-01-2014
2|Polo|18-01-2014|19-01-2014
2|Polo|19-01-2014|20-01-2014
2|Polo|20-01-2014|21-01-2014
2|Polo|21-01-2014|22-01-2014
2|Polo|22-01-2014|23-01-2014
2|Polo|23-01-2014|24-01-2014
2|Polo|24-01-2014|25-01-2014
2|Polo|25-01-2014|26-01-2014
2|Polo|26-01-2014|27-01-2014
2|Polo|27-01-2014|28-01-2014
2|Polo|28-01-2014|29-01-2014
2|Polo|29-01-2014|30-01-2014
2|Polo|30-01-2014|31-01-2014
2|Polo|31-01-2014|01-02-2014
You can use tJavaFlex to do this.
If you have a small amount of columns the a tFlowToIterate -> tJavaFlex options could be fine.
In the begin part you can start to iterate, and in the main part you assign values to the output schema. If you name your output is row6 then:
row6.id = (String)globalMap.get("id");
and so on.
I came here as I wanted to add all context parameters into an Excel data sheet. So the solution bellow works when you are taking 0 input lines, but can be adapted to generate several lines for each line in input.
The design is actually straight forward:
tJava –trigger-on-OK→ tFileInputDelimited → tDoSomethingOnRowSet
↓ ↑
[write into a CSV] [read the CSV]
And here is the kind of code structure usable in the tJava.
try {
StringBuffer wad = new StringBuffer();
wad.append("Key;Nub"); // Header
context.stringPropertyNames().forEach(
key -> wad.
append(System.getProperty("line.separator")).
append(key + ";" + context.getProperty(key) )
);
// Here context.metadata contains the path to the CSV file
FileWriter output = new FileWriter(context.metadata);
output.write(wad.toString());
output.close();
} catch (IOException mess) {
System.out.println("An error occurred.");
mess.printStackTrace();
}
Of course if you have a set of rows as input, you can adapt the process to use a tJavaRow instead of a tJava.
You might prefer to use an Excel file as an on disk buffer, but dealing with this file format asks more work at least the first time when you don’t have the Java libraries already configured in Talend. Apache POI might help you if you nonetheless chose to go this way.
So far I have this:
File dir = new File("C:\\Users\\User\\Desktop\\dir\\dir1\\dir2);
dir.mkdirs();
File file = new File(dir, "filename.txt");
FileWriter archivo = new FileWriter(file);
archivo.write(String.format("%20s %20s", "column 1", "column 2 \r\n"));
archivo.write(String.format("%20s %20s", "data 1", "data 2"));
archivo.flush();
archivo.close();
However. the file output looks like this:
Which I do not like at all.
How can I make a better table format for the output of a text file?
Would appreciate any assistance.
Thanks in advance!
EDIT: Fixed!
Also, instead of looking like
column 1 column 2
data 1 data 2
How can I make it to look like this:
column 1 column 2
data 1 data 2
Would prefer it that way.
The \r\n is been evaluated as part of the second parameter, so it basically calculating the required space as something like... 20 - "column 2".length() - " \r\n".length(), but since the second line doesn't have this, it takes less space and looks misaligned...
Try adding the \r\n as part of the base format instead, for example...
String.format("%20s %20s \r\n", "column 1", "column 2")
This generates something like...
column 1 column 2
data 1 data 2
In my tests...
I think you are trying to get data in tabular format. I've developed a Java library that can build much complex tables with more customization. You can get the source code here. Following are some of the basic table-views that my library can create. Hope this is useful enough!
COLUMN WISE GRID(DEFAULT)
+--------------+--------------+--------------+--------------+-------------+
|NAME |GENDER |MARRIED | AGE| SALARY($)|
+--------------+--------------+--------------+--------------+-------------+
|Eddy |Male |No | 23| 1200.27|
|Libby |Male |No | 17| 800.50|
|Rea |Female |No | 30| 10000.00|
|Deandre |Female |No | 19| 18000.50|
|Alice |Male |Yes | 29| 580.40|
|Alyse |Female |No | 26| 7000.89|
|Venessa |Female |No | 22| 100700.50|
+--------------+--------------+--------------+--------------+-------------+
FULL GRID
+------------------------+-------------+------+-------------+-------------+
|NAME |GENDER |MARRIE| AGE| SALARY($)|
+------------------------+-------------+------+-------------+-------------+
|Eddy |Male |No | 23| 1200.27|
+------------------------+-------------+------+-------------+-------------+
|Libby |Male |No | 17| 800.50|
+------------------------+-------------+------+-------------+-------------+
|Rea |Female |No | 30| 10000.00|
+------------------------+-------------+------+-------------+-------------+
|Deandre |Female |No | 19| 18000.50|
+------------------------+-------------+------+-------------+-------------+
|Alice |Male |Yes | 29| 580.40|
+------------------------+-------------+------+-------------+-------------+
|Alyse |Female |No | 26| 7000.89|
+------------------------+-------------+------+-------------+-------------+
|Venessa |Female |No | 22| 100700.50|
+------------------------+-------------+------+-------------+-------------+
NO GRID
NAME GENDER MARRIE AGE SALARY($)
Alice Male Yes 29 580.40
Alyse Female No 26 7000.89
Eddy Male No 23 1200.27
Rea Female No 30 10000.00
Deandre Female No 19 18000.50
Venessa Female No 22 100700.50
Libby Male No 17 800.50
Eddy Male No 23 1200.27
Libby Male No 17 800.50
Rea Female No 30 10000.00
Deandre Female No 19 18000.50
Alice Male Yes 29 580.40
Alyse Female No 26 7000.89
Venessa Female No 22 100700.50
You're currently including " \r\n" within your right-aligned second argument. I suspect you don't want the space at all, and you don't want the \r\n to be part of the count of 20 characters.
To left-align instead of right-aligning, use the - flag, i.e. %-20s instead of %20s. See the documentation for Formatter documentation for more information.
Additionally, you can make the code work in a more cross-platform way using %n to represent the current platform's line terminator (unless you specifically want a Windows file.
I'd recommend the use of Files.newBufferedWriter as well, as that allows you to specify the character encoding (and will use UTF-8 otherwise, which is better than using the platform default)... and use a try-with-resources statement to close the writer even in the face of an exception:
try (Writer writer = Files.newBufferedWriter(file.toPath())) {
writer.write(String.format("%-20s %-20s%n", "column 1", "column 2"));
writer.write(String.format("%-20s %-20s%n", "data 1", "data 2"));
}
try {
PrintWriter outputStream = new PrintWriter("myObjects.txt");
outputStream.println(String.format("%-20s %-20s %-20s", "Name", "Age", "Gender"));
outputStream.println(String.format("%-20s %-20s %-20s", "John", "30", "Male"));
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
It also works with the printf in case you want to output variables instead of hard coding
try {
PrintWriter myObj = new PrintWriter("Result.txt");
resultData.println("THE RESULTS OF THE OPERATIONS\n");
for (int i = 0; i < 15; i++){
resultData.printf("%-20d%-20d\r", finalScores[i], midSemScores[i]);
}
resultData.close();
} catch (IOException e){
System.Out.Println("An error occurred");
e.printStackTrace();
}
I tried to replace a list of words from a give string with the following code.
String Sample = " he saw a cat running of that pat's mat ";
String regex = "'s | he | of | to | a | and | in | that";
Sample = Sample.replaceAll(regex, " ");
The output is
[ saw cat running that pat mat ]
// minus the []
It still has the last word "that". Is there anyway to modify the regex to consider the last word also.
Try:
String Sample = " he saw a cat running of that pat's mat remove 's";
String resultString = Sample.replaceAll("\\b( ?'s|he|of|to|a|and|in|that)\\b", "");
System.out.print(resultString);
saw cat running pat mat remove
DEMO
http://ideone.com/Yitobz
The problem is that you have consecutive words that you are trying to replace.
For example, consider the substring
[ of that ]
while the replaceAll is running, the [ of ] matches
[ of that ]
^ ^
and that will be replaced with a (space). The next character to match is t, not a space expected by
... | that | ...
What I think you can do to fix this is add word boundaries instead of spaces.
String regex = "'s\\b|\\bhe\\b|\\bof\\b|\\bto\\b|\\ba\\b|\\band\\b|\\bin\\b|\\bthat\\b";
or the shorter version as shown in Tuga's answer.
it doesn't work, because you delete the " of " part first and then there is no space before the "that" word, because you deleted it (replaced)
you can change in two ways:
String regex = "'s | he | of| to | a | and | in | that";
or
String regex = "'s | he | of | to | a | and | in |that ";
or you just call Sample = Sample.replaceAll(regex, " "); again