How to get initial of last name in java using regex - java

Input : "Chris Gayle"
Required output : "Chris G"
I am currently using :
String inputStr = "Chris Gayle";
String[] strArr = inputStr.split(" ");
String output =strArr[0] + " " + strArr[1].charAt(0);
However, I was hoping to find an implementation that takes up fewer lines of code by using the 'replaceAll' function in the String class by using pattern matching techniques.

I wouldn't actually use regex for this.
Example
String input = "Chris Gayle";
System.out.println(input.substring(0, input.lastIndexOf(" ") + 2));
Output
Chris G
The advantage here is that you can have names with multiple items, i.e.... "Chris Foo Gayle" --> "Chris Foo G".
Note
This implies each item is separated by space, or at least the last name is. It would return unexpected results with something like "Chris J.Gayle".
Even worse, if your input does not contain any space (i.e. single name).
If that is a possible case, you should check that input.lastIndexOf(" ") != -1 prior to invoking substring.

Through replaceAll function.
string.replaceAll("(?<=\\s.).*", "");
The above regex would match all the characters which are preceded by a space and a single character.

One line answer:
System.out.println("Chris Gayle".replaceAll("([a-z]*)$", ""));
Note: last name must start with capital letter.

Related

Java - Regular Expressions Split on character after and before certain words

I'm having trouble figuring out how to grab a certain part of a string using regular expressions in JAVA. Here's my input string:
application.APPLICATION NAME.123456789.status
I need to grab the portion of the string called "APPLICATION NAME". I can't simply split on the period character becuase APPLICATION NAME may itself include a period. The first word, "application", will always remain the same and the characters after "APPLICATION NAME" will always be numbers.
I've been able to split on period and grab the 1st index but as I mentioned, APPLICATION NAME may itself include periods so this is no good. I've also been able to grab the first and second to last index of a period but that seems ineffecient and would like to future-proof by using REGEX.
I've googled around for hours and haven't been able to find much guidance. Thanks!
You can use ^application\.(.*)\.\d with find(), or application\.(.*)\.\d.* with matches().
Sample code using find():
private static void test(String input) {
String regex = "^application\\.(.*)\\.\\d";
Matcher m = Pattern.compile(regex).matcher(input);
if (m.find())
System.out.println(input + ": Found \"" + m.group(1) + "\"");
else
System.out.println(input + ": **NOT FOUND**");
}
public static void main(String[] args) {
test("application.APPLICATION NAME.123456789.status");
test("application.Other.App.Name.123456789.status");
test("application.App 55 name.123456789.status");
test("application.App.55.name.123456789.status");
test("bad input");
}
Output
application.APPLICATION NAME.123456789.status: Found "APPLICATION NAME"
application.Other.App.Name.123456789.status: Found "Other.App.Name"
application.App 55 name.123456789.status: Found "App 55 name"
application.App.55.name.123456789.status: Found "App.55.name"
bad input: **NOT FOUND**
The above will work as long as "status" doesn't start with a digit.
With split(), you could save key.split("\\.") in a String[] s and, in a second time, join from s[1] to s[s.length-3].
With regexes you can do:
String appName = key.replaceAll("application\\.(.*)\\.\\d+\\.\\w+")", "$1");
Why split? Just:
String appName = input.replaceAll(".*?\\.(.*)\\.\\d+\\..*", "$1");
This also correctly handles a dot then digits within the application name, but only works correctly if you know the input is in the expected format.
To handle "bad" input by returning blank if the pattern is not matched, be more strict and use an optional that will always match (replace) the entire input:
String appName = input.replaceAll("^application\\.(.*)\\.\\d+\\.\\w+$|.*", "$1");

regex - How to match elements while ignoring others between quotation marks?

I can't seem to find the regex that suits my needs.
I have a .txt file of this form:
Abc "test" aBC : "Abc aBC"
Brooking "ABC" sadxzc : "I am sad"
asd : "lorem"
a22 : "tactius"
testsa2 : "bruchia"
test : "Abc aBC"
b2 : "Ast2"
From this .txt file I wish to extract everything matching this regex "([a-zA-Z]\w+)", except the ones between the quotation marks.
I want to rename every word (except the words in quotation marks), so I should have for example the following output:
A "test " B : "Abc aBC"
Z "ABC" X : "I am sad"
Test : "lorem"
F : "tactius"
H : "bruchia"
Game : "Abc aBC"
S: "Ast2"
Is this even achievable using a regex? Are there alternatives without using regex?
If quotes are balanced and there is no escaping in the input like \" then you can use this regex to match words outside double quotes:
(?=(?:(?:[^"]*"){2})*[^"]*$)(\b[a-zA-Z]\w+\b)
RegEx Demo
In java it will be:
Pattern p = Pattern.compile("(?=(?:(?:[^\"]*\"){2})*[^\"]*$)(\\b[a-zA-Z]\\w+\\b)");
This regex will match word if those are outside double quotes by using a lookahead to make sure there are even number of quotes after each matched word.
A simple approach might be to split the string by ", then do the replace using your regex on every odd part (on parts 1, 3, ..., if you start the numbering from 1), and join everything back.
UPD
However, it is also simple to implement manually. Just go along the line and track whether you are inside quotes or not.
insideQuotes = false
result = ""
currentPart = ""
input = input + '"' // so that we do not need to process the last part separately
for ch in string
if ch == '"'
if not insideQuotes
currentPart = replace(currentPart)
result = result + currentPart + '"'
currentPart = ""
insideQuotes = not insideQuotes
else
currentPart = currentPart + ch
drop the last symbol of result (it is that quote mark that we have added)
However, think also on whether you will need some more advanced syntax. For example, quote escaping like
word "inside quote \" still inside" outside again
? If yes, then you will need a more advanced parser, or you might think of using some special format.
You can’t formulate a “within quotes” condition the way you might think. But you can easily search for unquoted words or quoted strings and take action only for the unquoted words:
Pattern p = Pattern.compile("\"[^\"]*\"|([a-zA-Z]\\w+)");
for(String s: lines) {
Matcher m=p.matcher(s);
while(m.find()) {
if(m.group(1)!=null) {
System.out.println("take action with "+m.group(1));
}
}
}
This utilizes the fact that each search for the next match starts at the end of the previous. So if you find a quoted string ("[^"]*") you don’t take any action and continue searching for other matches. Only if there is no match for a quoted string, the pattern looks for a word (([a-zA-Z]\w+)) and if one is found, the group 1 captures the word (will be non null).

Splitting a string with a certain pattern in Java

I am writing a parser for a file containing the following string pattern:
Key : value
Key : value
Key : value
etc...
I am able to retrieve those lines one by one into a list. What I would like to do is to separate the key from the value for each one of those strings. I know there is the split() method that can take a Regex and do this for me, but I am very unfamiliar with them so I don't know what Regex to give as a parameter to the split() function.
Also, while not in the specifications of the file I am parsing, I would like for that Regex to be able to recognize the following patterns as well (if possible):
Key: value
Key :value
Key:value
etc...
So basically, whether there's a space or not after/before/after AND before the : character, I would like for that Regex to be able to detect it. What is the Regex that can achieve this?
In other words split method should look for : and zero or more whitespaces before or after it.
Key: value
^^
Key :value
^^
Key:value
^
Key : value
^^^
In that case split("\\s*:\\s*") should do the trick.
Explanation:
\\s represents any whitespace
* means one or more occurrences of element described before it
\\s* means zero or more whitespaces.
On the other hand you may want also to find entire key:value pair and place parts matching key and value in separate groups (you can even name groups as you like using (?<groupName>regex)). In that case you may use
Pattern p = Pattern.compile("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)");
Matcher m = p.matcher(yourData);
while(m.find()){
System.out.println("key = " + m.group("key"));
System.out.println("value = " + m.group("value"));
System.out.println("--------");
}
If you want to use String.split(), you could use this:
String input = "key : value";
String[] s = input.split("\\s*:\\s*");
String key = s[0];
String value = s[1];
This will split the String at the ":", but add all whitespaces in front of the ":" to it, so that you will receive a trimmed string.
Explanation:
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
Note that this solution will cause an ArrayIndexOutOfBoundsException if your input line does not contain the key-value-format as you defined it.
If you are not sure if the line really contain the key-value-String, maybe because you want to have an empty line at the end of your file like there normally is, you could do it like that:
String input = "key : value";
Matcher m = Pattern.compile("(\\S+)\\s*:\\s*(.+)").matcher(input);
if (m.matches())
{
String key = m.group(1); // note that the count starts by 1 here
String value = m.group(2);
}
Explanation:
\\S+ matches any non-whitespace String - if it contains whitespaces, the next part of the regex will be matches with this expression already. Note that the () around it mark so that you can get it's value by m.group().
\\s* will match any whitespace, by default this is equal to [ \\n\\r\\t]*
The : in between the two \\s* means that your : need to be there
The last group, .+, will match any string, containing whitespaces and so on.
you can use the split method but can pass delimiter as ":"
This splits the string when it sees ':', then you can trim the values to get the key and value.
String s = " keys : value ";
String keyValuePairs[] = s.split(":");
String key = keyValuePairs[0].trim();
String value = keyValuePairs[1].trim();
You can also make use of regex to simplify it.
String keyValuePairs[] = s.trim().split("[ ]*:[ ]*");
s.trim() will remove the spaces before and after the string (if you have it in your case), So sting will become "keys : value" and
[ ]*:[ ]*
to split the string with regular expression saying spaces (one or more) : spaces (one or more) as delimiter.
For a pure regex solution, you can use the following pattern (note the space at the beginning):
?: ?
See http://regexr.com/39evh
String[] tokensVal = str.split(":");
String key = tokensVal[0].trim();
String value = tokensVal[1].trim();

I want to perform a split() on a string using a regex in Java, but would like to keep the delimited tokens in the array [duplicate]

This question already exists:
Is there a way to split strings with String.split() and include the delimiters? [duplicate]
Closed 8 years ago.
How can I format my regex to allow this?
Here's the regular expression:
"\\b[(\\w'\\-)&&[^0-9]]{4,}\\b"
It's looking for any word that is 4 letters or greater.
If I want to split, say, an article, I want an array that includes all the delimited values, plus all the values between them, all in the order that they originally appeared in. So, for example, if I want to split the following sentence: "I need to purchase a new vehicle. I would prefer a BMW.", my desired result from the split would be the following, where the italicized values are the delimiters.
"I ", "need", " to ", "purchase", " a new ", "vehicle", ". I ", "would", " ", "prefer", "a BMW."
So, all words with >4 characters are one token, while everything in between each delimited value is also a single token (even if it is multiple words with whitespace). I will only be modifying the delimited values and would like to keep everything else the same, including whitespace, new lines, etc.
I read in a different thread that I could use a lookaround to get this to work, but I can't seem to format it correctly. Is it even possible to get this to work the way I'd like?
I am not sure what you are trying to do but just in case that you want to modify words that have at least four letters you can use something like this (it will change words with =>4 letters to its upper cased version)
String data = "I need to purchase a new vehicle. I would prefer a BMW.";
Pattern patter = Pattern.compile("(?<![a-z\\-_'])[a-z\\-_']{4,}(?![a-z\\-_'])",
Pattern.CASE_INSENSITIVE);
Matcher matcher = patter.matcher(data);
StringBuffer sb = new StringBuffer();// holder of new version of our
// data
while (matcher.find()) {// lets find all words
// and change them with its upper case version
matcher.appendReplacement(sb, matcher.group().toUpperCase());
}
matcher.appendTail(sb);// lets not forget about part after last match
System.out.println(sb);
Output:
I NEED to PURCHASE a new VEHICLE. I WOULD PREFER a BMW.
OR if you change replacing code to something like
matcher.appendReplacement(sb, "["+matcher.group()+"]");
you will get
I [need] to [purchase] a new [vehicle]. I [would] [prefer] a BMW.
Now you can just split such string on every [ and ] to get your desired array.
Assuming that "word" is defined as [A-Za-z], you can use this regex:
(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))
Full code:
class RegexSplit{
public static void main(String[] args){
String str = "I need to purchase a new vehicle. I would prefer a BMW.";
String[] tokens = str.split("(?<=(\\b[A-Za-z]{4,50}\\b))|(?=(\\b[A-Za-z]{4,50}\\b))");
for(String token: tokens){
System.out.print("["+token+"]");
}
System.out.println();
}
}
to get this output:
[I ][need][ to ][purchase][ a new ][vehicle][. I ][would][ ][prefer][ a BMW.]

Error when splitting a string in java

I am trying to split a string according to a certain set of delimiters.
My delimiters are: ,"():;.!? single spaces or multiple spaces.
This is the code i'm currently using,
String[] arrayOfWords= inputString.split("[\\s{2,}\\,\"\\(\\)\\:\\;\\.\\!\\?-]+");
which works fine for most cases but i'm have a problem when the the first word is surrounded by quotation marks. For example
String inputString = "\"Word\" some more text.";
Is giving me this output
arrayOfWords[0] = ""
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
I want the output to give me an array with
arrayOfWords[0] = "Word"
arrayOfWords[1] = "some"
arrayOfWords[2] = "more"
arrayOfWords[3] = "text"
This code has been working fine when quotation marks are used in the middle of the sentence, I'm not sure what the trouble is when it's at the beginning.
EDIT: I just realized I have same problem when any of the delimiters are used as the first character of the string
Unfortunately you wont be able to remove this empty first element using only split. You should probably remove first elements from your string that match your delimiters and split after it. Also your regex seems to be incorrect because
by adding {2,} inside [...] you are in making { 2 , and } characters delimiters,
you don't need to escape rest of your delimiters (note that you don't have to escape - only because it is at end of character class [] so he cant be used as range operator).
Try maybe this way
String regexDelimiters = "[\\s,\"():;.!?\\-]+";
String inputString = "\"Word\" some more text.";
String[] arrayOfWords = inputString.replaceAll(
"^" + regexDelimiters,"").split(regexDelimiters);
for (String s : arrayOfWords)
System.out.println("'" + s + "'");
output:
'Word'
'some'
'more'
'text'
A delimiter is interpreted as separating the strings on either side of it, thus the empty string on its left is added to the result as well as the string to its right ("Word"). To prevent this, you should first strip any leading delimiters, as described here:
How to prevent java.lang.String.split() from creating a leading empty string?
So in short form you would have:
String delim = "[\\s,\"():;.!?\\-]+";
String[] arrayOfWords = inputString.replaceFirst("^" + delim, "").split(delim);
Edit: Looking at Pshemo's answer, I realize he is correct regarding your regex. Inside the brackets it's unnecessary to specify the number of space characters, as they will be caught be the + operator.

Categories

Resources