How to Split text by Numbers and Group of words

How to Split text by Numbers and Group of words - java

Assuming I have a string containing
- some comma separated string
- and text
my_string = "2 Marine Cargo 14,642 10,528 16,016 more text 8,609 argA 2,106 argB"
I would like to extract them into an array that is split by "Numbers" and "group of words"
resultArray = {"2", "Marine Cargo", "14,642", "10,528", "16,016",
"more text", "8,609", "argA", "2,106", "argB"};
note 0: there might be multiple spaces between each entries, which should be ignored.
note 1: "Marine Cargo" and "more text" is not separated into different strings since they are a group of words without numbers separating them.
while argA and argB are separated because there's a number between them.

you can try splitting using this regex
([\d,]+|[a-zA-Z]+ *[a-zA-Z]*) //note the spacing between + and *.
[0-9,]+ // will search for one or more digits and commas
[a-zA-Z]+ [a-zA-Z] // will search for a word, followed by a space(if any) followed by another word(if any).
String regEx = "[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*";
you use them like this
public static void main(String args[]) {
String input = new String("2 Marine Cargo 14,642 10,528 16,016 more text 8,609 argA 2,106 argB");
System.out.println("Return Value :" );
Pattern pattern = Pattern.compile("[0-9,]+|[a-zA-Z]+ *[a-zA-Z]*");
ArrayList<String> result = new ArrayList<String>();
Matcher m = pattern.matcher(input);
while (m.find()) {
System.out.println(">"+m.group(0)+"<");
result.add(m.group(0));
}
}
The following is the output as well as a detailed explaination of the RegEx that is autogenerated from https://regex101.com
1st Alternative [0-9,]+
Match a single character present in the list below [0-9,]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
, matches the character , literally (case sensitive)
2nd Alternative [a-zA-Z]+ *[a-zA-Z]*
Match a single character present in the list below [a-zA-Z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)
* matches the character literally (case sensitive)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [a-zA-Z]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
a-z a single character in the range between a (index 97) and z (index 122) (case sensitive)
A-Z a single character in the range between A (index 65) and Z (index 90) (case sensitive)

If spaces are your problem. String#split takes a regex as parameter. Then you could do this:
my_list = Arrays.asList(my_string.split("\s?"));
But, this won't solve all the problems, like those mentioned in the comments.

You could do something like so:
List<String> strings = new ArrayList<>();
String prev = null;
for (String w: my_string.split("\\s+")) {
if (w.matches("\\d+(?:,\\d+)?")) {
if (prev != null) {
strings.add(prev);
prev = null;
}
strings.add(w);
} else if (prev == null) {
prev = w;
} else {
prev += " " + w;
}
}
if (prev != null) {
strings.add(prev);
}

I like Angel Koh solution and want to add on it. His solution will only match if the numeric part consists out of one or two parts.
If you also want to capture parts consisting out of three or more parts you have to alter the regex a bit to: ([\d,]+|[a-zA-Z]+(?: *[a-zA-Z])*)
The non capturing group (?: *[a-zA-Z]) repeats infinite times, if needed and will capture all pure numeric parts.

Related

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

i want to have regular expression to check input character as a-z and 0-9 but i do not want to allow input as just numeric value at all ( must be have at least one alphabetic character)
for example :
413123123123131
not allowed but if have just only one alphabetic character in any place of phrase it's ok
i trying to define correct Regex for that and at final i raised to
[0-9]*[a-z].*
but in now i confused how to defined {x,y} length of phrase i want to have {9,31} but after last * i can not to have length block too i trying to define group but unlucky and not worked
tested at https://www.debuggex.com/
how can i to add it ??

What you seek is
String regex = "(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*";
Use it with String#matches() / Pattern#matches() method to require a full string match:
if (s.matches(regex)) {
return true;
}
Details
^ - implicit in matches() - matches the start of string
(?=.{9,31}$) - a positive lookahead that requires 9 to 31 any chars other than line break chars from the start to end of the string
\\p{Alnum}* - 0 or more alphanumeric chars
\\p{Alpha} - an ASCII letter
\\p{Alnum}* - 0 or more alphanumeric chars
Java demo:
String lines[] = {"413123123123131", "4131231231231a"};
Pattern p = Pattern.compile("(?=.{9,31}$)\\p{Alnum}*\\p{Alpha}\\p{Alnum}*");
for(String line : lines)
{
Matcher m = p.matcher(line);
if(m.matches()) {
System.out.println(line + ": MATCH");
} else {
System.out.println(line + ": NO MATCH");
}
}
Output:
413123123123131: NO MATCH
4131231231231a: MATCH

This might be what you are looking for.
[0-9a-zA-Z]*[a-zA-Z][0-9a-zA-Z]*
To help explain it, think of the middle term as your one required character and the outer terms as any number of alpha numeric characters.
Edit: to restrict the length of the string as a whole you may have to check that manually after matching. ie.
if (str.length > 9 && str.length < 31)
Wiktor does provide a solution that involves more regex, please look at his for a better regex pattern

Try this Regex:
^(?:(?=[a-z])[a-z0-9]{9,31}|(?=\d.*[a-z])[a-z0-9]{9,31})$
OR a bit shorter form:
^(?:(?=[a-z])|(?=\d.*[a-z]))[a-z0-9]{9,31}$
Demo
Explanation(for the 1st regex):
^ - position before the start of the string
(?=[a-z])[a-z0-9]{9,31} means If the string starts with a letter, then match Letters and digits. minimum 9 and maximum 31
| - OR
(?=\d.*[a-z])[a-z0-9]{9,31} means If the string starts with a digit followed by a letter somewhere in the string, then match letters and digits. Minimum 9 and Maximum 31. This also ensures that If the string starts with a digit and if there is no letter anywhere in the string, there won't be any match
$ - position after the last literal of the string
OUTPUT:
413123123123131 NO MATCH(no alphabets)
kjkhsjkf989089054835werewrew65 MATCH
kdfgfd4374985794379857984379857weorjijuiower NO MATCH(length more than 31)
9087erkjfg9080980984590p465467 MATCH
4131231231231a MATCH
kjdfg34 NO MATCH(Length less than 9)

Here's the regex:
[a-zA-Z\d]*[a-zA-Z][a-zA-Z\d]*
The trick here is to have something that is not optional. The leading and trailing [a-zA-Z\d] has a * quantifier, so they are optional. But the [a-zA-Z] in the middle there is not optional. The string must have a character that matches [a-zA-Z] in order to be matched.
However, you need to check the length of the string with length afterwards and not with regex. I can't think of any way how you can do this in regex.
Actually, I think you can do this regexless pretty easily:
private static boolean matches(String input) {
for (int i = 0 ; i < input.length() ; i++) {
if (Character.isLetter(input.charAt(i))) {
return input.length() >= 9 && input.length() <= 31;
}
}
return false;
}

Lexing a String in Java with Regex

For some reason the while loop is only going through one time, picking up a NUMBER and then exiting. Does anyone have any idea why it isn't lexing the rest of the String? All I had was an input of 1 + 2. Any help is much appreciated!!
public Lexer(String input) throws TokenMismatchException {
tokens = new ArrayList<Token>();
// Lexing logic begins here
StringBuffer tokenPatternsBuffer = new StringBuffer();
for (Type type : Type.values())
tokenPatternsBuffer.append(String.format("|(?<%s>%s)", type.name(), type.pattern));
Pattern tokenPatterns = Pattern.compile(new String(tokenPatternsBuffer.substring(1)));
// Begin matching tokens
Matcher matcher = tokenPatterns.matcher(input.replaceAll(" ", ""));
while (matcher.find()) {
if (matcher.group(Type.NUMBER.name()) != null) {
tokens.add(new Token(Type.NUMBER, matcher.group(Type.NUMBER.name())));
continue;
} else if (matcher.group(Type.OPERATOR.name()) != null) {
tokens.add(new Token(Type.OPERATOR, matcher.group(Type.OPERATOR.name())));
continue;
} else if (matcher.group(Type.UNIT.name()) != null) {
tokens.add(new Token(Type.UNIT, matcher.group(Type.UNIT.name())));
continue;
} else if (matcher.group(Type.PARENTHESES.name()) != null) {
tokens.add(new Token(Type.PARENTHESES, matcher.group(Type.PARENTHESES.name())));
continue;
} else {
throw new TokenMismatchException();
}
}
}
enum Type {
NUMBER("[0-9]+.*[0-9]*"), OPERATOR("[*|/|+|-]"), UNIT("[in|pt]"), PARENTHESES("[(|)]");
public final String pattern;
private Type(String pattern) {
this.pattern = pattern;
}
}

This pattern:
"[0-9]+.*[0-9]*"
matches one or more digits, followed by zero or more of any character, followed by zero or more digits. The dot is a special character in regexes that means "any character". If you're trying to match a decimal point, you need to put a backslash before the dot:
"[0-9]+\\.*[0-9]*"
(The backslash is doubled because it's in a Java string literal.) It appears to work on "1 + 2" if that one fix is made. However, some of your other patterns show some misunderstanding of what [] does in a regex. This is a "character class" that matches any of the characters you list in between the brackets, except that - can be used for a range of characters (like 0-9). So
"[*|/|+|-]"
matches any of the characters *, |, /, +, - (the | does not mean "or" inside square brackets). - isn't treated as a range operator here since it's last, but it's probably best to get in the habit of using \ in front of it anyway, so you want
"[*/+\\-]"
Similarly,
"[in|pt]"
matches one of the five characters i, n, |, p, t--certainly not what you want. You probably want
"(in|pt)"
which matches either "in" or "pt"; the parentheses may not be necessary in your case, but in a different case, they may be necessary to prevent some other characters from being included in one of the alternatives when the pattern is included in a larger string.

Splitting on comma outside quotes

My program reads a line from a file. This line contains comma-separated text like:
123,test,444,"don't split, this",more test,1
I would like the result of a split to be this:
123
test
444
"don't split, this"
more test
1
If I use the String.split(","), I would get this:
123
test
444
"don't split
this"
more test
1
In other words: The comma in the substring "don't split, this" is not a separator. How to deal with this?

You can try out this regex:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
Explanation:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

Why Split when you can Match?
Resurrecting this question because for some reason, the easy solution wasn't mentioned. Here is our beautifully compact regex:
"[^"]*"|[^,]+
This will match all the desired fragments (see demo).
Explanation
With "[^"]*", we match complete "double-quoted strings"
or |
we match [^,]+ any characters that are not a comma.
A possible refinement is to improve the string side of the alternation to allow the quoted strings to include escaped quotes.

Building upon #zx81's answer, cause matching idea is really nice, I've added Java 9 results call, which returns a Stream. Since OP wanted to use split, I've collected to String[], as split does.
Caution if you have spaces after your comma-separators (a, b, "c,d"). Then you need to change the pattern.
Jshell demo
$ jshell
-> String so = "123,test,444,\"don't split, this\",more test,1";
| Added variable so of type String with initial value "123,test,444,"don't split, this",more test,1"
-> Pattern.compile("\"[^\"]*\"|[^,]+").matcher(so).results();
| Expression value is: java.util.stream.ReferencePipeline$Head#2038ae61
| assigned to temporary variable $68 of type java.util.stream.Stream<MatchResult>
-> $68.map(MatchResult::group).toArray(String[]::new);
| Expression value is: [Ljava.lang.String;#6b09bb57
| assigned to temporary variable $69 of type String[]
-> Arrays.stream($69).forEach(System.out::println);
123
test
444
"don't split, this"
more test
1
Code
String so = "123,test,444,\"don't split, this\",more test,1";
Pattern.compile("\"[^\"]*\"|[^,]+")
.matcher(so)
.results()
.map(MatchResult::group)
.toArray(String[]::new);
Explanation
Regex [^"] matches: a quote, anything but a quote, a quote.
Regex [^"]* matches: a quote, anything but a quote 0 (or more) times , a quote.
That regex needs to go first to "win", otherwise matching anything but a comma 1 or more times - that is: [^,]+ - would "win".
results() requires Java 9 or higher.
It returns Stream<MatchResult>, which I map using group() call and collect to array of Strings. Parameterless toArray() call would return Object[].

You can do this very easily without complex regular expression:
Split on the character ". You get a list of Strings
Process each string in the list: Split every string that is on an even position in the List (starting indexing with zero) on "," (you get a list inside a list), leave every odd positioned string alone (directly putting it in a list inside the list).
Join the list of lists, so you get only a list.
If you want to handle quoting of '"', you have to adapt the algorithm a little bit (joining some parts, you have incorrectly split of, or changing splitting to simple regexp), but the basic structure stays.
So basically it is something like this:
public class SplitTest {
public static void main(String[] args) {
final String splitMe="123,test,444,\"don't split, this\",more test,1";
final String[] splitByQuote=splitMe.split("\"");
final String[][] splitByComma=new String[splitByQuote.length][];
for(int i=0;i<splitByQuote.length;i++) {
String part=splitByQuote[i];
if (i % 2 == 0){
splitByComma[i]=part.split(",");
}else{
splitByComma[i]=new String[1];
splitByComma[i][0]=part;
}
}
for (String parts[] : splitByComma) {
for (String part : parts) {
System.out.println(part);
}
}
}
}
This will be much cleaner with lambdas, promised!

Please see the below code snippet. This code only considers happy flow. Change the according to your requirement
public static String[] splitWithEscape(final String str, char split,
char escapeCharacter) {
final List<String> list = new LinkedList<String>();
char[] cArr = str.toCharArray();
boolean isEscape = false;
StringBuilder sb = new StringBuilder();
for (char c : cArr) {
if (isEscape && c != escapeCharacter) {
sb.append(c);
} else if (c != split && c != escapeCharacter) {
sb.append(c);
} else if (c == escapeCharacter) {
if (!isEscape) {
isEscape = true;
if (sb.length() > 0) {
list.add(sb.toString());
sb = new StringBuilder();
}
} else {
isEscape = false;
}
} else if (c == split) {
list.add(sb.toString());
sb = new StringBuilder();
}
}
if (sb.length() > 0) {
list.add(sb.toString());
}
String[] strArr = new String[list.size()];
return list.toArray(strArr);
}

How to retain all occurrences of X when using the greedy quantifier X* in a java regexp?

I have a regular expression that I use to find matches of a list of coma-separated words between <> inside a string, like "Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd" in the example
I want to use capturing groups to retain each word between the braces:
Here is my expression: < (\w+) (?: ,(\w+) )* > (spaces are added for readability but not a part of the pattern)
Parenthesis are for creating capturing groups, (?: ) is for creating a non capturing group, because I don't want to retain the coma.
Here is my test code:
#Test
public void test() {
String patternString = "<(\\w+)(?:,(\\w+))*>";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher("Hello <a1> sqjsjqk <b1,b2> dsjkfjkdsf <c1,c2,c3> ffsd");
while(matcher.find()) {
System.out.println("== Match ==");
MatchResult matchResult = matcher.toMatchResult();
for(int i = 0; i < matchResult.groupCount(); i++) {
System.out.println(" " + matchResult.group(i + 1));
}
}
}
This is the output produced:
== Match ==
a1
null
== Match ==
b1
b2
== Match ==
c1
c3
And here is what I wanted:
== Match ==
a1
== Match ==
b1
b2
== Match ==
c1
c2
c3
From this I understand that there is exactly as many groups as the number of capturing groups in my expression, but this is not what I want, because I need all the substring that were recognized as the \w+
Is there any chance to get what I want with a single RegExp, or should I finish the job with split(","), trim(), etc...

As far as I know .NET has the only regex engine out there, that can return multiple captures for a single capturing group. So what you are asking for is not possible in Java (at least not the way you asked for).
In your case this problem can however be solved to a certain extent. If you can be sure that there will never be an unmatched closing >, you can make the stuff you want to capture the full match, and require the correct position through a lookahead:
"\\w+(?=(?:,\\w+)*>)"
This can never match "words" outside of <...>, because they cannot get past the opening < to match the closing >. Of course that makes it hard to distinguish between elements from different sets of <...>.
Alternatively (and I suppose that is even better, because it's safer, and more readable), go for a two-step algorithm. First match
"<([\\w,]*)>"
Then split every result's first capture at ,.

How can I find repeated characters with a regex in Java?

Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)

Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)

String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}

Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to Split text by Numbers and Group of words - java

If spaces are your problem. String#split takes a regex as parameter. Then you could do this: my_list = Arrays.asList(my_string.split("\s?")); But, this won't solve all the problems, like those mentioned in the comments.

Related

Regular expression for phrase contain literals and numbers but is not all phrase as a number only with fixed range length

Lexing a String in Java with Regex

Splitting on comma outside quotes

How to retain all occurrences of X when using the greedy quantifier X* in a java regexp?

How can I find repeated characters with a regex in Java?

Categories

Resources