Regex to split when uppercase after alphabetic lowercase char - java

So I'm trying to split a string with a regex and the split function in java.
The regex should split the string when there is a capital letter after a noncapital letter like this
hHere // -> should split to ["h", "Here"]
I'm trying to split a string like this
String str = "1. Test split hHere and not .Here and /Here";
String[] splitString = str.split("(?=\\w+)((?=[^\\s])(?=\\p{Upper}))");
/* print splitString */
// -> should split to ["1. Test split h", "Here and not .Here and not /Here"]
for(String s : splitString) {
System.out.println(s);
}
output I get
1.
Test split h
Here and not .
Here and /
Here
output I want
1. Test split h
Here and not .Here and not /Here
Just can't figure out the regex to do this

You may use a easier pattern : (?<=\p{Ll})(?=\p{Lu})
(?<= ) ensures that the given pattern will match, ending at the current position in the expression.
(?= ) asserts that the given subpattern can be matched here, without consuming characters
both does not consume any characters, very important !
str.split("(?<=[a-z])(?=[A-Z])"); old version does not work for other alphabet

As per my original comment.
Code
Option 1
This option works with ASCII characters (it will not work for Unicode characters). Basically, this works with English text.
See regex in use here
(?<=[a-z])(?=[A-Z])
Option 2
This option works with Unicode characters. This works with any language.
See regex in use here
(?<=\p{Ll})(?=\p{Lu})
Explanation
Option 1
(?<=[a-z]) Positive lookbehind ensuring what precedes is a character in the set a-z (lowercase ASCII character)
(?=[A-Z]) Positive lookahead ensuring what follows is a character in the set A-Z (uppercase ASCII character)
Option 2
(?<=\p{Ll}) Positive lookbehind ensuring what precedes is a character in the set \p{Ll} (lowercase letter Unicode property/script category)
(?=\p{Lu}) Positive lookahead ensuring what follows is a character in the set \p{Lu} (uppercase letter Unicode property/script category)

Related

Regex to include all spanish characters and number

I have a Java app where I need to have a regex that replace ALL except characters and number (including the spanish characters as stressed vowels and ñ/Ñ) It's also needs to include some specific spacial characters.
I created the following regEx but it's removing also the stressed vowels which is not the idea
string.replaceAll("[^-_/.,a-zA-Z0-9 ]+","")
I just wanna accept those characters.. not others like æ, å or others..
You may use \p{L} instead of a-zA-Z:
string = string.replaceAll("[^-_/.,\\p{L}0-9 ]+","");
The \p{L} matches all Unicode letters regardless of modifiers passed to the regex compile.
See a Java test:
List<String> strs = Arrays.asList("!##Łąka$%^", "Word123-)(=+");
for (String str : strs)
System.out.println("\"" + str.replaceAll("[^-_/.,\\p{L}0-9 ]+","") + "\"");
Output:
"Łąka"
"Word123-"
Pattern details: the [^-_/.,\\p{L}0-9 ]+ pattern matches any char other than -, _, _, /, ., ,, Unicode letter, ASCII digit and a space.
Note that with this solution, you will still remove Unicode digits, like ٠١٢٣٤٥٦٧٨٩.
You may use Mena's suggested \p{Alnum} but with (?U) embedded flag option to reall match all Unicode letters and digits:
string = string.replaceAll("(?U)[^-_/.,\\p{Alnum} ]+","");
To only remove Unicode letters other than common European letters, just add À-ÿ and subtract two non-letters, ×÷, from this range:
string = string.replaceAll("(?U)[^-_/.,A-Za-zÀ-ÿ &&[^×÷]]+","");
You could try to include spanish special characters in a character class [ ... ], there are only 7 after all.
I needed only lowercase characters, so instead of [a-z], I used [a-zñáéíóúü] and that worked for me.
You can use the Alnum script to replace all alphabetic characters and digits, including accented characters:
"[^-_/.,\\p{Alnum} ]+"
See docs:
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
Note that your replacement currently impacts all alphabetic characters, etc.
If you want to actually negate that custom class (thus replacing everything that's not defined in there), use:
"[^[-_/.,\\p{Alnum} ]]+"
(note the additional square brackets after the ^, otherwise it would be interpreted as literal ^).
Edit
You can furtherly narrow down to a subset of latin character blocks by using:
String s = "a1᣹";
System.out.println(
s.replaceAll("[^[-_/.,\\p{InBASIC_LATIN}\\p{InLATIN_1_SUPPLEMENT}0-9]]+","")
);
Output
Łą
Note that you will still have some non-Spanish characters in the Latin 1 supplement, see here.
If you want to restrict your requirements further, you will likely need to define your own (lenghty) character class with specific Spanish characters.

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

How to write this Java regex?

I need to break the string into words by a hyphen. For example:
"WorkInProgress" is converted to "Work-In-Progress"
"NotComplete" is converted to "Not-Complete"
Most of cases one word starts with capital and ends with lowercase.
But there is one exception, "CIInProgress" should be converted to "CI-In-Progress".
I wrote like the code below, any pattern that has lowercase or "CI", followed by an capital, will be added "-" in middle. But it still can't work for "CIInProgress". Can anyone tell me how to correct it?
String str;
String pattern = "([a-z|CI]+)([A-Z])";
str= str.replaceAll(pattern, "$1\\-$2");
You could use a negative lookbehind,
Regex:
(?<!^)([A-Z][a-z])
Replacement string:
-$1
DEMO
Explanation:
(?<!^) Negative lookbehind is used here , which asserts what proceeds the characters [A-Z](uppercase) and also the following [a-z](lowercase) is not a starting anchor. An uppercase letter and the following lowercase letter will be matched only if it satisfies the above mentioned condition.() capturing groups are used to capture the matched characters, captured chars are stored into groups. Later you could get the captured chars by referring it's group index number.
Code:
System.out.println("WorkInProgress".replaceAll("(?<!^)([A-Z][a-z])", "-$1"));
System.out.println("NotComplete".replaceAll("(?<!^)([A-Z][a-z])", "-$1"));
System.out.println("CIInProgress".replaceAll("(?<!^)([A-Z][a-z])", "-$1"));
Output:
Work-In-Progress
Not-Complete
CI-In-Progress
You can't have | in a character class; it will just get interpreted as a literal vertical bar character. Try:
String pattern = "([a-z]+|CI)([A-Z])";
try this
str= str.replaceAll("(?<=\\p{javaLowerCase})(?=\\p{javaUpperCase})", "-");

Splitting words from text using regex

I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?
As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")
If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex

Regex deleted special character

I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.
Everything works fine, except if the word read has a special characters ľščťžýáíé in it. The regex deletes the char and splits the word where the special char was.
For Example :
Input:
I am Jožo.
Output:
I
am
Jo
o
Here's a snippet of the code:
while( (line = br.readLine())!= null ){
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(line);
}
Instead of this regex:
Pattern.compile("[\\w']+")
Use Unicode based:
Pattern.compile("[\\p{L}']+")
It is because by default \\w in Java matches only ASCII characters, digits 0-9 and underscore.
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
Like this:
Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)
\\w matches only a-z, A-Z and 0-9 (English alphabet plus numbers)
if you want to accept any character except whitespaces as part of a word, use \\S

Categories

Resources