Generate new attribute using regex in RapidMiner - java

I work with Excel-file, which contains several sentences. I would like to generate new attribute (I use "Generate Attribute" operator), which returns (“true or false”) if the sentence contains the some numbers with white spaces between them (e.g. 234 45 56). I have used the function “match nominal regex” (matches(sentences,"\d+\s+\d)) to do this. However, I faced the problem that Rapidminer does not recognize the escape () character. How do I change my Regex to make it work?
Some additional comments/examples:
My input sentences:
word word 123 345 6665 23456 54 word word word
word word word 12.3 34.5 6665 23.456 5.4 word word word
word word word 12,3 34,5 6665 23,456 5.4 word word word
word word word 12,3% 34,5% 6665% 23,456% 5.4% word word word
My output will be new variable with true or false, if the sentence contains such chain of numbers.
I first thought to use following Regex to capture numbers \d+[.,]?\d*\s+\d+[.,]?\d*.

You may express \d as [0-9] and \s as a space. Also, it seems you need to match the full line with matches, thus, add .*
match(sentences,".*[0-9] +[0-9].*")
This matches any 0+ chars other than a newline (as more as possible) followed with a digit, 1+ spaces and a digit, and then again 0+ chars other than a newline.
Also, try doubling the \ to match \d or \s (since the regex is Java flavor):
matches(sentences,".*\\d+\\s+\\d.*")

Related

Regex to identify consecutive and non-consecutive duplicate words in multiline text

I'm writing a syntax checker (in Java) for a file that has the keywords and comma (separation)/semicolon (EOL) separated values. The amount of spaces between two complete constructions is unspecified.
What is required:
Find any duplicate words (consecutive and non-consecutive) in the multiline file.
// Example_1 (duplicate 'test'):
item1 , test, item3 ;
item4,item5;
test , item6;
// Example_2 (duplicate 'test'):
item1 , test, test ;
item2,item3;
I've tried to apply the (\w+)(s*\W\s*\w*)*\1 pattern, which doesn't catch duplicate properly.
You may use this regex with mode DOTALL (single line):
(?s)(\b\w+\b)(?=.*\b\1\b)
RegEx Demo
RegEx Details:
(?s): Enable DOTALL mode
(\b\w+\b): Match a complete word and capture it in group #1
(?=.*\b\1\b): Lookahead to assert that we have back-reference \1 present somewhere ahead. \b is used to make sure we match exact same word again.
Additionally:
Based on earlier comments below if intent was to not match consecutive word repeats like item1 item1, then following regex may be used:
(?s)(\b\w+\b)(?!\W+\1\b)(?=.*\b\1\b)
RegEx Demo 2
There is one extra negative lookahead assertion here to make sure we don't match consecutive repeats.
(?!\W+\1\b): Negative lookahead to fail the match for consecutive repeats.
You may use
\b(\w+)\b(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b
See the regex demo
Details
\b(\w+)\b - Group 1: one or more word chars as a whole word
(?:\s*[^\w\s]\s*\w+)+ - 1 or more occurrences of:
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\w+ - 1+ word chars
\s* - 0+ whitespaces
[^\w\s] - 1 char other than a word and whitespace char
\s* - 0+ whitespaces
\b\1\b - the same value as in Group 1 as whole word.
To only match the word, put the second part of the regex into a positive lookahead:
\b(\w+)\b(?=(?:\s*[^\w\s]\s*\w+)+\s*[^\w\s]\s*\b\1\b)
^^^ ^
See this regex demo.
Java regex variable declaration:
String regex = "\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";
To make it fully Unicode aware add (?U):
String regex = "(?U)\\b(\\w+)\\b(?:\\s*[^\\w\\s]\\s*\\w+)+\\s*[^\\w\\s]\\s*\\b\\1\\b";

Regex that matches certain trailings of words with certain length?

How can I write a regex that matches the "e" trailings of every word, exept the "e" trailings of words with 2 or 3 letters?
Example:
abcdeeee: Full match for "eeee"
more: Full match for "e"
pie: No match
me: No match
Use a lookaround assertion:
e+(?<=\w{4})\b
This matches e characters if afterwards there have been 4 word characters before the final e. The \b makes sure it ends at a word boundary
Demo

Regex match numbers with spaces but not without spaces

Trying to match a string of numbers with spaces in between while ignoring other strings of numbers without spaces in between them. I'd like to match 16 characters.
eg. Would like to match 12345 67890 1234 but NOT 1234567890123456
I have tried this:
[0-9 ]{16}
But this matches both sets of strings.
I used and corrected #Wiktor Stribiżew regex, because original regex will match a space at the beginning and the end of the number.
Regex: \b(?![0-9]{16})\d[0-9 ]{14}\d\b
Details:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?!) Negative Lookahead
[] Match a single character present in the list 0-9
{n} Matches exactly n times
\d matches a digit (equal to [0-9])
RegEx demo
You can use this regex to enforcees at least one space in between numbers:
\d+(?:\h+\d+)+
RegEx Demo
\d+: Match 1+ digits
(?:\h+\d+)+: Match 1+ group of 1+ whitespace and 1+ digits

Regex to exclude special characters Java

I want to write a regex to include: Letters, Digits, and Spaces but I want to exclude special characters like !'^+%&/()=?_-*£#$, etc.
I thought I can use [a-zA-Z] for Letters, [0-9] for Digits and \S for Space characters.
[a-zA-Z0-9\s]
but the string I am trying to clear might have letters like é,ü,ğ,i,ç and so on.
I do not want these letters to be removed.
Is it possible to write such regex?
Yes, it is possible.
\p{L} matches anything that is a Unicode letter a-z and letters like é,ü,ğ,i,ç
\d matches a digit (equal to [0-9])
\s matches a space, tab, carriage return, new line, vertical tab or form feed character
[\p{L}\d\s]+ should match one or more character present in the list
Here you can see an example:
https://regex101.com/r/uQmu7a/1
If you want to do it using non regex way then you can do it using Apache StringUtils.isAlphanumericSpace(String str)
You could go a different way.
Note - these two regex have to be run with the Unicode character class flag option.
There are two ways to go
Using alnum and staying within the Ascii and Extended-Ascii range.
Note that this U+011F ğ LATIN SMALL LETTER G WITH BREVE is outside
the 0 - FF range in the regex below, so that won't get matched.
(?:\p{Alnum}(?<=[\x{00}-\x{FF}])|\s)+
Explained
(?:
\p{Alnum} # Any alpha numeric Unicode
(?<= [\x{00}-\x{FF}] ) # In the U+0 - U+0FF codepoint range
| # or,
\s # Whitespace
)+
Or, you can go the Latin classes route, using Latin block's/script and staying within the alnum range.
(?:[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}](?<=\p{Alnum})|\s)+
Expanded
(?:
[\p{Block=Latin_1_Supplement}\p{Block=Latin_Extended_A}\p{Block=Latin_Extended_Additional}\p{Block=Latin_Extended_B}\p{Block=Latin_Extended_C}\p{Block=Latin_Extended_D}\p{Block=Basic_Latin}\p{Script=Latin}]
(?<= \p{Alnum} )
|
\s
)+

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

Categories

Resources