It is possible to improve the performance of the following through a regular expression, the code is functional, but I want to know if there is any way to select the possible dash that exist in the unicode to standardize my dash
Words:
48553−FS002
48553-FS002
48553 FS002
48553-FS002-ESD12
Java
String reference = "48553−FS002";
String separador = reference.replaceFirst ( "\\w+(\\W)?\\w+", "$1" );
if(!separator.equals ( " " )) {
reference = reference.replaceAll ( separator, "-" );
}
Or you could search for the unicode code, I was reading the following: dash, but i haven't managed to make it work Java Regex Unicode
If you need to match any non-word but space, you may use
reference = reference.replaceAll("[^\\w ]", "-");
Or, with character class subtraction:
reference = reference.replaceAll("[\\W&&[^ ]]", "-");
You can use the following pattern to match your hyphen or dash like patterns:
[\p{Pd}\u00AD\u2212]
Here,
\p{Pd} - matches any Punctuation, Dash symbols
\u00AD - matches a soft hyphen
\u2212 - matches a minus symbol.
If you know your strings only contain word characters and separators, as seems to be the case, then you can just use
reference = reference.replaceAll("[^ \\w]", "-");
Related
I want to replace one string in a big string, but my regular expression is not proper I guess. So it's not working.
Main string is
Some sql part which is to be replaced
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'
String to find and replace is
Based on some condition sql part to be replaced
hemp.EMPLOYEE_NAME = 'xxx'
I have tried this with
Pattern and Matcher class is used and
Pattern pat1 = Pattern.compile("/^hemp.EMPLOYEE_NAME\\s=\\s\'\\w\'\\s[and|or]*/$", Pattern.CASE_INSENSITIVE);
Matcher mat = pat1.matcher(cond);
while (mat.find()) {
System.out.println("Match: " + mat.group());
cond = mat.replaceFirst("xx "+mat.group()+"x");
mat = pat1.matcher(cond);
}
It's not working, not entering the loop at all. Any help is appreciated.
Obviously not - your regexp pattern doesn't make any sense.
The opening /: In some languages, regexps aren't strings and start with an opening slash. Java is not one of those languages, and it has nothing to do with regexps itself. So, this looks for a literal slash in that SQL, which isn't there, thus, failure.
^ is regexpese for 'start of string'. Your string does not start with hemp.EMPLOYEE_NAME, so that also doesn't work. Get rid of both / and ^ here.
\\s is one whitespace character (there are many whitespace characters - this matches any one of them, exactly one though). Your string doesn't have any spaces. Your intent, surely, was \\s* which matches 0 to many of them, i.e.: \\s* is: "Whitespace is allowed here". \\s is: There must be exactly one whitespace character here. Make all the \\s in your regexp an \\s*.
\\w is exactly one 'word' character (which is more or less a letter or digit), you obviously wanted \\w*.
[and|or] this is regexpese for: "An a, or an n, or a d, or an o, or an r, or a pipe symbol". Clearly you were looking for (and|or) which is regexpese for: Either the sequence "and", or the sequence "or".
* - so you want 0 to many 'and' or 'or', which makes no sense.
closing slash: You don't want this.
closing $: You don't want this - it means 'end of string'. Your string didn't end here.
The code itself:
replaceFirst, itself, also does regexps. You don't want to double apply this stuff. That's not how you replace a found result.
This is what you wanted:
Matcher mat = pat1.matcher(cond);
mat.replaceFirst("replacement goes here");
where replacement can include references to groups in the match if you want to take parts of what you matched (i.e. don't use mat.group(), use those references).
More generally did you read any regexp tutorial, did any testing, or did any reading of the javadoc of Pattern and Matcher?
I've been developing for a few years. It's just personal experience, perhaps, but, reading is pretty fundamental.
Instead of the anchors ^ and $, you can use word boundaries \b to prevent a partial match.
If you want to match spaces on the same line, you can use \h to match horizontal whitespace char, as \s can also match a newline.
You can use replaceFirst on the string using $0 to get the full match, and an inline modifier (?i) for a case insensitive match.
Note that using [and|or] is a character class matching one of the listed chars and escape the dot to match it literally, or else . matches any char except a newline.
(?i)\bhemp\.EMPLOYEE_NAME\h*=\h*'\w+'\h+(?:and|or)\b
See a regex demo or a Java demo
For example
String regex = "\\bhemp\\.EMPLOYEE_NAME\\h*=\\h*'\\w+'\\h+(?:and|or)\\b";
String string = "cond = emp.EMAIL_ID = 'xx#xx.com' AND\n"
+ "emp.PERMANENT_ADDR LIKE('%98n%') \n"
+ "AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'";
System.out.println(string.replaceFirst(regex, "xx$0x"));
Output
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND xxhemp.EMPLOYEE_NAME = 'xxx' andx is_active='Y'
I have a string which needs to be split based on a delimiter(:). This delimiter can be escaped by a character (say '?'). Basically the delimiter can be preceded by any number of escape character. Consider below example string:
a:b?:c??:d???????:e
Here, after the split, it should give the below list of string:
a
b?:c??
d???????:e
Basically, if the delimiter (:) is preceded by even number of escape characters, it should split. If it is preceded by odd number of escape characters, it should not split. Is there a solution to this with regex?
Any help would be greatly appreciated.
Similar question has been asked earlier here, But the answers are not working for this use case.
Update:
The solution with the regex: (?:\?.|[^:?])* correctly split the string. However, this also gives few empty strings. If + is given instead of *, even the real empty matches also ignored. (Eg:- a::b gives only a,b)
Scenario 1: No empty matches
You may use
(?:\?.|[^:?])+
Or, following the pattern in the linked answer
(?:\?.|[^:?]++)+
See this regex demo
Details
(?: - start of a non-capturing group
\?. - a ? (the delimiter) followed with any char
| - or
[^:?] - any char but the : (your delimiter char) and ? (the escape char)
)+ - 1 or more repetitions.
In Java:
String regex = "(?:\\?.|[^:?]++)+";
In case the input contains line breaks, prepend the pattern with (?s) (like (?s)(?:\\?.|[^:?])+) or compile the pattern with Pattern.DOTALL flag.
Scenario 2: Empty matches included
You may add (?<=:)(?=:) alternative to the above pattern to match empty strings between : chars, see this regex demo:
String s = "::a:b?:c??::d???????:e::";
Pattern pattern = Pattern.compile("(?>\\?.|[^:?])+|(?<=:)(?=:)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("'" + matcher.group() + "'");
}
Output of the Java demo:
''
'a'
'b?:c??'
''
'd???????:e'
''
Note that if you want to also match empty strings at the start/end of the string, use (?<![^:])(?![^:]) rather than (?<=:)(?=:).
I am trying to learn Regular expressions and am trying to replace values in a string with white-spaces using regular expressions to feed it into a tokenizer. The string might contain many punctuations. However, I do not want to replace whitespaces in string which contain an apostrophe/ hyphen within them.
For example,
six-pack => six-pack
He's => He's
This,that => This That
I tried to replace all the punctuations with whitespace initially but that would not work.
I tried to replace only those punctuations by specifying the wordboundaries as in
\B[^\p{L}\p{N}\s]+\B|\b[^\p{L}\p{N}\s]+\B|\B[^\p{L}\p{N}\s]+\b
But, I am not able to exclude the hyphen and apostrophe from them.
My guess is that the above regex is also very cumbersome and there should be a better way. Is there any?
So, all I am trying to do is:
Replace all punctuations with whitespace
Do not do the above if they are hyphen/apostrophe
Do replace if the hyphen/apostrophe does occur at start/end of a word.
Any help is appreciated.
You can probably work out a set of punctuation characters that are ok between words, and another set that isn't, then define your regular expression based on that.
For instance:
String[] input = {
"six-pack",// => six-pack
"He's",// => He's
"This,that"// => This That"
};
for (String s: input) {
System.out.println(s.replaceAll("(?<=\\w)[\\p{Punct}&&[^'-]](?=\\w)", " "));
}
Output
six-pack
He's
This that
Note
Here I'm defining the Pattern by using a character class including all posix for punctuation, preceded and followed by a word character, but negating a character class containing either ' or -.
You can use this lookahead based regex:
(?!((?!^)['-].))\\p{Punct}
RegEx Demo
You could use negative lookahead assertion like below,
String s = "six-pack\n"
+ "He's\n"
+ "This,that";
System.out.println(s.replaceAll("(?m)^['-]|['-]$|(?!['-])\\p{Punct}", " "));
Output:
six-pack
He's
This that
Explanation:
(?m) Multiline Mode
^['-] Matches ' or - which are at the start.
| OR
['-]$ Matches ' or - which are at the end of the line.
| OR
(?!['-])\\p{Punct} Matches all the punctuations except these two ' or - . It won't touch the matched [-'] symbols (ie, at the start and end).
RegEx Demo
I need to find two types of instances when there is a "[" character using regular expressions:
When the "[" character is followed by a number.
When the "[" character is followed by letters.
In Java I have tried:
Pattern firstinstance = Pattern.compile("\\[abcdefgABCDEFG");
Pattern secondinstance = Pattern.compile("\\[[0-9]");
These however, don't really seem to work. Do you guys have any possible suggestions?
The first instance is when the "[" character is followed by a number.
Any decimal digit in any script:
"\\[\\p{Nd}"
Any digit in 0-9 only:
"\\[\\d"
"\\[[0-9]"
The second instance is when the "[" character is followed by letters.
Any letter in any script:
"\\[\\p{L}"
Only letters in A-Z or a-z:
"\\[[A-Za-z]"
Pattern firstinstance = Pattern.compile("\\[[a-zA-Z]+");
Pattern secondinstance = Pattern.compile("\\[[0-9]+");
Pattern first = Pattern.compile("[[][0-9]");
Pattern second = Patter.compile("[[][A-z]+");
Regular expressions are very simple to understand. Have a look at Basic Concepts
In Java, you need to escape your escape characters (this is a consequence of the pattern being defined a string). So you would use the code
Pattern firstinstance = Pattern.compile("\\[[0-9]");
Pattern secondinstance = Pattern.compile("\\[[a-zA-Z]");
Those strings are read as
\[[0-9]
and
\[[a-zA-Z]
which are the regular expression you want.
Note, to get a literal backslash in the regex you need to use 4 backslashes \\\\.
I'm new to regular expressions in Java and I need to validate if a string has alphanumeric chars, commas, apostrophes and full stops (periods) only. Anything else should equate to false.
Can anyone give any pointers?
I have this at the moment which I believe does alphanumerics for each char in the string:
Pattern p = Pattern.compile("^[a-zA-Z0-9_\\s]{1," + s.length() + "}");
Thanks
Mr Albany Caxton
I'm new to regular expressions in Java and I need to validate if a string has alphanumeric chars, commas, apostrophes and full stops (periods) only.
I suggest you use the \p{Alnum} class to match alpha-numeric characters:
Pattern p = Pattern.compile("[\\p{Alnum},.']*");
(I noticed that you included \s in your current pattern. If you want to allow white-space too, just add \s in the character class.)
From documentation of Pattern:
[...]
\p{Alnum} An alphanumeric character:[\p{Alpha}\p{Digit}]
[...]
You don't need to include ^ and {1, ...}. Just use methods like Matcher.matches or String.matches to match the full pattern.
Also, note that you don't need to escape . within a character class ([...]).
Pattern p = Pattern.compile("^[a-zA-Z0-9_\\s\\.,]{1," + s.length() + "}$");
Keep it simple:
String x = "some string";
boolean matches = x.matches("^[\\w.,']*$");