Java regex to match all words in a string - java

I am looking for a regex to match following pattern
(abc|def|ghi|abc+def+ghi|def+ghi)
essentially everything that's separated by | is an OR search
and everything joined with + all words must be present.
I have to construct the regex dynamically based on an input string in the above format.
I tried following for AND searches:
(?=.*?\babc\b)(?=.*?\bdef\b)(?=(.*?\bghi\b)
following for OR searches
.*(abc|def).*
Is there a single regex possible? any examples would help

(abc|def|ghi)
That will match every string that contains the words you're looking for.

AND searches
You list the following:
(?=.*?\babc\b)(?=.*?\bdef\b)(?=(.*?\bghi\b)
My version:
(?=.*?\babc\b)(?=.*?\bdef\b)(?=.*?\bghi\b).
Note that your version appears an extra ( before the ghi test.
Also note that I include a . at the end (capture any single character), this is so the regular expression actually can match something otherwise you are just doing a lookahead with no actual search.
OR searches
For a search for "abc" OR "def" I would use the following regular expression:
\babc\b|\bdef\b
OR
\b(?:abc|def)\b
Combined
So for your example of (abc|def|ghi|abc+def+ghi|def+ghi) the actual regular expression might look like this:
\babc\b|\bdef\b|\bghi\b|(?=.*?\babc\b)(?=.*?\bdef\b)(?=.*?\bghi\b).|(?=.*?\bdef\b)(?=.*?\bghi\b).
It's kind of a bad example because it would match abc on it's own because of the first OR case instead of the requirement specified by the AND case in the middle.
Remember to specify your case sensitivty for the regular expression to.

Wrote this sample method match(String input, String searchFilter)
public static void main(String[] args) {
String input = " dsfsdf Invalid Locatio sdfsdff Invalid c Test1 xx Test2";
String searchFilter = "Invalid Pref Code|Invalid Location+Invalid company|Test|Test1+Test2";
System.out.println(match(input, searchFilter));
}
/**
* #param input
* #param searchFilter
*/
private static boolean match(String input, String searchFilter) {
List<String> searchParts = Arrays.asList(searchFilter.split("\\|"));
ArrayList<String> ms = new ArrayList<String>();
ArrayList<String> ps = new ArrayList<String>();
for (String pls : searchParts) {
if (pls.indexOf("+") > 0) {
ms.add(pls);
} else {
ps.add(pls);
}
}
ArrayList<String> patterns = new ArrayList<>();
for (String msb : ms) {
StringBuffer sb = new StringBuffer();
for (String msbp : msb.trim().split("\\+")) {
sb.append("(?=.*?\\b").append(msbp.trim()).append("\\b).");
}
patterns.add(sb.toString());
}
Pattern p = Pattern
.compile("\\b(?:" + StringUtils.join(ps, "|") + ")\\b|"+ StringUtils.join(patterns, "|"),
Pattern.CASE_INSENSITIVE);
return p.matcher(input).find();
}

assertTrue(Pattern.matches("\\((\\w+(\\||\\+))+\\w+\\)", "(abc|def|ghi|abc+def+ghi|def+ghi)"));

Related

Splitting a String that has a particular structure

I have a string that goes something like this
"330 Daniel T92435"
Now I need to obtain the name "Daniel", and I could simply just type
string.substring(4,11);
But the position where a name ("Daniel") is placed could vary.
And I don't want to use the split[] method.
I was thinking if there was a way to make the substring method read data until a whitespace is found.
If input string always has the following string structure "someSymbols Name someSymbols" you can use the following regular expression to extract the name:
"[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+"
\\p{Alpha} - alphabetic character;
\\s - white space;
[^\\s] - any symbol apart from the white space.
In the code below Pattern is as object representing the regular expression. In turn, Matcher is a special object that is responsible for navigation over the given string and allows discovering the parts of this string that match the pattern.
public static String findName(String source) {
Pattern pattern = Pattern.compile("[^\\s]+\\s+(\\p{Alpha}+)\\s+[^\\s]+");
Matcher matcher = pattern.matcher(source);
String result = "no match was found";
if (matcher.find()) {
result = matcher.group(1); // group 1 corresponds to the first element enclosed in parentheses (\\p{Alpha}+)
}
return result;
}
main()
public static void main(String[] args) {
System.out.println(findName("330 Daniel T92435"));
}
Output
Daniel
You can use the str.indexOf(" ") function.
int start = string.indexOf(" ")+1;
string.substring(start,start + 7);
Edit: You can use
int start = string.indexOf(" ")+1;
int end = string.indexOf(" ", start+1);
string.substring(start,end >= 0 ? end : string.length());
if you want to select the first word and don't know how long it will be.

Regex pattern for String with multiple leading and trailing ones and zeroes

I have a search String which contains the format below:
Search String
111651311
111651303
4111650024
4360280062
20167400
It needs to be matched with sequence of numbers below
001111651311000
001111651303000
054111650024000
054360280062000
201674000000000
Please note the search strings have been added with additional numbers either on each sides.
I have tried the regex below in java to match the search strings but it only works for some.
Pattern pattern = Pattern.compile("([0-9])\1*"+c4MIDVal+"([0-9])\1*");
Any advice ?
Update
Added the code I used below might provide some clarity on what am trying to do
Code Snippet
public void compare(String fileNameAdded, String fileNameToBeAdded){
List<String> midListAdded = readMID.readMIDAdded(fileNameAdded);
HashMap<String, String> midPairsToBeAdded = readMID.readMIDToBeAdded(fileNameToBeAdded);
List <String []> midCaptured = new ArrayList<String[]>();
for (Map.Entry<String, String> entry: midPairsToBeAdded.entrySet()){
String c4StoreKey = entry.getKey();
String c4MIDVal = entry.getValue();
Pattern pattern = Pattern.compile("([0-9]?)\\1*"+c4MIDVal+"([0-9]?)\\2*");
for (String mid : midListAdded){
Matcher match = pattern.matcher(mid);
// logger.info("Match Configured MID :: "+ mid+ " with Pattern "+"\\*"+match.toString()+"\\*");
if (match.find()){
midCaptured.add(new String []{ c4StoreKey +"-"+c4MIDVal, mid});
}
}
}
logger.info(midCaptured.size()+ " List of Configured MIDs ");
for (String [] entry: midCaptured){
logger.info(entry[0]+ "- "+entry[1] );
}
}
You need to refer the second capturing group in the second part and also you need to make both the patterns inside the capturing group as optional.
Pattern pattern = Pattern.compile("([0-9]?)\\1*"+c4MIDVal+"([0-9]?)\\2*");
DEMO
What is the problem by using the String.contains() method?
"001111651311000".contains("111651311"); // true
"201674000000000".contains("111651311"); // false

Replace word with special characters from string in Java

I am writing a method which should replace all words which matches with ones from the list with '****'
characters. So far I have code which works but all special characters are ignored.
I have tried with "\\W" in my expression but looks like I didn't use it well so I could use some help.
Here's code I have so far:
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck, badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("(?i)\\b" + badWords.get(i) + "\\b", "****");
}
}
E.g. I have list of words ['bad', '#$$'].
If I have a string: "This is bad string with #$$" I am expecting this method to return "This is **** string with ****"
Note that method should be aware of case sensitive words, e.g. TesT and test should handle same.
I'm not sure why you use the StringUtils you can just directly replace words that match the bad words. This code works for me:
public static void main(String[] args) {
ArrayList<String> badWords = new ArrayList<String>();
badWords.add("test");
badWords.add("BadTest");
badWords.add("\\$\\$");
String test = "This is a TeSt and a $$ with Badtest.";
for(int i = 0; i < badWords.size(); i++) {
test = test.replaceAll("(?i)" + badWords.get(i), "****");
}
test = test.replaceAll("\\w*\\*{4}", "****");
System.out.println(test);
}
Output:
This is a **** and a **** with ****.
The problem is that these special characters e.g. $ are regex control characters and not literal characters. You'll need to escape any occurrence of the following characters in the bad word using two backslashes:
{}()\[].+*?^$|
My guess is that your list of bad words contains special characters that have particular meanings when interpreted in a regular expression (which is what the replaceAll method does). $, for example, typically matches the end of the string/line. So I'd recommend a combination of things:
Don't use containsIgnoreCase to identify whether a replacement needs to be done. Just let the replaceAll run each time - if there is no match against the bad word list, nothing will be done to the string.
The characters like $ that have special meanings in regular expressions should be escaped when they are added into the bad word list. For example, badwords.add("#\\$\\$");
Try something like this:
String stringToCheck = "This is b!d string with #$$";
List<String> badWords = asList("b!d","#$$");
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck,badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("["+badWords.get(i)+"]+","****");
}
}
System.out.println(stringToCheck);
Another solution: bad words matched with word boundaries (and case insensitive).
Pattern badWords = Pattern.compile("\\b(a|b|ĉĉĉ|dddd)\\b",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
String text = "adfsa a dfs bb addfdsaf ĉĉĉ adsfs dddd asdfaf a";
Matcher m = badWords.matcher(text);
StringBuffer sb = new StringBuffer(text.length());
while (m.find()) {
m.appendReplacement(sb, stars(m.group(1)));
}
m.appendTail(sb);
String cleanText = sb.toString();
System.out.println(text);
System.out.println(cleanText);
}
private static String stars(String s) {
return s.replaceAll("(?su).", "*");
/*
int cpLength = s.codePointCount(0, s.length());
final String stars = "******************************";
return cpLength >= stars.length() ? stars : stars.substring(0, cpLength);
*/
}
And then (in comment) the stars with the correct count: one star for a Unicode code point giving two surrogate pairs (two UTF-16 chars).

How to retrieve a value from a list of keys/values separated by pipes?

I have a string like:
String s = "a=xxx|b = yyy|c= zzz"
I am trying to write a function that returns the value corresponding to a given key but it does not work as expected (it returns an empty string):
static String getValueFromKey(String s, String key) {
return s.replaceAll(key + "\\s*=\\s*(.*?)(\\|)?.*", "$1");
}
Test:
static void test() {
String s = "a=xxx|b = yyy|c= zzz";
assertEquals(getValueFromKey(s, "a"), "xxx");
assertEquals(getValueFromKey(s, "b"), "yyy");
assertEquals(getValueFromKey(s, "c"), "zzz");
}
What regex do I need to pass the tests?
Using replaceAll here seems like overkill, because this method will have to iterate over entire String. Instead you could use Matcher and its find method which will stop after matching searched regex (in out case key=value pair).
So maybe use something like:
static String getValueFromKey(String s, String key) {
Matcher m = Pattern.compile(
"(?<=^|\\|)\\s*" + Pattern.quote(key) + "\\b\\s*=\\s*(?<value>[^|]*)")
.matcher(s);
if (m.find())
return m.group("value");
else
return null;// or maybe return empty String "" but that may be misleading
// for values which are really empty Strings
}
You can use this regex for matching:
\W*(\w+)\W*=\W*([^|]*)
RegExDemo
Code:
static void test() {
String s = "a=xxx|b = yyy|c= zzz";
Pattern pattern = Pattern.compile("\\W*(\\w+)\\W*=\\W*([^|]*)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(1) + " :: " + matcher.group(2));
}
}
}
Output:
a :: xxx
b :: yyy
c :: zzz
If you have those string you can use a simple regex like this :
X\s*=\s*(\w*)
Working demo
As Michelle pointed in his comment you if you have apple and pineapple as keys then you can use word boundaries to avoid having a trouble:
\bapple\b\s*=\s*(\w*)
Working demo
According to the question it seems that you only want to change the regex and keep the same code structure. So you have to make sure that your regex match the full pattern, otherwise the string will still contain other data. So even if there are better ways to accomplish this, the answer is:
s.replaceAll(".*?\\b" + key + "\\s*=\\s*(.*?)($|\\|.*)", "$1");
You can test it here.
This is pretty easy to do without regexpressions:
static String getValueFromKey(String s, String key) {
string[] pairs = s.split("|");
for (String p : pairs) {
string[] halves = p.split("=");
if (halves[0].equals(key)) {
return halves[1];
}
}
return "";
}
This code words by splitting the String s on the '|' charachter, creating an array of Strings of the form "key=value". It then splits each of these key-value pairs on the '=' character until it finds key, and returns the associated value.
Is there a specific reason why you need to use a regular expression rather than an approach such as this?

How to identify string pattern within a string but ignore if the match falls inside of identified pattern

I want to search a string for occurences of a string that matches a specific pattern.
Then I will write that unique list of found strings separated by commas.
The pattern is to look for "$FOR_something" as long as that pattern does not fall inside of "#LOOKING( )" or "/* */" and the _something part does not have any other special characters.
For example, if I have this string,
"Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again"
The resulting list of found patterns I'm looking for from the above quoted string would be:
$FOR_five, $FOR_six
I started with this example:
import java.lang.StringBuffer;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class testIt {
public static void main(String args[]) {
String myWords = "Not #LOOKING( $FOR_one $FOR_two) /* $FOR_three */ not $$$FOR_four or $FOR_four_b, but $FOR_five; and $FOR_six and not $FOR-seven or $FOR_five again";
StringBuffer sb = new StringBuffer(0);
if ( myWords.toUpperCase().contains("$FOR") )
{
Pattern p = Pattern.compile("\\$FOR[\\_][a-zA-Z_0-9]+[\\s]*", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(myWords);
String myFors = "";
while (m.find())
{
myFors = myWords.substring( m.start() , m.end() ).trim();
if ( sb.length() == 0 ) sb = sb.append(myFors);
else
{
if ( !(sb.toString().contains(myFors))) sb = sb.append(", " + myFors );
}
}
}
System.out.println(sb);
}
}
But it is not giving me what I want. What I want is:
$FOR_five, $FOR_six
Instead, I get all of the $FOR_somethings. I don't know how to ignore the occurences inside of the /**/ or the #LOOKING().
Any suggestions?
This problem goes beyond regular regex I would say. The $$$ patterns can be fixed with negative lookbehind, the others won't as easily.
What I would recommend you to do is to first use tokenizing / manual string parsing to discard unwanted data, such as /* ... */ or #LOOKING( .... ). This could however also be removed by another regex such as:
myWords.replaceAll("/\\*[^*/]+\\*/", ""); // removes /* ... */
myWords.replaceAll("#LOOKING\\([^)]+\\)", ""); // removes #LOOKING( ... )
Once stripped of context-based content you can use e..g, the following regex:
(?<!\\$)\\$FOR_\\p{Alnum}+(?=[\\s;])
Explanation:
(?<!\\$) // Match iff not prefixed with $
\\$FOR_ // Matches $FOR_
\\p{Alnum}+ // Matches one or more alphanumericals [a-zA-Z0-9]
(?=[\\s;]) // Match iff followed by space or ';'
Note that the employed (?...) are known as lookahead/lookbehind expressions which are not captured in the result itself. They act only as prefix/suffix conditions in the above sample.

Categories

Resources