How to remove a specific special character pattern from a string - java

I have a string name s,
String s = "<NOUN>Sam</NOUN> , a student of the University of oxford , won the Ethugalpura International Rating Chess Tournament which concluded on Dec.22 at the Blue Olympiad Hotel";
I want to remove all <NOUN> and </NOUN> tags from the string. I used this to remove tags,
s.replaceAll("[<NOUN>,</NOUN>]","");
Yes it removes the tag. but it also removes letter 'U' and 'O' characters from the string which gives me following output.
Sam , a student of the niversity of oxford , won the Ethugalpura International Rating Chess Tournament which concluded on Dec.22 at the Blue lympiad Hotel
Can anyone please tell me how to do this correctly?

Try:
s.replaceAll("<NOUN>|</NOUN>", "");
In RegEx, the syntax [...] will match every character inside the brackets, regardless of the order they appear in. Therefore, in your example, all appearances of "<", "N", "O" etc. are removed. Instead use the pipe (|) to match both "<NOUN>" and "</NOUN>".
The following should also work (and could be considered more DRY and elegant) since it will match the tag both with and without the forward slash:
s.replaceAll("</?NOUN>", "");

String.replaceAll() takes a regular expression as its first argument. The regexp:
"[<NOUN>,</NOUN>]"
defines within the brackets the set of characters to be identified and thus removed. Thus you're asking to remove the characters <,>,/,N,O,U and comma.
Perhaps the simplest method to do what you want is to do:
s.replaceAll("<NOUN>","").replaceAll("</NOUN>","");
which is explicit in what it's removing. More complex regular expressions are obviously possible.

You can use one regular expression for this: "<[/]*NOUN>"
so
s.replaceAll("<[/]*NOUN>","");
should do the trick. The "[/]*" matches zero or more "/" after the "<".

Try this :String result = originValue.replaceAll("\\<.*?>", "");

Related

Regex to remove initials from full name

I have names like "D John Livingston" , "S. Jennifer Adstan" and I want only the initials to be removed from the names , "D" in the first name and "S." in the second name. How can i do it using java regex?
The following code snippet seems to be working well:
String input = "John O'Connel";
input = input.replaceAll("\\b[A-Z]+(?:\\.|\\s+|$)", "").trim();
System.out.println(input);
John O'Connel
Your question is chock full of edge cases, since an initial could be, for example, more than one letter, and could appear at the start, middle, or end of the name. I replaced using the pattern \s*[A-Z]+(?:\.|\b), which seems to at least cover your examples. Also, I make a call to String#trim() for some whitespace cleanup for initials at the very beginning or end.
Demo
For this I would consider using String replaceAll().
So how do we design the regex?
Basically there are three cases you need to consider:
A. a single letter at the beginning of the name (optional period), followed by one
space
B. a single letter at the end of the name (optional period), preceded by one
space
C. a single letter in the middle of the name (optional period), surrounded by
two spaces
For the first two cases, you need to leave no spaces. So you would match one space and replace it with zero spaces.
For the last case, you need to leave one space. However, rather than handling this case explicitly, you may treat it as either A or B, since those will replace only one of the two spaces, leaving you with the desired number of spaces: 1.
So how do we combine case A and case B together? Using the symbol of |.
To prevent grabbing a single letter from a larger chain of letters, you can use the word border marker \b on the side which is not demarcated by a space character. (Normally for cases A and B, I would have used ^ and $ to explicitly match begin and end of string for this purpose. However, since we also need to handle case C in the middle of the string, we should use word border marker instead. )
And how do we represent the optional period? Since the period is a special character it must be escaped: \. Then it is marked as optional with question mark: \.? However, there's still the problem that the A. in the middle of a name might be matched as just A since period also counts as a word border. To prevent this, we add a possessive quantifier to the optional period \\.?+.
Putting all of this together, our regex would be: (\b[A-Z]\.?+ )|( [A-Z]\.?+\b)
However, in the final Java string, the backslash must be escaped, so in the final Java string, each \ will appear as \\
Example code:
String pattern = "(\\b[A-Z]\\.?+ )|( [A-Z]\\.?+\\b)";
String input1 = "MC Hammer I Smash U";
String input2 = "S. Jennifer A. Adstan JR.";
System.out.println(input1.replaceAll(pattern, ""));
System.out.println(input2.replaceAll(pattern, ""));
Output:
MC Hammer Smash
Jennifer Adstan JR.

Find and replace characters in brackets

I have a string kind of:
String text = "(plum) some other words, [apple], another words {pear}.";
I have to find and replace the words in brackets, don't replacing the brackets themselves.
If I write:
text = text.replaceAll("(\\(|\\[|\\{).*?(\\)|\\]|\\})", "fruit");
I get:
fruit some other words, fruit, another words fruit.
So the brackets went away with the fruits, but I need to keep them.
Desired output:
(fruit) some other words, [fruit], another words {fruit}.
Here is your regex:
(?<=[({\[\(])[A-Za-z]*(?=[}\]\)])
Test it here:
https://regex101.com/
In order to use it in Java, remember to add second backslashes:
(?<=[({\\[\\(])[A-Za-z]*(?=[}\\]\\)])
It matches 0 or more letters (uppercase or lowercase) preceded by either of these [,{,( and followed by either of these ],},).
If you want to have at least 1 letter between brackets just replace '*' with '+' like this:
(?<=[({\[\(])[A-Za-z]+(?=[}\]\)])
GCP showed how to use look aheads and look behinds to exclude the brackets from the matched part. But you can also match them, and refer to them in your replacement string with capturing groups:
text.replaceAll("([\\(\\[\\{]).*?([\\)\\]\\}])", "$1fruit$2");
Also note that you can replace the | ORs by a character group [].

String replace with condition not be a subpart of word Java

String replace change more that i want.
For example
String input = "The blue house Theatres";
input = input.replace("the", "AAA");
output it will be:
"AAA blue house AAAatres"
I don't whant to change when is a subpart of a word.
First you should try to use replaceAll(regex, replacement) instead of replace(literal, replacement) since the latter works on literals only, i.e. you can't use expressions, while the former uses regular expressions to find matches.
Next your regular expression should use word boundaries, e.g. \bthe\b where \b marks a word boundary.
Finally if you want to do a case-insensitive replacement you'll need to either handle the possible cases in the epxression (e.g. \b[tT]he\b) or switch the expression to case-insensitive mode by prepending it with (?i), i.e. (?i)\bthe\b. Note that the expression [tT]he would not match THE while the case-insensitive expression would, so depending on your requirements you'd need to choose one or the other.
Using all that you'd get input = input.replaceAll("(?i)\\bthe\\b", "AAA");.
Edit:
According to your comment on the question you don't want to use word boundaries but only look for characters before and after. You can achieve that with negative look-around expressions, e.g. (?i)(?<![a-z])the(?![a-z]). Note that I used the quite simple character class [a-z] here, if you need to exclude more characters you'd need to expand it.
The above expression would match !The, the, THE? etc. but not Theatre or aether etc. since if requires the match to not be preceded by a character ((?<![a-z])) and not be followed by one ((?![a-z])).
Use a regex with word boundaries \b:
String input = "The blue house Theatres";
input.replaceAll("\\bThe\\b", "AAA");

Regular expression to match a character only once before any whitespace

In Java, what regular expression would I use to match a string that has exactly one colon and makes sure that the colon appears before any whitespace?
For example, it should match these strings:
label: print "Enter input"
But: I still had the money.
ghjkdhfjkgjhalergfyujhrageyjdfghbg:
area:54
But not
label: print "Enter input:"
There was one more thing: I still had the money.
ghfdsjhgakjsdhfkjdsagfjkhadsjkhflgadsjklfglsd
area::54
If you use it with matches (which requires to match the entire string), you could use
[^\\s:]*:[^:]*
Which means: arbitrarily many non-whitespace, non-: characters, then a :, then more arbitrarily many non-: characters.
I've really only used two regex concepts: (negated) character classes and repetition.
If you want to require at least one character before or after :, replace the corresponding * with + (as jlordo pointed out in a comment).
The following should work:
^[^\s:]*:(?!.*:)
If your strings can contain line breaks, use the DOTALL flag or change the regex to the following:
(?s)^[^\s:]*:(?!.*:)
It depends on what we call white space, it could be
[^\\p{Space}:]*:[^:]
The following should get you started:
Matcher MatchedPattern = Pattern.compile("^(\\w+\\:{1}[\"\\w\\s\\.]*)$").matcher("yourstring");

Java Quotation Matching

I'm not sure if this is a regex question, but i need to be able to grab whats inside single qoutes, and surround them with something. For example:
this is a 'test' of 'something' and 'I' 'really' want 'to' 'solve this'
would turn into
this is a ('test') of ('something') and ('I') ('really') want ('to') ('solve this')
Any help you could provide would be great!
Thanks!
String str = "this is a 'test' of 'something'";
String rep = str.replaceAll("'[^']*'", "($0)"); // stand back, I know regex
What I did here is use toe replaceAll() method which searches for all matches for regex "'[^']*'" and replaces them with regex "($0)".
The pattern "'[^']*'" matches all substrings that start and end with a single quote ('), and between them are any characters, except another single quote ([^']), and those can appear any number of times (*). Replacing those with "($0)" means taking every match ($0) and wrapping it with parenthesis.
One easy way (but not always valid) is the following
If always you have [ '] and [' ] , you can do this:
myString.replace(" '"," ('"); // replaces all <space_apostrophe> with <space_bracket_apostrophe>
Do the same for the rear bracket :)
One more thing - why do you even have apostrophes-surrounded words? Is it a must that they must be like that? If you made them like that, why did you do it and then look for another approach !
If you can ignore single apostrophes, you could do like this (C# code, should be easy to translate)
string input = "this is a 'test' of 'something' and ...";
Console.WriteLine(Regex.Replace(input, "'([^']*)'", "('$1')"));

Categories

Resources