Extract words (multiple whitespace) starting with # by regular expression - java

I have a problem with my regular expression:
String regex = "(?<=[\\s])#\\w+\\s";
I want a regex that formats a string like this:
"This is a Text #tag1 #tag2 #tag3"
With the regular expression, I get the last two values as result but not tag1 - because there is more than one whitespace. But i want all 3 of them!
I tried some variations, but nothing worked.

Use this regular expression:
(?<=(^|\\S)\\s)#\\w+(?=\\s|$)
Here's a demo.

It's a bit unclear from your question what you're really after, so I've put up some simple alternatives:
To capture all the tags in the string, we can use a lookbehind:
((?<=\\s|^)#\\w+)
To capture all the tags at the end of the string, we can use a lookahead:
(#\\w+(?=\\s#)|#\\w+$)
If there's always three tags at the end, there's no need for a lookaround:
(#\\w+)\s(#\\w+)\s(#\\w+)$

Related

Remove repeating set of characters in a string

I want to remove the sequesnce "-~-~-" if it repeats in a string, but only if they are together.
I have tried to create a regex based on the removing of multiple white spaces regex:
test.replaceAll("\\s+", " ");
Unfortunately I was unsuccessful. Can someone please help me write the correct regex? thanks.
Example:
string test = "hello-~-~--~-~--~-~-"
output:
hello-~-~-
Another example
string test = "-~-~--~-~--~-~-hello-~-~--~-~--~-~-"
output:
-~-~-hello-~-~-
The regex is:
test.replaceAll("(-~-~-){2,}", "-~-~-")
replaceAll replaces all occurrences matched by the regex (the first parameter) with the second parameter.
the () groups the expression -~-~- together, {2,} means two or more occurrences.
EDIT
Like #anubhava said, instead of using -~-~- for the replacement string, you could also use $1 which backreferences the first capturing group (i.e. the expression in the regex surrounded by ()).
test.replaceAll("(-~-~-)+", "-~-~-");
This is the regex you need:
(-~-~-){2}

Parsing HTML with Regular Expressions?

I've been trying to gather information using regular expressions:
Pattern hp = Pattern.compile("<small>.....</small>");
Matcher mp = hp.matcher(code);
while (mp.find()) {
String grupoHORARIO = mp.group();
System.out.println(grupoHORARIO); }
When I run the program, instead of showing me:
RESULT1
RESULT2
RESULT3
It shows this:
<small>RESULT1</small>
<small>RESULT2</small>
As you see, it shows the opening and closing "small" tags before and after the word I am looking for.
What I need is just the word, without the "small" tags around it.
USING REGEX TO PARSE HTML IS BAD.
Again, using RegEx to parse HTML is bad.
That being said... In answer to your question, the problem is how you're using the Regular Expression. The only code of yours I would change is what is inside the Pattern.compile() method. The way you're currently doing it, (click on the Java button to view the results), you will only match when there is <small>, then 5 characters, then </small>. This match includes the start and end tags.
If what you want is to only match the middle parts, then you can try using RegEx lookaround. The way I did it is: (?<=<small>).*(?=</small>). Into parts:
.* - Any number of characters.
.*(?=</small>) - Any number of characters that are followed by </small>.
(?<=<small>).*(?=</small>) - Any number of characters that are preceded by <small> and followed by </small>.
If you don't want to have it match any character, then replace the .* with whatever you do want to find (for example, ..... or {5}. will match 5 characters).

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));
Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here
Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.
Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

Regex to exactly match against a given set of keywords in between delimiters?

Im trying a regex to exactly match against a given set of keywords in between delimiters?
For example:
Keywords: keyone, keytwo, keythree
Start delimiter: ;
End delimiter: ;
Text under test: some text ;keyone; other text ;keytwo; some text ;keythreeeee;
Regex i tried : ;([keyonekeytwokeythree]+);
Problem with this regex is, this matching with keythreeeee also. My expectation is it should not match keythreeeee because this is not exact match.
You should read up on regular expression syntax.
([keyonekeytwokeythree]+)
The square bracket syntax tells the regexp matcher to match 'any number of characters from the set keyonekeytwokeythree'. It will thus also match yekenoeerth.
You're looking for something like:
;(keyone|keytwo|keythree);
You should use a regex like this:
;(keyone|keytwo|keythree);
I first take all the text inside delimiters.
(delmiterSart)(.)*(delimiterEnd)
and then on this selected text i try to search you word
(key1|key2|keyn)+

How do I write a regular expression to find the following pattern?

I am trying to write a regular expression to do a find and replace operation. Assume Java regex syntax. Below are examples of what I am trying to find:
12341+1
12241+1R1
100001+1R2
So, I am searching for a string beginning with one or more digits, followed by a "1+1" substring, followed by 0 or more characters. I have the following regex:
^(\d+)(1\\+1).*
This regex will successfully find the examples above, however, my goal is to replace the strings with everything before "1+1". So, 12341+1 would become 1234, and 12241+1R1 would become 1224. If I use the first grouped expression $1 to replace the pattern, I get the wrong result as follows:
12341+1 becomes 12341
12241+1R1 becomes 12241
100001+1R2 becomes 100001
Any ideas?
Your existing regex works fine, just that you are missing a \ before \d
String str = "100001+1R2";
str = str.replaceAll("^(\\d+)(1\\+1).*","$1");
Working link
IMHO, the regex is correct.
Perhaps you wrote it wrong in the code. If you want to code the regex ^(\d+)(1\+1).* in a string, you have to write something like String regex = "^(\\d+)(1\\+1).*".
Your output is the result of ^(\d+)(1+1).* replacement, as you miss some backslash in the string (e.g. "^(\\d+)(1\+1).*").
Your regex looks fine to me - I don't have access to java but in JavaScript the code..
"12341+1".replace(/(\d+)(1\+1)/g, "$1");
Returns 1234 as you'd expect. This works on a string with many 'codes' in too e.g.
"12341+1 54321+1".replace(/(\d+)(1\+1)/g, "$1");
gives 1234 5432.
Personally, I wouldn't use a Regex at all (it'd be like using a hammer on a thumbtack), I'd just create a substring from (Pseudocode)
stringName.substring(0, stringName.indexOf("1+1"))
But it looks like other posters have already mentioned the non-greedy operator.
In most Regex Syntaxes you can add a '?' after a '+' or '*' to indicate that you want it to match as little as possible before moving on in the pattern. (Thus: ^(\d+?)(1+1) matches any number of digits until it finds "1+1" and then, NOT INCLUDING the "1+1" it continues matching, whereas your original would see the 1 and match it as well).

Categories

Resources