How to use regex to remove punctuations in a sentence

How to use regex to remove punctuations in a sentence - java

I am trying to take from a file all the valid words. Valid words are defined as normal characters that can appear like so:
don't won't can't
and I have to ignore commas periods and exclamation points.
I have gotten the expression to just get characters but now it won't get words like don't and can't or won't.
This is the expression I am using "[^A-Za-z]+" and I have tried "\'[^A-Za-z]+" but this breaks and allows all characters. Does anyone have any idea what I can use to get normal words including don't and won't and can't and such words.
Thank you very much

[^A-Za-z] Would mean anything NOT matching those character ranges! Try this:
[A-Za-z']
You may need to escape the single quote, in which case you'll probably need to escape the slash that escapes it:
[A-Za-z\\']

Another way (using abbreviations) is: \b[\w']+

This will match letters from any language and exclude numbers.
\b[\p{L}\!\'\?]+
Here is a very good resource for regular expressions.
http://www.regular-expressions.info/

Related

Java regex not matching German "Umlaut" OR underscore

I'm trying to "play around" with some REST APIs and Java code.
As I am using German language mainly, I already managed it to get the Apache HTTP Client to work with UTF-8 encoding to make sure "Umlaut" are handled the right way.
Still I can't get my regex to match my words correctly.
I try to find words/word combinations like "Büro_Licht" from string like ..."type":"Büro_Licht"....
Using regex expression ".*?type\":\"(\\w+).*?" returns "B" for me, as it doesn't recognize the "ü" as a word character. Clearly, as \w is said to be [a-z A-Z 0-9]. Within strings with no special characters I get the full "Office_Light" meanwhile.
So I tried another hint mentioned here in like nearly the same question (which I could not comment, because I lack of reputation points).
Using regex expression ".*?type\":\"(\\p{L}).*?" returns "Büro" for me. But here again it cuts on the underscore for a reason I don't understand.
Is there a nice way to combine both expressions to get the "full" word including underscores and special characters?

If you have to keep using regex, which is not a great tool for parsing JSON, try \p{L}_. In your case it would be:
String regex = ".*?type\":\"[\\p{L}_]+\"";
With on-line example: https://regex101.com/r/57oFD5/2
\p{L} matches any kind of letter from any language
_ matches the character _ literally (case sensitive)
This will get hectic if you need to support other languages, whitespaces and various other UTF code points. For example do you need to support random number of white spaces around :? Take a look at this answer on removing emojis, there are many corner cases.

Could I give Java a regular expression when java should not split an string?

Can I give the String.split method a parameter which tells it when it must not split the given string? In my particular case, I have text documents with lots of text and symbols. But in every file there are many different symbols. This is what I want to achieve:
string.split(not(A-Z,ß,ä,ö,ü));
So basically, I want String.split to only split whenever it finds a character that is not part of the German set of characters.
I hope you can help me.

There are three tokens in regular expressions that allow you to do exactly what you want to achieve:
[] creates a character class which contains all characters that are listed inside. In your particular case, you'd want this to be [a-zßäöü] as this character group contains all characters a through z, ß, ä, ö and ü.
^ negates the contents of a character class. So, using the character class from above, you'd use [^a-zßäöü] if you wanted to match any character that is not part of the character group.
Additionally, adding (?i) in front of your regular expression causes it to be case insensitive, which allows your expression to match the uppercase letters as well without having to actually add them to your expression.
So, adding those three tokens together, you get the regular expression (?i)[^a-zßäöü]. Now the only thing left is to put them into your String.split method and you're done:
string.split("(?i)[^a-zßäöü]");

Mr.Human,
If I'm understanding your question correctly, you want to split a string on non-German characters?
So,
abcdöyüp
becomes
a, b, c, dö, yü, p
If that is the case, then unfortunately you need to specify the set of characters that are non-German, e.g. [A-Z] to split on. If you are trying to accomplish something other than this, please clarify and/or provide an example.

Remove everything from a string upto a certain character and optionally a string if it follows too

I am looking to write a regex that can remove any characters upto the first &emsp and if there is a (new section) following &emsp then remove that as well. But the following regex doesn't seem to work. Why? How do I correct this?
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
Pattern removeEmspPattern1 = Pattern.compile("(.*( (\\(new section\\)))?)(.*)", Pattern.MULTILINE);
System.out.println(removeEmspPattern1.matcher(removeEmsp).replaceAll("$2"));

Have you tried String Split? This creates an array of strings from a string, based on a deliminator.
Once you have the string split, just select the elements of the array that you need for print statement.
Read more here

Your regex is very long and I do not want to debug it. However the tip is that some characters have special meaning in regular expressions. For example & means "and". Squire brackets allow defining characters groups etc. Such characters must be escaped if you want them to be interpreted as just characters and not regex commands. To escape special character you have to write \ in front of it. But \ is escape character for java too, so it should be duplicate.
For example to replace ampersand by letter A you should write str.replaceAll("\\&", "A")
Now you have all information you need. Try to start from simpler regex and then expand it to what you need. Good luck.
EDIT
BTW parsing XML and/or HTML using regular expressions is possible but is highly not recommended. Use special parser for such formats.

Try this:
String removeEmsp =" “[<centd>[</centd>]§ 431:10A–126 (new section)[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.";
System.out.println(removeEmsp.replaceFirst("^.*?\\ (\\(new\\ssection\\))?", ""));
System.out.println(removeEmsp.replaceAll("^.*?\\ (\\(new\\ssection\\))?", ""));
Output:
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
[<centd>]Chemotherapy services.</centd>] <centa>Cancer treatment.</centa>test snl.
It will remove everything up to " " and optionally, the following "(new section)" text if any.

Blank spaces in regular expression

I use this regular to validate many of the input fields of my java web app:
"^[a-zA-Z0-9]+$"
But i need to modify it, because i have a couple of fields that need to allow blank spaces(for example: Address).
How can i modify it to allow blank spaces(if possible not at the start).
I think i need to use some scape character like \
I tried a few different combinations but none of them worked. Can somebody help me with this regex?

I'd suggest using this:
^[a-zA-Z0-9][a-zA-Z0-9 ]+$
It adds two things: first, you're guaranteed not to have a space at the beginning, while allowing characters you need. Afterwards, letters a-z and A-Z are allowed, as well as all digits and spaces (there's a space at the end of my regex).

If you want to use only a whitespace, you can do:
^[a-zA-Z0-9 ]+$
If you want to include tabs \t, new-line \n \r\n characters, you can do:
^[a-zA-Z0-9\s]+$
Also, as you asked, if you don't want the whitespace to be at the begining:
^[a-zA-Z0-9][a-zA-Z0-9 ]+$

Use this: ^[a-zA-Z0-9]+[a-zA-Z0-9 ]+$. This should work. First atom ensures that there must be at least one character at beginning.

try like this ^[a-zA-Z0-9 ]+$ that is, add a space in it

This regex dont allow spaces at the end of string, one downside it accepts underscore character also.
^(\w+ )+\w+|\w+$

Try this one: I assume that any input with a length of at least one character is valid. The previously mentioned answers does not take that into account.
"^[a-zA-Z0-9][a-zA-Z0-9 ]*$"
If you want to allow all whitespace characters, replace the space by "\s"

regex for that excludes matches within quotes

I'm working on this pretty big re-factoring project and I'm using intellij's find/replace with regexp to help me out.
This is the regexp I'm using:
\b(?<!\.)Units(?![_\w(.])\b
I find that most matches that are not useful for my purpose are the matches that occur with strings within quotes, for example: "units"
I'd like to find a way to have the above expression not match when it finds a matching string that's between quotes...
Thx in advance, this place rocks!

Assuming the quotes are always paired on a given line, you could create matches before and after for an even number of quotes, and make sure the whole line is matched:
^([^"]*("[^"]*")*[^"]*)*\b(?<!\.)Units(?![_\w(.])\b([^"]*("[^"]*")*[^"]*)*$
this works because the fragment
([^"]*("[^"]*")*[^"]*)*
will only match paired quotes. By adding the begin and end line anchors, it forces the quotes on the left and right side of your regex to be an even count.
This won't handle embedded escaped quotes properly, and multiline quoted strings will be trouble.

Intellij uses Java regexes, doesn't it? Try this:
(?m)(?<![\w.])Units(?![\w(.])(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
The first part is your regex after a little cosmetic surgery:
(?<![\w.])Units(?![\w(.])
The \b at the beginning and end were effectively the same as a negative lookbehind and a negative lookahead (respectively) for \w, so I folded them into your existing lookarounds. The new lookahead matches the rest of the line if it contains even number (including zero) of unescaped quotation marks:
(?=(?:[^\r\n"\\]++|\\.)*+[^\r\n"\\]*+$)
That handles pathological cases like the one Welbog pointed out, and unlike Michael's regex it will find multiple occurrences of the text the same line. But it doesn't take comments into account. Is Intellij's find/replace feature intelligent enough to disregard text in comments? Come to think of it, doesn't it have some kind of refactoring support built in?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to use regex to remove punctuations in a sentence - java

[^A-Za-z] Would mean anything NOT matching those character ranges! Try this: [A-Za-z'] You may need to escape the single quote, in which case you'll probably need to escape the slash that escapes it: [A-Za-z\\']

Another way (using abbreviations) is: \b[\w']+

This will match letters from any language and exclude numbers. \b[\p{L}\!\'\?]+ Here is a very good resource for regular expressions. http://www.regular-expressions.info/

Related

Java regex not matching German "Umlaut" OR underscore

Could I give Java a regular expression when java should not split an string?

Remove everything from a string upto a certain character and optionally a string if it follows too

Blank spaces in regular expression

regex for that excludes matches within quotes

Categories

Resources