Eliminating spaces and words starting with particular chars from JAVA string

Eliminating spaces and words starting with particular chars from JAVA string - java

Eliminating spaces and words starting with particular chars from JAVA string.
With the following code spaces between string words are eliminated:
String str1= "This is symbel for snow and silk. Grapes are very dear"
String str2=str1.replaceAll(" ","");
System.out.println(str2);
It gives this output:-
output:
Thisissymbelforsnowandsilk.Grapesareverydear
But I want to eliminate all the words in str1 starting with char 's' (symbel snow silk) and char 'd' (dear) to get the following output:-
output:
Thisisforand.Grapesarevery
How it can be achieved by amending this code?

The best solution is to use a Regular Expression also known as a Regex.
These are designed specifically for complex search and replace functionality in strings.
This one:
"([sd]\\w+)|\\s+"
matches a word group indicated by the parentheses () starting with 's' or 'd' followed by one or more "word" characters (\\w = any alpha numeric or underscore) OR one or more whitespace characters (\\s = whitespace). When used as an argument to the String replaceAll function like so:
s.replaceAll("([sd]\\w+)|\\s+", "");
every occurance that matches either of these two patterns is replaced with the empty string.
There is comprehensive information on regexes in Oracle's java documentation here:
http://docs.oracle.com/javase/tutorial/essential/regex/
Although they seem cryptic at first, learning them can greatly simplify your code. Regexes are available in almost all modern languages so any knowledge you gain about regexes is useful and transferable.
Furthermore, the web is littered with handy sites where you can test your regexes out before committing them to code.

Do like this
String str1= "This is symbel for snow and silk. Grapes are very dear";
System.out.print(str1.replaceAll("[sd][a-z]+|[ ]+",""));
Explanation

try this
s = s.replaceAll("([sd]\\w+)|\\s+", "");

Related

Replace hashtags in a single pass with regex

I want to replace all hashtags in a string with their equivalent tag in Java. Examples:
This is a #foo_bar #document about #nothing_but_tags!
will result in:
This is a foo bar document about nothing but tags!
Is this possible in a one-pass regex replace? A hashtag may contain many words.

Here is a way to do it with a little hack:
String str = "#This is a #foo_bar #document about #nothing_but_tags!";
String res = str.replaceAll(" ?#|(?<=#\\w{0,100})_", " ").trim();
It would break with hashtags longer than 100 characters, and it would insert a space in place of hash in the tag if it happens to be the first thing in a string (hence a call to trim()).
Demo.
The 100 character limitation comes from {0,100} portion of lookbehind. This is a limitation of Java regex engine: unlike some other regex engines, it requires the lengths of look-aneads and look-behinds to have an explicit upper limit.

What does this regex syntax actually mean in Java?

I wrote a program to detect palindromes. It works with what I have, but I stumbled upon another bit of syntax, and I would like to know what it means exactly?
This is the line of code I'm using:
userString = userString.toLowerCase().replaceAll("[^a-zA-Z]", "");
I understand that the replaceAll code snippet means to "match characters ([...]) that are not (^) in the range a-z and A-Z (a-zA-Z)."
However, this worked as well:
replaceAll("[^(\p{L}')]", "");
I just don't understand how to translate that into English. I am completely new to regular expressions, and I find them quite fascinating. Thanks to anyone who can tell me what it means.

You should check this website:
https://regex101.com
It helped me a lot when I was writing/testing/debugging some regexes ;)
It gives the following explanation:
[^(\p{L}')] match a single character not present in the list below:
( the literal character (
\p{L} matches any kind of letter from any language
') a single character in the list ') literally

The two regexes are not the same:
[^a-zA-Z] matches any char not an English letter
[^(\p{L}')] matches any char not a letter, quote or bracket
ie the 2nd one removes brackets and quotes too.
The regex \p{L} is the posix character class for "any letter". IE these two regexes are equivalent in the context of letters only from English:
[a-zA-Z]
\p{L}

Java Regex Remove Text Between and Including Parenthesis from String

I am programming in Java, and I have a few Strings that look similar to this:
"Avg. Price ($/lb)"
"Average Price ($/kg)"
I want to remove the ($/lb) and ($/kg) from both Strings and be left with
"Avg. Price"
"Average Price".
My code checks whether a String str variable matches one of the strings above, and if it does, replaces the text inside including the parentheses with an empty string:
if(str.matches(".*\\(.+?\\)")){
str = str.replaceFirst("\\(.+?\\)", "");
}
When I change str.matches to str.contains("$/lb"); as a test, the wanted substring is removed which leads me to believe there is something wrong with the if statement. Any help as to what I am doing wrong? Thank you.
Update
I changed the if statement to:
if(str.contains("(") && str.contains (")"))
Maybe not an elegant solution but it seems to work.

str.matches has always been problematic for me. I think it implies a '^' and '$' surrounding the regex you pass it.
Since you just care about replacing any occurrence of the string in question - try the following:
str = str.replaceAll("\\s+\\(\\$\\/(lb|kg)\\)", "");
There is an online regex testing tool that you can also try out to see how your expression works out.
EDIT With regard to your comment, the expression could be altered to just:
str = str.replaceAll("\\s+\\([^)]+\\)$", "");
This would mean, find any section of content starting with one or more white-space characters, followed by a literal '(', then look for any sequence of non-')' characters, followed by a literal ')' at the end of the line.
Is that more in-line with your expectation?
Additionally, heed the comment with regard to 'matches()' vs 'find()' that is very much so what is impacting operation here for you.

Unlike most other popular application languages, the matches() method in java only returns true if the regex matches the whole string (not part of the string like in perl, ruby, php, javascript etc).
The regex to match bracketed input, including any leading spaces, is:
" *\\(.*?\\)"
and the code to use this to remove matches is:
str = str.replaceAll(" *\\(.+?\\)", "");
Here's some test code:
String str = "foo (stuff) bar(whatever)";
str = str.replaceAll(" *\\(.+?\\)", "");
System.out.println(str);
Output:
"foo bar"

This code is working fine.
String str = "Avg. Price ($/lb) Average Price ($/kg)";
if (str.matches(".*\\(.+?\\)")) {
str = str.replaceFirst("\\(.+?\\)", "");
}
System.out.println("str: "+str);
This will print Avg. Price Average Price which is what you need.
Note: I changed replaceFirst with replaceAll here.

String first = "^(\\w+\\.\\s\\w+)";
This would print out Avg. Price
String second="(\\w\\s\\w)";
This would print out Average Price
hope this simple answer helps

Java Regex replaceAll() with lookahead

I am fairly new to using regex with java. My motive is to escape all occurrences of '*' with a back slash.
This was the statement that I tried:
String replacementStr= str.replaceAll("(?=\\[*])", "\\\\");
This does not seem to work though. After some amount of tinkering, found out that this works though.
String replacementStr= str.replaceAll("(?=[]\\[*])", "\\\\");
Based on what I know of regular expressions, I thought '[]' represents an empty character class. Am I missing something here? Can someone please help me understand this?
Note: The motive of my trial was to learn to use the lookahead feature of regex. While the purpose stated in the question does not warrant the use of lookahead, am just trying to use it for educational purposes. Sorry for not making that clear!

When some metacharacters are placed within brackets, no need to escape.
In another way, I do not know if you mean escape * with \*. In that case, try the next:
String newStr = str.replace("*", "\\*");
EDIT: There is something curious in your regular expressions.
(?=\[*]) Look ahead for the character [ (0 or more times), followed by ]
(?=[]\[*]) Look ahead for one of the next characters: [, ], *
Perhaps the regex that you are looking for is the following:
(?=\*)
In Java, "(?=\\*)"

In your replaceAll("(?=\\[*])", "\\\\"); simply modify as
String newStr = str.replace("*", "\\");
Dont bother about regex
For example
String str = "abc*123*";
String newStr = str.replace("*", "\\");
System.out.println(newStr);
Shows output as
abc\123\
Know about String replace

Below Code will work
Code
String strTest = "jhgfg*gfb*gfhh";
strTest = strTest.replaceAll("\\*", "\\\\"); // strTest = strTest.replace("*", "\\");
System.out.println("String is : "+strTest);
OUTPUT
String is : jhgfg\gfb\gfhh

If the regex engine finds [], it treats the ] as a literal ]. This is never a problem because an empty character class is useless anyway, and it means you can avoid some character escaping.
There are a few rules for characters you don't have to escape in character classes:
in [] (or [^]), the ] is literal
in [-.....] or [^-.....] or [.....-] or [^.....-], the - is literal
^ is literal unless it is at the start of the character class
So you'll never need to escape ], - or ^ if you don't want to.
This is down to the Perl origins of the regex syntax. It's a very Perl-style way of doing things.

Java - Regex to Split Tokens With Minimum Size and Delimiters

I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.
I want to split a String in Java, and I have 4 constraints:
The delimiters are [.?!] (end of the sentence)
Decimal numbers shouldn't be tokenized
The delimiters shouldn't be removed.
The minimum size of each token should be 5
For example, for input:
"Hello World! This answer worth $1.45 in U.S. dollar. Thank you."
The output will be:
[Hello World!, This answer worth $1.45 in U.S. dollar., Thank you.]
Up to now I got the answer for three first constraints by this regex:
text.split("(?<=[.!?])(?<!\\d)(?!\\d)");
And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.
For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.
Finally, introducing a good tutorial of regex is appreciated.
UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.
So, Be careful! The accepted answer will not cover all possible use cases!

Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).
Now let's modify it to the needs of this question:
We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).
Some links:
Online tester, jump to JAVA
Explain tool (Not JAVA based)
THE regex tutorial
Java specific regex tutorial
SO regex chatroom
Some advanced nice regex-fu on SO
How does this regex find triangular numbers?
How can we match a^n b^n?
How does this Java regex detect palindromes?
How to determine if a number is a prime with regex?
"vertical" regex matching in an ASCII "image"
Can the for loop be eliminated from this piece of PHP code? ^-- See regex solution, although not sure if applicable in JAVA

What about the next regular expression?
(?<=[.!?])(?!\w{1,5})(?<!\d)(?!\d)
e.g.
private static final Pattern REGEX_PATTERN =
Pattern.compile("(?<=[.!?])(?!\\w{1,5})(?<!\\d)(?!\\d)");
public static void main(String[] args) {
String input = "Hello World! This answer worth $1.45 in U.S. dollar. Thank you.";
System.out.println(java.util.Arrays.toString(
REGEX_PATTERN.split(input)
)); // prints "[Hello World!, This answer worth $1.45 in U.S., dollar., Thank you.]"
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Eliminating spaces and words starting with particular chars from JAVA string - java

Do like this String str1= "This is symbel for snow and silk. Grapes are very dear"; System.out.print(str1.replaceAll("[sd][a-z]+|[ ]+","")); Explanation

try this s = s.replaceAll("([sd]\\w+)|\\s+", "");

Related

Replace hashtags in a single pass with regex

What does this regex syntax actually mean in Java?

Java Regex Remove Text Between and Including Parenthesis from String

Java Regex replaceAll() with lookahead

Java - Regex to Split Tokens With Minimum Size and Delimiters

Categories

Resources