I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.
From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"
The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here
Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Query about the trim() method in Java
I am parsing a site's usernames and other information, and each one has a bunch of spaces after it (but spaces in between the words).
For example: "Bob the Builder " or "Sam the welder ". The numbers of spaces vary from name to name. I figured I'd just use .trim(), since I've used this before.
However, it's giving me trouble. My code looks like this:
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).trim());
}
The result is just the same; no spaces are removed at the end.
Thank you in advance for your excellent answers!
UPDATE:
The full code is a bit more complicated, since there are HTML tags that are parsed out first. It goes exactly like this:
for (String s : splitSource2) {
if (s.length() > "<td class=\"dddefault\">".length() && s.substring(0, "<td class=\"dddefault\">".length()).equals("<td class=\"dddefault\">")) {
splitSource3.add(s.substring("<td class=\"dddefault\">".length()));
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
splitSource3.set(i, splitSource3.get(i).substring(0, splitSource3.get(i).length() - 5));
splitSource3.set(i, splitSource3.get(i).trim());
System.out.println(i + ": " + splitSource3.get(i));
}
}
UPDATE:
Calm down. I never said the fault lay with Java, and I never said it was a bug or broken or anything. I simply said I was having trouble with it and posted my code for you to collaborate on and help solve my issue. Note the phrase "my issue" and not "java's issue". I have actually had the code printing out
System.out.println(i + ": " + splitSource3.get(i) + "*");
in a for each loop afterward.
This is how I knew I had a problem.
By the way, the problem has still not been fixed.
UPDATE:
Sample output (minus single quotes):
'0: Olin D. Kirkland '
'1: Sophomore '
'2: Someplace, Virginia 12345<br />VA SomeCity<br />'
'3: Undergraduate '
EDIT the OP rephrased his question at Query about the trim() method in Java, where the issue was found to be Unicode whitespace characters which are not matched by String.trim().
It just occurred to me that I used to have this sort of issue when I worked on a screen-scraping project. The key is that sometimes the downloaded HTML sources contain non-printable characters which are non-whitespace characters too. These are very difficult to copy-paste to a browser. I assume that this could happened to you.
If my assumption is correct then you've got two choices:
Use a binary reader and figure out what those characters are - and delete them with String.replace(); E.g.:
private static void cutCharacters(String fromHtml) {
String result = fromHtml;
char[] problematicCharacters = {'\000', '\001', '\003'}; //this could be a private static final constant too
for (char ch : problematicCharacters) {
result = result.replace(ch, ""); //I know, it's dirty to modify an input parameter. But it will do as an example
}
return result;
}
If you find some sort of reoccurring pattern in the HTML to be parsed then you can use regexes and substrings to cut the unwanted parts. E.g.:
private String getImportantParts(String fromHtml) {
Pattern p = Pattern.compile("(\\w*\\s*)"); //this could be a private static final constant as well.
Matcher m = p.matcher(fromHtml);
StringBuilder buff = new StringBuilder();
while (m.find()) {
buff.append(m.group(1));
}
return buff.toString().trim();
}
Works without a problem for me.
Here your code a bit refactored and (maybe) better readable:
final String openingTag = "<td class=\"dddefault\">";
final String closingTag = "</td>";
List<String> splitSource2 = new ArrayList<String>();
splitSource2.add(openingTag + "Bob the Builder " + closingTag);
splitSource2.add(openingTag + "Sam the welder " + closingTag);
for (String string : splitSource2) {
System.out.println("|" + string + "|");
}
List<String> splitSource3 = new ArrayList<String>();
for (String s : splitSource2) {
if (s.length() > openingTag.length() && s.startsWith(openingTag)) {
String nameWithoutOpeningTag = s.substring(openingTag.length());
splitSource3.add(nameWithoutOpeningTag);
}
}
System.out.println("\n");
for (int i = 0; i < splitSource3.size(); i++) {
String name = splitSource3.get(i);
int closingTagBegin = splitSource3.get(i).length() - closingTag.length();
String nameWithoutClosingTag = name.substring(0, closingTagBegin);
String nameTrimmed = nameWithoutClosingTag.trim();
splitSource3.set(i, nameTrimmed);
System.out.println("|" + splitSource3.get(i) + "|");
}
I know that's not a real answer, but i cannot post comments and this code as a comment wouldn't fit, so I made it an answer, so that Olin Kirkland can check his code.
I have a code like,
String str = " " ;
while( cond ) {
str = str + "\n" ;
}
Now, I don't know why at the time of printing, the output string is not printing the newline character. However, when I add any other character like ( str = str + "c"), it is printing properly. Can anybody help me, how to solve this problem and why this happening ?
The newline character is considered a control character, which doesn't print a special character to the screen by default.
As an example, try this:
String str = "Hi";
while (cond) {
str += "\n"; // Syntactically equivalent to your code
}
str += "Bye";
System.out.println(str);
Looks like you are trying to run the above code on Windows. Well the line separator or new line is different on Windows ( '\r\n' ) and Unix flavors ('\n').
So, instead of hard coding and using '\n' as new line. Try getting new line from the system like:
String newLine = System.getProperty("line.separator");
String str = " " ;
while( cond ) {
str = str + newLine ;
}
If you really want \n, to get printed, do it like this.
String first = "C:/Mine/Java" + "\\n";
System.out.println(first);
OUTPUT is as follows :
For a good reference as to why is this happening, visit JAVA Tutorials
As referred in that TUTORIAL : A character preceded by a backslash is an escape sequence, and has a special meaning to the compiler. When an escape sequence is encountered in a print statement, the compiler interprets it accordingly
Hope this might help.
Regards
Based on your sample, the only reason it would not show a new line character is that cond is never true and thus the while loop never runs...
Looking for quick, simple way in Java to change this string
" hello there "
to something that looks like this
"hello there"
where I replace all those multiple spaces with a single space, except I also want the one or more spaces at the beginning of string to be gone.
Something like this gets me partly there
String mytext = " hello there ";
mytext = mytext.replaceAll("( )+", " ");
but not quite.
Try this:
String after = before.trim().replaceAll(" +", " ");
See also
String.trim()
Returns a copy of the string, with leading and trailing whitespace omitted.
regular-expressions.info/Repetition
No trim() regex
It's also possible to do this with just one replaceAll, but this is much less readable than the trim() solution. Nonetheless, it's provided here just to show what regex can do:
String[] tests = {
" x ", // [x]
" 1 2 3 ", // [1 2 3]
"", // []
" ", // []
};
for (String test : tests) {
System.out.format("[%s]%n",
test.replaceAll("^ +| +$|( )+", "$1")
);
}
There are 3 alternates:
^_+ : any sequence of spaces at the beginning of the string
Match and replace with $1, which captures the empty string
_+$ : any sequence of spaces at the end of the string
Match and replace with $1, which captures the empty string
(_)+ : any sequence of spaces that matches none of the above, meaning it's in the middle
Match and replace with $1, which captures a single space
See also
regular-expressions.info/Anchors
You just need a:
replaceAll("\\s{2,}", " ").trim();
where you match one or more spaces and replace them with a single space and then trim whitespaces at the beginning and end (you could actually invert by first trimming and then matching to make the regex quicker as someone pointed out).
To test this out quickly try:
System.out.println(new String(" hello there ").trim().replaceAll("\\s{2,}", " "));
and it will return:
"hello there"
Use the Apache commons StringUtils.normalizeSpace(String str) method. See docs here
This worked perfectly for me : sValue = sValue.trim().replaceAll("\\s+", " ");
trim() method removes the leading and trailing spaces and using replaceAll("regex", "string to replace") method with regex "\s+" matches more than one space and will replace it with a single space
myText = myText.trim().replaceAll("\\s+"," ");
The following code will compact any whitespace between words and remove any at the string's beginning and end
String input = "\n\n\n a string with many spaces, \n"+
" a \t tab and a newline\n\n";
String output = input.trim().replaceAll("\\s+", " ");
System.out.println(output);
This will output a string with many spaces, a tab and a newline
Note that any non-printable characters including spaces, tabs and newlines will be compacted or removed
For more information see the respective documentation:
String#trim() method
String#replaceAll(String regex, String replacement) method
For information about Java's regular expression implementation see the documentation of the Pattern class
"[ ]{2,}"
This will match more than one space.
String mytext = " hello there ";
//without trim -> " hello there"
//with trim -> "hello there"
mytext = mytext.trim().replaceAll("[ ]{2,}", " ");
System.out.println(mytext);
OUTPUT:
hello there
To eliminate spaces at the beginning and at the end of the String, use String#trim() method. And then use your mytext.replaceAll("( )+", " ").
You can first use String.trim(), and then apply the regex replace command on the result.
Try this one.
Sample Code
String str = " hello there ";
System.out.println(str.replaceAll("( +)"," ").trim());
OUTPUT
hello there
First it will replace all the spaces with single space. Than we have to supposed to do trim String because Starting of the String and End of the String it will replace the all space with single space if String has spaces at Starting of the String and End of the String So we need to trim them. Than you get your desired String.
String blogName = "how to do in java . com";
String nameWithProperSpacing = blogName.replaceAll("\\\s+", " ");
trim()
Removes only the leading & trailing spaces.
From Java Doc,
"Returns a string whose value is this string, with any leading and trailing whitespace removed."
System.out.println(" D ev Dum my ".trim());
"D ev Dum my"
replace(), replaceAll()
Replaces all the empty strings in the word,
System.out.println(" D ev Dum my ".replace(" ",""));
System.out.println(" D ev Dum my ".replaceAll(" ",""));
System.out.println(" D ev Dum my ".replaceAll("\\s+",""));
Output:
"DevDummy"
"DevDummy"
"DevDummy"
Note: "\s+" is the regular expression similar to the empty space character.
Reference : https://www.codedjava.com/2018/06/replace-all-spaces-in-string-trim.html
In Kotlin it would look like this
val input = "\n\n\n a string with many spaces, \n"
val cleanedInput = input.trim().replace(Regex("(\\s)+"), " ")
A lot of correct answers been provided so far and I see lot of upvotes. However, the mentioned ways will work but not really optimized or not really readable.
I recently came across the solution which every developer will like.
String nameWithProperSpacing = StringUtils.normalizeSpace( stringWithLotOfSpaces );
You are done.
This is readable solution.
You could use lookarounds also.
test.replaceAll("^ +| +$|(?<= ) ", "");
OR
test.replaceAll("^ +| +$| (?= )", "")
<space>(?= ) matches a space character which is followed by another space character. So in consecutive spaces, it would match all the spaces except the last because it isn't followed by a space character. This leaving you a single space for consecutive spaces after the removal operation.
Example:
String[] tests = {
" x ", // [x]
" 1 2 3 ", // [1 2 3]
"", // []
" ", // []
};
for (String test : tests) {
System.out.format("[%s]%n",
test.replaceAll("^ +| +$| (?= )", "")
);
}
See String.replaceAll.
Use the regex "\s" and replace with " ".
Then use String.trim.
String str = " hello world"
reduce spaces first
str = str.trim().replaceAll(" +", " ");
capitalize the first letter and lowercase everything else
str = str.substring(0,1).toUpperCase() +str.substring(1,str.length()).toLowerCase();
you should do it like this
String mytext = " hello there ";
mytext = mytext.replaceAll("( +)", " ");
put + inside round brackets.
String str = " this is string ";
str = str.replaceAll("\\s+", " ").trim();
This worked for me
scan= filter(scan, " [\\s]+", " ");
scan= sac.trim();
where filter is following function and scan is the input string:
public String filter(String scan, String regex, String replace) {
StringBuffer sb = new StringBuffer();
Pattern pt = Pattern.compile(regex);
Matcher m = pt.matcher(scan);
while (m.find()) {
m.appendReplacement(sb, replace);
}
m.appendTail(sb);
return sb.toString();
}
The simplest method for removing white space anywhere in the string.
public String removeWhiteSpaces(String returnString){
returnString = returnString.trim().replaceAll("^ +| +$|( )+", " ");
return returnString;
}
check this...
public static void main(String[] args) {
String s = "A B C D E F G\tH I\rJ\nK\tL";
System.out.println("Current : "+s);
System.out.println("Single Space : "+singleSpace(s));
System.out.println("Space count : "+spaceCount(s));
System.out.format("Replace all = %s", s.replaceAll("\\s+", ""));
// Example where it uses the most.
String s = "My name is yashwanth . M";
String s2 = "My nameis yashwanth.M";
System.out.println("Normal : "+s.equals(s2));
System.out.println("Replace : "+s.replaceAll("\\s+", "").equals(s2.replaceAll("\\s+", "")));
}
If String contains only single-space then replace() will not-replace,
If spaces are more than one, Then replace() action performs and removes spacess.
public static String singleSpace(String str){
return str.replaceAll(" +| +|\t|\r|\n","");
}
To count the number of spaces in a String.
public static String spaceCount(String str){
int i = 0;
while(str.indexOf(" ") > -1){
//str = str.replaceFirst(" ", ""+(i++));
str = str.replaceFirst(Pattern.quote(" "), ""+(i++));
}
return str;
}
Pattern.quote("?") returns literal pattern String.
My method before I found the second answer using regex as a better solution. Maybe someone needs this code.
private String replaceMultipleSpacesFromString(String s){
if(s.length() == 0 ) return "";
int timesSpace = 0;
String res = "";
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if(c == ' '){
timesSpace++;
if(timesSpace < 2)
res += c;
}else{
res += c;
timesSpace = 0;
}
}
return res.trim();
}
Stream version, filters spaces and tabs.
Stream.of(str.split("[ \\t]")).filter(s -> s.length() > 0).collect(Collectors.joining(" "))
I know replaceAll method is much easier but I wanted to post this as well.
public static String removeExtraSpace(String input) {
input= input.trim();
ArrayList <String> x= new ArrayList<>(Arrays.asList(input.split("")));
for(int i=0; i<x.size()-1;i++) {
if(x.get(i).equals(" ") && x.get(i+1).equals(" ")) {
x.remove(i);
i--;
}
}
String word="";
for(String each: x)
word+=each;
return word;
}
String myText = " Hello World ";
myText = myText.trim().replace(/ +(?= )/g,'');
// Output: "Hello World"
string.replaceAll("\s+", " ");
If you already use Guava (v. 19+) in your project you may want to use this:
CharMatcher.whitespace().trimAndCollapseFrom(input, ' ');
or, if you need to remove exactly SPACE symbol ( or U+0020, see more whitespaces) use:
CharMatcher.anyOf(" ").trimAndCollapseFrom(input, ' ');
public class RemoveExtraSpacesEfficient {
public static void main(String[] args) {
String s = "my name is mr space ";
char[] charArray = s.toCharArray();
char prev = s.charAt(0);
for (int i = 0; i < charArray.length; i++) {
char cur = charArray[i];
if (cur == ' ' && prev == ' ') {
} else {
System.out.print(cur);
}
prev = cur;
}
}
}
The above solution is the algorithm with the complexity of O(n) without using any java function.
Please use below code
package com.myjava.string;
import java.util.StringTokenizer;
public class MyStrRemoveMultSpaces {
public static void main(String a[]){
String str = "String With Multiple Spaces";
StringTokenizer st = new StringTokenizer(str, " ");
StringBuffer sb = new StringBuffer();
while(st.hasMoreElements()){
sb.append(st.nextElement()).append(" ");
}
System.out.println(sb.toString().trim());
}
}