RegEx to match lines consisting of whitespace only - java

I've gone through multiple expressions for hours but can't quite get one to match what I need exactly.
If I have the following input:
Hi
This
Is
A
Test
I am trimming it to:
Hi
This
Is
A
Test
All is good when the blank lines length are 0 (no String) however some inputs contains a few spaces (" ") within those blank lines and thus would like to check whether a string has 0:infinite number of whitespaces but no characters (simply a blank line).
ArrayList<Integer> listOfBlanks = new ArrayList<>();
for(int i = 0; i < arrayList.size(); i++) {
if(arrayList.get(i).isEmpty()) {
if(arrayList.get(i+1).isEmpty())
listOfBlanks.add(i+1);
}
}
String#isEmpty is only good when there are no whitespaces

Use string.trim().isEmpty() to check for length 0 after trimming leading and trailing whitespaces.

The regular expression to do so would be s.matches("\\s*"). \\s* matches zero or more whitespace characters.

Related

.split() and [\\W] creates an additional empty string?

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks
Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].
If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.
Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

Replace leading zeros till decimal point with dash

If a string is a = 000102.45600. I need to convert it to a = ---102.45600.
Any help in java using either regex or String formatter?
Tried the following:
a = a.replaceFirst("^0+(?!$)","-");
but i am getting only a = -102.45600 not 3 dashes.
Rules: Any leading zeros before decimal in string should be replaced by that many dashes.
000023.45677 to ----23.45677
002345.56776 to --2345.56776
00000.45678 to -----.45678
Hopefully I am clear on what my need is?
String subjectString = "000102.45600";
String resultString = subjectString.replaceAll("\\G0", "-");
System.out.println(resultString); // prints ---102.45600
\G acts like \A (the start-of-string anchor) on the first iteration of replaceAll(), but on subsequent passes it anchors the match to the spot where the previous match ended. That prevents it from matching zeroes anywhere else in the string, like after the decimal point.
See: reference SO answer.
This should do it:
String number = //assign a value here
for (int i=number.length();i>0; i--) {
if (number.substring(0,i).matches("^0+$")) {
System.out.println(number.replaceAll("0","-"));
break;
}
}
This searches for the longest substring of number which starts at index 0 and consists entirely of zeroes - starting by checking the entire String, then shortening it until it finds the longest substring of leading zeroes. Once it finds this substring, it replaces each zero with a dash and breaks out of the loop.
Why not convert the start of the string to the "." to an integer, convert it back to a string then compare the lengths. 000102 length = 6. 102 length = 3. You would have your preceding zero count.

why split() produces extra , after sets limit -1

I want to split Area Code and preceding number from Telephone number without brackets so i did this.
String pattern = "[\\(?=\\)]";
String b = "(079)25894029".trim();
String c[] = b.split(pattern,-1);
for (int a = 0; a < c.length; a++)
System.out.println("c[" + a + "]::->" + c[a] + "\nLength::->"+ c[a].length());
Output:
c[0]::-> Length::->0
c[1]::->079 Length::->3
c[2]::->25894029 Length::->8
Expected Output:
c[0]::->079 Length::->3
c[1]::->25894029 Length::->8
So my question is why split() produces and extra blank at the start, e.g
[, 079, 25894029]. Is this its behavior, or I did something go wrong here?
How can I get my expected outcome?
First you have unnecessary escaping inside your character class. Your regex is same as:
String pattern = "[(?=)]";
Now, you are getting an empty result because ( is the very first character in the string and split at 0th position will indeed cause an empty string.
To avoid that result use this code:
String str = "(079)25894029";
toks = (Character.isDigit(str.charAt(0))? str:str.substring(1)).split( "[(?=)]" );
for (String tok: toks)
System.out.printf("<<%s>>%n", tok);
Output:
<<079>>
<<25894029>>
From the Java8 Oracle docs:
When there is a positive-width match at the beginning of this string
then an empty leading substring is included at the beginning of the
resulting array. A zero-width match at the beginning however never
produces such empty leading substring.
You can check that the first character is an empty string, if yes then trim that empty string character.
Your regex has problems, as does your approach - you can't solve it using your approach with any regex. The magic one-liner you seek is:
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
This removes all leading/trailing non-digits, then splits on non-digits. This will handle many different formats and separators (try a few yourself).
See live demo of this:
String b = "(079)25894029".trim();
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
System.out.println(Arrays.toString(c));
Producing this:
[079, 25894029]

String.split() returning a "" unexpectedly

I have a simple method splitting a string into an array. It splits it where there are non-letter characters. The line I am using right now is as follows:
String[] words = str.split("[^a-zA-Z]");
So this should split the string where there are only alphabetical characters. But the problem is that when it splits it works for some, but not all. For example:
String str = "!!day--yaz!!";
String[] words = str.split("[^a-zA-Z]");
String result = "";
for (int i = 0; i < words.length; i++) {
result += words[i] + "1 ";
}
return result;
I added the 1 in there to see where the split takes place, becuase i was getting errors on null values. Anyway, when I run this code I get an output of:
1 1 day1 1 yaz1
Why is it splitting between the first two !'s and after one of the -'s, but not after the last two !'s? Why is it even splitting there at all? Any help on this would be great!
It doesn't split before or after it splits ON the matches, therefore you get an empty String between the dashes and the bangs.
This doesn't apply to the trailing bangs, because trailing empty Strings are omitted as described in the javadoc
Trailing empty strings are therefore not included in the resulting
array.
This happens because it indeed uses every non-letter character as a delimiter. It means that string "!" will be splitted into array of 2 empty strings to the left and to the right of the exclamation sign.
Your problem can be solved withing 2 steps.
use "[^a-zA-Z]+" instead of "[^a-zA-Z]". The + will help you to avoid empty string between 2 dashes.
Remove starting and trailing non-letter characters before splitting. This will remove leading and trailing empty strings: str.replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")
Finally your split will look like:
String[] words = str..replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")split("[^a-zA-Z]");
If you want to get rid of some of the extra splits, use split("[^a-zA-Z]+") instead of split("[^a-zA-Z]"). This will match a continuous part of the String that matches the pattern.

Why are my character and word counts off?

Given the following string:
String text = "The woods are\nlovely,\t\tdark and deep.";
I want all whitespace treated as a single character. So for instance, the \n is 1 char. The \t\t should also be 1 char. With that logic, I count 36 characters and 7 words. But when I run this through the following code:
String text = "The woods are\nlovely,\t\tdark and deep.";
int numNewCharacters = 0;
for(int i=0; i < text.length(); i++)
if(!Character.isWhitespace(text.charAt(i)))
numNewCharacters++;
int numNewWords = text.split("\\s").length;
// Prints "30"
System.out.println("Chars:" + numNewCharacters);
// Prints "8"
System.out.println("Words:" + numNewWords);
It's telling me that there are 30 characters and 8 words. Any ideas as to why? Thanks in advance.
You are matching on individual whitespaces. Instead you could match on one or more:
text.split("\\s+")
You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.
Reimueus has already solved your word count problem:
text.split("\\s+")
And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.
Here is the complete solution to counting words and characters:
System.out.println("Characters: " + text.replaceAll("\\s+", " ").length());
Matcher m = Pattern.compile("[^\\s]+", Pattern.MULTILINE).matcher(text);
int wordCount = 0;
while (m.find()) {
wordCount ++;
}
System.out.println("Words: "+ wordCount);
Character count is accomplished by replacing all whitespaces groups to a single space and just taking the resulting string's length;
For word count we create a pattern that will match any char group which does not contain a whitespace. You could use \\w+ pattern here, but it will match only alphanumeric characters and underscore. Note also Pattern.MULTILINE parameter.

Categories

Resources