.split() and [\\W] creates an additional empty string? - java

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks

Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].

If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.

Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

Related

Replacing consecutive repeated characters in java

I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}

why split() produces extra , after sets limit -1

I want to split Area Code and preceding number from Telephone number without brackets so i did this.
String pattern = "[\\(?=\\)]";
String b = "(079)25894029".trim();
String c[] = b.split(pattern,-1);
for (int a = 0; a < c.length; a++)
System.out.println("c[" + a + "]::->" + c[a] + "\nLength::->"+ c[a].length());
Output:
c[0]::-> Length::->0
c[1]::->079 Length::->3
c[2]::->25894029 Length::->8
Expected Output:
c[0]::->079 Length::->3
c[1]::->25894029 Length::->8
So my question is why split() produces and extra blank at the start, e.g
[, 079, 25894029]. Is this its behavior, or I did something go wrong here?
How can I get my expected outcome?
First you have unnecessary escaping inside your character class. Your regex is same as:
String pattern = "[(?=)]";
Now, you are getting an empty result because ( is the very first character in the string and split at 0th position will indeed cause an empty string.
To avoid that result use this code:
String str = "(079)25894029";
toks = (Character.isDigit(str.charAt(0))? str:str.substring(1)).split( "[(?=)]" );
for (String tok: toks)
System.out.printf("<<%s>>%n", tok);
Output:
<<079>>
<<25894029>>
From the Java8 Oracle docs:
When there is a positive-width match at the beginning of this string
then an empty leading substring is included at the beginning of the
resulting array. A zero-width match at the beginning however never
produces such empty leading substring.
You can check that the first character is an empty string, if yes then trim that empty string character.
Your regex has problems, as does your approach - you can't solve it using your approach with any regex. The magic one-liner you seek is:
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
This removes all leading/trailing non-digits, then splits on non-digits. This will handle many different formats and separators (try a few yourself).
See live demo of this:
String b = "(079)25894029".trim();
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
System.out.println(Arrays.toString(c));
Producing this:
[079, 25894029]

Spliting a sentence when a hyphen is encountered in java

I have following code in my program. It splits a line when a hyphen is encountered and stores each word in the String Array 'tokens'. But I want the hyphen also to be stored in the String Array 'tokens' when it is encountered in a sentence.
String[] tokens = line.split("-");
The above code splits the sentence but also totally ignores the hyphen in the resulting array.
What can I do to store hyphen also in the resulting array?
Edit : -
Seems like you want to split on both whitespaces and hyphen but keeping only the hyphen in the array (As, I infer from your this line - stores each word in the String Array), you can use this: -
String[] tokens = "abc this is-a hyphen def".split("((?<=-)|(?=-))|\\s+");
System.out.println(Arrays.toString(tokens));
Output: -
[abc, this, is, -, a, hyphen, def]
For handling spaces before and after hyphen, you can first trim those spaces using replaceAll method, and then do split: -
"abc this is - a hyphen def".replaceAll("[ ]*-[ ]*", "-")
.split("((?<=-)|(?=-))|\\s+");
Previous answer : -
You can use this: -
String[] tokens = "abc-efg".split("((?<=-)|(?=-))");
System.out.println(Arrays.toString(tokens));
OUTPUT : -
[abc, -, efg]
It splits on an empty character before and after the hyphen (-).
I suggest to use a regular expression in combination with the Java Pattern and Matcher. Example:
String line = "a-b-c-d-e-f-";
Pattern p = Pattern.compile("[^-]+|-");
Matcher m = p.matcher(line);
while (m.find())
{
String match = m.group();
System.out.println("match:" + match);
}
To test your regular expression you could use an online regexp tester like this

Split a String based on regex

I have a string that needs to be split based on the occurrence of a ","(comma), but need to ignore any occurrence of it that comes within a pair of parentheses.
For example, B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3
Should be split into
B2B,
(A2C,AMM),
(BNC,1NF),
(106,A01),
AAA,
AX3
FOR NON NESTED
,(?![^\(]*\))
FOR NESTED(parenthesis inside parenthesis)
(?<!\([^\)]*),(?![^\(]*\))
Try below:
var str = 'B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3';
console.log(str.match(/\([^)]*\)|[A-Z\d]+/g));
// gives you ["B2B", "(A2C,AMM)", "(BNC,1NF)", "(106,A01)", "AAA", "AX3"]
Java edition:
String str = "B2B,(A2C,AMM),(BNC,1NF),(106,A01),AAA,AX3";
Pattern p = Pattern.compile("\\([^)]*\\)|[A-Z\\d]+");
Matcher m = p.matcher(str);
List<String> matches = new ArrayList<String>();
while(m.find()){
matches.add(m.group());
}
for (String val : matches) {
System.out.println(val);
}
One simple iteration will be probably better option then any regex, especially if your data can have parentheses inside parentheses. For example:
String data="Some,(data,(that),needs),to (be, splited) by, comma";
StringBuilder buffer=new StringBuilder();
int parenthesesCounter=0;
for (char c:data.toCharArray()){
if (c=='(') parenthesesCounter++;
if (c==')') parenthesesCounter--;
if (c==',' && parenthesesCounter==0){
//lets do something with this token inside buffer
System.out.println(buffer);
//now we need to clear buffer
buffer.delete(0, buffer.length());
}
else
buffer.append(c);
}
//lets not forget about part after last comma
System.out.println(buffer);
output
Some
(data,(that),needs)
to (be, splited) by
comma
Try this
\w{3}(?=,)|(?<=,)\(\w{3},\w{3}\)(?=,)|(?<=,)\w{3}
Explanation: There are three parts separated by OR (|)
\w{3}(?=,) - matches the 3 any alphanumeric character (including underscore) and does the positive look ahead for comma
(?<=,)\(\w{3},\w{3}\)(?=,) - matches this pattern (ABC,E4R) and also does a positive lookahead and look behind for the comma
(?<=,)\w{3} - matches the 3 any alphanumeric character (including underscore) and does the positive look behind for comma

How can I find repeated characters with a regex in Java?

Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)
Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)
String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}
Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}

Categories

Resources