Java Split regex - java

Given a string S, find the number of words in that string. For this problem a word is defined by a string of one or more English letters.
Note: Space or any of the special characters like ![,?.\_'#+] will act as a delimiter.
Input Format: The string will only contain lower case English letters, upper case English letters, spaces, and these special characters: ![,?._'#+].
Output Format: On the first line, print the number of words in the string. The words don't need to be unique. Then, print each word in a separate line.
My code:
Scanner sc = new Scanner(System.in);
String str = sc.nextLine();
String regex = "( |!|[|,|?|.|_|'|#|+|]|\\\\)+";
String[] arr = str.split(regex);
System.out.println(arr.length);
for(int i = 0; i < arr.length; i++)
System.out.println(arr[i]);
When I submit the code, it works for just over half of the test cases. I do not know what the test cases are. I'm asking for help with the Murphy's law. What are the situations where the regex I implemented won't work?

You don't escape some special characters in your regex. Let's start with []. Since you don't escape them, the part [|,|?|.|_|'|#|+|] is treated like a set of characters |,?._'#+. This means that your regex doesn't split on [ and ].
For example x..]y+[z is split to x, ]y and [z.
You can fix that by escaping those characters. That will force you to escape more of them and you end up with a proper definition:
String regex = "( |!|\\[|,|\\?|\\.|_|'|#|\\+|\\])+";
Note that instead of defining alternatives, you could use a set which will make your regex easier to read:
String regex = "[!\\[,?._'#+\\].]+";
In this case you only need to escape [ and ].
UPDATE:
There's also a problem with leading special character (like in your example ".Hi?there[broski.]#####"). You need to split on it but it produces an empty string in the results. I don't think there's a way to use split function without producing it but you can mitigate it by removing the first group before splitting using the same regex:
String[] arr = str.replaceFirst(regex, "").split(regex);

Related

Java String.split : Trouble using \\W as non word delimiter

Going by the suggestion provided here
I tried using \\W as a delimiter for non word character in string.split function of java.
String str = "id-INT, name-STRING,";
This looks like a really simple string. I wanted to extract just the words from this string. The length of the array that I get is 5 whereas it should be 4. There is an empty string at position right after INT. I don't understand why the space in there is not being considered as non word
The , and the space are been treated as separate entities, try using \\W+ instead
String str = "id-INT, name-STRING,";
String[] parts = str.split("\\W+");
System.out.println(parts.length);
System.out.println(Arrays.toString(parts));
Which outputs
4
[id, INT, name, STRING]

Java String delete tokens contains numbers

I have a string like this and I would like to eliminate all the tokens that contain a number:
String[] s="In the 1980s".split(" ");
Is there a way to remove the tokens that contain numbers - in this case 1980s, but also, for example 784th or s787?
Use a \w*\d\w* regex matcher for that. It will match all words with at least one digit in them. Although I generally despise regexes, they are particularily well suited for your problem.
String[] s = input.replaceAll("\\w*\\d\\w* *", "").split(" +");
See Java lib docs for Pattern/Matcher (RegEx) for more reference how to work with regexes in general.
Test code:
http://ideone.com/LrHDsT
Remove the unwanted words first, then split:
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
Some test code:
String str = "666 In the 1980s 784th s787 foo BAR";
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
System.out.println(Arrays.toString(s));
Output:
[In, the, foo, BAR]
You could Regex as suggested by #vaxquis or alternately after splitting the string based on the delimiter
You could Parse the token strings and check if the token has number among them using NumberUtils.isNumber and remove those tokens.
split doesn't seem to be what you are looking for. Even if you remove words which contain digit like in case of
"1foo f2oo bar whatever baz2"
you will end up with
" bar whatever "
and if you split on spaces now you will end up with ["", "bar", "whatever"].
To solve this problem you may want also to remove spaces after word you removed so now
"1foo f2oo bar whatever baz2"
would become
"bar whatever "
so it can be split correctly (space at the end is not the problem since split by default removes trailing empty strings in result array).
But instead of doing two iterations (removing words and splitting on string) you can achieve same thing with only one iteration. All you need to do is use opposite approach:instead of focusing on removing wrong elements, lets try to find correct ones.
Correct tokens seem to be words which contains any non-space characters but not digits. You can regex representing such words with this regex \b[\S&&\D]\b where:
\b represents word boundaries,
\S any non whitespace character
\D any non digit character
[\S&&\D] intersection of non-whitespaces and non-digits, in other words non whitespaces which are also non-ditigts
Demo:
String input = "1foo f2oo bar whatever baz2";
Pattern p = Pattern.compile("\\b[\\S&&\\D]+\\b");
Matcher m = p.matcher(input);
while(m.find())
System.out.println(m.group());
Output:
bar
whatever
BTW to avoid potential problems with potential empty element at start of results you can use Scanner which doesn't return empty element if delimiter is found at start of string. So we can simply set delimiter as series of spaces or words which contains digit. So your code can also look like
Scanner sc = new Scanner(input);
sc.useDelimiter("(\\s|\\w*\\d\\w*)+");
while (sc.hasNext())
System.out.println(sc.next());
sc.close();

Java Split on Spaces and Special Characters

I am trying to split a string on spaces and some specific special characters.
Given the string "john - & + $ ? . # boy"
I want to get the array:
array[0]="john";
array[1]="boy";
I've tried several regular expressions and gotten no where. Here is my current stab:
String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.#&].*");
Which preserves "john" but not "boy". Can anyone get me the rest of this?
Just use:
String[] terms = input.split("[\\s#&.?$+-]+");
You can put a short-hand character class inside a character class (note the \s), and most meta-character loses their meaning inside a character class, except for [, ], -, &, \. However, & is meaningful only when comes in pair &&, and - is treated as literal character if put at the beginning or the end of the character class.
Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines.
As #Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \W consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.
You could make your code much easier by replacing your pattern with "\\W+" (one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)
And of Course things could be made more efficient by using Guava's Splitter class
Try out this.....
Input.replace("-&+$?.#"," ").split(" ");
Breaking then step by step:
For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.
String ugly = "john - & + $ ? . # boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
to add to what have been said about Splitter, you can do something of this sort:
String str = "john - & + $ ? . # boy";
Iterable<String> ttt = Splitter.on(Pattern.compile("\\W")).trimResults().omitEmptyStrings().split(str);
Use this format.
String s = "john - & + $ ? . # boy";
String reg = "[!_.',#? ]";
String[] res = s.split(reg);
here include every character that you want to split inside the [ ] brackets.
You can use something like below
arrayOfStringType=string.split(" |'|,|.|//+|_");
'|' will work as an or operator here.

Split string by punctuation marks in Java

I'm trying to do the following:
String[] Res = Text.split("[\\p{Punct}\\s]+");
But, I always get a few words with space before them.
How can I parse the sentence without getting spaces and other punctuation marks as a part of the word itself?
Since you didn’t provide a sample input which can reproduce the problem I can only guess. I can’t see why the regex you provided should ever leave spaces in the result unless you are using non-ASCII white-space or punctuation characters. The reason that is both \\p{Punct} and \\s are POSIX character classes limited to ASCII, e.g. \\s will not match \u00a0. Use [\\p{IsPunctuation}\\p{IsWhite_Space}]+ if non-ASCII punctuation and white-space characters are your problem.
Example
String text="Some\u00a0words stick together⁈";
String[] res1 = text.split("[\\p{Punct}\\s]+");
System.out.println(Arrays.toString(res1));
String[] res2 = text.split("[\\p{IsPunctuation}\\p{IsWhite_Space}]+");
System.out.println(Arrays.toString(res2));
will produce:
[Some words, stick, together⁈]
[Some, words, stick, together]
You need to trim() all the Strings in the array before using them. This will eliminate all the leading and trailing white spaces.
str = str.trim();
In your case
for(String str : Res) {
str = str.trim();
// use str now, without any white spaces
}
If you need to keep the punctuations also, then, you need to use the StringTokenizer which takes in the boolean value of keeping the delimiters or not.
For removing spaces trailing or leading whatever it may be use
String str=" java ";
str = str.trim();

Java (Regex) - Get all words in a sentence

I need to split a java string into an array of words. Let's say the string is:
"Hi!! I need to split this string, into a serie's of words?!"
At the moment I'm tried using this String[] strs = str.split("(?!\\w)") however it keeps symbols such as ! in the array and it also keeps strings like "Hi!" in the array as well. The string I am splitting will always be lowercase. What I would like is for an array to be produced that looks like:
{"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"} - Note the apostrophe is kept.
How could I change my regex to not include the symbols in the array?
Apologies, I would define a word as a sequence of alphanumeric characters only but with the ' character inclusive if it is in the above context such as "it's", not if it is used to a quote a word such as "'its'". Also, in this context "hi," or "hi-person" are not words but "hi" and "person" are. I hope that clarifies the question.
You can remove all ?! symbols and then split into words
str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");
Result:
Hi, I, need, to, split, this, string, into, a, serie's, of, words
Should work for what you want.
String line = "Hi!! I need to split this string, into a serie's of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));
Gives
[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]
If you define a word as a sequence of non-whitespace characters (whitespace character as defined by \s), then you can split along space characters:
str.split("\\s+")
Note that ";.';.##$>?>#4", "very,bad,punctuation", and "'goodbye'" are words under the definition above.
Then the other approach is to define a word as a sequence of characters from a set of allowed characters. If you want to allow a-z, A-Z, and ' as part of a word, you can split along everything else:
str.split("[^a-zA-Z']+")
This will still allow "''''''" to be defined as a word, though.
So what you want is to split on anything that is not a wordcharacter [a-zA-Z] and is not a '
This regex will do that "[^a-zA-Z']\s"
There will be a problem if the string contains a quote that is quoted in '
I usually use this page for testing my regex'
http://www.regexplanet.com/advanced/java/index.html
I would use str.split("[\\s,?!]+"). You can add whatever character you want to split with inside the brackets [].
You could filter out the characters you deem as "non-word" characters:
String[] strs = str.split("[,!? ]+");
myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");
replaceAll("[^a-zA-Z'\\s]","") method replaces all the characters which are not a-z or A-Z or ' or a whitespace with nothing ("") and then toLowerCase method make all the chars returned from replaceAll method lower case. Finally we are splitting the string in terms of whitespace char. more readable one;
myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");

Categories

Resources