Matching strings and storing into an array with regex in java - java

Im making a program that takes a file and finds identifiers. So far I removed any words in quotes, any words that start with a number and I removed all the non word characters.
Is there a way to find words that dont match words in an array and store those words into another array using regex? I can figure it out, I was trying to use the split method but its not working right when I try to split by spaces...This is what I did to split it.
String[] SplitString = newLine.split("[\\s]");

Use
String[] SplitString = newLine.split("\\s");
if you don't want to combine multiple spaces/tabs, etc., but use
String[] SplitString = newLine.split("\\s+");
if you do. For example, if your string is:
"a b c"
the first will give you four tokens: "a", "", "b", and "c", and the second will give you three: "a", "b", and "c".

You can do it simply in one line by first removing the known words, then splitting:
String[] unknownWords = newLine.replaceAll("\\b(apple|orange|banana)\\b", "").split("\\s+");
Notes:
Your regex [\s] is equivalent to \s, so I simplified it
You should probably split on any number of spaces: \s+
\b means "word boundary" - this means the removal regex won't match applejack
The regex (A|B|C|etc) is the syntax for "OR" logic

Related

Java regex - split string with leading special characters

I am trying to split a string that contains whitespaces and special characters. The string starts with special characters.
When I run the code, the first array element is an empty string.
String s = ",hm ..To?day,.. is not T,uesday.";
String[] sArr = s.split("[^a-zA-Z]+\\s*");
Expected result is ["hm", "To", "day", "is", "not", "T", "uesday"]
Can someone explain how this is happening?
Actual result is ["", "hm", "To", "day", "is", "not", "T", "uesday"]
Split is behaving as expected by splitting off a zero-length string at the start before the first comma.
To fix, first remove all splitting chars from the start:
String[] sArr = s.replaceAll("^([^a-zA-Z]*\\s*)*", "").split("[^a-zA-Z]+\\s*");
Note that I’ve altered the removal regex to trim any sequence of spaces and non-letters from the front.
You don’t need to remove from the tail because split discards empty trailing elements from the result.
I would simplify it by making it a two-step process rather than trying to achieve a pure regex split() operation:
s.replaceAll( '[^a-zA-Z]+', ' ' ).trim().split( ' ' )

Splitting string in java produces empty element first

Im trying to split a sting on multiple or single occurences of "O" and all other characters will be dots. I'm wondering why this produces en empty string first.
String row = ".....O.O.O"
String[] arr = row.split("\\.+");
This produces produces:
["", "O", "O", "O"]
You just need to make sure that any trailing or leading dots are removed.
So one solution is:
row.replaceAll("^\\.+|\\.+$", "").split("\\.+");
For this pattern you can use replaceFirstMethod() and then split by dot
String[] arr = row.replaceFirst("\\.+","").split("\\.");
Output will be
["O","O","O"]
The "+" character is removing multiple instances of the seperator, so what your split is essentially doing is splitting the following string on "."
.0.0.0.
This, of course, means that your first field is empty. Hence the result you get.
To avoid this, strip all leading separators from the string before splitting it. Rather than type some examples on how to do this, here's a thread with a few suggestions.
Java - Trim leading or trailing characters from a string?

How to use split() to remove all delimiters from a sentence in Java?

String text = "Good morning. Have a good class. " +
"Have a good visit. Have fun!";
String[] words = text.split("[ \n\t\r.,;:!?(){");
This split method is provided in text book, meant to remove all the delimiters in the sentence as well as white space character but clearly it is not working and throws a regex exception to my disappointment....I am wondering what could we do here to make it work? The requirement is after the split method, everything in the `String[] words are either just English words without any delimiters attaching to it or whitespace character! Thanks a lot!
You are missing closing ] in your character class:
String[] words = text.split("[ \n\t\r.,;:!?(){]");
btw you can just do (and it is better option):
String[] words = text.split("\\W+");
to split on any non-word character.
String.split() is NOT for removing characters. It is used to divide the String into smaller substrings.
Example:
String s = "This is a string!";
String[] tokens = s.split(" ");
Split will have used the String " " (one space character) as a delimiter to, well, split the string. As a result, the array tokens will look something like
{"This", "is", "a", "string!"}
If you want to remove characters, try taking a look at String.replaceAll()

Java String delete tokens contains numbers

I have a string like this and I would like to eliminate all the tokens that contain a number:
String[] s="In the 1980s".split(" ");
Is there a way to remove the tokens that contain numbers - in this case 1980s, but also, for example 784th or s787?
Use a \w*\d\w* regex matcher for that. It will match all words with at least one digit in them. Although I generally despise regexes, they are particularily well suited for your problem.
String[] s = input.replaceAll("\\w*\\d\\w* *", "").split(" +");
See Java lib docs for Pattern/Matcher (RegEx) for more reference how to work with regexes in general.
Test code:
http://ideone.com/LrHDsT
Remove the unwanted words first, then split:
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
Some test code:
String str = "666 In the 1980s 784th s787 foo BAR";
String[] s = str.replaceAll("\\w*\\d\\w*", "").trim().split(" +");
System.out.println(Arrays.toString(s));
Output:
[In, the, foo, BAR]
You could Regex as suggested by #vaxquis or alternately after splitting the string based on the delimiter
You could Parse the token strings and check if the token has number among them using NumberUtils.isNumber and remove those tokens.
split doesn't seem to be what you are looking for. Even if you remove words which contain digit like in case of
"1foo f2oo bar whatever baz2"
you will end up with
" bar whatever "
and if you split on spaces now you will end up with ["", "bar", "whatever"].
To solve this problem you may want also to remove spaces after word you removed so now
"1foo f2oo bar whatever baz2"
would become
"bar whatever "
so it can be split correctly (space at the end is not the problem since split by default removes trailing empty strings in result array).
But instead of doing two iterations (removing words and splitting on string) you can achieve same thing with only one iteration. All you need to do is use opposite approach:instead of focusing on removing wrong elements, lets try to find correct ones.
Correct tokens seem to be words which contains any non-space characters but not digits. You can regex representing such words with this regex \b[\S&&\D]\b where:
\b represents word boundaries,
\S any non whitespace character
\D any non digit character
[\S&&\D] intersection of non-whitespaces and non-digits, in other words non whitespaces which are also non-ditigts
Demo:
String input = "1foo f2oo bar whatever baz2";
Pattern p = Pattern.compile("\\b[\\S&&\\D]+\\b");
Matcher m = p.matcher(input);
while(m.find())
System.out.println(m.group());
Output:
bar
whatever
BTW to avoid potential problems with potential empty element at start of results you can use Scanner which doesn't return empty element if delimiter is found at start of string. So we can simply set delimiter as series of spaces or words which contains digit. So your code can also look like
Scanner sc = new Scanner(input);
sc.useDelimiter("(\\s|\\w*\\d\\w*)+");
while (sc.hasNext())
System.out.println(sc.next());
sc.close();

Java (Regex) - Get all words in a sentence

I need to split a java string into an array of words. Let's say the string is:
"Hi!! I need to split this string, into a serie's of words?!"
At the moment I'm tried using this String[] strs = str.split("(?!\\w)") however it keeps symbols such as ! in the array and it also keeps strings like "Hi!" in the array as well. The string I am splitting will always be lowercase. What I would like is for an array to be produced that looks like:
{"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"} - Note the apostrophe is kept.
How could I change my regex to not include the symbols in the array?
Apologies, I would define a word as a sequence of alphanumeric characters only but with the ' character inclusive if it is in the above context such as "it's", not if it is used to a quote a word such as "'its'". Also, in this context "hi," or "hi-person" are not words but "hi" and "person" are. I hope that clarifies the question.
You can remove all ?! symbols and then split into words
str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");
Result:
Hi, I, need, to, split, this, string, into, a, serie's, of, words
Should work for what you want.
String line = "Hi!! I need to split this string, into a serie's of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));
Gives
[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]
If you define a word as a sequence of non-whitespace characters (whitespace character as defined by \s), then you can split along space characters:
str.split("\\s+")
Note that ";.';.##$>?>#4", "very,bad,punctuation", and "'goodbye'" are words under the definition above.
Then the other approach is to define a word as a sequence of characters from a set of allowed characters. If you want to allow a-z, A-Z, and ' as part of a word, you can split along everything else:
str.split("[^a-zA-Z']+")
This will still allow "''''''" to be defined as a word, though.
So what you want is to split on anything that is not a wordcharacter [a-zA-Z] and is not a '
This regex will do that "[^a-zA-Z']\s"
There will be a problem if the string contains a quote that is quoted in '
I usually use this page for testing my regex'
http://www.regexplanet.com/advanced/java/index.html
I would use str.split("[\\s,?!]+"). You can add whatever character you want to split with inside the brackets [].
You could filter out the characters you deem as "non-word" characters:
String[] strs = str.split("[,!? ]+");
myString.replaceAll("[^a-zA-Z'\\s]","").toLowerCase().split("\\s+");
replaceAll("[^a-zA-Z'\\s]","") method replaces all the characters which are not a-z or A-Z or ' or a whitespace with nothing ("") and then toLowerCase method make all the chars returned from replaceAll method lower case. Finally we are splitting the string in terms of whitespace char. more readable one;
myString = myString.replaceAll("[^a-zA-Z'\\s]","");
myString = myString.toLowerCase();
String[] strArr = myString.split("\\s+");

Categories

Resources