Split String To Get Word Separators

Split String To Get Word Separators - java

I want to find store all the separators between the words in a sentence which could be spaces, newlines.
Say I have the following String:
String text = "hello, darkness my old friend.\nI've come to you again\r\nasd\n 123123";
String[] separators = text.split("\\S+");
Output: [, , , , ,
, , , , ,
,
]
So I split on anything but a space it is returning an empty separator at first and the rest are good. Why the empty string at first tho?
Also, I would like to split on periods and commas. But I don't know how to do that meaning that ".\n" is a separator.
Wanted Output for the above String:
separators = {", ", " ", " ", " ", ".\n", " ", " ", " ", " ", "\r\n", "\n "}
or
separators = {",", " ", " ", " ", " ", ".", "\n", " ", " ", " ", " ", "\r\n", "\n "}

Try this:
String[] separators = text.split("[\\w']+");
This defines non-separators as "word chars" and/or apostrophes.
This does leave a leading blank in the result array, which is not possible to avoid, except by removing the leading word first:
String[] separators = text.replaceAll("^[\\w']+", "").split("[\\w']+");
You may consider adding the hyphen to the character class, if you consider hyphenated words (example in the previous sentence) as one word, ie
String[] separators = text.split("[\\w'-]+");
See live demo.

I think this can also work correctly:
String[] separators = text.split("\\w+");

If think it's more easy to use the .find() method to obtain the desired result:
String text = "hello, darkness my old friend.\nI've come to you again\r\nasd\n 123123";
String pat = "[\\s,.]+"; // add all that you need to the character class
Matcher m = Pattern.compile(pat).matcher(text);
List<String> list = new ArrayList<String>();
while( m.find() ) {
list.add(m.group());
}
// the result is already stored in "list" but if you
// absolutely want to store the result in an array, just do:
String[] result = list.toArray(new String[0]);
This way you avoid the empty string problem at the beginning.

Related

Digits are getting deleted when splitting a string

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: normal

- is a special character in PHP character classes. For instance, [a-z] matches all chars from a to z inclusive. Note that you've got )-_ in your regex.

- defines a range in regular expressions as used by String.split argument so that needs to be escaped
String[] part = line.toLowerCase().split("[,/?:;\"{}()\\-_+*=|<>!`~##$%^&]");

String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");

Java Split a String with Regex expression

I don't know much about regex. So can you please tell me how to split the below string to get the desired output?
String ruleString= "/Rule/Account/Attribute[N='accountCategory' and V>=1]"+
" and /Rule/Account/Attribute[N='accountType' and V>=34]"+
" and /Rule/Account/Attribute[N='acctSegId' and V>=341]"+
" and /Rule/Account/Attribute[N='is1sa' and V>=1]"+
" and /Rule/Account/Attribute[N='isActivated' and V>=0]"+
" and /Rule/Account/Attribute[N='mogId' and V>=3]"+
" and /Rule/Account/Attribute[N='regulatoryId' and V>=4]"+
" and /Rule/Account/Attribute[N='vipCode' and V>=5]"+
" and /Rule/Subscriber/Attribute[N='agentId' and V='346']";
Desired output:
a[0] = /Rule/Account/Attribute[N='accountCategory' and V>=1]
a[1] = /Rule/Account/Attribute[N='accountType' and V>=34]
.
.
.
a[n] = /Rule/Subscriber/Attribute[N='agentId' and V='346']
We can not simply split a string using " and " as we have two of those in the string (one is required and other one is not)
I want to split it something like this
String[] splitArray= ruleString.split("] and ");
But this won't work, as it will remove the end bracket ] from each of the splits.

Split your input according to the below regex.
String[] splitArray= ruleString.split("\\s+and\\s+(?=/)");
This splits the input according to the and which exits just before to the forward slash.

You have to use look-behind here:
String[] splitArray= ruleString.split("(?<=\\])\\s*and\\s*");

More efficient way to make a string in a string of just words

I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).
How I currently do it is by using 6 different .replaceAll() functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll().
The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.
What my requirements are in this order (also written in comments down bellow):
Delete all usernames mentioned
Delete all RT (retweets flags)
Delete all hashtags mentioned
Replace all break lines with spaces
Replace all double spaces with single spaces
Delete all special characters except spaces
Here is a Short and Compilable Example:
public class StringTest {
public static void main(String args[]) {
String text = "RT #AshStewart09: Vote for Lady Gaga for \"Best Fans\""
+ " at iHeart Awards\n"
+ "\n"
+ "RT!!\n"
+ "\n"
+ "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
+ " htt…";
String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
System.out.println("Before: " + text + "\n");
// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("#AshStewart09", "");
System.out.println("First Phase: " + text + "\n");
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
System.out.println("Second Phase: " + text + "\n");
// Delete all hashtags mentioned
for (String hashtag : hashtags) {
text = text.replaceAll(hashtag, "");
}
System.out.println("Third Phase: " + text + "\n");
// Replace all break lines with spaces
text = text.replaceAll("\n", " ");
System.out.println("Fourth Phase: " + text + "\n");
// Replace all double spaces with single spaces
text = text.replaceAll(" +", " ");
System.out.println("Fifth Phase: " + text + "\n");
// Delete all special characters except spaces
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
System.out.println("Finaly: " + text);
}
}

Relying on replaceAll is probably the biggest performance killer as it compiles the regex again and again. The use of regexes for everything is probably the second most significant problem.
Assuming all usernames start with #, I'd replace
// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("#AshStewart09", "");
by a loop copying everything until it founds a #, then checking if the following chars match any of the listed usernames and possibly skipping them. For this lookup you could use a trie. A simpler method would be a replaceAll-like loop for the regex #\w+ together with a HashMap lookup.
// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");
Here,
private static final Pattern RT_PATTERN = Pattern.compile("RT");
is a sure win. All the following parts could be handled similarly. Instead of
// Delete all special characters except spaces
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
you could use Guava's CharMatcher. The method removeFrom does exactly what you did, but collapseFrom or trimAndCollapseFrom might be better.

According to the now closed question, it all boils down to
tweet = tweet.replaceAll("#\\w+|#\\w+|\\bRT\\b", "")
.replaceAll("\n", " ")
.replaceAll("[^\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
The second line seems to be redundant as the third one does remove \n too. Changing the first line's replacement to " " doesn't change the outcome an allows to aggregate the replacements.
tweet = tweet.replaceAll("#\\w*|#\\w*|\\bRT\\b|[^##\\p{L}\\p{N} ]+", " ")
.replaceAll(" +", " ")
.trim();
I've changed the usernames and hashtags part to eating also lone # or #, so that it doesn't need to be consumed by the special chars part. This is necessary for corrent processing of strings like !#AshStewart09.
For maximum performance, you surely need a precompiled pattern. I'd also re-suggest to use Guava's CharMatcher for the second part. Guava is huge (2 MB I guess), but you surely find more useful things there. So in the end you can get
private static final Pattern PATTERN =
Pattern.compile("#\\w*|#\\w*|\\bRT\\b|[^##\\p{L}\\p{N} ]+");
private static final CharMatcher CHAR_MATCHER = CharMacher.is(" ");
tweet = PATTERN.matcher(tweet).replaceAll(" ");
tweet = CHAR_MATCHER.trimAndCollapseFrom(tweet, " ");

You can inline all of the things that are being replaced with nothing into one call to replace all and everything that is replaced with a space into one call like so (also using a regex to find the hashtags and usernames as this seems easier):
text = text.replaceAll("#\w+|#\w+|RT", "");
text = text.replaceAll("\n| +", " ");
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();

How to catch the regex in java

I have a string like below:
"This is the code: cd001, cd002, cd003 "
(i already catch: cd001, cd002, cd003)
but it's must ignore for: cd001, cd002, cd003 in the string below
"This is the code: cd001,cd002, cd003,xxxx "
i have a regex: [^|\\s|>]*([a-z]{2}[0-9]+\\.?)\\b
(Begin with start string, space then two lowercase letters, digits after, and then is [. or , or # or space] )

// parse inputString into String[] of codes
// if there are no codes in the string, codes[0] is ""
String[] codes =
// delete beginning of the line till ":" inclusive
inputString.replaceFirst("^.*: ", "").
// delete two codes that are separated by "," and
// followed by 0 or 1 "," and 1 " "
replaceAll("[a-z0-9]+,[a-z0-9]+,? ", "").
// delete trailing spaces
replaceFirst(" +$", "").
// split codes
split(", ");

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

JTextField reSource; //contains the regex expression the user wants to search for
String re=reSource.getText();
Pattern p=Pattern.compile(re,myflags); //myflags defined elsewhere in code
Matcher m=p.matcher(src); //src is the text to search and comes from a JTextArea
while (m.find()==true) {
If the user enters \t it finds \t not tab.
If the user enters \\\t it finds \\\t not tab.
If the user enters [\t] or [\\\t] it finds t not tab.
I want it such that if the user enters \t it finds tab. Of course it also needs to work with \n, \r etc...
If re="\t"; is used instead of re=reSource.getText(); with \t in the JTextField then it finds tabs. How do I get it to work with the contents of the JTextField?

Example:
String src = "This\tis\ta\ttest";
System.out.println("src=\"" + src + '"'); // --> prints "This is a test"
String re="\\t";
System.out.println("re=\"" + re + '"'); // --> prints "\t" - as when you use reSource.getText();
Pattern p = Pattern.compile(re);
Matcher m = p.matcher(src);
while (m.find()) {
System.out.println('"' + m.group() + '"');
}
Output:
src="This is a test"
re="\t"
" "
" "
" "
Try this:
re=re.replace("\\t", "\t");
OR
re=re.replace("\\t", "\\\\t");
I think the problem is in understanding that when you type:
String str = "\t";
Then it is actualy same as:
String str = " ";
But if you type:
String str = "\\t";
Then the System.out.print(str) will be "\t".

Matching \t should work, however, your flags might have a problem.
Here's what works for me:
String src = "A\tBC\tD";
Pattern p=Pattern.compile("\\w\\t\\w"); //simulates the user entering \w\t\w
Matcher m=p.matcher(src);
while (m.find())
{
System.out.println("Match: \"" + m.group(0) + "\"");
}
Output is:
Match: "A B"
Match: "C D"

My experience is that Java Swing JTextField and JTable GUI controls escape user-entered backslashes by prefixing a backslash.
User types two-character sequence "backslash t", control's getText() method returns a String containing the three-character sequence "backslash backslash t". The SO formatter does its own thing with backslashes in text so here it is as code:
Single backslash: input is 2 char sequence \t and return value is 3 char \\t
For three-character input sequence "backsl backsl t", getText() returns the five-character sequence "backsl backsl backsl backsl t". As code:
Double backslash: input is 3 char sequence \\t and return value is 5 char \\\\t
This basically prevents the backslash from modifying the t to yield a character sequence that becomes a tab when interpreted by something like System.out.println.
Conveniently, and surprisingly to me, the regex processor accepts it either way. A two-character sequence "\t" matches a tab character, as does a three-character sequence "\\t". Please see demo code below. The system.out calls demonstrate which sequences and patterns, have tabs, and in JDK 1.7 both matches yield true.
package my.text;
/**
* Demonstrate use of tab character in regexes
*/
public class RegexForSo {
public static void main(String [] argv) {
final String sequenceTab="x\ty\tz";
final String patternBsTab = "x\t.*";
final String patternBsBsTab = "x\\t.*";
System.out.println("sequence is >" + sequenceTab + "<");
System.out.println("pattern BsTab is >" + patternBsTab + "<");
System.out.println("pattern BsBsTab is >" + patternBsBsTab + "<");
System.out.println("matched BsTab = " + sequenceTab.matches(patternBsTab));
System.out.println("matched BsBsTab = " + sequenceTab.matches(patternBsBsTab));
}
}
Output on my JDK1.7 system is below, tabs in output might not survive SO formatter :)
sequence is >x y z<
pattern BsTab is >x .*<
pattern BsBsTab is >x\t.*<
matched BsTab = true
matched BsBsTab = true
HTH

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split String To Get Word Separators - java

I think this can also work correctly: String[] separators = text.split("\\w+");

Related

Digits are getting deleted when splitting a string

Java Split a String with Regex expression

More efficient way to make a string in a string of just words

How to catch the regex in java

Using a JTextField to get a regular expression from a user. How do I make it see \t as a tab instead of a \ followed by a t

Categories

Resources