Breaking string into sentences in java (after symbol of specified group occurs) - java

So I wrote the following code:
String text = "This is a string. I want to break it into sentences";
String[] sentences = text.split("\\.");
for (int i = 0; i < sentences.length; i++)
System.out.println(sentences[i]);
The output of this code is:
This is a string
I want to break it into sentences
How do I change this code so that
Each new sentence will be created not only after ".", but also after "!" or "?".
There won't be any spaces in the beginning of sentence.
For example, if we have the following string
String text = "This is a string! Is this a string? I want to break it into sentences";
then the output should be:
This is a string
Is this a string
I want to break it into sentences

Put the delimiters inside a character class and add \\s* next to the char class so that it would consume also the following zero or more spaces.
String[] sentences = text.split("[?!.]\\s*");
Example:
String text = "This is a string! Is this a string? I want to break it into sentences";
String[] parts = text.split("[?!.]\\s*");
for(String i: parts)
{
System.out.println(i);
}
Output:
This is a string
Is this a string
I want to break it into sentences

You can use a character class to split around either one of the dot (.), ? or ! characters. To remove the space at the beginning (and possibly at the end) of the sentence, you can simply trim the resulting string:
String[] sentences = text.split("[.!?]");
for (int i = 0; i < sentences.length; i++) {
System.out.println(sentences[i].trim());
}

Related

String Splitting wrong output

I wrote this simple program which splits a given input at every Non-Digit Character.
public class Fileread {
public static void main(String[] args) throws IOException {
//Declarations
String[] temp;
String current;
//Execution
BufferedReader br = new BufferedReader(new FileReader("input.txt"));
while ((current = br.readLine()) != null) {
temp = current.split("\\D"); //Splitting at Non Digits
for (int i = 0; i < temp.length; i++) {
System.out.println(temp[i]);
}
}
}
}
This is the input.txt :
hello1world2
world3
end4of5world6
Output :
1
2
3
4
5
6
Why do so many extra spaces appear? I need to print each number on a separate line, without the spaces in between. How can I fix this?
It is splitting at EACH and EVERY non-digit.
To treat strings of non-digits as one delimiter, specify
temp = current.split("\\D+");
instead. Adding the plus-sign makes the pattern match one or more consecutive non-digit characters.
//Declarations
String[] temp;
String current;
//Execution
BufferedReader br = new BufferedReader(new FileReader("d://input.txt"));
while ((current = br.readLine()) != null) {
temp = current.split("\\D+"); //Splitting at Non Digits
for (int i = 0; i < temp.length; i++) {
if (!temp[i].equalsIgnoreCase("")) {
System.out.println(temp[i]);
}
}
}
In short, use
.replaceFirst("^\\D+","").split("\\D+")
Splitting the string with \D (a non-digit char matching pattern) means you match a single non-digit char at a time, and break the string at that char. When you need to split on a chunk of characters, you need to match multiple consecutive characters, and in your case, you just need to add a + quantifier after \\D.
However, that means that you will still have an empty element at Index 0 if your string has a non-digit(s) at the beginning of the string. The workaround is to remove the substring at the start with the split pattern.
The final solution is
List<String> strs = Arrays.asList("hello1world2", "world3", "end4of5world6");
for (String str : strs) {
System.out.println("---- Next string ----");
String[] temp = str.replaceFirst("^\\D+","").split("\\D+");
for (String s: temp) {
System.out.println(s);
}
}
See the online Java demo
Java's String#split method will create a token for each point appearing between two delimiters. Consider the following example:
String s = "a,b,c,,,f";
Because the delimiter , appears consecutively with nothing in between, s.split(",") produces the following output:
{"a", "b", "c", "", "", "f"}
You'll notice there are two blank strings in this array; a blank is inserted to represent the token that would've appeared between each pair of consecutive commas. Basically, the string is treated as a,b,c,(blank),(blank),f.
The solution for this is to have consecutive delimiters be treated as a single delimiter. Now, it's important to remember that your argument to split is actually a regular expression literal. So you can include the + greedy regex quantifier to tell the engine to match one or more consecutive delimiters, and treat them as a single split-point:
s.split(",+")
For the example above, this now yields the following (sans blank strings):
{"a", "b", "c", "f"}
You can apply a similar technique to your regex, using \\D+.

Java adding modified tokens into string

I currently have a program that individually converts tokens of a string into their piglatin counterparts. However, the program needs to insert them back into the string they were taken with, with ALL of the original characters in it.
Hasta la vista baby. - the Terminator.
Hasta
astaHay
la
alay
vista
istavay
baby
abybay
the
ethay
Terminator
erminatorTay
These are all of the words and their conversions. I tried a method directly placing them back in, however accounting for missing characters and different length made it hard for me to do that. I tried to insert characters based on the length of each token added up, but that ran into complications when there were more than 1 whitespace character. How would I insert these words back into the string so it looks like this:
Astahay alay istavay abybay. - ethay Erminatortay
PigOrig = key.readLine();
String[] PigSplit = PigOrig.split("\\W+");
for(int i = 0; i < PigSplit.length; i++)
{
if(PigSplit[i] != null)
{
FinalStr += Piggy.vowelOut(PigSplit[i]); // VowelOut returns the converted word only, no trailing whitespace or punctuation
lengthtot += PigSplit[i].length();
FinalStr += PigOrig.charAt(lengthtot); // attempt at adding up the words and inserting the original punctuation that was in the string PigOrig
lengthtot ++;
}
}
If I understand your question, it is 'how do I replace each word with its translation in a string?' The simplest way is to use String.replace.
So if you have created a translate method then you could do something like:
String line = key.readLine();
for (String word: line.split("\\W+"))
line = line.replace(word, translate(word));
The advantage of this approach is that you are replacing the words in the original string not putting the words back together again.
Also note that it might be easier to translate just using pattern matching. For example:
private String translate(String word) {
Matcher match = Pattern.compile("(\\w*)([aeiou]\\w*)").match(word);
if (match.matches())
return match.group(2) + match.group(1) + "ay";
else
return word;
}
If I understand correctly that you want to translate all the words in the input, my taste would be for building the new string from scratch:
String pigOrig = key.readLine();
String[] pigSplit = pigOrig.split("\\W+");
StringBuilder buf = new StringBuilder(pigOrig.length());
buf.append(translateWord(pigSplit[0]));
for(int i = 1; i < pigSplit.length; i++) {
buf.append(' ');
buf.append(translateWord(pigSplit[i]));
}
String result = buf.toString();

How to find the last word in a String in Java?

How can I find the last word of a string? I am not trying to find a fixed word, in other words, I would not know what the last word is, however I want to retrieve it.
Here is my code:
myString = myString.trim();
String[] wordList = myString.split("\\s+");
System.out.println(wordList[wordList.length-1]);
Providing you consider words in a sentence to be delimited by whitespace and punctuation (particularly commas, spaces, new lines, brackets, and so on), which means punctuation can appear at the end of the sentence, and you want to include non-ASCII characters in the words, then the following will find you the last word in a string without the punctuation included:
static String lastWord(String sentence) {
Pattern p = Pattern.compile("([\\p{Alpha}]+)(?=\\p{Punct}*$)", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(sentence);
if (m.find()) {
return m.group();
}
return ""; // or null
}
The regular expression uses look-ahead to find zero-or-more punctuations at the end of the string and matches the alphabetical word before it.
If you want to also allow numbers in the word, change {Alpha} to {Alnum}.
Read the String API for various methods you might use.
For example you could:
Use the lastIndexOf(...) method to find where the start of the word is
Then use the substring(...) method to get the word
Use the StringTokenizer for this
StringTokenizer st = new StringTokenizer("this is a test");//Take any String
int count = st.countTokens();//it will count the number of token in that particular String
String[] myStringArray = new String[count];
for (int i = 0; i < count; i++) {
`myStringArray[i] = st.nextToken();`//insert the words/Tken in to the string array
}
`System.out.println("Last Word is--" + myStringArray[myStringArray.length - 1])`;//get the last words of the given String
System.out.println("enter the string");
Scanner input = new Scanner(System.in);
String instr = input.nextLine();
instr = instr.trim();
int index = instr.lastIndexOf(" ");
int l = instr.length();
System.out.println(l);
String lastStr = instr.substring(index+1,l);
System.out.println("last string .."+lastStr);

Replace word with special characters from string in Java

I am writing a method which should replace all words which matches with ones from the list with '****'
characters. So far I have code which works but all special characters are ignored.
I have tried with "\\W" in my expression but looks like I didn't use it well so I could use some help.
Here's code I have so far:
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck, badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("(?i)\\b" + badWords.get(i) + "\\b", "****");
}
}
E.g. I have list of words ['bad', '#$$'].
If I have a string: "This is bad string with #$$" I am expecting this method to return "This is **** string with ****"
Note that method should be aware of case sensitive words, e.g. TesT and test should handle same.
I'm not sure why you use the StringUtils you can just directly replace words that match the bad words. This code works for me:
public static void main(String[] args) {
ArrayList<String> badWords = new ArrayList<String>();
badWords.add("test");
badWords.add("BadTest");
badWords.add("\\$\\$");
String test = "This is a TeSt and a $$ with Badtest.";
for(int i = 0; i < badWords.size(); i++) {
test = test.replaceAll("(?i)" + badWords.get(i), "****");
}
test = test.replaceAll("\\w*\\*{4}", "****");
System.out.println(test);
}
Output:
This is a **** and a **** with ****.
The problem is that these special characters e.g. $ are regex control characters and not literal characters. You'll need to escape any occurrence of the following characters in the bad word using two backslashes:
{}()\[].+*?^$|
My guess is that your list of bad words contains special characters that have particular meanings when interpreted in a regular expression (which is what the replaceAll method does). $, for example, typically matches the end of the string/line. So I'd recommend a combination of things:
Don't use containsIgnoreCase to identify whether a replacement needs to be done. Just let the replaceAll run each time - if there is no match against the bad word list, nothing will be done to the string.
The characters like $ that have special meanings in regular expressions should be escaped when they are added into the bad word list. For example, badwords.add("#\\$\\$");
Try something like this:
String stringToCheck = "This is b!d string with #$$";
List<String> badWords = asList("b!d","#$$");
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck,badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("["+badWords.get(i)+"]+","****");
}
}
System.out.println(stringToCheck);
Another solution: bad words matched with word boundaries (and case insensitive).
Pattern badWords = Pattern.compile("\\b(a|b|ĉĉĉ|dddd)\\b",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
String text = "adfsa a dfs bb addfdsaf ĉĉĉ adsfs dddd asdfaf a";
Matcher m = badWords.matcher(text);
StringBuffer sb = new StringBuffer(text.length());
while (m.find()) {
m.appendReplacement(sb, stars(m.group(1)));
}
m.appendTail(sb);
String cleanText = sb.toString();
System.out.println(text);
System.out.println(cleanText);
}
private static String stars(String s) {
return s.replaceAll("(?su).", "*");
/*
int cpLength = s.codePointCount(0, s.length());
final String stars = "******************************";
return cpLength >= stars.length() ? stars : stars.substring(0, cpLength);
*/
}
And then (in comment) the stars with the correct count: one star for a Unicode code point giving two surrogate pairs (two UTF-16 chars).

Add brackets to sequence of chars in string

I need to put a sequence of characters in a String in brackets in such way that it would choose the longest substring as the optimal to put in brackets. To make it clear because it is too complicated to explain with words:
If my input is:
'these are some chars *£&$'
'these are some chars *£&$^%(((£'
the output in both inputs respectively should be:
'these are some chars (*£&$)'
'these are some chars (*£&$^%)(((£'
so I would like to put in brackets the sequence *£&$^% IF it exists otherwise put in brackets just *£&$
I hope it makes sense!
In the general case, this method works. It surrounds the earliest substring of any keyword in any given String:
public String bracketize() {
String chars = ...; // you can put whatever input (such as 'these are some chars *£&$')
String keyword = ...; // you can put whatever keyword (such as *£&$^%)
String longest = "";
for(int i=0;i<keyword.length()-1;i++) {
for(int j=keyword.length(); j>i; j--) {
String tempString = keyword.substring(i,j);
if(chars.indexOf(tempString) != -1 && tempString.length()>longest.length()) {
longest = tempString;
}
}
}
if(longest.length() == 0)
return chars; // no possible substring of keyword exists in chars, so just return chars
String bracketized = chars.substring(0,chars.indexOf(longest))+"("+longest+")"+chars.substring(chars.indexOf(longest)+longest.length());
return bracketized;
}
The nested for loops check every possible substring of keyword and select the longest one that is contained in the bigger String, chars. For example, if the keyword is Dog, it will check the substrings "Dog", "Do", "D", "og", "o", and "g". It stores this longest possible substring in longest (which is initialized to the empty String). If the length of longest is still 0 after checking every substring, then no such substring of keyword can be found in chars, so the original String, chars, is returned. Otherwise, a new string is returned which is chars with the substring longest surrounded by brackets (parentheses).
Hope this helps, let me know if it works.
Try something like this (assuming target string only occurs once).
String input = "these are some chars *£&$"
String output = "";
String[] split;
if(input.indexOf("*£&$^%")!=(-1)){
split = input.split("*£&$^%");
output = split[0]+"(*£&$^%)";
if(split.length>1){
output = output+split[1];
}
}else if(input.indexOf("*£&$")!=(-1)){
split = input.split("*£&$");
output = split[0]+"(*£&$)";
if(split.length>1){
output = output+split[1];
}
}else{
System.out.println("does not contain either string");
}

Categories

Resources