Char Sequence vs Regex - java

txt.replaceAll("a","b");
Is "a" a Char Sequence or a Regex (or more specific Literal Search)?
And is my code correct?
I’m coding the Exercise "Normalize Text".
Task:
Only one space between words.
Only one space after comma (,), dot (.) and colon (:). First
character of word after dot is in Uppercase and other words are in
lower case.
Please correct me if I am wrong, including my English.
public class NormalizeText {
static String spacesBetweenWords(String txt){
txt = txt.replaceAll(" +", " ");
return txt;
}
/**
* - There are no spaces between comma or dot and word in front of it.
* - Only one space after comma (,), dot (.) and colon (:).
*/
static String spacesCommaDotColon(String txt) {
txt = txt.replaceAll(" +\\.", ".");
txt = txt.replaceAll(" +,", ",");
txt = txt.replaceAll(" +[:]", ":");
txt = txt.replaceAll("[.]( *)", ". ");
txt = txt.replaceAll("[,]( *)", ", ");
txt = txt.replaceAll("[:]( *)", ": ");
//txt.replaceAll("a","b");
return txt;
}
public static void main(String[] args) {
// TODO code application logic here
String txt = "\" \\\" i want to f\\\"ly\" . B.ut : I , Cant\\";
System.out.println(txt);
txt = spacesBetweenWords(txt);
System.out.println(spacesBetweenWords(txt));
System.out.println(spacesCommaDotColon(txt));
}
}
My teacher said my code is not using regex, but rather a Char Sequence.
I am very confused.

For starters because you learn how to user regex, an amazing site to learn how to use regex is this.
Now replaceAll first argument counts as regex. Just the letter "a" is a regex matching only the "a" inside the text. So what your teacher meant is probably to use a more complicated regex ( something to match multiple cases at once).
As this is an exercise I prefer not to give a solution so you will try to figure it out by yourself. The tip is try to use replaceAll only once.! Or the closer you can get to once.
As for your code if its correct. It seems good but you are missing the uppercase after the dots condition.
Also because I said try to use only one replaceAll the solution for the uppercase doesn't count as it requires an other approach.
I hope I helped and you will find a solution to the exercise and again sorry for not providing an answer to the exercise but In my opinion you need to try to figure it out on your own. You are already on a good road!

With regards to replaceAll, the docs say:
Replaces each substring of this string that matches the given regular expression with the given replacement.
An invocation of this method of the form str.replaceAll(regex, repl) yields exactly the same result as the expression
       
Pattern.compile(regex).matcher(str).replaceAll(repl)
Therefore, replaceAll will always use regular expressions for its first parameter. With regards to simplifying your code,
static String spacesCommaDotColon(String txt) {
txt = txt.replaceAll(" +\\.", ".");
txt = txt.replaceAll(" +,", ",");
txt = txt.replaceAll(" +[:]", ":");
txt = txt.replaceAll("[.]( *)", ". ");
txt = txt.replaceAll("[,]( *)", ", ");
txt = txt.replaceAll("[:]( *)", ": ");
//txt.replaceAll("a","b");
return txt;
}
can be simplified to:
static String spacesCommaDotColon(String txt) {
return txt.replaceAll(" *([:,.]) *","$2 ");
}
and
static String spacesBetweenWords(String txt){
txt = txt.replaceAll(" +", " ");
return txt;
}
can be simplified to:
static String spacesBetweenWords(String txt){
return txt.replaceAll(" +", " ");
}

Your code is correct. Also, you could perform dot, comma and colon formatting with one call using capturing groups:
static String spacesCommaDotColon(String txt) {
return txt.replaceAll("\\s*([.,:])\\s*", "$1 ");
}
Explanation:
"\\s*([.,:])\\s*": look for a comma, dot or colon character with any surrounding blank character; capture said character (parenthesis captures matched text)
"$1 ": replace the matched text by the captured character (labelled as $1 since it's was caught by the first and uniq capturing parenthesis) and one space
Another solution given by TEXHIK, using look-ahead:
txt.replaceAll("(?<=[,.:])\s{2,}", "");
Which looks for any set of at least two blank character preceded by a comma, a dot or a colon and remove it. Maybe not something to see before understanding regex basis.

Related

Write a regular expression to count sentences

I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.
You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.
this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !
A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.
For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}

JAVA: Replacing words in string

I want to replace words in a string, but I am having little difficulties. Here is what I want to do. I have string:
String a = "I want to replace some words in this string";
It should work like some kind of a translator. I am doing this with String.replaceAll(), but it doesn't work completely because of this. Let's say I am translating from English to German, than this should be the output (Ich means I in German).
String toTranslate = "I";
String translated = "Ich";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
Now the output of the String a will be this:
"ich want to replace some words ich**n** **th**ich**s** **str**ich**ng**"
How to replace just the words, not the subwords in the words?
replaceAll uses regex, so you may add word boundaries or look-around mechanisms to check if there are no non-space characters surrounding word you want to replace.
String toTranslate = "I";
String translated = "Ich";
a = a.replaceAll("(?<!\\S)"+toTranslate.toLowerCase()+"(?!\\S)", translated.toLowerCase());
You can also add quotation mechanism to escape any regex metacharacters like + * ( inside word you want to replace. BTW you don't need to change your string to lower case, simply add case-insensitive flag to regex (?i).
a = a.replaceAll("(?i)(?<!\\S)"+Pattern.quote(toTranslate)+"(?!\\S)", translated.toLowerCase());
Use split(" ") for getting each word in the sentence. And then use replaceAll on each word.
String a = "I want to replace some words in this string";
String toTranslate = "I";
String translated = "Ich";
String newString[]=a.split(" ");
for (String string : newString) {
string=string.replaceAll(toTranslate, toTranslate.toLowerCase());//Adding this line ensures you dont miss any uppercase toTranslate
string=string.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
System.out.println("after translation ="+string);
}
String toTranslate = "I ";
String translated = "Ich ";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
If you add a space after the "I" it should replace it when it comes to the word "Ich" but if your word ends in a "I" then thats another problem
If you assume that I will always be capitalized in English as it should be then
a = a.replaceAll(toTranslate, translated);
will work, otherwise you need to replace both cases
a = a.replaceAll(toTranslate, translated);
a = a.replaceAll("([^a-zA-Z])("+toTranslate.toLowerCase()+")([^a-zA-Z])", "$1"+translated.toLowerCase()+"$3");
Here is a working example
Yes, the word boundaries are the solution. I just did this in the regex:
text.replaceAll("\\b" + parts1[i] + "\\b", map.element.value);
Don't be confused with the second argument it's string (from Hash table).
You can use RegEx's word bound, which is \b
String toTranslate = "\\bI\\b";
String translated = "Ich";
a = a.replaceAll(toTranslate.toLowerCase(), translated.toLowerCase());
This should ensure I is separated entirely into its own word
Edit: I misread the question and realized you want whole words. See above, as I have accounted for that

Replace whole tokens that may contain regular expression

I want to do a startStr.replaceAll(searchStr, replaceStr) and I have two requirements.
The searchStr must be a whole word, meaning it must have a space, beginning of string or end of string character around it.
e.g.
startStr = "ON cONfirmation, put ON your hat"
searchStr = "ON"
replaceStr = ""
expected = " cONfirmation, put your hat"
The searchStr may contain a regex pattern
e.g.
startStr = "remove this * thing"
searchStr = "*"
replaceStr = ""
expected = "remove this thing"
For requirement 1, I've found that this works:
startStr.replaceAll("\\b"+searchStr+"\\b",replaceStr)
For requirement 2, I've found that this works:
startStr.replaceAll(Pattern.quote(searchStr), replaceStr)
But I can't get them to work together:
startStr.replaceAll("\\b"+Pattern.quote(searchStr)+"\\b", replaceStr)
Here is the simple test case that's failing
startStr = "remove this * thing but not this*"
searchStr = "*"
replaceStr = ""
expected = "remove this thing but not this*"
actual = "remove this * thing but not this*"
What am I missing?
Thanks in advance
First off, the \b, or word boundary, is not going to work for you with the asterisks. The reason is that \b only detects boundaries of word characters. A regex parser won't acknowledge * as a word character, so a wildcard-endowed word that begins or ends with a regex won't be surrounded by valid word boundaries.
Reference pages:
http://www.regular-expressions.info/wordboundaries.html
http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
An option you might like is to supply wildcard permutations in a regex:
(?<=\s|^)(ON|\*N|O\*|\*)(?=\s|$)
Here's a Java example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegExTest
{
public static void main(String[] args){
String sourcestring = "ON cONfirmation, put * your hat";
sourcestring = sourcestring.replaceAll("(?<=\\s|^)(ON|\\*N|O\\*|\\*)(?=\\s|$)","").replaceAll(" "," ").trim();
System.out.println("sourcestring=["+sourcestring+"]");
}
}
You can write a little function to generate the wildcard permutations automatically. I admit I cheated a little with the spaces, but I don't think that was a requirement anyway.
Play with it online here: http://ideone.com/7uGfIS
The pattern "\\b" matches a word boundary, with a word character on one side and a non-word character on the other. * is not a word character, so \\b\\*\\b won't work. Look-behind and look-ahead match but do not consume patterns. You can specify that the beginning of the string or whitespace must come before your pattern and that whitespace or the end of the string must follow:
startStr.replaceAll("(?<=^|\\s)"+Pattern.quote(searchStr)+"(?=\\s|$)", replaceStr)
Try this,
For removing "ON"
StringBuilder stringBuilder = new StringBuilder();
String[] splittedValue = startStr.split(" ");
for (String value : splittedValue)
{
if (!value.equalsIgnoreCase("ON"))
{
stringBuilder.append(value);
stringBuilder.append(" ");
}
}
System.out.println(stringBuilder.toString().trim());
For removing "*"
String startStr1 = "remove this * thing";
System.out.println(startStr1.replaceAll("\\*[\\s]", ""));
You can use (^| )\*( |$) instead of using \\b
Try this startStr.replaceAll("(^| )youSearchString( |$)", replaceStr);

Getting scanner to read text file

I am trying to use a scanner to read a text file pulled with JFileChooser. The wordCount is working correctly, so I know it is reading. However, I cannot get it to search for instances of the user inputted word.
public static void main(String[] args) throws FileNotFoundException {
String input = JOptionPane.showInputDialog("Enter a word");
JFileChooser fileChooser = new JFileChooser();
fileChooser.showOpenDialog(null);
File fileSelection = fileChooser.getSelectedFile();
int wordCount = 0;
int inputCount = 0;
Scanner s = new Scanner (fileSelection);
while (s.hasNext()) {
String word = s.next();
if (word.equals(input)) {
inputCount++;
}
wordCount++;
}
You'll have to look for
, ; . ! ? etc.
for each word. The next() method grabs an entire string until it hits an empty space.
It will consider "hi, how are you?" as the following "hi,", "how", "are", "you?".
You can use the method indexOf(String) to find these characters. You can also use replaceAll(String regex, String replacement) to replace characters. You can individuality remove each character or you can use a Regex, but those are usually more complex to understand.
//this will remove a certain character with a blank space
word = word.replaceAll(".","");
word = word.replaceAll(",","");
word = word.replaceAll("!","");
//etc.
Read more about this method:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29
Here's a Regex example:
//NOTE: This example will not work for you. It's just a simple example for seeing a Regex.
//Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
word = word.replaceAll(pattern, "$1$3");
Source:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
Here is a good Regex example that may help you:
Regex for special characters in java
Parse and remove special characters in java regex
Remove all non-"word characters" from a String in Java, leaving accented characters?
if the user inputed text is different in case then you should try using equalsIgnoreCase()
in addition to blackpanthers answer you should also use trim() to account for whitespaces.as
"abc" not equal to "abc "
You should take a look at matches().
equals will not help you, since next() doesn't return the file word by word,
but rather whitespace (not comma, semicolon, etc.) separated token by token (as others mentioned).
Here the java docString#matches(java.lang.String)
...and a little example.
input = ".*" + input + ".*";
...
boolean foundWord = word.matches(input)
. is the regex wildcard and stands for any sign. .* stands for 0 or more undefined signs. So you get a match, if input is somewhere in word.

String Matches, Java

I have a sort of a problem with this code:
String[] paragraph;
if(paragraph[searchKeyword_counter].matches("(.*)(\\b)"+"is"+"(\\b)(.*)")){
if i am not mistaken to use .matches() and search a particular character in a string i need a .* but what i want to happen is to search a character without matching it to another word.
For example is the keyword i am going to search I do not want it to match with words that contain is character like ship, his, this. so i used \b for boundary but the code above is not working for me.
Example:
String[] Content= {"is,","his","fish","ish","its","is"};
String keyword = "is";
for(int i=0;i<Content.length;i++){
if(content[i].matches("(.*)(\\b)"+keyword+"(\\b)(.*)")){
System.out.println("There are "+i+" is.");
}
}
What i want to happen here is that it will only match with is is, but not with his fish. So is should match with is, and is meaning I want it to match even the character is beside a non-alphanumerical character and spaces.
What is the problem with the code above?
what if one of the content has a uppercase character example IS and it is compared with is, it will be unmatched. Correct my if i am wrong. How to match a lower cased character to a upper cased character without changing the content of the source?
String string = "...";
String word = "is";
Pattern p = Pattern.compile("\\b" + Pattern.quote(word) + "\\b");
Matcher m = p.matcher(string);
if (m.find()) {
...
}
just add spaces like this:
suppose message equal your content string and pattern is your keyword
if ((message).matches(".* " + pattern + " .*")||(message).matches("^" + pattern + " .*")
||(message).matches(".* " + pattern + "$")) {

Categories

Resources