Remove Punctuation issue

Remove Punctuation issue - java

Im trying to find a word in a string. However, due to a period it fails to recognize one word. Im trying to remove punctuation, however it seems to have no effect. Am I missing something here? This is the line of code I am using: s.replaceAll("([a-z] +) [?:!.,;]*","$1");
String test = "This is a line about testing tests. Tests are used to examine stuff";
String key = "tests";
int counter = 0;
String[] testArray = test.toLowerCase().split(" ");
for(String s : testArray)
{
s.replaceAll("([a-z] +) [?:!.,;]*","$1");
System.out.println(s);
if(s.equals(key))
{
System.out.println(key + " FOUND");
counter++;
}
}
System.out.println(key + " has been found " + counter + " times.");
}
I managed to find a solution (though may not be ideal) through using s = s.replaceAll("\W",""); Thanks for everyones guidance on how to solve this problem.

You could also take advantage of the regex in the split operation. Try this:
String[] testArray = test.toLowerCase().split("\\W+");
This will split on apostrophe, so you may need to tweak it a bit with a specific list of characters.

Strings are immutable. You would need assign the result of replaceAll to the new String:
s = s.replaceAll("([a-z] +)*[?:!.,;]*", "$1");
^
Also your regex requires that a space exist between the word and the the punctuation. In the case of tests., this isn't true. You can adjust you regex with an optional (zero or more) character to account for this.

Your regex doesn't seem to work as you want.
If you want to find something which has period after that then this will work
([a-z]*) [?(:!.,;)*]
it returns "tests." when it's run on your given string.
Also
[?(:!.,;)*]
just points out the punctuation which will then can be replaced.
However I am not sure why you are not using substring() function.

Related

Java - Regular Expressions Split on character after and before certain words

I'm having trouble figuring out how to grab a certain part of a string using regular expressions in JAVA. Here's my input string:
application.APPLICATION NAME.123456789.status
I need to grab the portion of the string called "APPLICATION NAME". I can't simply split on the period character becuase APPLICATION NAME may itself include a period. The first word, "application", will always remain the same and the characters after "APPLICATION NAME" will always be numbers.
I've been able to split on period and grab the 1st index but as I mentioned, APPLICATION NAME may itself include periods so this is no good. I've also been able to grab the first and second to last index of a period but that seems ineffecient and would like to future-proof by using REGEX.
I've googled around for hours and haven't been able to find much guidance. Thanks!

You can use ^application\.(.*)\.\d with find(), or application\.(.*)\.\d.* with matches().
Sample code using find():
private static void test(String input) {
String regex = "^application\\.(.*)\\.\\d";
Matcher m = Pattern.compile(regex).matcher(input);
if (m.find())
System.out.println(input + ": Found \"" + m.group(1) + "\"");
else
System.out.println(input + ": **NOT FOUND**");
}
public static void main(String[] args) {
test("application.APPLICATION NAME.123456789.status");
test("application.Other.App.Name.123456789.status");
test("application.App 55 name.123456789.status");
test("application.App.55.name.123456789.status");
test("bad input");
}
Output
application.APPLICATION NAME.123456789.status: Found "APPLICATION NAME"
application.Other.App.Name.123456789.status: Found "Other.App.Name"
application.App 55 name.123456789.status: Found "App 55 name"
application.App.55.name.123456789.status: Found "App.55.name"
bad input: **NOT FOUND**
The above will work as long as "status" doesn't start with a digit.

With split(), you could save key.split("\\.") in a String[] s and, in a second time, join from s[1] to s[s.length-3].
With regexes you can do:
String appName = key.replaceAll("application\\.(.*)\\.\\d+\\.\\w+")", "$1");

Why split? Just:
String appName = input.replaceAll(".*?\\.(.*)\\.\\d+\\..*", "$1");
This also correctly handles a dot then digits within the application name, but only works correctly if you know the input is in the expected format.
To handle "bad" input by returning blank if the pattern is not matched, be more strict and use an optional that will always match (replace) the entire input:
String appName = input.replaceAll("^application\\.(.*)\\.\\d+\\.\\w+$|.*", "$1");

How to delete duplicated characters in a string?

Okay, I'm a huge newbie in the world of java and I can't seem to get this program right. I am suppose to delete the duplicated characters in a 2 worded string and printing the non duplicated characters.
for example:I input the words "computer program." the output should be "cute" because these are the only char's that are not repeated.
I made it until here:
public static void main(String[] args) {
System.out.print("Input two words: ");
String str1 = Keyboard.readString();
String words[] = str1.split(" ");
String str2 = words[0] + " ";
String str3 = words[words.length - 1] ;
}
but i don't know how to output the characters. Could someone help me?
I don't know if I should use if, switch, for, do, or do-while...... I'm confused.

what you need is to build up logic for your problem. First break the problem statement and start finding solution for that. Here you go for steps,
Read every character from a string.
Add it to a collection, but before adding that, just check whether it exists.
If it exists just remove it and continue the reading of characteer.
Once you are done with reading the characters, just print the contents of collection to console using System.out.println.
I will recommend you to refer books like "Think like A Programmer". This will help you to get started with logic building.

Just a hint: use a hash map (http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html).

Adding following code after last line of your main program will resolve your issue.
char[] strChars = str2.toCharArray();
String newStr="";
for (char c : strChars) {
String charStr = ""+c;
if(!str3.contains(charStr.toLowerCase()) && !str3.contains(charStr.toUpperCase())){
newStr+=c;
}
}
System.out.println(newStr);
This code loops through all the characters of the first word and check if the second string contains that character (In any form of case Lower or Upper). If it is not containing, adding it to output string and at the end printing it.
Hope this will work in your case.

How about doing it in just 1 line?
str = str.replaceAll("(.)(?=.*\\1)", "");

Replace new line/return with space using regex

Pretty basic question for someone who knows.
Instead of getting from
"This is my text.
And here is a new line"
To:
"This is my text. And here is a new line"
I get:
"This is my text.And here is a new line.
Any idea why?
L.replaceAll("[\\\t|\\\n|\\\r]","\\\s");
I think I found the culprit.
On the next line I do the following:
L.replaceAll( "[^a-zA-Z0-9|^!|^?|^.|^\\s]", "");
And this seems to be causing my issue.
Any idea why?
I am obviously trying to do the following: remove all non-chars, and remove all new lines.

\s is a shortcut for whitespace characters in regex. It has no meaning in a string. ==> You can't use it in your replacement string. There you need to put exactly the character(s) that you want to insert. If this is a space just use " " as replacement.
The other thing is: Why do you use 3 backslashes as escape sequence? Two are enough in Java. And you don't need a | (alternation operator) in a character class.
L.replaceAll("[\\t\\n\\r]+"," ");
Remark
L is not changed. If you want to have a result you need to do
String result = L.replaceAll("[\\t\\n\\r]+"," ");
Test code:
String in = "This is my text.\n\nAnd here is a new line";
System.out.println(in);
String out = in.replaceAll("[\\t\\n\\r]+"," ");
System.out.println(out);

The new line separator is different for different OS-es - '\r\n' for Windows and '\n' for Linux.
To be safe, you can use regex pattern \R - the linebreak matcher introduced with Java 8:
String inlinedText = text.replaceAll("\\R", " ");

Try
L.replaceAll("(\\t|\\r?\\n)+", " ");
Depending on the system a linefeed is either \r\n or just \n.

I found this.
String newString = string.replaceAll("\n", " ");
Although, as you have a double line, you will get a double space. I guess you could then do another replace all to replace double spaces with a single one.
If that doesn't work try doing:
string.replaceAll(System.getProperty("line.separator"), " ");
If I create lines in "string" by using "\n" I had to use "\n" in the regex. If I used System.getProperty() I had to use that.

Your regex is good altough I would replace it with the empty string
String resultString = subjectString.replaceAll("[\t\n\r]", "");
You expect a space between "text." and "And" right?
I get that space when I try the regex by copying your sample
"This is my text. "
So all is well here. Maybe if you just replace it with the empty string it will work. I don't know why you replace it with \s. And the alternation | is not necessary in a character class.

You May use first split and rejoin it using white space.
it will work sure.
String[] Larray = L.split("[\\n]+");
L = "";
for(int i = 0; i<Larray.lengh; i++){
L = L+" "+Larray[i];
}

This should take care of space, tab and newline:
data = data.replaceAll("[ \t\n\r]*", " ");

Help building a regex

I need to build a regular expression that finds the word "int" only if it's not part of some string.
I want to find whether int is used in the code. (not in some string, only in regular code)
Example:
int i; // the regex should find this one.
String example = "int i"; // the regex should ignore this line.
logger.i("int"); // the regex should ignore this line.
logger.i("int") + int.toString(); // the regex should find this one (because of the second int)
thanks!

It's not going to be bullet-proof, but this works for all your test cases:
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
It does a look behind and look ahead to assert that there's either none or two preceding/following quotes "
Here's the code in java with the output:
String regex = "(?<=^([^\"]*|[^\"]*\"[^\"]*\"[^\"]*))\\bint\\b(?=([^\"]*|[^\"]*\"[^\"]*\"[^\"]*)$)";
System.out.println(regex);
String[] tests = new String[] {
"int i;",
"String example = \"int i\";",
"logger.i(\"int\");",
"logger.i(\"int\") + int.toString();" };
for (String test : tests) {
System.out.println(test.matches("^.*" + regex + ".*$") + ": " + test);
}
Output (included regex so you can read it without all those \ escapes):
(?<=^([^"]*|[^"]*"[^"]*"[^"]*))\bint\b(?=([^"]*|[^"]*"[^"]*"[^"]*)$)
true: int i;
false: String example = "int i";
false: logger.i("int");
true: logger.i("int") + int.toString();
Using a regex is never going to be 100% accurate - you need a language parser. Consider escaped quotes in Strings "foo\"bar", in-line comments /* foo " bar */, etc.

Not exactly sure what your complete requirements are but
$\s*\bint\b
perhaps

Assuming input will be each line,
^int\s[\$_a-bA-B\;]*$
it follows basic variable naming rules :)

If you think to parse code and search isolated int word, this works:
(^int|[\(\ \;,]int)
You can use it to find int that in code can be only preceded by space, comma, ";" and left parenthesis or be the first word of line.
You can try it here and enhance it http://www.regextester.com/
PS: this works in all your test cases.

$[^"]*\bint\b
should work. I can't think of a situation where you can use a valid int identifier after the character '"'.
Of course this only applies if the code is limited to one statement per line.

How to find a whole word in a String in Java?

I have a String that I have to parse for different keywords.
For example, I have the String:
"I will come and meet you at the 123woods"
And my keywords are
'123woods'
'woods'
I should report whenever I have a match and where. Multiple occurrences should also be accounted for.
However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.
My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?

The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.
String text = "I will come and meet you at the woods 123woods and all the woods";
List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.

Use regex + word boundaries as others answered.
"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");
will be true.
"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");
will be false.

Hope this works for you:
String string = "I will come and meet you at the 123woods";
String keyword = "123woods";
Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
System.out.println("Keyword matched the string");
}
http://codigounico.blogspot.com/

How about something like Arrays.asList(String.split(" ")).contains("xx")?
See String.split() and How can I test if an array contains a certain value.

Got a way to match Exact word from String in Android:
String full = "Hello World. How are you ?";
String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";
boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);
Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);
Result: false-true-true-false
Function for match word:
private boolean isContainExactWord(String fullString, String partWord){
String pattern = "\\b"+partWord+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(fullString);
return m.find();
}
Done

public class FindTextInLine {
String match = "123woods";
String text = "I will come and meet you at the 123woods";
public void findText () {
if (text.contains(match)) {
System.out.println("Keyword matched the string" );
}
}
}

Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.

The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:
This is a classical application for multi-pattern-search-algorithms.
Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).
Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.
Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.
// example with StringSearchAlgorithms
AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));
CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);
StringFinder finder = stringSearch.createFinder(text);
List<StringMatch> all = finder.findAll();

A much simpler way to do this is to use split():
String match = "123woods";
String text = "I will come and meet you at the 123woods";
String[] sentence = text.split();
for(String word: sentence)
{
if(word.equals(match))
return true;
}
return false;
This is a simpler, less elegant way to do the same thing without using tokens, etc.

You can use regular expressions.
Use Matcher and Pattern methods to get the desired output

You can also use regex matching with the \b flag (whole word boundary).

To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion.
One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.
\b(?>123woods|woods)\b
it searches 123woods as primary search, once it got matched it exits the search.

Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)
It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:
// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
+ "say, at the woods of 123woods.";
// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings =
java.util.Arrays.asList(sentence.split(" |,|\\."));
// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");
By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:
// Set... ready...?
int counter = 0;
// Go!
for(String s : strings)
{
// Asking if the sentence exists in the keywords, not the other
// around, to find repeated keywords in the sentence.
Boolean found = keywords.contains(s.toLowerCase());
if(found)
{
counter ++;
System.out.println("Found: " + s);
}
}
// Statistics:
if (counter > 0)
{
System.out.println("In sentence: " + sentence + "\n"
+ "Count: " + counter);
}
And the results are:
Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5

If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.
StringBuilder sb = new StringBuilder();
String[] splited = value.split("\\s+");
if(ArrayUtils.isNotEmpty(splited)) {
for(String valor : splited) {
sb.append(" ");
if("not".equals(valor.toLowerCase())) {
sb.append("'").append(valor).append("'");
} else {
sb.append(valor);
}
}
}
return sb.toString();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove Punctuation issue - java

You could also take advantage of the regex in the split operation. Try this: String[] testArray = test.toLowerCase().split("\\W+"); This will split on apostrophe, so you may need to tweak it a bit with a specific list of characters.

Related

Java - Regular Expressions Split on character after and before certain words

How to delete duplicated characters in a string?

Replace new line/return with space using regex

Help building a regex

How to find a whole word in a String in Java?

Categories

Resources