Censoring bad words in a string - java

I am trying to create a function to replace all the bad words in a string with an asterisk in the middle, and here is what I came up with.
public class Censor {
public static String AsteriskCensor(String text){
String[] word_list = text.split("\\s+");
String result = "";
ArrayList<String> BadWords = new ArrayList<String>();
BadWords.add("word1");
BadWords.add("word2");
BadWords.add("word3");
BadWords.add("word4");
BadWords.add("word5");
BadWords.add("word6");
ArrayList<String> WordFix = new ArrayList<String>();
WordFix.add("w*rd1");
WordFix.add("w*rd2");
WordFix.add("w*rd3");
WordFix.add("w*rd4");
WordFix.add("w*rd5");
WordFix.add("w*rd6");
int index = 0;
for (String i : word_list)
{
if (BadWords.contains(i)){
word_list[index] = WordFix.get(BadWords.indexOf(i));
index++;
}
}
for (String i : word_list)
result += i + ' ';
return result;
}
My idea was to break it down into single words, then replace the word if you encounter a bad word, but so far it is not doing anything. Can someone tell me where did I go wrong? I am quite new to the language

If you move the index++ to out of the if statement, then your code works fine.
Online demo
However, it won't work properly if there are any punctuation marks immediately following a word to be censored. For example, the sentence "We have word1 to word6, and they are censored", then only "word1" will be censored, due to the comma immediately following the word.
I personally would approach this differently. Instead of maintaining two lists, you could also create a Map which maps the bad words to their censored counterparts:
static String censor(String text) {
Map<String, String> filters = Map.of(
"hello", "h*llo",
"world", "w*rld",
"apple", "*****"
);
for (var filter : filters.entrySet()) {
text = text.replace(filter.getKey(), filter.getValue());
}
return text;
}
Of course, this is code is still a little naive, because it will also filter words like 'applet', because the word 'applet' contains 'apple'. That's probably not what you want.
Instead, we need to tweak the code a little, so the found words must be whole words, that is, not part of another word. You can fix this by replacing the body of the for loop by this:
String pattern = "\\b" + Pattern.quote(filter.getKey()) + "\\b";
text = text.replaceAll(pattern, filter.getValue());
It replaces text using a regular expression. The \b is a word-boundary character, which makes sure it only matches the start or end of a word. This way, words like 'dapple' and 'applet' are no longer matched.
Online demo

Related

How I can use regex to implement contains functionality?

P.S : If you don't understand anything from the below I describe, please ask me
I have a Dictionary with the list of words.
And I have String of one word with multiple characters.
Eg: Dictionary =>
String[] = {"Manager","age","range", "east".....} // list of words in dictionary
Now I have one string tageranm.
I have to find all the words in the dictionary which can be made using this string. I have been able to find the solution using create all string using Permuation and verify the string is present in the dictionary.
But I have another solution, but dint know how I can do it in Java using Regex
Algorithm:
// 1. Sort `tageranm`.
char c[] = "tageranm".toCharArray();
Arrays.sort(c);
letters = String.valueOf(c); // letters = "aaegmnrt"
2.Sort all words in dictionary:
Example: "range" => "aegnr" // After sorting
Now If I will use "aaegmnrt".contains("aegnr") will return false. As 'm' is coming in between.
Is there a way to use Regex and ignore the character m and get all the words in dictionary using the above approach?
Thanks in advance.
Here is a possible solution, using the regex-type stated by #MattTimmermans in the comments. It's not very fast though, so there are probably loads of ways to improve this.. I'm also pretty sure there should be libraries for this kind of searches, which will (hopefully) have used performance-reducing algorithms.
java.util.List<String> test(String[] words, String input){
java.util.List<String> result = new java.util.ArrayList<>();
// Sort the characters in the input-String:
byte[] inputArray = input.getBytes();
java.util.Arrays.sort(inputArray);
String sortedInput = new String(inputArray);
for(String word : words){
// Sort the characters of the word:
byte[] wordArray = word.getBytes();
java.util.Arrays.sort(wordArray);
String sortedWord = new String(wordArray);
// Create a regex to match from this word:
String wordRegex = ".*" + sortedWord.replaceAll(".", "$0.*");
// If the input matches this regex:
if(sortedInput.matches(wordRegex))
// Add the word to the result-List:
result.add(word);
}
return result;
}
Try it online (with added DEBUG-lines to see what's happening).
For your inputs {"Manager","age","range", "east"} and "tageranm" it will return ["age", "range"].
EDIT: Doesn't match Manager because the M is in uppercase. If you want case-insensitive matching, the easiest it to convert both the input and words to the same case before checking:
input.getBytes() becomes input.toLowerCase().getBytes()
word.getBytes() becomes word.toLowerCase().getBytes()
Try it online (now resulting in ["Manager", "age", "range"]).

Extracting Twitter username from a given text (JAVA, Regex)

I believe the code is OK, the problem is the regex.
Basically I want to find a username mention (it starts with #), and then I want to extract the allowed username part from the given word.
For example if the text contains "#FOO!!" I want to extract only "foo", but I believe the problem is with my "split("[a-z0-9-_]+")[0]" part.
Btw, allowed symbols are numbers, letters, - and _
public static Set<String> getMentionedUsers(List<Tweet> tweets) {
Set<String> mentioned = new HashSet<>();
for (Tweet tweet : tweets) {
String tweetToAnal = null;
if (tweet.getText().contains("#")) tweetToAnal = tweet.getText();
if (tweetToAnal == null) continue;
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
}
}
return mentioned;
}
The problem is not on your regex but on your logic.
You are using below line to analize usernames:
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
If you debug step by step your code, you will notice that you are consuming (with substring(1)) the # and then you are splitting by using a regex, therefore this split is consuming all your characters as well. However, you don't want to consume characters with the split method but you just want to capture content.
So, you can actually use split by using the negated regex you are using by doing:
split("[^a-z0-9-_]+")
^---- Notice the negate character class indicator
On the other hand, instead of splitting the whole text in multiple tokens to further be analyzed, you can use a regex with capturing group and then grab the username you want. So, instead of having this code:
String[] splited = tweetToAnal.split("\\s+");
for (String elem : splited) {
String newElem = "";
if (elem.startsWith("#")) {
newElem = elem.substring(1).toLowerCase().split("[a-z0-9-_]+")[0];
}
if (newElem.length() > 0) mentioned.add(newElem);
You can use a much more simpler code like this:
Matcher m = Pattern.compile("(?<=#)([\\w-]+)").matcher(tweetToAnal); // Analyze text with a regex that will capture usernames preceded by #
while (m.find()) { // Stores all username (without #)
mentioned.add(m.group(1));
}
Btw, I didn't test the code, so I may have a typo but you can understand the idea. Anyway the code is pretty simple to understand.
I'm not a Java-Person, but you can easily match twitter-usernames without the "#" using the following regex:
(?<=#)[\w-]+
which can be seen here. Of course you would need to escape special characters properly, but since I have no clue of Java, you would have to do that and the actual matching by yourself.

Java - Changing multiple words in a string at once?

I'm trying to create a program that can abbreviate certain words in a string given by the user.
This is how I've laid it out so far:
Create a hashmap from a .txt file such as the following:
thanks,thx
your,yr
probably,prob
people,ppl
Take a string from the user
Split the string into words
Check the hashmap to see if that word exists as a key
Use hashmap.get() to return the key value
Replace the word with the key value returned
Return an updated string
It all works perfectly fine until I try to update the string:
public String shortenMessage( String inMessage ) {
String updatedstring = "";
String rawstring = inMessage;
String[] words = rawstring.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
for (String word : words) {
System.out.println(word);
if (map.containsKey(word) == true) {
String x = map.get(word);
updatedstring = rawstring.replace(word, x);
}
}
System.out.println(updatedstring);
return updatedstring;
}
Input:
thanks, your, probably, people
Output:
thanks, your, probably, ppl
Does anyone know how I can update all the words in the string?
Thanks in advance
updatedstring = rawstring.replace(word, x);
This keeps replacing your updatedstring with the rawstring with a the single replacement.
You need to do something like
updatedstring = rawstring;
...
updatedString = updatedString.replace(word, x);
Edit:
That is the solution to the problem you are seeing but there are a few other problems with your code:
Your replacement won't work for things that you needed to lowercased or remove characters from. You create the words array that you iterate from altered version of your rawstring. Then you go back and try to replace the altered versions from your original rawstring where they don't exist. This will not find the words you think you are replacing.
If you are doing global replacements, you could just create a set of words instead of an array since once the word is replaced, it shouldn't come up again.
You might want to be replacing the words one at a time, because your global replacement could cause weird bugs where a word in the replacement map is a sub word of another replacement word. Instead of using String.replace, make an array/list of words, iterate the words and replace the element in the list if needed and join them. In java 8:
String.join(" ", elements);

Replace word with special characters from string in Java

I am writing a method which should replace all words which matches with ones from the list with '****'
characters. So far I have code which works but all special characters are ignored.
I have tried with "\\W" in my expression but looks like I didn't use it well so I could use some help.
Here's code I have so far:
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck, badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("(?i)\\b" + badWords.get(i) + "\\b", "****");
}
}
E.g. I have list of words ['bad', '#$$'].
If I have a string: "This is bad string with #$$" I am expecting this method to return "This is **** string with ****"
Note that method should be aware of case sensitive words, e.g. TesT and test should handle same.
I'm not sure why you use the StringUtils you can just directly replace words that match the bad words. This code works for me:
public static void main(String[] args) {
ArrayList<String> badWords = new ArrayList<String>();
badWords.add("test");
badWords.add("BadTest");
badWords.add("\\$\\$");
String test = "This is a TeSt and a $$ with Badtest.";
for(int i = 0; i < badWords.size(); i++) {
test = test.replaceAll("(?i)" + badWords.get(i), "****");
}
test = test.replaceAll("\\w*\\*{4}", "****");
System.out.println(test);
}
Output:
This is a **** and a **** with ****.
The problem is that these special characters e.g. $ are regex control characters and not literal characters. You'll need to escape any occurrence of the following characters in the bad word using two backslashes:
{}()\[].+*?^$|
My guess is that your list of bad words contains special characters that have particular meanings when interpreted in a regular expression (which is what the replaceAll method does). $, for example, typically matches the end of the string/line. So I'd recommend a combination of things:
Don't use containsIgnoreCase to identify whether a replacement needs to be done. Just let the replaceAll run each time - if there is no match against the bad word list, nothing will be done to the string.
The characters like $ that have special meanings in regular expressions should be escaped when they are added into the bad word list. For example, badwords.add("#\\$\\$");
Try something like this:
String stringToCheck = "This is b!d string with #$$";
List<String> badWords = asList("b!d","#$$");
for(int i = 0; i < badWords.size(); i++) {
if (StringUtils.containsIgnoreCase(stringToCheck,badWords.get(i))) {
stringToCheck = stringToCheck.replaceAll("["+badWords.get(i)+"]+","****");
}
}
System.out.println(stringToCheck);
Another solution: bad words matched with word boundaries (and case insensitive).
Pattern badWords = Pattern.compile("\\b(a|b|ĉĉĉ|dddd)\\b",
Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE);
String text = "adfsa a dfs bb addfdsaf ĉĉĉ adsfs dddd asdfaf a";
Matcher m = badWords.matcher(text);
StringBuffer sb = new StringBuffer(text.length());
while (m.find()) {
m.appendReplacement(sb, stars(m.group(1)));
}
m.appendTail(sb);
String cleanText = sb.toString();
System.out.println(text);
System.out.println(cleanText);
}
private static String stars(String s) {
return s.replaceAll("(?su).", "*");
/*
int cpLength = s.codePointCount(0, s.length());
final String stars = "******************************";
return cpLength >= stars.length() ? stars : stars.substring(0, cpLength);
*/
}
And then (in comment) the stars with the correct count: one star for a Unicode code point giving two surrogate pairs (two UTF-16 chars).

java create variable from regex findings

I'm pretty new to Java, but I am looking to create a String variable from a regex finding. But I am not too sure how.
Basically I need: previous_identifer = (all the text in nextline up to the third comma);
Something maybe like this?
previous_identifier = line.split("^(.+?),(.+?),(.+?),");
Or:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("^(.+?),(.+?),(.+?),");
previous_identifier = (courseColumnPattern.matcher(line).find());
But I know that won't work. What should I do differently?
You can use split to return an array of Strings, then use a StringBuilder to build your return string. An advantage of this approach is being able to easily return the first four strings, two strings, ten strings, etc.
int limit = 3, current = 0;
StringBuilder sb = new StringBuilder();
// Used as an example of input
String str = "test,west,best,zest,jest";
String[] strings = str.split(",");
for(String s : strings) {
if(++current > limit) {
// We've reached the limit; bail
break;
}
if(current > 1) {
// Add a comma if it's not the first element. Alternative is to
// append a comma each time after appending s and remove the last
// character
sb.append(",");
}
sb.append(s);
}
System.out.println(sb.toString()); // Prints "test,west,best"
If you don't need to use the three elements separately (you truly want just the first three elements in a chunk), you can use a Matcher with the following regular expression:
String str = "test, west, best, zest, jest";
// Matches against "non-commas", then a comma, then "non-commas", then
// a comma, then "non-commas". This way, you still don't have a trailing
// comma at the end.
Matcher match = Pattern.compile("^([^,]*,[^,]*,[^,]*)").matcher(str);
if(match.find())
{
// Print out the output!
System.out.println(match.group(1));
}
else
{
// We didn't have a match. Handle it here.
}
Your regex will work, but could be expressed more briefly. This is how you can "extract" it:
String head = str.replaceAll("((.+?,){3}).*", "$1");
This matches the whole string, while capturing the target, with the replacement being the captured input using a back reference to group 1.
Despite the downvote, here's proof the code works!
String str = "foo,bar,baz,other,stuff";
String head = str.replaceAll("((.+?,){3}).*", "$1");
System.out.println(head);
Output:
foo,bar,baz,
try an online regex tester to work out the regex, i think you need less brackets to get the entire text, i'd guess something like:
([^,+?],[^,+?],[^,+?])
Which says, find everything except a comma, then a comma, then everything but a comma, then a comman, then everything else that isn't a comma. I suspect this can be improved dramatically, i am not a regex expert
Then your java just needs to compile it and match against your string:
line = reader.readLine();
Pattern courseColumnPattern = Pattern.compile("([^,+?],[^,+?],[^,+?])");
if (previous_identifier.matches()) {
previous_identifier = (courseColumnPattern.matcher(line);
}

Categories

Resources