How I can use regex to implement contains functionality? - java

P.S : If you don't understand anything from the below I describe, please ask me
I have a Dictionary with the list of words.
And I have String of one word with multiple characters.
Eg: Dictionary =>
String[] = {"Manager","age","range", "east".....} // list of words in dictionary
Now I have one string tageranm.
I have to find all the words in the dictionary which can be made using this string. I have been able to find the solution using create all string using Permuation and verify the string is present in the dictionary.
But I have another solution, but dint know how I can do it in Java using Regex
Algorithm:
// 1. Sort `tageranm`.
char c[] = "tageranm".toCharArray();
Arrays.sort(c);
letters = String.valueOf(c); // letters = "aaegmnrt"
2.Sort all words in dictionary:
Example: "range" => "aegnr" // After sorting
Now If I will use "aaegmnrt".contains("aegnr") will return false. As 'm' is coming in between.
Is there a way to use Regex and ignore the character m and get all the words in dictionary using the above approach?
Thanks in advance.

Here is a possible solution, using the regex-type stated by #MattTimmermans in the comments. It's not very fast though, so there are probably loads of ways to improve this.. I'm also pretty sure there should be libraries for this kind of searches, which will (hopefully) have used performance-reducing algorithms.
java.util.List<String> test(String[] words, String input){
java.util.List<String> result = new java.util.ArrayList<>();
// Sort the characters in the input-String:
byte[] inputArray = input.getBytes();
java.util.Arrays.sort(inputArray);
String sortedInput = new String(inputArray);
for(String word : words){
// Sort the characters of the word:
byte[] wordArray = word.getBytes();
java.util.Arrays.sort(wordArray);
String sortedWord = new String(wordArray);
// Create a regex to match from this word:
String wordRegex = ".*" + sortedWord.replaceAll(".", "$0.*");
// If the input matches this regex:
if(sortedInput.matches(wordRegex))
// Add the word to the result-List:
result.add(word);
}
return result;
}
Try it online (with added DEBUG-lines to see what's happening).
For your inputs {"Manager","age","range", "east"} and "tageranm" it will return ["age", "range"].
EDIT: Doesn't match Manager because the M is in uppercase. If you want case-insensitive matching, the easiest it to convert both the input and words to the same case before checking:
input.getBytes() becomes input.toLowerCase().getBytes()
word.getBytes() becomes word.toLowerCase().getBytes()
Try it online (now resulting in ["Manager", "age", "range"]).

Related

Censoring bad words in a string

I am trying to create a function to replace all the bad words in a string with an asterisk in the middle, and here is what I came up with.
public class Censor {
public static String AsteriskCensor(String text){
String[] word_list = text.split("\\s+");
String result = "";
ArrayList<String> BadWords = new ArrayList<String>();
BadWords.add("word1");
BadWords.add("word2");
BadWords.add("word3");
BadWords.add("word4");
BadWords.add("word5");
BadWords.add("word6");
ArrayList<String> WordFix = new ArrayList<String>();
WordFix.add("w*rd1");
WordFix.add("w*rd2");
WordFix.add("w*rd3");
WordFix.add("w*rd4");
WordFix.add("w*rd5");
WordFix.add("w*rd6");
int index = 0;
for (String i : word_list)
{
if (BadWords.contains(i)){
word_list[index] = WordFix.get(BadWords.indexOf(i));
index++;
}
}
for (String i : word_list)
result += i + ' ';
return result;
}
My idea was to break it down into single words, then replace the word if you encounter a bad word, but so far it is not doing anything. Can someone tell me where did I go wrong? I am quite new to the language
If you move the index++ to out of the if statement, then your code works fine.
Online demo
However, it won't work properly if there are any punctuation marks immediately following a word to be censored. For example, the sentence "We have word1 to word6, and they are censored", then only "word1" will be censored, due to the comma immediately following the word.
I personally would approach this differently. Instead of maintaining two lists, you could also create a Map which maps the bad words to their censored counterparts:
static String censor(String text) {
Map<String, String> filters = Map.of(
"hello", "h*llo",
"world", "w*rld",
"apple", "*****"
);
for (var filter : filters.entrySet()) {
text = text.replace(filter.getKey(), filter.getValue());
}
return text;
}
Of course, this is code is still a little naive, because it will also filter words like 'applet', because the word 'applet' contains 'apple'. That's probably not what you want.
Instead, we need to tweak the code a little, so the found words must be whole words, that is, not part of another word. You can fix this by replacing the body of the for loop by this:
String pattern = "\\b" + Pattern.quote(filter.getKey()) + "\\b";
text = text.replaceAll(pattern, filter.getValue());
It replaces text using a regular expression. The \b is a word-boundary character, which makes sure it only matches the start or end of a word. This way, words like 'dapple' and 'applet' are no longer matched.
Online demo

Java Regex expression to match and store any integers

Right now, using Java, I just want it to be able to tokenize any string of integers to an array
input = 1dsa23f hj23nma9123
array = 1,23,23,9123;
I have been trying a few different ways to do it, string.matches("") and then tokenising after it's in the right format and what not but it is too limiting to the user.
It looks like you are looking for something like
String[] nums = text.split("\\D+");
\D regex is negation of \d (it is like [^\d]) which means \D+ will match one or more non-digits.
Only problem with this solution is that if your text start with non-digits result array will start with one empty string.
If you still want to use split then you can simply remove that non-digits part from start of your text.
String[] nums = text.replaceFirst("^\\D+","").split("\\D+");
Other approach than split which is focusing on finding delimiters would be focusing on finding parts which are interesting to us. So instead of searching for non-digits lets find digits.
We can do it in few ways like Patter/Matcher#find, or with Scanner. Problem here is that these approaches don't return array but single elements which you would need to store in some resizeable structure like List.
So solution using Pattern and Matcher could look like:
List<String> numbers = new ArrayList<>();
Matcher m = Pattern.compile("\\d+").matcher(yourText);
while(m.find()){
numbers.add(m.group());
}
Solution using Scanner is similar, we just need to set proper delimiter (to non-digit) and read everything which is not delimiter (delimiters at start of text will be ignored which will should prevent returning empty strings).
List<String> nums = new ArrayList<>();
Scanner sc = new Scanner(yourText);
sc.useDelimiter("\\D+");
while(sc.hasNext()){
nums.add(sc.next());
}
final String input = "1dsa23f hj23nma9123";
final String[] parts = input.split("[^0-9]+");
for (final String s: parts) {
final int i = Integer.parseInt(s);
}

Java - Changing multiple words in a string at once?

I'm trying to create a program that can abbreviate certain words in a string given by the user.
This is how I've laid it out so far:
Create a hashmap from a .txt file such as the following:
thanks,thx
your,yr
probably,prob
people,ppl
Take a string from the user
Split the string into words
Check the hashmap to see if that word exists as a key
Use hashmap.get() to return the key value
Replace the word with the key value returned
Return an updated string
It all works perfectly fine until I try to update the string:
public String shortenMessage( String inMessage ) {
String updatedstring = "";
String rawstring = inMessage;
String[] words = rawstring.replaceAll("[^a-zA-Z ]", "").toLowerCase().split("\\s+");
for (String word : words) {
System.out.println(word);
if (map.containsKey(word) == true) {
String x = map.get(word);
updatedstring = rawstring.replace(word, x);
}
}
System.out.println(updatedstring);
return updatedstring;
}
Input:
thanks, your, probably, people
Output:
thanks, your, probably, ppl
Does anyone know how I can update all the words in the string?
Thanks in advance
updatedstring = rawstring.replace(word, x);
This keeps replacing your updatedstring with the rawstring with a the single replacement.
You need to do something like
updatedstring = rawstring;
...
updatedString = updatedString.replace(word, x);
Edit:
That is the solution to the problem you are seeing but there are a few other problems with your code:
Your replacement won't work for things that you needed to lowercased or remove characters from. You create the words array that you iterate from altered version of your rawstring. Then you go back and try to replace the altered versions from your original rawstring where they don't exist. This will not find the words you think you are replacing.
If you are doing global replacements, you could just create a set of words instead of an array since once the word is replaced, it shouldn't come up again.
You might want to be replacing the words one at a time, because your global replacement could cause weird bugs where a word in the replacement map is a sub word of another replacement word. Instead of using String.replace, make an array/list of words, iterate the words and replace the element in the list if needed and join them. In java 8:
String.join(" ", elements);

Create String[] containing only certain characters

I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.

how to replace parts of string using regular expressions

I am not a beginner to regular expressions, but their use in perl seems a bit different than in Java.
Anyways, I basically have a dictionary of shorthand words and their definitions. I want to iterate over words in the dictionary and replace them with their meanings. what is the best way to do this in JAVA?
I have seen String.replaceAll(), String.replace(), as well as the Pattern/Matcher classes. I wish to do a case insensitive replacement along the lines of:
word =~ s/\s?\Q$short_word\E\s?/ \Q$short_def\E /sig
While I am at it, do you think that it is best to extract all the words from the string and then apply my dictionary or just apply the dictionary to the string? I know that I need to be careful, because the shorthand words could match parts of other shorthand meanings.
Hopefully this all makes sense.
Thanks.
Clarification:
Dictionary is something like:
lol:laugh out loud, rofl:rolling on the floor laughing, ll:like lemons
string is:
lol, i am rofl
replaced text:
laugh out loud, i am rolling on the floor laughing
notice how the ll wasnt added anywhere
The danger is false positives inside of normal words. "fell" != "felikes lemons"
One way is to split the words on whitespace (do multiple spaces need to be conserved?) then loop over the List performing the 'if contains() { replace } else { output original } idea above.
My output class would be a StringBuffer
StringBuffer outputBuffer = new StringBuffer();
for(String s: split(inputText)) {
outputBuffer.append( dictionary.contains(s) ? dictionary.get(s) : s);
}
Make your split method smart enough to return word delimiters also:
split("now is the time") -> now,<space>,is,<space>,the,<space><space>,time
Then you don't have to worry about conserving white space - the loop above will just append anything that isn't a dictionary word to the StringBuffer.
Here's a recent SO thread on retaining delimiters when regexing.
If you insist on using regex, this would work (taking Zoltan Balazs' dictionary map approach):
Map<String, String> substitutions = loadDictionaryFromSomewhere();
int lengthOfShortestKeyInMap = 3; //Calculate
int lengthOfLongestKeyInMap = 3; //Calculate
StringBuffer output = new StringBuffer(input.length());
Pattern pattern = Pattern.compile("\\b(\\w{" + lengthOfShortestKeyInMap + "," + lengthOfLongestKeyInMap + "})\\b");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String candidate = matcher.group(1);
String substitute = substitutions.get(candidate);
if (substitute == null)
substitute = candidate; // no match, use original
matcher.appendReplacement(output, Matcher.quoteReplacement(substitute));
}
matcher.appendTail(output);
// output now contains the text with substituted words
If you plan to process many inputs, pre-compiling the pattern is more efficient than using String.split(), which compiles a new Pattern each call.
(edit) Compiling all of the keys into a single pattern yields a more efficient approach, like so:
Pattern pattern = Pattern.compile("\\b(lol|rtfm|rofl|wtf)\\b");
// rest of the method unchanged, don't need the shortest/longest key stuff
This allows the regex engine to skip over any words that happen to be short enough but aren't in the list, saving you a lot of map accesses.
The first thing, that comes into my mind is this:
...
// eg: lol -> laugh out loud
Map<String, String> dictionatry;
ArrayList<String> originalText;
ArrayList<String> replacedText;
for(String string : originalText) {
if(dictionary.contains(string)) {
replacedText.add(dictionary.get(string));
} else {
replacedText.add(string);
}
...
Or you could use a StringBuffer instead of the replacedText.

Categories

Resources