I am trying to create a String[] which contains only words that comprise of certain characters. For example I have a dictionary containing a number of words like so:
arm
army
art
as
at
attack
attempt
attention
attraction
authority
automatic
awake
baby
back
bad
bag
balance
I want to narrow the list down so that it only contains words with the characters a, b and g. Therefore the list should only contain the word 'bag' in this example.
Currently I am trying to do this using regexes but having never used them before I can't seem to get it to work.
Here is my code:
public class LetterJugglingMain {
public static void main(String[] args) {
String dictFile = "/Users/simonrhillary/Desktop/Dictionary(3).txt";
fileReader fr = new fileReader();
fr.openFile(dictFile);
String[] dictionary = fr.fileToArray();
String regx = "able";
String[] newDict = createListOfValidWords(dictionary, regx);
printArray(newDict);
}
public static String[] createListOfValidWords(String[] d, String regex){
List<String> narrowed = new ArrayList<String>();
for(int i = 0; i<d.length; i++){
if(d[i].matches(regex)){
narrowed.add(d[i]);
System.out.println("added " + d[i]);
}
}
String[] narrowArray = narrowed.toArray(new String[0]);
return narrowArray;
}
however the array returned is always empty unless the String regex is the exact word! Any ideas? I can post more code if needed...I think I must be trying to initialise the regex wrong.
The narrowed down list must contain ONLY the characters from the regex.
Frankly, I'm not an expert in regexes, but I don't think it's the best tool to do what you want. I would use a method like the following:
public boolean containsAll(String s, Set<Character> chars) {
Set<Character> copy = new HashSet<Character>();
for (int i = 0; i < s.length() && copy.size() < chars.size(); i++) {
char c = s.charAt(i);
if (chars.contains(c)) {
copy.add(c);
}
}
return copy.size() == chars.size();
}
The regex able will match only the string "able". However, if you want a regular expression to match either character of a, b, l or e, the regex you're looking for is [able] (in brackets). If you want words containing several such characters, add a + for repeating the pattern: [able]+.
The OP wants words that contain every character. Not just one of them.
And other characters are not a problem.
If this is the case, I think the simiplest way would be to loop through the entire string, character by character, and check to see if it contains all of the characters you want. Keep flags to check and see if every character has been found.
If this isn't the case.... :
Try using the regex:
^[able]+$
Here's what it does:
^ matches the beginning of the string and $ matches the end of the string. This makes sure that you're not getting a partial match.
[able] matches the characters you want the string to consist of, in this case a, b, l, and e. + Makes sure that there are 1 or more of these characters in the string.
Note: This regex will match a string that contains these 4 letters. For example, it will match:
able, albe, aeble, aaaabbblllleeee
and will not match
qable, treatable, and abled.
A sample regex that filters out words that contains at least one occurrence of all characters in a set. This will match any English word (case-insensitive) that contains at least one occurrence of all the characters a, b, g:
(?i)(?=.*a)(?=.*b)(?=.*g)[a-z]+
Example of strings that match would be bag, baggy, grab.
Example of strings that don't match would be big, argument, nothing.
The (?i) means turns on case-insensitive flag.
You need to append as many (?=.*<character>) as the number of characters in the set, for each of the characters.
I assume a word only contains English alphabet, so I specify [a-z]. Specify more if you need space, hyphen, etc.
I assume matches(String regex) method in String class, so I omitted the ^ and $.
The performance may be bad, since in the worst case (the characters are found at the end of the words), I think that the regex engine may go through the string for around n times where n is the number of characters in the set. It may not be an actual concern at all, since the words are very short, but if it turns out that this is a bottleneck, you may consider doing simple looping.
Related
I am working on twitter data normalization. Twitter users frequently uses terms like ts I looooooove it in order to emphasize the word love. I want to such repeated characters to a proper English word by replacing repeat characters till I get a proper meaningful word (I am aware that I can not differentiate between good and god by this mechanism).
My strategy would be
identify existence of such repeated strings. I would look for more than 2 same characters, as probably there is no English word with more than two repeat characters.
String[] strings = { "stoooooopppppppppppppppppp","looooooove", "good","OK", "boolean", "mee", "claaap" };
String regex = "([a-z])\\1{2,}";
Pattern pattern = Pattern.compile(regex);
for (String string : strings) {
Matcher matcher = pattern.matcher(string);
if (matcher.find()) {
System.out.println(string+" TRUE ");
}
}
Search for such words in a Lexicon like Wordnet
Replace all but two such repeat characters and check in Lexicon
If not there in the Lexicon remove one more repeat character (Otherwise treat it as misspelling).
Due to my poor Java knowledge I am unable to manage 3 and 4. Problem is I can not replace all but two repeated consecutive characters.
Following code snippet replace all but one repeated characters System.out.println(data.replaceAll("([a-zA-Z])\\1{2,}", "$1"));
Help is required to find out
A. How to replace all but 2 consecutive repeat characters
B. How to remove one more consecutive character from the output of A
[I think B can be managed by the following code snippet]
System.out.println(data.replaceAll("([a-zA-Z])\\1{1,}", "$1"));
Edit: Solution provided by Wiktor Stribiżew works perfectly in Java. I was wondering what changes are required to get the same result in python.
Python uses re.sub.
Your regex ([a-z])\\1{2,} matches and captures an ASCII letter into Group 1 and then matches 2 or more occurrences of this value. So, all you need to replace with a backreference, $1, that holds the value captured. If you use one $1, the aaaaa will be replaced with a single a and if you use $1$1, it will be replaced with aa.
String twoConsecutivesOnly = data.replaceAll(regex, "$1$1");
String noTwoConsecutives = data.replaceAll(regex, "$1");
See the Java demo.
If you need to make your regex case insensitive, use "(?i)([a-z])\\1{2,}" or even "(\\p{Alpha})\\1{2,}". If any Unicode letters must be handled, use "(\\p{L})\\1{2,}".
BONUS: In a general case, to replace any amount of any repeated consecutive chars use
text = text.replaceAll("(?s)(.)\\1+", "$1"); // any chars
text = text.replaceAll("(.)\\1+", "$1"); // any chars but line breaks
text = text.replaceAll("(\\p{L})\\1+", "$1"); // any letters
text = text.replaceAll("(\\w)\\1+", "$1"); // any ASCII alnum + _ chars
/*This code checks a character in a given string repeated consecutively 3 times
if you want to check for 4 consecutive times change count==2--->count==3 OR
if you want to check for 2 consecutive times change count==2--->count==1*/
public class Test1 {
static char ch;
public static void main(String[] args) {
String str="aabbbbccc";
char[] charArray = str.toCharArray();
int count=0;
for(int i=0;i<charArray.length;i++){
if(i!=0 ){
if(charArray[i]==ch)continue;//ddddee
if(charArray[i]==charArray[i-1]) {
count++;
if(count==2){
System.out.println(charArray[i]);
count=0;
ch=charArray[i];
}
}
else{
count=0;//aabb
}
}
}
}
}
I have a method that converts all the first letters of the words in a sentence into uppercase.
public static String toTitleCase(String s)
{
String result = "";
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++)
{
result += words[i].replace(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
}
return result;
}
The problem is that the method converts each other letter in a word that is the same letter as the first to uppercase. For example, the string title comes out as TiTle
For the input this is a title this becomes the output This Is A TiTle
I've tried lots of things. A nested loop that checks every letter in each word, and if there is a recurrence, the second is ignored. I used counters, booleans, etc. Nothing works and I keep getting the same result.
What can I do? I only want the first letter in upper case.
Instead of using the replace() method, try replaceFirst().
result += words[i].replaceFirst(words[i].charAt(0)+"", Character.toUpperCase(words[i].charAt(0))+"") + " ";
Will output:
This Is A Title
The problem is that you are using replace method which replaces all occurrences of described character. To solve this problem you can either
use replaceFirst instead
take first letter,
create its uppercase version
concatenate it with rest of string which can be created with a little help of substring method.
since you are using replace(String, String) which uses regex you can add ^ before character you want to replace like replace("^a","A"). ^ means start of input so it will only replace a that is placed after start of input.
I would probably use second approach.
Also currently in each loop your code creates new StringBuilder with data stored in result, append new word to it, and reassigns result of output from toString().
This is infective approach. Instead you should create StringBuilder before loop that will represent your result and append new words created inside loop to it and after loop ends you can get its String version with toString() method.
Doing some Regex-Magic can simplify your task:
public static void main(String[] args) {
final String test = "this is a Test";
final StringBuffer buffer = new StringBuffer(test);
final Pattern patter = Pattern.compile("\\b(\\p{javaLowerCase})");
final Matcher matcher = patter.matcher(buffer);
while (matcher.find()) {
buffer.replace(matcher.start(), matcher.end(), matcher.group().toUpperCase());
}
System.out.println(buffer);
}
The expression \\b(\\p{javaLowerCase}) matches "The beginning of a word followed by a lower-case letter", while matcher.group() is equal to whats inside the () in the part that matches. Example: Applying on "test" matches on "t", so start is 0, end is 1 and group is "t". This can easily run through even a huge amount of text and replace all those letters that need replacement.
In addition: it is always a good idea to use a StringBuffer (or similar) for String manipulation, because each String in Java is unique. That is if you do something like result += stringPart you actually create a new String (equal to result + stringPart) each time this is called. So if you do this with like 10 parts, you will in the end have at least 10 different Strings in memory, while you only need one, which is the final one.
StringBuffer instead uses something like char[] to ensure that if you change only a single character no extra memory needs to be allocated.
Note that a patter only need to be compiled once, so you can keep that as a class variable somewhere.
I am using this code
Matcher m2 = Pattern.compile("\\b[ABE]+\\b").matcher(key);
to only get keys from a HashMap that contain the letters A, B or E
I am not though interested in words such as AAAAAA or EEEEE I need words with at least two different letters (in the best case, three).
Is there a way to modify the regex ? Can anyone offer insight on this?
Replace everything except your letters, make a Set of the result, test the Set for size.
public static void main (String args[])
{
String alphabet = "ABC";
String totest = "BBA";
if (args.length == 2)
{
alphabet = args[0];
totest = args[1];
}
String cleared = totest.replaceAll ("[^" + alphabet + "]", "");
char[] ca = cleared.toCharArray ();
Set <Character> unique = new HashSet <Character> ();
for (char c: ca)
unique.add (c);
System.out.println ("Result: " + (unique.size () > 1));
}
Example implementation
You could use a more complicated regex to do it e.g.
(.*A.*[BE].*|.*[BE].*A.*)|(.*B.*[AE].*|.*[AE].*B.*)|(.*E.*[BA].*|.*[BA].*E.*)
But it's probably going to be more easy to understand to do some kind of replacement, for instance make a loop that replaces one letter at a time with '', and check the size of the new string each time - if it changes the size of the string twice, then you've got two of your desired characters. EDIT: actually, if you know the set of desired characters at runtime before you do the check, NullUserException had it right in his comment - indexOf or contains will be more efficient and probably more readable than this.
Note that if your set of desired characters is unknown at compile time (or at least pre-string-checking at runtime), the second option is preferable - if you're looking for any characters, just replace all occurrences of the first character in a while(str.length > 0) loop - the number of times it goes through the loop is the number of different characters you've got.
Mark explicitly the repetition of desired letters,
It would look like this :
\b[ABE]{1,3}\b
It matches AAE, EEE, AEE but not AAAA, AAEE
I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.
You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.
It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.
Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.
When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.
You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.
You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.
One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false
Alright so here is my problem. Basically I have a string with 4 words in it, with each word seperated by a #. What I need to do is use the substring method to extract each word and print it out. I am having trouble figuring out the parameters for it though. I can always get the first one right, but the following ones generally have problems.
Here is the first piece of the code:
word = format.substring( 0 , format.indexOf('#') );
Now from what I understand this basically means start at the beginning of the string, and end right before the #. So using the same logic, I tried to extract the second word like so:
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#') );
//The plus one so I don't start at the #.
But with this I continually get errors saying it doesn't exist. I figured that the compiler was trying to read the first # before the second word, so I rewrote it like so:
wordTwo = format.substring (wordlength + 1, 1 + wordLength + format.indexOf('#') );
And with this it just completely screws it up, either not printing the second word or not stopping in the right place. If I could get any help on the formatting of this, it would be greatly appreciated. Since this is for a class, I am limited to using very basic methods such as indexOf, length, substring etc. so if you could refrain from using anything to complex that would be amazing!
If you have to use substring then you need to use the variant of indexOf that takes a start. This means you can start look for the second # by starting the search after the first one. I.e.
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#', wordlength + 1 ) );
There are however much better ways of splitting a string on a delimiter like this. You can use a StringTokenizer. This is designed for splitting strings like this. Basically:
StringTokenizer tok = new StringTokenizer(format, "#");
String word = tok.nextToken();
String word2 = tok.nextToken();
String word3 = tok.nextToken();
Or you can use the String.split method which is designed for splitting strings. e.g.
String[] parts = String.split("#");
String word = parts[0];
String word2 = parts[1];
String word3 = parts[2];
You can go with split() for this kind of formatting strings.
For instance if you have string like,
String text = "Word1#Word2#Word3#Word4";
You can use delimiter as,
String delimiter = "#";
Then create an string array like,
String[] temp;
For splitting string,
temp = text.split(delimiter);
You can get words like this,
temp[0] = "Word1";
temp[1] = "Word2";
temp[2] = "Word3";
temp[3] = "Word4";
Use split() method to do this with "#" as the delimiter
String s = "hi#vivek#is#good";
String temp = new String();
String[] arr = s.split("#");
for(String x : arr){
temp = temp + x;
}
Or if you want to exact each word... you have it already in arr
arr[0] ---> First Word
arr[1] ---> Second Word
arr[2] ---> Third Word
I suggest that you've a look at the Javadoc for String before you proceed further.
Since this is your homework, I'll give you a couple of hints and maybe you can solve it yourself:
The format for subString is public void subString(int beginIndex, int endIndex). As per the javadoc for this method:
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.
Note that if you've to use this method, understand that you'll have to shift your beginIndex and endIndex each time because in your situation, you'll have multiple words that are separated by #.
However if you look closely, there's another method in String class that might be helpful to you. That's the public String[] split(String regex) method. The javadoc for this one states:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The split() method looks pretty interesting for your case. You can split your String with the delimiter that you have as the parameter to this method, get the String array and work with that.
Hope this helps you to understand your problem and get started towards a solution :)
Since this is a home work, it may be better to have try to write it your self. But I will give a clue.
Clue:
The indexOf method has another overload: int indexOf(int chr,
int fromIndex) which find the first character chr in the string
from the fromIndex.
http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html
From this clue, the program will look something like this:
Find the index of the first '#' from the start of the string.
Extract the word from 0th character to that index.
Find the index of the first '#' from the character AFTER the first '#'.
Extract the word from the first '#' that index.
... Just do it until you get 4 words or the string ends.
Hope this helps.
I don't know why you're forced to use String#substring, but as others have mentioned, it seems like the wrong method for the kind of functionality you need.
String#split(String regex) is what you would use for such a problem, or, if your input sequence is something you don't control, I would suggest you look at the overloaded method String#split(String regex, int limit); this way you can impose a limit on the amount of matches you make, controlling your resulting array.