String search with "?" as a joker? - java

I have an ArrayList of strings and I want to perform a search method on them.
So far I only have this:
public void searchNote(String searchCertainNote)
{
for (String note: notes)
{
if (note.contains(searchCertainNote))
{
System.out.println(note);
}
}
}
I would like to improve this search method a bit by allowing the user to search for a string like:
String searchCertainString = "a?c"
... Possible search results: "abcd"; "a4c"; "Abc", "a22c", "a$c", etc...
The question mark "?" should represent all characters that are out there.
I did some googling and found out that I could implement this by using regex and String.matches() ...
But I still need your help on this!
Thanks =)

If the user will be entering a '?' as the wildcard (or joker as you called it), then, if you use Regex, you'll need to convert the '?' to '.' to use it in the Regex pattern. In Regex, the period matches any single character.
So, you'd need to change the user's input of "a?b" to "a.b".
If the '?' is meant to match at least one character, but possible more than one, then use '.+' instead. The '+' is a qualifier that means 'one or more of the preceding thing'.
So, you'd need to change the user's input of "a?b" to "a.+b". This pattern will match "axb" "axyb", but not "ab".
Here's a good beginner's guide for Java regular expression handling:
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Do this:
if(searchCertainNote.length <= 0) {
// String is empty, do something
return;
}
for(int i = 0; i < searchCertainNote.length; i++) {
if(searchCertainNote.charAt(i) == 'a') {
if(searchCertainNote.length >= ((i + 1) + 1))) {
if(searchCertainNote.charAt(i + 2) == 'c') {
System.out.println(searchCertainNote.charAt(i) + searchCertainNote.charAt(i + 1) + searchCertainNote.charAt(i + 2));
return;
}
} else {
// Not found, so do something
return;
}
}
}
You can trust completely on this code. It will print to screen a?c if found in ANY string that you enter and return, else it will do something you want and returns. It also detects empty and too short strings and avoids an ArrayIndexOutOfBoundsException. Good luck :D.

Related

Generate new word from wildcard [duplicate]

This question already has answers here:
Returning a list of wildcard matches from a HashMap in java
(3 answers)
Closed 7 years ago.
Im trying to generate a word with a wild card and check and see if this word is stored in the dictionary database. Like "appl*" should return apply or apple. However the problem comes in when I have 2 wild cards. "app**" will make words like appaa, appbb..appzz... instead of apple. The second if condition is just for a regular string that contains no wildcards"*"
public static boolean printWords(String s) {
String tempString, tempChar;
if (s.contains("*")) {
for (char c = 'a'; c <= 'z'; c++) {
tempChar = Character.toString(c);
tempString = s.replace("*", tempChar);
if (myDictionary.containsKey(tempString) == true) {
System.out.println(tempString);
}
}
}
if (myDictionary.containsKey(s) == true) {
System.out.println(s);
return true;
} else {
return false;
}
}
You're only using a single for loop over characters, and replacing all instances of * with that character. See the API for String.replace here. So it's no surprise that you're getting strings like Appaa, Appbb, etc.
If you want to actually use Regex expressions, then you shouldn't be doing any String.replace or contains, etc. etc. See Anubian's answer for how to handle your problem.
If you're treating this as a String exercise and don't want to use regular expressions, the easiest way to do what you're actually trying to do (try all combinations of letters for each wildcard) is to do it recursively. If there are no wild cards left in the string, check if it is a word and if so print. If there are wild cards, try each replacement of that wildcard with a character, and recursively call the function on the created string.
public static void printWords(String s){
int firstAsterisk = s.indexOf("*");
if(firstAsterisk == -1){ // doesn't contain asterisk
if (myDictionary.containsKey(s))
System.out.println(s);
return;
}
for(char c = 'a', c <= 'z', c++){
String s2 = s.subString(0, firstAsterisk) + c + s.subString(firstAsterisk + 1);
printWords(s2);
}
}
The base cause relies on the indexOf function - when indexOf returns -1, it means that the given substring (in our case "*") does not occur in the string - thus there are no more wild cards to replace.
The substring part basically recreates the original string with the first asterisk replaced with a character. So supposing that s = "abcd**ef" and c='z', we know that firstAsterisk = 4 (Strings are 0-indexed, index 4 has the first "*"). Thus,
String s2 = s.subString(0, firstAsterisk) + c + s.subString(firstAsterisk + 1);
= "abcd" + 'z' + "*ef"
= "abcdz*ef"
The * character is a regex wildcard, so you can treat the input string as a regular expression:
for (String word : myDictionary) {
if (word.matches(s)) {
System.out.println(word);
}
}
Let the libraries do the heavy lifting for you ;)
With your approach you have to check all possible combinations.
The better way would be to make a regex out of your input string, so replace all * with ..
Than you can loop over your myDirectory and check for every entry whether it matches the regex.
Something like this:
Set<String> dict = new HashSet<String>();
dict.add("apple");
String word = "app**";
Pattern pattern = Pattern.compile(word.replace('*', '.'));
for (String entry : dict) {
if (pattern.matcher(entry).matches()) {
System.out.println("matches: " + entry);
}
}
You have to take care if your input string already contains . than you have to escape them with a \. (The same for other special regex characters.)
See also
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html and
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html

Efficient Regular Expression for big data, if a String contains a word

I have a code that works but is extremely slow. This code determines whether a string contains a keyword. The requirements I have need to be efficient for hundreds of keywords that I will search for in thousands of documents.
What can I do to make finding the keywords (without falsely returning a word that contains the keyword) efficiently?
For example:
String keyword="ac";
String document"..." //few page long file
If i use :
if(document.contains(keyword) ){
//do something
}
It will also return true if document contains a word like "account";
so I tried to use regular expression as follows:
String pattern = "(.*)([^A-Za-z]"+ keyword +"[^A-Za-z])(.*)";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(document);
if(m.find()){
//do something
}
Summary:
This is the summary: Hopefully it will be useful to some one else:
My regular expression would work but extremely impractical while
working with big data. (it didn't terminate)
#anubhava perfected the regular expression. it was easy to
understand and implement. It managed to terminate which is a big
thing. but it was still a bit slow. (Roughly about 240 seconds)
#Tomalak solution is abit complex to implement and understand but it
was the fastest solution. so hats off mate.(18 seconds)
so #Tomalak solution was ~15 times faster than #anubhava.
Don't think you need to have .* in your regex.
Try this regex:
String pattern = "\\b"+ Pattern.quote(keyword) + "\\b";
Here \\b is used for word boundary. If the keyword can contain special characters, make sure they are not at the start or end of the word, or the word boundaries will fail to match.
Also you must be using Pattern.quote if your keyword contains special regex characters.
EDIT: You might use this regex if your keywords are separated by space.
String pattern = "(?<=\\s|^)"+ Pattern.quote(keyword) + "(?=\\s|$)";
The fastest-possible way to find substrings in Java is to use String.indexOf().
To achieve "entire-word-only" matches, you would need to add a little bit of logic to check the characters before and after a possible match to make sure they are non-word characters:
public class IndexOfWordSample {
public static void main(String[] args) {
String input = "There are longer strings than this not very long one.";
String search = "long";
int index = indexOfWord(input, search);
if (index > -1) {
System.out.println("Hit for \"" + search + "\" at position " + index + ".");
} else {
System.out.println("No hit for \"" + search + "\".");
}
}
public static int indexOfWord(String input, String word) {
String nonWord = "^\\W?$", before, after;
int index, before_i, after_i = 0;
while (true) {
index = input.indexOf(word, after_i);
if (index == -1 || word.isEmpty()) break;
before_i = index - 1;
after_i = index + word.length();
before = "" + (before_i > -1 ? input.charAt(before_i) : "");
after = "" + (after_i < input.length() ? input.charAt(after_i) : "");
if (before.matches(nonWord) && after.matches(nonWord)) {
return index;
}
}
return -1;
}
}
This would print:
Hit for "long" at position 44.
This should perform better than a pure regular expressions approach.
Think if ^\W?$ already matches your expectation of a "non-word" character. The regular expression is a compromise here and may cost performance if your input string contains many "almost"-matches.
For extra speed, ditch the regex and work with the Character class, checking a combination of the many properties it provides (like isAlphabetic, etc.) for before and after.
I've created a Gist with an alternative implementation that does that.

Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

I searched even on page 3 at google for this problem, but it seems there is no proper solution.
The following string
"zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\'polo'"
should be splitted by comma in Java. The quotes can be double quotes or single. I tried the following regex
,(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)
but because of the escaped quote at 'marc o\'polo' it fails...
Can somebody help me out?
Code for tryout:
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc \'opolo'";
Pattern COMMA_PATTERN = Pattern.compile(",(?=([^\"']*[\"'][^\"']*[\"'])*[^\"']*$)");
String[] splits = COMMA_PATTERN.split(checkString);
for (String split : splits) {
System.out.println(split);
}
You can do it like this:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}
Splitting CSV with regex is not the right solution... which is probably why you are struggling to find one with split/csv/regex search terms.
Using a dedicated library with a state machine is typically the best solution. There are a number of them:
This closed question seems relevant: https://stackoverflow.com/questions/12410538/which-is-the-best-csv-parser-in-java
I have used opencsv in the past, and I beleive the apache csv tool is good too. I am sure there are others. I am specifically not linking any library because you should o your own research on what to use.
I have been involved in a number of commercail projects where the csv parser was custom-built, but I see no reason why that should still be done.
What I can say, is that regex and CSV get very, very complicated relatively quickly (as you have discovered), and that for performance reasons alone, a 'raw' parser is better.
If you are parsing CVS (or something very similar) than using one of the stablished frameworks normally is a good idea as they cover most corner-cases and are tested by a wider audience thorough usage in different projects.
If however libraries are no option you could go with e.g. this:
public class Curios {
public static void main(String[] args) {
String checkString = "zhg,wimö,'astor wohnideen','multistore 2002',yonza,'asdf, saflk','marc o\\'polo'";
List<String> result = splitValues(checkString);
System.out.println(result);
System.out.println(splitValues("zhg\\,wi\\'mö,'astor wohnideen','multistore 2002',\"yo\\\"nza\",'asdf, saflk\\\\','marc o\\'polo',"));
}
public static List<String> splitValues(String checkString) {
List<String> result = new ArrayList<String>();
// Used for reporting errors and detecting quotes
int startOfValue = 0;
// Used to mark the next character as being escaped
boolean charEscaped = false;
// Is the current value quoted?
boolean quoted = false;
// Quote-character in use (only valid when quoted == true)
char quote = '\0';
// All characters read from current value
final StringBuilder currentValue = new StringBuilder();
for (int i = 0; i < checkString.length(); i++) {
final char charAt = checkString.charAt(i);
if (i == startOfValue && !quoted) {
// We have not yet decided if this is a quoted value, but we are right at the beginning of the next value
if (charAt == '\'' || charAt == '"') {
// This will be a quoted String
quote = charAt;
quoted = true;
startOfValue++;
continue;
}
}
if (!charEscaped) {
if (charAt == '\\') {
charEscaped = true;
} else if (quoted && charAt == quote) {
if (i + 1 == checkString.length()) {
// So we don't throw an exception
quoted = false;
// Last value will be added to result outside loop
break;
} else if (checkString.charAt(i + 1) == ',') {
// Ensure we don't parse , again
i++;
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
quoted = false;
} else {
throw new IllegalStateException(String.format(
"Value was quoted with %s but prematurely terminated at position %d " +
"maybe a \\ is missing before this %s or a , after? " +
"Value up to this point: \"%s\"",
quote, i, quote, checkString.substring(startOfValue, i + 1)));
}
} else if (!quoted && charAt == ',') {
// Add the value to the result
result.add(currentValue.toString());
// Prepare for next value
currentValue.setLength(0);
startOfValue = i + 1;
} else {
// a boring character
currentValue.append(charAt);
}
} else {
// So we don't forget to reset for next char...
charEscaped = false;
// Here we can do interpolations
switch (charAt) {
case 'n':
currentValue.append('\n');
break;
case 'r':
currentValue.append('\r');
break;
case 't':
currentValue.append('\t');
break;
default:
currentValue.append(charAt);
}
}
}
if(charEscaped) {
throw new IllegalStateException("Input ended with a stray \\");
} else if (quoted) {
throw new IllegalStateException("Last value was quoted with "+quote+" but there is no terminating quote.");
}
// Add the last value to the result
result.add(currentValue.toString());
return result;
}
}
Why not simply a regular expression?
Regular expressions don't understand nesting very well. While certainly the regular expression by Casimir does a good job, differences between quoted and unquoted values are easier to model in some form of a state-machine. You see how difficult it was to ensure you don't accidentally match an ecaped or quoted ,. Also while you are allready evaluating every character it is easy to interpret escape-sequences like \n
What to watch out for?
My function was not written for white-space arround values (this can be changed)
My function will interpret the escape-sequences \n, \r, \t, \\ like most C-style language interpreters while reading \x as x (this can easily be changed)
My function accepts quotes and escapes inside unquoted values (this can easily be changed)
I did only a few tests and tried my best to exhibit a good memory-management and timing, but you will need to see if it fits your needs.

contains quotation mark java

how to I add an argument to check if a string contains ONE quotation mark ? I tried to escape the character but it doesn't work
words[i].contains()
EDIT: my bad, got some unclosed brackets, works fine now
words[i].matches("[^\"]*\"[^\"]*")
That is: any non-quotes, a quote, any non-quotes.
You could use something like this:
words[i].split("\"").length - 1
That would give you the amount of "s in your string. Therefore, just use:
if (words[i].split("\"").length == 2) {
//do stuff
}
You can check if the first quotation mark exists, and then check if the second one doesn't. It's much faster than using matches or split.
int index = words[i].indexOf('\"');
if (index != -1 && words[i].indexOf('\"', index + 1) == -1){
// do stuff
}
To check number or quotes you can also use length of string after removing ".
int quotesNumber = words[i].length() - words[i].replace("\"", "").length();
if (quotesNumber == 1){
//do stuff
}

Using "Predefined character classes" on a substring

I want to check the last character of the String for characters that are not a non-word character using '\W' and allow certain symbols like ". , ! etc" from the top of my head I thought of using a code similar to this.
Boolean notCompleted = true;
int deduct = 1;
while(notCompleted){
if(string.charAt(string.length() -deduct) == '\W'){ // '\W' <-- doesn't work since it accepts anything other than "escape sequences".
if(string.charAt(string.length() -deduct) == '.'||string.charAt(string.length() -deduct) == ','||string.charAt(string.length() -deduct) == '!'){
//Do nothing and move on to the while loop
}else{
//Replace the non word character with ' '.
}
}
deduct++;
if(deduct >= html.length()){
notCompleted = false;
}
}
The reason why this doesn't work is because using string.charAt only accepts "Escapes sequence".
My question is there another way to pull this off rather than doing.
string.replaceAll("\W", "");
All suggestions is greatly appreciated. Thank you.
Thanks to the tip npinti gave me I built this code. However I am getting an error line
Desired Result of fakeNewString as requested should be "! asdsdefwef.,a,,sda.sd";
fakeNewString = sb.toString(); // NullPointerException
public static void test5(){
Boolean notCompleted = true;
String fakeNewString = "!##$%^&*( asdsdefwef.,a,,sda.sd";
int start = 0, end = 1;
StringBuilder sb = null;
try{
while(notCompleted){
start++;
String tempString = fakeNewString.substring(start, end);
if(Pattern.matches("\\W$", tempString)){
if(Pattern.matches("!", tempString)||Pattern.matches(".", tempString)||Pattern.matches(",", tempString)||Pattern.matches("\"", tempString)){
//do nothing
sb.append(tempString);
}else{
//Change it to spaces.
tempString = " ";
sb.append(tempString);
}
}
end++;
if(end >= fakeNewString.length()){
notCompleted = false;
fakeNewString = sb.toString();
System.out.println(fakeNewString);
}
}
}catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
You can do something like so:
Pattern pattern = Pattern.compile("\\W$");
Matcher matcher = pattern.match(string);
if (matcher.find())
{
//do something when the string ends with a non word character
}
Take a look at this tutorial for more information on regular expressions.
You can use String.replaceAll in a slightly different way to do this. It achieves the same effect as the code you're trying to write, which seems like a complex solution for a simple problem. Try this code:
string.replaceAll("[^\\w!,.]", " ");
All the invalid characters are now replaced by a space, and multiple sequential occurrences of them are replaced by multiple spaces.
Lets try to break down the question (desire) and answer it:
I want to check the last character of the String for characters that are not a non-word character using '\W' and allow certain symbols like ". , ! etc"
First we have:
I want to check the last character of the String
Expression for character X at end of string:
X$
Then:
for characters that are not a non-word character
Expression:
[^\W] i.e. \w
And also:
allow certain symbols like ". , ! etc"
Added to the expression above:
[\w.,!]
And the combined final result is:
[\w.,!]$
Ta-da! (Altho I'm guessing OP is looking for something else, I did it for teh lulz.)

Categories

Resources