java startsWith() method with custom rules

java startsWith() method with custom rules - java

I implement typing trainer and would like to create my special String startsWith() method with specific rules.
For example: '-' char should be equal to any long hyphen ('‒', etc). Also I'll add other rules for special accent characters (e equals é, but not é equals e).
public class TestCustomStartsWith {
private static Map<Character, List<Character>> identityMap = new HashMap<>();
static { // different hyphens: ‒, –, —, ―
List<Character> list = new LinkedList<>();
list.add('‒');
list.add('–'); // etc
identityMap.put('-', list);
}
public static void main(String[] args) {
System.out.println(startsWith("‒d--", "-"));
}
public static boolean startsWith(String s, String prefix) {
if (s.startsWith(prefix)) return true;
if (prefix.length() > s.length()) return false;
int i = prefix.length();
while (--i >= 0) {
if (prefix.charAt(i) != s.charAt(i)) {
List<Character> list = identityMap.get(prefix.charAt(i));
if ((list == null) || (!list.contains(s.charAt(i)))) return false;
}
}
return true;
}
}
I could just replace all kinds of long hyphens with '-' char, but if there will be more rules, I'm afraid replacing will be too slow.
How can I improve this algorithm?

I don't know all of your custom rules, but would a regular expression work?
The user is passing in a String. Create a method to convert that String to a regex, e.g.
replace a short hyphen with short or long ([-‒]),
same for your accents, e becomes [eé]
Prepend with the start of word dohicky (\b),
Then convert this to a regex and give it a go.
Note that the list of replacements could be kept in a Map as suggested by Tobbias. Your code could be something like
public boolean myStartsWith(String testString, String startsWith) {
for (Map.Entry<String,String> me : fancyTransformMap) {
startsWith = startsWith.replaceAll(me.getKey(), me.getValue());
}
return testString.matches('\b' + startsWith);
}
p.s. I'm not a regex super-guru so if there may be possible improvements.

I'd think something like a HashMap that maps the undesirable characters to what you want them to be interpreted as might be the way to go if you are worried about performance;
HashMap<Character, Character> fastMap = new Map<Character, Character>();
// read it as '<long hyphen> can be interpreted as <regular-hyphen>
fastMap.add('–', '-');
fastMap.add('é', 'e');
fastMap.add('è', 'e');
fastMap.add('？', '?');
...
// and so on
That way you could ask for the value of the key: value = map.get(key).
However, this will only work as long as you have unique key-values. The caveat is that é can't be interpreted as è with this method - all the keys must be unique. However, if you are worried about performance, this is an exceedingly fast way of doing it, since the lookup time for a HashMap is pretty close to being O(1). But as others on this page has written, premature optimization is often a bad idea - try implementing something that works first, and if at the end of it you find it is too slow, then optimize.

Related

Getting intermediate results from stream to be used later in stream

I was trying to write some functional programming code (using lambdas and streams from Java 8) to test if a string has unique characters in it (if it does, return true, if it does not, return false). A common way to do this using vanilla Java is with a data structure like a set, i.e.:
public static boolean oldSchoolMethod(String str) {
Set<String> set = new HashSet<>();
for(int i=0; i<str.length(); i++) {
if(!set.add(str.charAt(i) + "")) return false;
}
return true;
}
The set returns true if the character/object can be added to the set (because it did not exist there previously). It returns false if it cannot (it exists in the set already, duplicated value, and cannot be added). This makes it easy to break out the loop and detect if you have a duplicate, without needing to iterate through all length N characters of the string.
I know in Java 8 streams you cannot break out a stream. Is there anyway way to capture the return value of an intermediate stream operation, like adding to the set's return value (true or false) and send that value to the next stage of the pipeline (another intermediate operation or terminal stream operation)? i.e.
Arrays.stream(myInputString.split(""))
.forEach( i -> {
set.add(i) // need to capture whether this returns "true" or "false" and use that value later in
// the pipeline or is this bad/not possible?
});
One of the other ways I thought of solving this problem, is to just use distinct() and collect the results into a new string and if it is the same length as the original string, than you know it is unique, else if there are different lengths, some characters got filtered out for not being distinct, thus you know it is not unique when comparing lengths. The only issue I see here is that you have to iterate through all length N chars of the string, where the "old school" method best-case scenario could be done in almost constant time O(1), since it is breaking out the loop and returning as soon as it finds 1 duplicated character:
public static boolean java8StreamMethod(String str) {
String result = Arrays.stream(str.split(""))
.distinct()
.collect(Collectors.joining());
return result.length() == str.length();
}

Your solutions are all performing unnecessary string operations.
E.g. instead of using a Set<String>, you can use a Set<Character>:
public static boolean betterOldSchoolMethod(String str) {
Set<Character> set = new HashSet<>();
for(int i=0; i<str.length(); i++) {
if(!set.add(str.charAt(i))) return false;
}
return true;
}
But even the boxing from char to Character is avoidable.
public static boolean evenBetterOldSchoolMethod(String str) {
BitSet set = new BitSet();
for(int i=0; i<str.length(); i++) {
if(set.get(str.charAt(i))) return false;
set.set(str.charAt(i));
}
return true;
}
Likewise, for the Stream variant, you can use str.chars() instead of Arrays.stream(str.split("")). Further, you can use count() instead of collecting all elements to a string via collect(Collectors.joining()), just to call length() on it.
Fixing both issues yields the solution:
public static boolean newMethod(String str) {
return str.chars().distinct().count() == str.length();
}
This is simple, but lacks short-circuiting. Further, the performance characteristics of distinct() are implementation-dependent. In OpenJDK, it uses an ordinary HashSet under the hood, rather than BitSet or such alike.

This code might work for you:
public class Test {
public static void main(String[] args) {
String myInputString = "hellowrd";
HashSet<String> set = new HashSet<>();
Optional<String> duplicateChar =Arrays.stream(myInputString.split("")).
filter(num-> !set.add(num)).findFirst();
if(duplicateChar.isPresent()){
System.out.println("Not unique");
}else{
System.out.println("Unique");
}
}
}
Here using findFirst() I am able to find the first duplicate element. So that we don't need to continue on iterating rest of the characters.

What about just mapping to a boolean?
Arrays.stream(myInputString.split(""))
.map(set::add)
.<...>
That would solve your concrete issue, I guess, but it's not a very nice solution because the closures in stream chains should not have side-effects (that is exactly the point of functional programming...).
Sometimes the classic for-loop is still the better choice for certain problems ;-)

Better way to invert cases of characters in a string in Java

As a novice Java programmer who barely got started in Java programming, I am totally exhausted in trying to find a solution to this issue. A course that I am currently studying gave homework that asked me to create a Java class that has a sort of “reverse” method that returns a new version of the string
of the current string where the capitalization is reversed (i.e., lowercase to uppercase
and uppercase to lowercase) for the alphabetical characters specified in a given condition. Say if I were to reverse “abc, XYZ; 123.” using reverse("bcdxyz#3210."), it must return "aBC, xyz; 123.". (P.S: the class ignores numbers and special characters and the variable "myString" is where the "abc, XYZ; 123." goes to.). So far, I've only managed to return out "aBC, XYZ; 123." with the code below. Am I missing something here?
public String reverse(String arg) {
// TODO Implement method
String arg_no_sym = arg.replaceAll("[^a-zA-Z0-9]","");
String arg_perfect = arg_no_sym.replaceAll("\\d","");
if (myString != null) {
char[] arrayOfReplaceChars = arg_perfect.toCharArray();
char[] arrayOfmyString = myString.toCharArray();
for (int i = 0; i < arg_perfect.length(); i++) {
myString = myString.replace(String.valueOf((arrayOfReplaceChars[i])), String.valueOf((arrayOfReplaceChars[i])).toUpperCase());
}
return myString;
}
else {
return "";
}
}

How about using the methods isUpperCase() and isLowerCase() to check the case of the letters and then use toUpperCase() and toLowerCase() to change the case of them?

Simple Logic or Regex Expresssion for a string

Hi I've a string like the following -
name,number,address(line1,city),status,contact(id,phone(number,type),email(id),type),closedate
I need to output the following -
name,number,address.line1,address.city,status,contact.id,contact.phone.number,contact.phone.type,contact.email.id,contact.type,closedate
Is it possible to do it using regex in java. Logic I have thought of is using string manipulation (with substring,recursion etc). Is there a simple way of achieving this? I would prefer a regular expression which works in java. Other suggestions are also welcome.
To give you a context
The string above is coming as query parameter, I have to find out what all columns I need to select based on that. so all these individual items in the output will have a respective column name in property file.
Thanks
Pal

public class Main {
public static void main(String[] args) {
;
String input ="name,number,address(line1,test(city)),status,contact(id,phone(number,type),email(id),type),closedate";
List<String> list = new ArrayList<String>(Arrays.asList(input.split(","))); // We need a list for the iterator (or ArrayIterator)
List<String> result = new Main().parse(list);
System.out.println(String.join(",", result));
}
private List<String> parse(List<String> inputString){
Iterator<String> it = inputString.iterator();
ArrayList<String> result = new ArrayList<>();
while(it.hasNext()){
String word = it.next();
if(! word.contains("(")){
result.add(word);
} else { // if we come across a "(", start the recursion and parse it till we find the matching ")"
result.addAll(buildDistributedString(it, word,""));
}
}
return result;
}
/*
* recursivly parse the string
* #param startword The first word of it (containing the new prefix, the ( and the first word of this prefic
* #param prefix Concatenation of previous prefixes in the recursion
*/
private List<String> buildDistributedString(Iterator<String> it, String startword,String prefix){
ArrayList<String> result = new ArrayList<>();
String[] splitted = startword.split("\\(");
prefix += splitted[0]+".";
if(splitted[1].contains(")")){ //if the '(' is immediately matches, return only this one item
result.add(prefix+splitted[1].substring(0,splitted[1].length()-1));
return result;
} else {
result.add(prefix+splitted[1]);
}
while(it.hasNext()){
String word = it.next();
if( word.contains("(")){ // go deeper in the recursion
List<String> stringList = buildDistributedString(it, word, prefix);
if(stringList.get(stringList.size()-1).contains(")")){
// if multiple ")"'s were found in the same word, go up multiple recursion levels
String lastString = stringList.remove(stringList.size()-1);
stringList.add(lastString.substring(0,lastString.length() -1));
result.addAll(stringList);
break;
}
result.addAll(stringList);
} else if(word.contains(")")) { // end this recursion level
result.add(prefix + word.substring(0,word.length()-1)); // ")" is always the last char
break;
} else {
result.add(prefix+word);
}
}
return result;
}
}
I wrote a quick parser for this. There probably are some improvements possible, but this should give you an idea. It was just meant to get a working version asap.

Since nested parentheses appear in your string, regular expressions can't do the job. The explanation why is complicated, requiring knowledge in context free grammars. See Can regular expressions be used to match nested patterns?
I've heard this kind of parsing can be done through callbacks, but I believe it doesn't exist in Java.
Parser generators like JavaCC would do the job, but that's a huge overkill for the task you are describing.
I recommend you to look into java.util.Scanner, and you recursively call the parse method whether you see a left paren.

Modelling a regular expression parser with polymorphism

So, I'm doing a regular expression parser for school that creates a hierarchy of objects in charge of the matching. I decided to do it object oriented because it's easier for me to imagine an implementation of the grammar that way. So, these are my classes making up the regular expressions. It's all in Java, but I think you can follow along if you're proficient in any object oriented language.
The only operators we're required to implement is Union (+), Kleene-Star (*), Concatenation of expressions (ab or maybe (a+b)c) and of course the Parenthesis as illustrated in the example of Concatination. This is what I've implemented right now and I've got it to work like a charm with a bit of overhead in the main.
The parent class, Regexp.java
public abstract class Regexp {
//Print out the regular expression it's holding
//Used for debugging purposes
abstract public void print();
//Checks if the string matches the expression it's holding
abstract public Boolean match(String text);
//Adds a regular expression to be operated upon by the operators
abstract public void add(Regexp regexp);
/*
*To help the main with the overhead to help it decide which regexp will
*hold the other
*/
abstract public Boolean isEmpty();
}
There's the most simple regexp, Base.java, which holds a char and returns true if the string matches the char.
public class Base extends Regexp{
char c;
public Base(char c){
this.c = c;
}
public Base(){
c = null;
}
#Override
public void print() {
System.out.println(c);
}
//If the string is the char, return true
#Override
public Boolean match(String text) {
if(text.length() > 1) return false;
return text.startsWith(""+c);
}
//Not utilized, since base is only contained and cannot contain
#Override
public void add(Regexp regexp) {
}
#Override
public Boolean isEmpty() {
return c == null;
}
}
A parenthesis, Paren.java, to hold a regexp inside it. Nothing really fancy here, but illustrates how matching works.
public class Paren extends Regexp{
//Member variables: What it's holding and if it's holding something
private Regexp regexp;
Boolean empty;
//Parenthesis starts out empty
public Paren(){
empty = true;
}
//Unless you create it with something to hold
public Paren(Regexp regexp){
this.regexp = regexp;
empty = false;
}
//Print out what it's holding
#Override
public void print() {
regexp.print();
}
//Real simple; either what you're holding matches the string or it doesn't
#Override
public Boolean match(String text) {
return regexp.match(text);
}
//Pass something for it to hold, then it's not empty
#Override
public void add(Regexp regexp) {
this.regexp = regexp;
empty = false;
}
//Return if it's holding something
#Override
public Boolean isEmpty() {
return empty;
}
}
A Union.java, which is two regexps that can be matched. If one of them is matched, the whole Union is a match.
public class Union extends Regexp{
//Members
Regexp lhs;
Regexp rhs;
//Indicating if there's room to push more stuff in
private Boolean lhsEmpty;
private Boolean rhsEmpty;
public Union(){
lhsEmpty = true;
rhsEmpty = true;
}
//Can start out with something on the left side
public Union(Regexp lhs){
this.lhs = lhs;
lhsEmpty = false;
rhsEmpty = true;
}
//Or with both members set
public Union(Regexp lhs, Regexp rhs) {
this.lhs = lhs;
this.rhs = rhs;
lhsEmpty = false;
rhsEmpty = false;
}
//Some stuff to help me see the unions format when I'm debugging
#Override
public void print() {
System.out.println("(");
lhs.print();
System.out.println("union");
rhs.print();
System.out.println(")");
}
//If the string matches the left side or right side, it's a match
#Override
public Boolean match(String text) {
if(lhs.match(text) || rhs.match(text)) return true;
return false;
}
/*
*If the left side is not set, add the member there first
*If not, and right side is empty, add the member there
*If they're both full, merge it with the right side
*(This is a consequence of left-to-right parsing)
*/
#Override
public void add(Regexp regexp) {
if(lhsEmpty){
lhs = regexp;
lhsEmpty = false;
}else if(rhsEmpty){
rhs = regexp;
rhsEmpty = false;
}else{
rhs.add(regexp);
}
}
//If it's not full, it's empty
#Override
public Boolean isEmpty() {
return (lhsEmpty || rhsEmpty);
}
}
A concatenation, Concat.java, which is basically a list of regexps chained together. This one is complicated.
public class Concat extends Regexp{
/*
*The list of regexps is called product and the
*regexps inside called factors
*/
List<Regexp> product;
public Concat(){
product = new ArrayList<Regexp>();
}
public Concat(Regexp regexp){
product = new ArrayList<Regexp>();
pushRegexp(regexp);
}
public Concat(List<Regexp> product) {
this.product = product;
}
//Adding a new regexp pushes it into the list
public void pushRegexp(Regexp regexp){
product.add(regexp);
}
//Loops over and prints them
#Override
public void print() {
for(Regexp factor: product){
factor.print();
}
}
/*
*Builds up a substring approaching the input string.
*When it matches, it builds another substring from where it
*stopped. If the entire string has been pushed, it checks if
*there's an equal amount of matches and factors.
*/
#Override
public Boolean match(String text) {
ArrayList<Boolean> bools = new ArrayList<Boolean>();
int start = 0;
ListIterator<Regexp> itr = product.listIterator();
Regexp factor = itr.next();
for(int i = 0; i <= text.length(); i++){
String test = text.substring(start, i);
if(factor.match(test)){
start = i;
bools.add(true);
if(itr.hasNext())
factor = itr.next();
}
}
return (allTrue(bools) && (start == text.length()));
}
private Boolean allTrue(List<Boolean> bools){
return product.size() == bools.size();
}
#Override
public void add(Regexp regexp) {
pushRegexp(regexp);
}
#Override
public Boolean isEmpty() {
return product.isEmpty();
}
}
Again, I've gotten these to work to my satisfaction with my overhead, tokenization and all that good stuff. Now I want to introduce the Kleene-star operation. It matches on any number, even 0, of occurrences in the text. So, ba* would match b, ba, baa, baaa and so on while (ba)* would match on ba, baba, bababa and so on. Does it even look possible to extend my Regexp to this or do you see another way of solving this?
PS: There's getters, setter and all kinds of other support functions that I didn't write out, but this is mainly for you to get the point quickly of how these classes works.

You seem to be trying to use a fallback algorithm to do the parsing. That can work -- although it is easier to do with higher-order functions -- but it is far from the best way to parse regular expressions (by which I mean the things which are mathematically regular expressions, as opposed to the panoply of parsing languages implemented by "regular expression" libraries in various languages).
It's not the best way because the parsing time is not linear in the size of the string to be matched; in fact, it can be exponential. But to understand that, it's important to understand why your current implementation has a problem.
Consider the fairly simple regular expression (ab+a)(bb+a). That can match exactly four strings: abbb, aba, abb, aa. All of those strings start with a, so your concatenation algorithm will match the first concatenand ((ab+a)) at position 1, and proceed to try the second concatenand (bb+a). That will successfully match abb and aa, but it will fail on aba and abbb.
Now, suppose you modified the concatenation function to select the longest matching substring rather than the shortest one. In that case, the first subexpression would match ab in three of the possible strings (all but aa), and the match would fail in the case of abb.
In short, when you are matching a concatenation R·S, you need to do something like this:
Find some initial string which matches R
See if S matches the rest of the text
If not, repeat with another initial string which matches R
In the case of full regular expression matches, it doesn't matter which order we list matches for R, but usually we're trying to find the longest substring which matches a regular expression, so it is convenient to enumerate the possible matches from longest to shortest.
Doing that means that we need to be able to restart a match after a downstream failure, to find the "next match". That's not terribly complicated, but it definitely complicates the interface, because all of the compound regular expression operators need to "pass through" the failure to their children in order to find the next alternative. That is, the operator R+S might first find something which matches R. If asked for the next possibility, it first has to ask R if there is another string which it could match, before moving on to S. (And that's passing over the question of how to get + to list the matches in order by length.)
With such an implementation, it's easy to see how to implement the Kleene star (R*), and it is also easy to see why it can take exponential time. One possible implementation:
First, match as many R as possible.
If asked for another match: ask the last R for another match
If there are no more possibilities, drop the last R from the list, and ask what is now the last R for another match
If none of that worked, propose the empty string as a match
Fail
(This can be simplified with recursion: Match an R, then match an R*. For the next match, first try the next R*; failing that try the next R and the first following R*; when all else fails, try the empty string.)
Implementing that is an interesting programming exercise, so I encourage you to continue. But be aware that there are better algorithms. You might want to read Russ Cox's interesting essays on regular expression matching.

Highest performance for finding substrings

I have an array of strings (keywords), and I need to check how many of those strings existing within a larger string (text read from file). I need the check to be case insensitive.
At this moment what I do is this:
private void findKeywords() {
String body = email.getMessage();
for (String word : keywords) {
if (body.toLowerCase().contains(word.toLowerCase())) {
//some actions }
if (email.getSubject().contains(word)) {
//some actions
}
}
}
From reading questions in here another solution came up:
private void findKeywords() {
String body = email.getMessage();
for (String word : keywords) {
boolean body_match = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE).matcher(body).find();
boolean subject_match = Pattern.compile(Pattern.quote(word), Pattern.CASE_INSENSITIVE).matcher(email.getSubject()).find();
if (body_match) {
rating++;
}
if (subject_match) {
rating++;
}
}
}
Which of these solutions is more efficient? Also is there another way to do this that is better? Any accepted solutions must be simple to implement(on par with the above) and preferably without external libraries as this is not very important issue in this case.

Both of the solutions seem viable to me. One improvement I would suggest is moving functions out of the loop. In your current code you are repeatedly doing actions such as toLowerCase() and Pattern.compile which you only need to do once.
Obviously there are much faster methods to solve this problem, but they require much more complex code than these 5-liners.

Better: build a single pattern with all keywords. Then search on that pattern. Assuming your keywords do not contain meta-characters (characters with special meanings in patterns), then use:
StringBuilder keywordRegex = new StringBuilder();
for (String w : keywords) {
keywordRegex.append("|"+w);
}
Pattern p = Pattern.compile(keywordRegex.substring(1));
Matcher m = new p.matcher(textToMatch);
while (m.find()) {
// match is at m.start(); word is m.group(0);
}
Much more efficient than iterating through all keywords: pattern compilation (once) will have generated an automata that looks for all keywords at once.

I think the explicit regex solution you mentioned would be more efficient since it doesn't have the toLowerCase operation, which would copy the input string in memory and make chars lowercase.
Both solutions should be practical and your question is mostly academic, but I think the regexes provide cleaner code.

If your email bodies are very large, writing a specialized case-insensitive contains may be justified, because you can avoid calling toUpperCase() on big strings:
static bool containsIgnoreCase(String big, String small) {
if (small == null || big == null || small.length() > big.length()) {
return false;
}
String smallLC = small.toLowerCase();
String smallUC = small.toUpperCase();
for (int i = 0; i < big.length(); ++i) {
if (matchesAt(big, i, smallLC, smallUC)) {
return true;
}
}
return false;
}
private static bool matchesAt(String big, int index, String lc, String uc) {
if (index + lc.length() > big.length()) {
return false;
}
for (int i = 0; i < lc.length(); ++i) {
char c = big.charAt(i + index);
if ((c != lc.charAt(i)) && (c != uc.charAt(i))) {
return false;
}
}
return true;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.