RegEx to find URLs in HTML takes 25 seconds in Java/Android

RegEx to find URLs in HTML takes 25 seconds in Java/Android - java

In Android/Java, given a website's HTML source code, I would like to extract all XML and CSV file paths.
What I am doing (with RegEx) is this:
final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
"[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|].(xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
urls.add(makeAbsoluteURL(url.group(0)));
}
public String makeAbsoluteURL(String url) {
if (url.startsWith("http://") || url.startsWith("http://")) {
return url;
}
else if (url.startsWith("/")) {
return mRootURL+url.substring(1);
}
else {
return mBaseURL+url;
}
}
Unfortunately, this runs for about 25 seconds for an average website with normal length. What is going wrong? Is my RegEx just bad? Or is RegEx just so slow?
Can I find the URLs faster without RegEx?
Edit:
The source for the valid characters was (roughly) this answer. However, I think the two character classes (square brackets) must be swapped so that you have a more limited character set for the first char of the URL and a broader character class for all remaining chars. This was the intention.

Your regex is written in a way that makes it slow for long inputs.
The * operator is greedy.
For instance for input:
http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android.xml
The [-a-zA-Z0-9+&##/%?=~_|!:,.;]* part of the regex will consume the whole string. It will then try to match the next character group, which will fail (since whole string is consumed). It will then backtrack in match of first part of the regex by one character and try to match the second character group again. It will match. Then it will try to match the dot and fail because the whole string is consumed. Another backtrack etc...
In essence your regex is forcing a lot of backtracking to match anything. It will also waste a lot of time on matches that have no way of succeeding.
For word forest it will first consume whole word in the first part of expression and then repeatedly backtrack after failing to match the rest of expression. Huge waste of time.
Also:
the . in regex is unescaped and it will match ANY character.
url.group(0) is redundant. url.group() has same meaning
In order to speed up the regex you need to figure out a way to reduce the amount of backtracking and it would also help if you had a less general start of the match. Right now every single word will cause matching to start and generally fail. For instance typically in html all the links are inside 2 ". If that's the case you can start your matching at " which will speed it up tremendously. Try to find a better start of the expression.

I've nothing the say in the theoretical overview that U Mad did, he highlighted everything I'd noticed.
What I would like to suggest you, considering what are you look for with the RE, is to change the point of view of your RE :)
You are looking for xml and csv files, so why don't you reverse the html string, for example using:
new StringBuilder("bla bla bla foo letme/find.xml bla bla").reverse().toString()
after that you could look for the pattern:
final Pattern urlRegex = Pattern.compile(
"(vsc|lmx)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
urlRegex pattern could be refined as U Mad has already suggested. But in this way you could reduce the number of failed matches.

I had my doubts, if there can be a String really long enough to take 25 seconds for parsing. So I tried and must admit now, that with about 27MB of text, it takes around 25 seconds to parse it with the given regular expression.
Being curious I changed the little test program with #FabioDch's approach (so, please vote for him, if you want to vote anywhere :-)
The result is quite impressing: Instead of 25 Seconds, #FabioDch's approach needed less then 1 second (100ms to 800ms) + 70ms to 85ms for reversing!
Here's the code I used. It reads text from the largest text file I've found and copies it 10 time to get 27MB of text. Then runs the regex against it and prints out the results.
#Test
public final void test() throws IOException {
final Pattern urlRegex = Pattern.compile("(lmx|vsc)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
printTimePassed("initialized");
List<String> lines = Files.readAllLines(Paths.get("testdata", "Aster_Express_User_Guide_0500.txt"), Charset.defaultCharset());
StringBuilder sb = new StringBuilder();
for(int i=0; i<10; i++) { // Copy 10 times to get more useful data
for(String line : lines) {
sb.append(line);
sb.append('\n');
}
}
printTimePassed("loaded: " + lines.size() + " lines, in " + sb.length() + " chars");
String html = sb.reverse().toString();
printTimePassed("reversed");
int i = 0;
final Matcher url = urlRegex.matcher(html);
while (url.find()) {
System.out.println(i++ + ": FOUND: " + new StringBuilder(url.group()).reverse() + ", " + url.start() + ", " + url.end());
}
printTimePassed("ready");
}
private void printTimePassed(String msg) {
long current = System.currentTimeMillis();
System.out.printf("%s: took %d ms\n", msg, (current - ms));
ms = current;
}

Would suggest only using the regex to find file extensions (.xml or .csv). This should be a lot faster and when found, you can look backwards, examining each character before and stop when you reach one that couldn't be in a URL - see below:
final HashSet<String> urls = new HashSet<String>();
final Pattern fileExtRegex = Pattern.compile("\\.(xml|csv)");
final Matcher fileExtMatcher = fileExtRegex.matcher(htmlString);
// Find next occurrence of ".xml" or ".csv" in htmlString
while (fileExtMatcher.find()) {
// Go backwards from the character just before the file extension
int dotPos = fileExtMatcher.start() - 1;
int charPos = dotPos;
while (charPos >= 0) {
// Break if current character is not a valid URL character
char chr = htmlString.charAt(charPos);
if (!((chr >= 'a' && chr <= 'z') ||
(chr >= 'A' && chr <= 'Z') ||
(chr >= '0' && chr <= '9') ||
chr == '-' || chr == '+' || chr == '&' || chr == '#' ||
chr == '#' || chr == '/' || chr == '%' || chr == '?' ||
chr == '=' || chr == '~' || chr == '|' || chr == '!' ||
chr == ':' || chr == ',' || chr == '.' || chr == ';')) {
break;
}
charPos--;
}
// Extract/add URL if there are valid URL characters before file extension
if ((dotPos > 0) && (charPos < dotPos)) {
String url = htmlString.substring(charPos + 1, fileExtMatcher.end());
urls.add(makeAbsoluteURL(url));
}
}
Small disclaimer: I used part of your original regex for valid URL characters: [-a-zA-Z0-9+&##/%?=~_|!:,.;]. Haven't verified if this is comprehensive and there are perhaps further improvements that could be made, e.g. it would currently find local file paths (e.g. C:\TEMP\myfile.xml) as well as URLs. Wanted to keep the code above simple to demonstrate the technique so haven't tackled this.
EDIT Following the comment about effiency I've modified to no longer use a regex for checking valid URL characters. Instead, it compares the character against valid ranges manually. Uglier code but should be faster...

I know people love to use regex to parse html, but have you considered using jsoup?

For sake of clarity I created a separate answer for this regex:
Edited to escape the dot and remove reluctant quant.
(?<![-a-zA-Z0-9+&##/%=~_|])[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]‌\\.(xml|csv)
Please try this one and tell me how it goes.
Also here's a class which will enable you to search a reversed string without actually reversing it:
public class ReversedString implements CharSequence {
public ReversedString(String input) {
this.s = input;
this.len = s.length();
}
private final String s;
private final int len;
#Override
public CharSequence subSequence(final int start, final int end) {
return new CharSequence() {
#Override
public CharSequence subSequence(int start, int end) {
throw new UnsupportedOperationException();
}
#Override
public int length() {
return end-start;
}
#Override
public char charAt(int index) {
return s.charAt(len-start-index-1);
}
#Override
public String toString() {
StringBuilder buf = new StringBuilder(end-start);
for(int i = start;i < end;i++) {
buf.append(s.charAt(len-i-1));
}
return buf.toString();
}
};
}
#Override
public int length() {
return len;
}
#Override
public char charAt(int index) {
return s.charAt(len-1-index);
}
}
You can use this class as such:
pattern.matcher(new ReversedString(inputString));

Related

Split a string by commas except when the comma is part of the sentence [duplicate]

I have a string vaguely like this:
foo,bar,c;qual="baz,blurb",d;junk="quux,syzygy"
that I want to split by commas -- but I need to ignore commas in quotes. How can I do this? Seems like a regexp approach fails; I suppose I can manually scan and enter a different mode when I see a quote, but it would be nice to use preexisting libraries. (edit: I guess I meant libraries that are already part of the JDK or already part of a commonly-used libraries like Apache Commons.)
the above string should split into:
foo
bar
c;qual="baz,blurb"
d;junk="quux,syzygy"
note: this is NOT a CSV file, it's a single string contained in a file with a larger overall structure

Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by #MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:
Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

While I do like regular expressions in general, for this kind of state-dependent tokenization I believe a simple parser (which in this case is much simpler than that word might make it sound) is probably a cleaner solution, in particular with regards to maintainability, e.g.:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
List<String> result = new ArrayList<String>();
int start = 0;
boolean inQuotes = false;
for (int current = 0; current < input.length(); current++) {
if (input.charAt(current) == '\"') inQuotes = !inQuotes; // toggle state
else if (input.charAt(current) == ',' && !inQuotes) {
result.add(input.substring(start, current));
start = current + 1;
}
}
result.add(input.substring(start));
If you don't care about preserving the commas inside the quotes you could simplify this approach (no handling of start index, no last character special case) by replacing your commas in quotes by something else and then split at commas:
String input = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
StringBuilder builder = new StringBuilder(input);
boolean inQuotes = false;
for (int currentIndex = 0; currentIndex < builder.length(); currentIndex++) {
char currentChar = builder.charAt(currentIndex);
if (currentChar == '\"') inQuotes = !inQuotes; // toggle state
if (currentChar == ',' && inQuotes) {
builder.setCharAt(currentIndex, ';'); // or '♡', and replace later
}
}
List<String> result = Arrays.asList(builder.toString().split(","));

http://sourceforge.net/projects/javacsv/
https://github.com/pupi1985/JavaCSV-Reloaded
(fork of the previous library that will allow the generated output to have Windows line terminators \r\n when not running Windows)
http://opencsv.sourceforge.net/
CSV API for Java
Can you recommend a Java library for reading (and possibly writing) CSV files?
Java lib or app to convert CSV to XML file?

I would not advise a regex answer from Bart, I find parsing solution better in this particular case (as Fabian proposed). I've tried regex solution and own parsing implementation I have found that:
Parsing is much faster than splitting with regex with backreferences - ~20 times faster for short strings, ~40 times faster for long strings.
Regex fails to find empty string after last comma. That was not in original question though, it was mine requirement.
My solution and test below.
String tested = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\",";
long start = System.nanoTime();
String[] tokens = tested.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
long timeWithSplitting = System.nanoTime() - start;
start = System.nanoTime();
List<String> tokensList = new ArrayList<String>();
boolean inQuotes = false;
StringBuilder b = new StringBuilder();
for (char c : tested.toCharArray()) {
switch (c) {
case ',':
if (inQuotes) {
b.append(c);
} else {
tokensList.add(b.toString());
b = new StringBuilder();
}
break;
case '\"':
inQuotes = !inQuotes;
default:
b.append(c);
break;
}
}
tokensList.add(b.toString());
long timeWithParsing = System.nanoTime() - start;
System.out.println(Arrays.toString(tokens));
System.out.println(tokensList.toString());
System.out.printf("Time with splitting:\t%10d\n",timeWithSplitting);
System.out.printf("Time with parsing:\t%10d\n",timeWithParsing);
Of course you are free to change switch to else-ifs in this snippet if you feel uncomfortable with its ugliness. Note then lack of break after switch with separator. StringBuilder was chosen instead to StringBuffer by design to increase speed, where thread safety is irrelevant.

You're in that annoying boundary area where regexps almost won't do (as has been pointed out by Bart, escaping the quotes would make life hard) , and yet a full-blown parser seems like overkill.
If you are likely to need greater complexity any time soon I would go looking for a parser library. For example this one

I was impatient and chose not to wait for answers... for reference it doesn't look that hard to do something like this (which works for my application, I don't need to worry about escaped quotes, as the stuff in quotes is limited to a few constrained forms):
final static private Pattern splitSearchPattern = Pattern.compile("[\",]");
private List<String> splitByCommasNotInQuotes(String s) {
if (s == null)
return Collections.emptyList();
List<String> list = new ArrayList<String>();
Matcher m = splitSearchPattern.matcher(s);
int pos = 0;
boolean quoteMode = false;
while (m.find())
{
String sep = m.group();
if ("\"".equals(sep))
{
quoteMode = !quoteMode;
}
else if (!quoteMode && ",".equals(sep))
{
int toPos = m.start();
list.add(s.substring(pos, toPos));
pos = m.end();
}
}
if (pos < s.length())
list.add(s.substring(pos));
return list;
}
(exercise for the reader: extend to handling escaped quotes by looking for backslashes also.)

Try a lookaround like (?!\"),(?!\"). This should match , that are not surrounded by ".

The simplest approach is not to match delimiters, i.e. commas, with a complex additional logic to match what is actually intended (the data which might be quoted strings), just to exclude false delimiters, but rather match the intended data in the first place.
The pattern consists of two alternatives, a quoted string ("[^"]*" or ".*?") or everything up to the next comma ([^,]+). To support empty cells, we have to allow the unquoted item to be empty and to consume the next comma, if any, and use the \\G anchor:
Pattern p = Pattern.compile("\\G\"(.*?)\",?|([^,]*),?");
The pattern also contains two capturing groups to get either, the quoted string’s content or the plain content.
Then, with Java 9, we can get an array as
String[] a = p.matcher(input).results()
.map(m -> m.group(m.start(1)<0? 2: 1))
.toArray(String[]::new);
whereas older Java versions need a loop like
for(Matcher m = p.matcher(input); m.find(); ) {
String token = m.group(m.start(1)<0? 2: 1);
System.out.println("found: "+token);
}
Adding the items to a List or an array is left as an excise to the reader.
For Java 8, you can use the results() implementation of this answer, to do it like the Java 9 solution.
For mixed content with embedded strings, like in the question, you can simply use
Pattern p = Pattern.compile("\\G((\"(.*?)\"|[^,])*),?");
But then, the strings are kept in their quoted form.

what about a one-liner using String.split()?
String s = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] split = s.split( "(?<!\".{0,255}[^\"]),|,(?![^\"].*\")" );

A regular expression is not capable of handling escaped characters. For my application, I needed the ability to escape quotes and spaces (my separator is spaces, but the code is the same).
Here is my solution in Kotlin (the language from this particular application), based on the one from Fabian Steeg:
fun parseString(input: String): List<String> {
val result = mutableListOf<String>()
var inQuotes = false
var inEscape = false
val current = StringBuilder()
for (i in input.indices) {
// If this character is escaped, add it without looking
if (inEscape) {
inEscape = false
current.append(input[i])
continue
}
when (val c = input[i]) {
'\\' -> inEscape = true // escape the next character, \ isn't added to result
',' -> if (inQuotes) {
current.append(c)
} else {
result += current.toString()
current.clear()
}
'"' -> inQuotes = !inQuotes
else -> current.append(c)
}
}
if (current.isNotEmpty()) {
result += current.toString()
}
return result
}
I think this is not a place to use regular expressions. Contrary to other opinions, I don't think a parser is overkill. It's about 20 lines and fairly easy to test.

Rather than use lookahead and other crazy regex, just pull out the quotes first. That is, for every quote grouping, replace that grouping with __IDENTIFIER_1 or some other indicator, and map that grouping to a map of string,string.
After you split on comma, replace all mapped identifiers with the original string values.

I would do something like this:
boolean foundQuote = false;
if(charAtIndex(currentStringIndex) == '"')
{
foundQuote = true;
}
if(foundQuote == true)
{
//do nothing
}
else
{
string[] split = currentString.split(',');
}

Deleting all regex instances starting with char '[' and ending with char ']' from a String

I need to take a String and deleting all the regexes in it starting with character '[' and ending with character ']'.
Now i don't know how to tackle this problem. I tried to convert the String to character array and then putting empty characters from any starting '[' till his closing ']' and then convert it back to a String using toString() method.
MyCode:
char[] lyricsArray = lyricsParagraphElements.get(1).text().toCharArray();
for (int i = 0;i < lyricsArray.length;i++)
{
if (lyricsArray[i] == '[')
{
lyricsArray[i] = ' ';
for (int j = i + 1;j < lyricsArray.length;j++)
{
if (lyricsArray[j] == ']')
{
lyricsArray[j] = ' ';
i = j + 1;
break;
}
lyricsArray[j] = ' ';
}
}
}
String songLyrics = lyricsArray.toString();
System.out.println(songLyrics);
But in the print line of songLyrics i get weird stuff like
[C#71bc1ae4
[C#6ed3ef1
[C#2437c6dc
[C#1f89ab83
[C#e73f9ac
[C#61064425
[C#7b1d7fff
[C#299a06ac
[C#383534aa
[C#6bc168e5
I guess there is a simple method for it. Any help will be very appreciated.
For clarification:
converting "abcd[dsadsadsa]efg[adf%#1]d" Into "abcdefgd"

Or simply use a regular expression to replace all occurences of \\[.*\\] with nothing:
String songLyrics = text.replaceAll("\\[.*?\\]", "");
Where text is ofcourse:
String text = lyricsParagraphElements.get(1).text();
What does \\[.*\\] mean?
The first parameter of replaceAll is a string describing a regular expression. A regular expression defines a pattern to match in a string.
So let's split it up:
\\[ matches exactly the character [. Since [ has a special meaning within a regular expression, it needs to be escaped (twice!).
. matches any character, combine this with the (lazy) zero-or-more operator *?, and it will match any character until it finally finds:
\\], which matches the character ]. Note the escaping again.

Your code below is referencing to the string object and you are then printing the reference songLyrics.
String songLyrics = lyricsArray.toString();
System.out.println(songLyrics);
Replace above two lines with
String songLyrics = new String(lyricsArray);
System.out.println(songLyrics);
Ideone1
Other way without converting it into char array and again to string.
String lyricsParagraphElements = "asdasd[asd]";
String songLyrics = lyricsParagraphElements.replaceAll("\\[.*\\]", "");
System.out.println(songLyrics);
Ideone2

You're printing a char[] and Java char[] does not override toString(). And, a Java String is immutable, but Java does have StringBuilder which is mutable (and StringBuilder.delete(int, int) can remove arbitrary substrings). You could use it like,
String songLyrics = lyricsParagraphElements.get(1).text();
StringBuilder sb = new StringBuilder(songLyrics);
int p = 0;
while ((p = sb.indexOf("[", p)) >= 0) {
int e = sb.indexOf("]", p + 1);
if (e > p) {
sb.delete(p, e + 1);
}
p++;
}
System.out.println(sb);

You are getting "weird stuff" because you are printing the string representation of the array, not converting the array to a String.
Instead of lyricsArray.toString(), use
new String(lyricsArray);
But if you do this, you will find that you are not actually removing characters from the string, just replacing them with spaces.
Instead, you can shift all of the characters left in the array, and construct the new String only up to the right number of characters:
int src = 0, dst = 0;
while (src < lyricsArray.length) {
while (src < lyricsArray.length && lyricsArray[src] != '[') {
lyricsArray[dst++] = lyricsArray[src++];
}
if (src < lyricsArray.length) {
++src;
while (src - 1 < lyricsArray.length && lyricsArray[src - 1] != ']') {
src++;
}
}
}
String lyricsString = new String(lyricsArray, 0, dst);

This is exactly regex string for your case:
\\[([\\w\\%\\#]+)\\]
It's very hard when your plant string is contain special symbol. I can't find shorter regex, without explain special symbol like an exception.
reference: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#cg
================
I'm read your new case, a string contain symbol "-" or something else in
!"#$%&'()*+,-./:;<=>?#\^_`{|}~
add them (with prefix "\\") after \\# on my regex string.

Generate new word from wildcard [duplicate]

This question already has answers here:
Returning a list of wildcard matches from a HashMap in java
(3 answers)
Closed 7 years ago.
Im trying to generate a word with a wild card and check and see if this word is stored in the dictionary database. Like "appl*" should return apply or apple. However the problem comes in when I have 2 wild cards. "app**" will make words like appaa, appbb..appzz... instead of apple. The second if condition is just for a regular string that contains no wildcards"*"
public static boolean printWords(String s) {
String tempString, tempChar;
if (s.contains("*")) {
for (char c = 'a'; c <= 'z'; c++) {
tempChar = Character.toString(c);
tempString = s.replace("*", tempChar);
if (myDictionary.containsKey(tempString) == true) {
System.out.println(tempString);
}
}
}
if (myDictionary.containsKey(s) == true) {
System.out.println(s);
return true;
} else {
return false;
}
}

You're only using a single for loop over characters, and replacing all instances of * with that character. See the API for String.replace here. So it's no surprise that you're getting strings like Appaa, Appbb, etc.
If you want to actually use Regex expressions, then you shouldn't be doing any String.replace or contains, etc. etc. See Anubian's answer for how to handle your problem.
If you're treating this as a String exercise and don't want to use regular expressions, the easiest way to do what you're actually trying to do (try all combinations of letters for each wildcard) is to do it recursively. If there are no wild cards left in the string, check if it is a word and if so print. If there are wild cards, try each replacement of that wildcard with a character, and recursively call the function on the created string.
public static void printWords(String s){
int firstAsterisk = s.indexOf("*");
if(firstAsterisk == -1){ // doesn't contain asterisk
if (myDictionary.containsKey(s))
System.out.println(s);
return;
}
for(char c = 'a', c <= 'z', c++){
String s2 = s.subString(0, firstAsterisk) + c + s.subString(firstAsterisk + 1);
printWords(s2);
}
}
The base cause relies on the indexOf function - when indexOf returns -1, it means that the given substring (in our case "*") does not occur in the string - thus there are no more wild cards to replace.
The substring part basically recreates the original string with the first asterisk replaced with a character. So supposing that s = "abcd**ef" and c='z', we know that firstAsterisk = 4 (Strings are 0-indexed, index 4 has the first "*"). Thus,
String s2 = s.subString(0, firstAsterisk) + c + s.subString(firstAsterisk + 1);
= "abcd" + 'z' + "*ef"
= "abcdz*ef"

The * character is a regex wildcard, so you can treat the input string as a regular expression:
for (String word : myDictionary) {
if (word.matches(s)) {
System.out.println(word);
}
}
Let the libraries do the heavy lifting for you ;)

With your approach you have to check all possible combinations.
The better way would be to make a regex out of your input string, so replace all * with ..
Than you can loop over your myDirectory and check for every entry whether it matches the regex.
Something like this:
Set<String> dict = new HashSet<String>();
dict.add("apple");
String word = "app**";
Pattern pattern = Pattern.compile(word.replace('*', '.'));
for (String entry : dict) {
if (pattern.matcher(entry).matches()) {
System.out.println("matches: " + entry);
}
}
You have to take care if your input string already contains . than you have to escape them with a \. (The same for other special regex characters.)
See also
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html and
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html

Efficiently removing specific characters (some punctuation) from Strings in Java?

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:
private static String processWord(String x) {
String tmp;
tmp = x.toLowerCase();
tmp = tmp.replace(",", "");
tmp = tmp.replace(".", "");
tmp = tmp.replace(";", "");
tmp = tmp.replace("!", "");
tmp = tmp.replace("?", "");
tmp = tmp.replace("(", "");
tmp = tmp.replace(")", "");
tmp = tmp.replace("{", "");
tmp = tmp.replace("}", "");
tmp = tmp.replace("[", "");
tmp = tmp.replace("]", "");
tmp = tmp.replace("<", "");
tmp = tmp.replace(">", "");
tmp = tmp.replace("%", "");
return tmp;
}
Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

Although \\p{Punct} will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:
tmp = tmp.replaceAll("\\p{Punct}+", "");

Here's a late answer, just for fun.
In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:
private static String processWord(String x) {
return x.replaceAll("[][(){},.;!?<>%]", "");
}
This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.
private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");
private static String processWord(String x) {
return UNDESIRABLES.matcher(x).replaceAll("");
}
This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.
Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:
private static final boolean[] CHARS_TO_KEEP = new boolean[];
Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)
Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.
One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

You could do something like this:
static String RemovePunct(String input)
{
char[] output = new char[input.length()];
int i = 0;
for (char ch : input.toCharArray())
{
if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch))
{
output[i++] = ch;
}
}
return new String(output, 0, i);
}
// ...
String s = RemovePunct("This is (a) test string.");
This will likely perform better than using regular expressions, if you find them to slow for your needs.
However, it could get messy fast if you have a long, distinct list of special characters you'd like to remove. In this case regular expressions are easier to handle.
http://ideone.com/mS8Irl

Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.

Use String#replaceAll(String regex, String replacement) as
tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");
System.out.println(
"f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
"[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"

Right now your code will iterate over all characters of tmp and compare them with all possible characters that you want to remove, so it will use
number of tmp characters x number or characters you want to remove comparisons.
To optimize your code you could use short circuit OR || and do something like
StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
|| c == '(' || c == ')' || c == '{' || c == '}' || c == '['
|| c == ']' || c == '<' || c == '>' || c == '%'))
sb.append(c);
}
tmp = sb.toString();
or like this
StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();
outer:
for (char strChar : tmp.toCharArray()) {
for (char badChar : badChars) {
if (badChar == strChar)
continue outer;// we skip `strChar` since it is bad character
}
sb.append(strChar);
}
tmp = sb.toString();
This way you will iterate over every tmp characters but number of comparisons for that character can decrease if it is not % (because it will be last comparison, if character would be . program would get his result in one comparison).
If I am not mistaken this approach is used with character class ([...]) so maybe try it this way
Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so
//you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");

You can do this:
tmp.replaceAll("\\W", "");
to remove punctuation

Find whole words without regex

I need to find whole words in a sentence, but without using regular expressions. So if I wanted to find the word "the" in this sentence: "The quick brown fox jumps over the lazy dog", I'm currently using:
String text = "the, quick brown fox jumps over the lazy dog";
String keyword = "the";
Matcher matcher = Pattern.compile("\\b"+keyword+"\\b").matcher(text);
Boolean contains = matcher.find();
but if I used:
Boolean contains = text.contains(keyword);
and pad the keyword with a space, it won't find the first "the" in the sentence, both because it doesn't have surround whitespaces and the punctuations.
To be clear, I'm building an Android app, and I'm getting memory leaks and it might be because I'm using a regular-expression in a ListView, so it's performing a regular-expression match X number of times, depending on the items in the Listview.

If you needed to check for multiple words and do it without regular expressions you could use StringTokenizer with a space as the delimiter.
You could then build a custom search method. Otherwise, the other solutions using String.contains() or String.indexOf() qualify.

What you do is search for "the". Then for each match you test to see if the surrounding characters are white space (or punctuation), or if the match is at the beginning / end of the string respectively.

public int findWholeWorld(final String text, final String searchString) {
return (" " + text + " ").indexOf(" " + searchString + " ");
}
This will give you the index of the first occurrence of the word "the" or -1 if the word "the" doesn't exist.

Split the string on space, and then see if the resulting array contains your word.

Simply iterate over the characters and keep storing them in a char buffer. Every time you see a whitespace, empty the buffer into a list of words and go on till you reach the end.

In the comments of the StringTokenizer.class:
StringTokenizer is a legacy class that is retained for
compatibility reasons although its use is discouraged in new code. It is
recommended that anyone seeking this functionality use the split
method of String or the java.util.regex package instead.
The following example illustrates how the String.split
method can be used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);
prints the following output:
this
is
a
test
Iterate through your resulting string array and test for equality and keep a count.
for (String s : result)
{
count++;
}
If this is a homework assignment, tell your lecturer to read up on Java, times have changed. I remember having the exact same stupid questions during school and it does nothing to prepare you for the real world.

I have a project that requires whole word matching, but I can't use regular expressions(because regular expressions escape some keywords), I tried to write my own code to simulate it with non-regular expressions (\bxxx\b), I only know C# and it worked fine.
public static class Finder
{
public static bool Find(string? input, string? pattern, bool isMatchCase = false, bool isMatchWholeWord = false, bool isMatchRegex = false)
{
if (String.IsNullOrWhiteSpace(input) || String.IsNullOrWhiteSpace(pattern))
{
return false;
}
if (!isMatchCase && !isMatchRegex)
{
input = input.ToLower();
pattern = pattern.ToLower();
}
if (isMatchWholeWord && !isMatchRegex)
{
int len = pattern.Length;
int suffix = 0;
while (true)
{
int start = input.IndexOf(pattern, suffix);
if (start == -1)
{
return false;
}
int end = start + len - 1;
int prefix = start - 1;
suffix = end + 1;
bool isPrefixMatched, isSuffixMatched;
if (start == 0)
{
isPrefixMatched = true;
}
else
{
isPrefixMatched = IsWord(input[prefix]) != IsWord(input[start]);
}
if (end == input.Length - 1)
{
isSuffixMatched = true;
}
else
{
isSuffixMatched = IsWord(input[suffix]) != IsWord(input[end]);
}
if (isPrefixMatched && isSuffixMatched)
{
return true;
}
}
}
if (isMatchRegex)
{
if (isMatchWholeWord)
{
if (!pattern.StartsWith(#"\b"))
{
pattern = $#"\b{pattern}";
}
if (!pattern.EndsWith(#"\b"))
{
pattern = $#"{pattern}\b";
}
}
return Regex.IsMatch(input, pattern, isMatchCase ? RegexOptions.None : RegexOptions.IgnoreCase);
}
return input.Contains(pattern);
}
private static bool IsWord(char ch)
{
return Char.IsLetterOrDigit(ch) || ch == '_';
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.