Efficiently removing specific characters (some punctuation) from Strings in Java?

Efficiently removing specific characters (some punctuation) from Strings in Java? - java

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:
private static String processWord(String x) {
String tmp;
tmp = x.toLowerCase();
tmp = tmp.replace(",", "");
tmp = tmp.replace(".", "");
tmp = tmp.replace(";", "");
tmp = tmp.replace("!", "");
tmp = tmp.replace("?", "");
tmp = tmp.replace("(", "");
tmp = tmp.replace(")", "");
tmp = tmp.replace("{", "");
tmp = tmp.replace("}", "");
tmp = tmp.replace("[", "");
tmp = tmp.replace("]", "");
tmp = tmp.replace("<", "");
tmp = tmp.replace(">", "");
tmp = tmp.replace("%", "");
return tmp;
}
Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

Although \\p{Punct} will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:
tmp = tmp.replaceAll("\\p{Punct}+", "");

Here's a late answer, just for fun.
In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:
private static String processWord(String x) {
return x.replaceAll("[][(){},.;!?<>%]", "");
}
This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.
private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");
private static String processWord(String x) {
return UNDESIRABLES.matcher(x).replaceAll("");
}
This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.
Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:
private static final boolean[] CHARS_TO_KEEP = new boolean[];
Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)
Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.
One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

You could do something like this:
static String RemovePunct(String input)
{
char[] output = new char[input.length()];
int i = 0;
for (char ch : input.toCharArray())
{
if (Character.isLetterOrDigit(ch) || Character.isWhitespace(ch))
{
output[i++] = ch;
}
}
return new String(output, 0, i);
}
// ...
String s = RemovePunct("This is (a) test string.");
This will likely perform better than using regular expressions, if you find them to slow for your needs.
However, it could get messy fast if you have a long, distinct list of special characters you'd like to remove. In this case regular expressions are easier to handle.
http://ideone.com/mS8Irl

Strings are immutable so its not good to try and use them very dynamically try using StringBuilder instead of String and use all of its wonderful methods! It will let you do anything you want. Plus yes if you have something your trying to do, figure out the regex for it and it will work a lot better for you.

Use String#replaceAll(String regex, String replacement) as
tmp = tmp.replaceAll("[,.;!?(){}\\[\\]<>%]", "");
System.out.println(
"f,i.l;t!e?r(e)d {s}t[r]i<n>g%".replaceAll(
"[,.;!?(){}\\[\\]<>%]", "")); // prints "filtered string"

Right now your code will iterate over all characters of tmp and compare them with all possible characters that you want to remove, so it will use
number of tmp characters x number or characters you want to remove comparisons.
To optimize your code you could use short circuit OR || and do something like
StringBuilder sb = new StringBuilder();
for (char c : tmp.toCharArray()) {
if (!(c == ',' || c == '.' || c == ';' || c == '!' || c == '?'
|| c == '(' || c == ')' || c == '{' || c == '}' || c == '['
|| c == ']' || c == '<' || c == '>' || c == '%'))
sb.append(c);
}
tmp = sb.toString();
or like this
StringBuilder sb = new StringBuilder();
char[] badChars = ",.;!?(){}[]<>%".toCharArray();
outer:
for (char strChar : tmp.toCharArray()) {
for (char badChar : badChars) {
if (badChar == strChar)
continue outer;// we skip `strChar` since it is bad character
}
sb.append(strChar);
}
tmp = sb.toString();
This way you will iterate over every tmp characters but number of comparisons for that character can decrease if it is not % (because it will be last comparison, if character would be . program would get his result in one comparison).
If I am not mistaken this approach is used with character class ([...]) so maybe try it this way
Pattern p = Pattern.compile("[,.;!?(){}\\[\\]<>%]"); //store it somewhere so
//you wont need to compile it again
tmp = p.matcher(tmp).replaceAll("");

You can do this:
tmp.replaceAll("\\W", "");
to remove punctuation

Related

String manipulation of function names

For this Kata, i am given random function names in the PEP8 format and i am to convert them to camelCase.
(input)get_speed == (output)getSpeed ....
(input)set_distance == (output)setDistance
I have a understanding on one way of doing this written in pseudo-code:
loop through the word,
if the letter is an underscore
then delete the underscore
then get the next letter and change to a uppercase
endIf
endLoop
return the resultant word
But im unsure the best way of doing this, would it be more efficient to create a char array and loop through the element and then when it comes to finding an underscore delete that element and get the next index and change to uppercase.
Or would it be better to use recursion:
function camelCase takes a string
if the length of the string is 0,
then return the string
endIf
if the character is a underscore
then change to nothing,
then find next character and change to uppercase
return the string taking away the character
endIf
finally return the function taking the first character away
Any thoughts please, looking for a good efficient way of handing this problem. Thanks :)

I would go with this:
divide given String by underscore to array
from second word until end take first letter and convert it to uppercase
join to one word
This will work in O(n) (go through all names 3 time). For first case, use this function:
str.split("_");
for uppercase use this:
String newName = substring(0, 1).toUpperCase() + stre.substring(1);
But make sure you check size of the string first...
Edited - added implementation
It would look like this:
public String camelCase(String str) {
if (str == null ||str.trim().length() == 0) return str;
String[] split = str.split("_");
String newStr = split[0];
for (int i = 1; i < split.length; i++) {
newStr += split[i].substring(0, 1).toUpperCase() + split[i].substring(1);
}
return newStr;
}
for inputs:
"test"
"test_me"
"test_me_twice"
it returns:
"test"
"testMe"
"testMeTwice"

It would be simpler to iterate over the string instead of recursing.
String pep8 = "do_it_again";
StringBuilder camelCase = new StringBuilder();
for(int i = 0, l = pep8.length(); i < l; ++i) {
if(pep8.charAt(i) == '_' && (i + 1) < l) {
camelCase.append(Character.toUpperCase(pep8.charAt(++i)));
} else {
camelCase.append(pep8.charAt(i));
}
}
System.out.println(camelCase.toString()); // prints doItAgain

The question you pose is whether to use an iterative or a recursive approach. For this case I'd go for the recursive approach because it's straightforward, easy to understand doesn't require much resources (only one array, no new stackframe etc), though that doesn't really matter for this example.
Recursion is good for divide-and-conquer problems, but I don't see that fitting the case well, although it's possible.
An iterative implementation of the algorithm you described could look like the following:
StringBuilder buf = new StringBuilder(input);
for(int i = 0; i < buf.length(); i++){
if(buf.charAt(i) == '_'){
buf.deleteCharAt(i);
if(i != buf.length()){ //check fo EOL
buf.setCharAt(i, Character.toUpperCase(buf.charAt(i)));
}
}
}
return buf.toString();
The check for the EOL is not part of the given algorithm and could be ommitted, if the input string never ends with '_'

RegEx to find URLs in HTML takes 25 seconds in Java/Android

In Android/Java, given a website's HTML source code, I would like to extract all XML and CSV file paths.
What I am doing (with RegEx) is this:
final HashSet<String> urls = new HashSet<String>();
final Pattern urlRegex = Pattern.compile(
"[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|].(xml|csv)");
final Matcher url = urlRegex.matcher(htmlString);
while (url.find()) {
urls.add(makeAbsoluteURL(url.group(0)));
}
public String makeAbsoluteURL(String url) {
if (url.startsWith("http://") || url.startsWith("http://")) {
return url;
}
else if (url.startsWith("/")) {
return mRootURL+url.substring(1);
}
else {
return mBaseURL+url;
}
}
Unfortunately, this runs for about 25 seconds for an average website with normal length. What is going wrong? Is my RegEx just bad? Or is RegEx just so slow?
Can I find the URLs faster without RegEx?
Edit:
The source for the valid characters was (roughly) this answer. However, I think the two character classes (square brackets) must be swapped so that you have a more limited character set for the first char of the URL and a broader character class for all remaining chars. This was the intention.

Your regex is written in a way that makes it slow for long inputs.
The * operator is greedy.
For instance for input:
http://stackoverflow.com/questions/19019504/regex-to-find-urls-in-html-takes-25-seconds-in-java-android.xml
The [-a-zA-Z0-9+&##/%?=~_|!:,.;]* part of the regex will consume the whole string. It will then try to match the next character group, which will fail (since whole string is consumed). It will then backtrack in match of first part of the regex by one character and try to match the second character group again. It will match. Then it will try to match the dot and fail because the whole string is consumed. Another backtrack etc...
In essence your regex is forcing a lot of backtracking to match anything. It will also waste a lot of time on matches that have no way of succeeding.
For word forest it will first consume whole word in the first part of expression and then repeatedly backtrack after failing to match the rest of expression. Huge waste of time.
Also:
the . in regex is unescaped and it will match ANY character.
url.group(0) is redundant. url.group() has same meaning
In order to speed up the regex you need to figure out a way to reduce the amount of backtracking and it would also help if you had a less general start of the match. Right now every single word will cause matching to start and generally fail. For instance typically in html all the links are inside 2 ". If that's the case you can start your matching at " which will speed it up tremendously. Try to find a better start of the expression.

I've nothing the say in the theoretical overview that U Mad did, he highlighted everything I'd noticed.
What I would like to suggest you, considering what are you look for with the RE, is to change the point of view of your RE :)
You are looking for xml and csv files, so why don't you reverse the html string, for example using:
new StringBuilder("bla bla bla foo letme/find.xml bla bla").reverse().toString()
after that you could look for the pattern:
final Pattern urlRegex = Pattern.compile(
"(vsc|lmx)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
urlRegex pattern could be refined as U Mad has already suggested. But in this way you could reduce the number of failed matches.

I had my doubts, if there can be a String really long enough to take 25 seconds for parsing. So I tried and must admit now, that with about 27MB of text, it takes around 25 seconds to parse it with the given regular expression.
Being curious I changed the little test program with #FabioDch's approach (so, please vote for him, if you want to vote anywhere :-)
The result is quite impressing: Instead of 25 Seconds, #FabioDch's approach needed less then 1 second (100ms to 800ms) + 70ms to 85ms for reversing!
Here's the code I used. It reads text from the largest text file I've found and copies it 10 time to get 27MB of text. Then runs the regex against it and prints out the results.
#Test
public final void test() throws IOException {
final Pattern urlRegex = Pattern.compile("(lmx|vsc)\\.[-a-zA-Z0-9+&##/%=~_|][-a-zA-Z0-9+&##/%?=~_|!:,.;]*");
printTimePassed("initialized");
List<String> lines = Files.readAllLines(Paths.get("testdata", "Aster_Express_User_Guide_0500.txt"), Charset.defaultCharset());
StringBuilder sb = new StringBuilder();
for(int i=0; i<10; i++) { // Copy 10 times to get more useful data
for(String line : lines) {
sb.append(line);
sb.append('\n');
}
}
printTimePassed("loaded: " + lines.size() + " lines, in " + sb.length() + " chars");
String html = sb.reverse().toString();
printTimePassed("reversed");
int i = 0;
final Matcher url = urlRegex.matcher(html);
while (url.find()) {
System.out.println(i++ + ": FOUND: " + new StringBuilder(url.group()).reverse() + ", " + url.start() + ", " + url.end());
}
printTimePassed("ready");
}
private void printTimePassed(String msg) {
long current = System.currentTimeMillis();
System.out.printf("%s: took %d ms\n", msg, (current - ms));
ms = current;
}

Would suggest only using the regex to find file extensions (.xml or .csv). This should be a lot faster and when found, you can look backwards, examining each character before and stop when you reach one that couldn't be in a URL - see below:
final HashSet<String> urls = new HashSet<String>();
final Pattern fileExtRegex = Pattern.compile("\\.(xml|csv)");
final Matcher fileExtMatcher = fileExtRegex.matcher(htmlString);
// Find next occurrence of ".xml" or ".csv" in htmlString
while (fileExtMatcher.find()) {
// Go backwards from the character just before the file extension
int dotPos = fileExtMatcher.start() - 1;
int charPos = dotPos;
while (charPos >= 0) {
// Break if current character is not a valid URL character
char chr = htmlString.charAt(charPos);
if (!((chr >= 'a' && chr <= 'z') ||
(chr >= 'A' && chr <= 'Z') ||
(chr >= '0' && chr <= '9') ||
chr == '-' || chr == '+' || chr == '&' || chr == '#' ||
chr == '#' || chr == '/' || chr == '%' || chr == '?' ||
chr == '=' || chr == '~' || chr == '|' || chr == '!' ||
chr == ':' || chr == ',' || chr == '.' || chr == ';')) {
break;
}
charPos--;
}
// Extract/add URL if there are valid URL characters before file extension
if ((dotPos > 0) && (charPos < dotPos)) {
String url = htmlString.substring(charPos + 1, fileExtMatcher.end());
urls.add(makeAbsoluteURL(url));
}
}
Small disclaimer: I used part of your original regex for valid URL characters: [-a-zA-Z0-9+&##/%?=~_|!:,.;]. Haven't verified if this is comprehensive and there are perhaps further improvements that could be made, e.g. it would currently find local file paths (e.g. C:\TEMP\myfile.xml) as well as URLs. Wanted to keep the code above simple to demonstrate the technique so haven't tackled this.
EDIT Following the comment about effiency I've modified to no longer use a regex for checking valid URL characters. Instead, it compares the character against valid ranges manually. Uglier code but should be faster...

I know people love to use regex to parse html, but have you considered using jsoup?

For sake of clarity I created a separate answer for this regex:
Edited to escape the dot and remove reluctant quant.
(?<![-a-zA-Z0-9+&##/%=~_|])[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]‌\\.(xml|csv)
Please try this one and tell me how it goes.
Also here's a class which will enable you to search a reversed string without actually reversing it:
public class ReversedString implements CharSequence {
public ReversedString(String input) {
this.s = input;
this.len = s.length();
}
private final String s;
private final int len;
#Override
public CharSequence subSequence(final int start, final int end) {
return new CharSequence() {
#Override
public CharSequence subSequence(int start, int end) {
throw new UnsupportedOperationException();
}
#Override
public int length() {
return end-start;
}
#Override
public char charAt(int index) {
return s.charAt(len-start-index-1);
}
#Override
public String toString() {
StringBuilder buf = new StringBuilder(end-start);
for(int i = start;i < end;i++) {
buf.append(s.charAt(len-i-1));
}
return buf.toString();
}
};
}
#Override
public int length() {
return len;
}
#Override
public char charAt(int index) {
return s.charAt(len-1-index);
}
}
You can use this class as such:
pattern.matcher(new ReversedString(inputString));

How to split this "Tree-like" string in Java regex?

This is the string:
String str = "(S(B1)(B2(B21)(B22)(B23))(B3)())";
Content in a son-() may be "", or just the value of str, or like that pattern, recursively, so a sub-() is a sub-tree.
Expected result:
str1 is "(S(B1))"
str2 is "(B2(B21)(B22)(B23))" //don't expand sons of a son
str3 is "(B3)"
str4 is "()"
str1-4 are e.g. elements in an Array
How to split the string?
I have a fimiliar question: How to split this string in Java regex? But its answer is not good enough for this one.

Regexes do not have sufficient power to parse balanced/nested brackets. This is essentially the same problem as parsing markup languages such as HTML where the consistent advice is to use special parsers, not regexes.
You should parse this as a tree. In overall terms:
Create a stack.
when you hit a "(" push the next chunk onto the stack.
when you hit a ")" pop the stack.
This takes a few minutes to write and will check that your input is well-formed.
This will save you time almost immediately. Trying to manage regexes for this will become more and more complex and will almost inevitably break down.
UPDATE: If you are only concerned with one level then it can be simpler (NOT debugged):
List<String> subTreeList = new ArrayList<String>();
String s = getMyString();
int level = 0;
int lastOpenBracket = -1
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c == '(') {
level++;
if (level == 1) {
lastOpenBracket = i;
}
} else if (c == ')') {
if (level == 1) {
subStreeList.add(s.substring(lastOpenBracket, i);
}
level--;
}
}
I haven't checked it works, and you should debug it. You should also put checks to make sure you
don't have hanging brackets at the end or strange characters at level == 1;

How to extract specific substring from a bigger string java

I have the following string:
String n = "(.........)(......)(.......)(......) etc"
I want to write a method which will fill a List<String> with every substring of n which is between ( and ) . Thank you in advance!

It can be done in one line:
String[] parts = input.replaceAll("(^.*\\()|(\\).*$)", "").split("\\)\\(");
The call to replaceAll() strips off the leasing and trailing brackets (plus any other junk characters before/after those first/last brackets), then you just split() on bracket pairs.

I'm not very familiar with the String methods, so I'm sure there's a way that it could be done without having to code it yourself, and just using some fancy method, but here you go:
Tested, works 100% perfect :)
String string = "(stack)(over)(flow)";
ArrayList<String> subStrings = new ArrayList<String>();
for(int c = 0; c < string.length(); c++) {
if(string.charAt(c) == '(') {
c++;
String newString = "";
for(;c < string.length() && string.charAt(c) != ')'; c++) {
newString += string.charAt(c);
}
subStrings.add(newString);
}
}

If the (...) pairs aren't nested, you can use a regular expression in Java. Take a look at the java.util.regex.Pattern class.

I made this regex version, but it's kind of lengthy. I'm sure it could be improved upon. (note: "n" is your input string)
Pattern p = Pattern.compile("\\((.*?)\\)");
Matcher matcher = p.matcher(n);
List<String> list = new ArrayList<String>();
while (matcher.find())
{
list.add(matcher.group(1)); // 1 == stuff between the ()'s
}

This should work:
String in = "(bla)(die)(foo)";
in = in .substring(1,in.length()-1);
String[] out = in .split(Pattern.quote(")("));

How do I find out if first character of a string is a number?

In Java is there a way to find out if first character of a string is a number?
One way is
string.startsWith("1")
and do the above all the way till 9, but that seems very inefficient.

Character.isDigit(string.charAt(0))
Note that this will allow any Unicode digit, not just 0-9. You might prefer:
char c = string.charAt(0);
isDigit = (c >= '0' && c <= '9');
Or the slower regex solutions:
s.substring(0, 1).matches("\\d")
// or the equivalent
s.substring(0, 1).matches("[0-9]")
However, with any of these methods, you must first be sure that the string isn't empty. If it is, charAt(0) and substring(0, 1) will throw a StringIndexOutOfBoundsException. startsWith does not have this problem.
To make the entire condition one line and avoid length checks, you can alter the regexes to the following:
s.matches("\\d.*")
// or the equivalent
s.matches("[0-9].*")
If the condition does not appear in a tight loop in your program, the small performance hit for using regular expressions is not likely to be noticeable.

Regular expressions are very strong but expensive tool. It is valid to use them for checking if the first character is a digit but it is not so elegant :) I prefer this way:
public boolean isLeadingDigit(final String value){
final char c = value.charAt(0);
return (c >= '0' && c <= '9');
}

IN KOTLIN :
Suppose that you have a String like this :
private val phoneNumber="9121111111"
At first you should get the first one :
val firstChar=phoneNumber.slice(0..0)
At second you can check the first char that return a Boolean :
firstChar.isInt() // or isFloat()

regular expression starts with number->'^[0-9]'
Pattern pattern = Pattern.compile('^[0-9]');
Matcher matcher = pattern.matcher(String);
if(matcher.find()){
System.out.println("true");
}

I just came across this question and thought on contributing with a solution that does not use regex.
In my case I use a helper method:
public boolean notNumber(String input){
boolean notNumber = false;
try {
// must not start with a number
#SuppressWarnings("unused")
double checker = Double.valueOf(input.substring(0,1));
}
catch (Exception e) {
notNumber = true;
}
return notNumber;
}
Probably an overkill, but I try to avoid regex whenever I can.

To verify only first letter is number or character --
For number
Character.isDigit(str.charAt(0)) --return true
For character
Character.isLetter(str.charAt(0)) --return true

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.