REGEX not working as in JAVA program as expected - java

I have been working on a program which makes use of Regular Expressions. It searches for some text in the files to give me a database based on the scores of different players.
Here is the sample of the text within which it searches.
ISLAMABAD UNITED 1st innings
Player Status Runs Blls 4s 6s S/R
David Warner lbw b. Hassan 19 16 4 0 118.8%
Joe Burns b. Morkel 73 149 16 0 49.0%
Kane Wiliiamson b. Tahir 135 166 28 2 81.3%
Asad Shafiq c. Rahane b. Morkel 22 38 5 0 57.9%
Kraigg Braithwaite c. Khan b. Boult 24 36 5 0 66.7%
Corey Anderson b. Tahir 18 47 3 0 38.3%
Sarfaraz Ahmed b. Morkel 0 6 0 0 0.0%
Tim Southee c. Hales b. Morkel 0 6 0 0 0.0%
Kyle Abbbott c. Rahane b. Morkel 26 35 4 0 74.3%
Steven Finn c. Hales b. Hassan 10 45 1 0 22.2%
Yasir Shah not out 1 12 0 0 8.3%
Total: 338/10 Overs: 92.1 Run Rate: 3.67 Extras: 10
Day 2 10:11 AM
-X-
I am using the following regex to get the different fields..
((?:\/)?(?:[A-Za-z']+)?\s?(?:[A-Za-z']+)?\s?(?:[A-Za-z']+)?\s?)\s+(?:lbw)?(?:not\sout)?(?:run\sout)?\s?(?:\(((?:[A-Za-z']+)?\s?(?:['A-Za-z]+)?)\))?(?:(?:st\s)?\s?(?:((?:['A-Za-z]+)\s(?:['A-Za-z]+)?)))?(?:c(?:\.)?\s((?:(?:['A-Za-z]+)?\s(?:[A-Za-z']+)?)?(?:&)?))?\s+(?:b\.)?\s+((?:[A-Za-z']+)\s(?:[A-Za-z']+)?)?\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)
Batsman Name - Group 1
Person Affecting Stumping (if any) - Group 2
Person Affecting RunOut (if any) - Group 3
Person Taking Catch (if any) - Group 4
Person Taking the wicket (if any) - Group 5
Runs Scored - Group 6
Balls Faced - Group 7
Fours Hit - Group 8
Sixes Hit - Group 9
Here is an example of the text I need to extract...
Group 0 contains David Warner lbw b. Hassan 19 16 4 0 118.8%
Group 1 contains 'David Warner'
Group 2 does not exist in this example
Group 3 does not exist in this example
Group 4 does not exist in this example
Group 5 contains 'Hassan'
Group 6 contains '19'
Group 7 contains '16'
Group 8 contains '4'
Group 9 contains '0'
When I try this on Regexr or Regex101, it gives the Group 1 as David Warner in the Group 1... But in my Java Program, it gives it as David. It is same for all results. I don't know why?
Here's the code of my program:
Matcher bat = Pattern.compile("((?:\\/)?(?:[A-Za-z']+)?\\s?(?:[A-Za-z']+)?\\s?(?:[A-Za-z']+)?\\s?)\\s+(?:lbw)?(?:not\\sout)?(?:run\\sout)?\\s?(?:\\(((?:[A-Za-z']+)?\\s?(?:['A-Za-z]+)?)\\))?(?:(?:st\\s)?\\s?(?:((?:['A-Za-z]+)\\s(?:['A-Za-z]+)?)))?(?:c(?:\\.)?\\s((?:(?:['A-Za-z]+)?\\s(?:[A-Za-z']+)?)?(?:&)?))?\\s+(?:b\\.)?\\s+((?:[A-Za-z']+)\\s(?:[A-Za-z']+)?)?\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)").matcher(batting.group(1));
while (bat.find()) {
batPos++;
Batsman a = new Batsman(bat.group(1).replace("\n", "").replace("\r", "").replace("S/R", "").replace("/R", "").trim(), batting.group(2));
if (bat.group(0).contains("not out")) {
a.bat(Integer.parseInt(bat.group(6)), Integer.parseInt(bat.group(7)), Integer.parseInt(bat.group(8)), Integer.parseInt(bat.group(9)), batting.group(2), false);
} else {
a.bat(Integer.parseInt(bat.group(6)), Integer.parseInt(bat.group(7)), Integer.parseInt(bat.group(8)), Integer.parseInt(bat.group(9)), batting.group(2), true);
}
if (!teams.contains(batting.group(2))) {
teams.add(batting.group(2));
}
boolean f = true;
Batsman clone = null;
for (Batsman b1 : batted) {
if (b1.eq(a)) {
clone = b1;
f = false;
break;
}
}
if (!f) {
if (bat.group(0).contains("not out")) {
clone.batUpdate(a.getRunScored(), a.getBallFaced(), a.getFour(), a.getSix(), false, true);
} else {
clone.batUpdate(a.getRunScored(), a.getBallFaced(), a.getFour(), a.getSix(), true, true);
}
} else {
batted.add(a);
}
}

Your regex is way too complicated for such a simple task. To make it simple(or eliminate it for that matter), operate on a single line rather than the bunch of text.
For this, do
String array[] = str.split("\\n");
Then once you get each individual line, just split by a mutliple spaces, like
String parts[] = array[1].split("\\s\\s+");
Then you can access each part seperately, like Status can be accessed like
System.out.println("Status - " + parts[1]);

All commentators are right, of course, this might not be a typical problem to solve with a regex. But to answer your question - why is there a difference between java and regex101? - let's try to pull out some of the problems caused by your regex that makes it too complex. Next step would be to track down if and why there is a difference in using it in java.
I tried to understand your regex (and cricket at the same time!) and came up with a proposal that might help you to make us understand what your regex should look like.
First attempt reads until the number columns are reached. My guess is, that you should be looking at alternation instead of introducing a lot of groups. Take a look at this: example 1
Explanation:
( # group 1 start
\/? # not sure why there should be /?
[A-Z][a-z]+ # first name
(?:\s(?:[A-Z]['a-z]+)+) # last name
)
(?:\ # spaces
( # group 2 start
lbw # lbw or
|not\sout # not out or
|(c\.|st|run\sout) # group 3: c., st or run out
\s # space
\(? # optional (
(\w+) # group 4: name
\)? # optional )
))? # group 2 end
(?:\s+ # spaces
( # group 5 start
(?:b\.\s)(\w+) # b. name
))? # group 5 end
\s+ # spaces
EDIT 1: Actually, there is a 'stumped' option missing in your regex as well. Added that in mine.
EDIT 2: Stumped doesn't have a dot.
EDIT 3: The complete example can be found at example 2
Some java code to test it:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Foo {
public static void main(String[] args) {
String[] examples = {
"David Warner lbw b. Hassan 19 16 4 0 118.8%",
"Joe Burns b. Morkel 73 149 16 0 49.0%",
"Asad Shafiq c. Rahane b. Morkel 22 38 5 0 57.9%",
"Yasir Shah not out 1 12 0 0 8.3%",
"Yasir Shah st Rahane 1 12 0 0 8.3%",
"Morne Morkel run out (Shah) 11 17 1 1 64.7%"
};
Pattern pattern = Pattern.compile("(\\/?[A-Z][a-z]+(?:\\s(?:[A-Z]['a-z]+)+))(?:\\s+(lbw|not\\sout|(c\\.|st|run\\sout)\\s\\(?(\\w+)\\)?))?(?:\\s+((?:b\\.\\s)(\\w+)))?\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+)\\s+(\\d+\\.\\d%)");
for (String text : examples) {
System.out.println("TEXT: " + text);
Matcher matcher = pattern.matcher(text);
if (matcher.matches()) {
System.out.println("batsman: " + matcher.group(1));
if (matcher.group(2) != null) System.out.println(matcher.group(2));
if (matcher.group(5) != null && matcher.group(5).matches("^b.*"))
System.out.println("bowler: " + matcher.group(6));
StringBuilder sb = new StringBuilder("numbers are: ");
int[] groups = {7, 8, 9, 10, 11};
for (int i : groups) {
sb.append(" " + matcher.group(i));
}
System.out.println(sb.toString());
System.out.println();
}
}
}
}

Related

HBASE filter by multiple values

I am having problems using filters to search data in hbase.
First I am reading some data from one table and storing in a vector or arrayList:
for (Result r : rs) {
for (KeyValue kv : r.raw()) {
if (new String(kv.getFamily()).equals("mpnum")) {
temp = new String(kv.getValue());
x.addElement(temp);
}
}
}
Then, I want to search a different table based on the values of this vector. I used filters to do this: (I tried BinaryPrefixComparator and BinaryComparator as well)
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ONE);
for (int c = 0; c < x.size(); c++) {
System.out.println(x.get(c).toString());
filterList.addFilter(new SingleColumnValueFilter(Bytes.toBytes("mpnum"), null, CompareOp.EQUAL, new SubstringComparator( x.get(c).toString() )));
}
I should get 3 results back, however I only get one result back, the first entry in the database.
What doesn't make sense is that when I hardcode the value that I am looking for into my code, I will get all 3 results back.
I thought there might be some issue with converting the bytes to String and then back to bytes, but that would not explain how it was able to bring back the first result. For some reason, it is stopping at the first match and doesn't continue to find the other 2 rows that contain matching data. If I hardcode it i get the results:
x.addElement("abc123");
filterList.addFilter(new SingleColumnValueFilter(Bytes.toBytes("mpnum"), null, CompareOp.EQUAL, new SubstringComparator( x.get(0).toString() )));
Does anyone know what the problem is or what I need to do to resolve my issue? Your help is much appreciated.
Thank You
edit: Here is the contents of the tables:
TABLE1:
ROW COLUMN+CELL
0 column=gpnum:, timestamp=1481300288449, value=def123
0 column=mpnum:, timestamp=1481300273355, value=abc123
0 column=price:, timestamp=1481300255337, value=85.0
1 column=gpnum:, timestamp=1481301599999, value=def2244
1 column=mpnum:, timestamp=1481301582336, value=011511607
1 column=price:, timestamp=1481301673886, value=0.76
TABLE2
ROW COLUMN+CELL
0 column=brand:, timestamp=1481300227283, value=x
0 column=mpnum:, timestamp=1481300212289, value=abc123
0 column=price:, timestamp=1481300110950, value=50.0
1 column=mpnum:, timestamp=1481301806687, value=011511607
1 column=price:, timestamp=1481301777345, value=1.81
13 column=webtype:, timestamp=1483507543878, value=US
3 column=avail:, timestamp=1481306538360, value=avail
3 column=brand:, timestamp=1481306538360, value=brand
3 column=descr:, timestamp=1481306538360, value=description
3 column=dist:, timestamp=1481306538360, value=distributor
3 column=mpnum:, timestamp=1481306538360, value=pnum
3 column=price:, timestamp=1481306538360, value=price
3 column=url:, timestamp=1481306538360, value=url
3 column=webtype:, timestamp=1481306538360, value=webtype
4 column=avail:, timestamp=1481306538374, value=4
4 column=brand:, timestamp=1481306538374, value=x
4 column=descr:, timestamp=1481306538374, value=description
4 column=dist:, timestamp=1481306538374, value=x
4 column=mpnum:, timestamp=1482117383212, value=011511607
4 column=price:, timestamp=1481306538374, value=34.51
4 column=url:, timestamp=1481306538374, value=x
4 column=webtype:, timestamp=1481306538374, value=US
5 column=avail:, timestamp=1481306538378, value=
5 column=brand:, timestamp=1481306538378, value=name
5 column=descr:, timestamp=1481306538378, value=x
5 column=dist:, timestamp=1481306538378, value=x
5 column=mpnum:, timestamp=1482117392043, value=011511607
5 column=price:, timestamp=1481306538378, value=321.412
5 column=url:, timestamp=1481306538378, value=x.com
THIRD TABLE (to store result matches)
0 column=brand:, timestamp=1481301813849, value=name
0 column=cprice:, timestamp=1481301813849, value=1.81
0 column=gpnum:, timestamp=1481301813849, value=def2244
0 column=gprice:, timestamp=1481301813849, value=0.76
0 column=mpnum:, timestamp=1481301813849, value=011511607
**should be three matches those that are in bold above but only brings back one match
If anyone is willing to help for a fee, send me an email at tt224416#gmail.com

Division of teams on the basis of skill point

I am trying to put n players having different skill point(ranging from 100-3000) into r teams such that overall skill in each team is as close as possible to every other team.
I first sorted the players in descending order of skill points and top r players were put into each team. Now the team with lowest skill point(iterating and calculating sum) gets the top player remaining.
For eg.
A 600
B 550
C 400
D 250
E 220
F 200
G 150
H 140
For 2 teams, result will be:
Team A{600,250,220,150}= 1220
Team B{550,400,200,140}= 1290
In another approach each team gets a player from top and a player from bottom.
Team A{600,140,400,200}=1340
Team B{550,150,250,220}=1170
So here 1st approach was better, but for different set of data sometimes approach 2 is optimum and sometimes approach 1 is optimum.
Is there any specific algorithm to do this? I tried to read Microsoft's TrueSkill algorithm, but it was way too complex.
Looks like you want to score each combination of players. I'm going to cheat and use python here:
from itertools import combinations
players = [600, 550, 400, 250, 220, 150, 140]
scores = {}
for i in range(1, int(len(players)/2)):
for c in combinations(players, i):
scores[c] = abs(sum(c) - sum([p for p in players if p not in c]))
print sorted(scores.items(), key=lambda x: x[1])[0]
prints: ((600, 550), 10)
Edit: Didn't recognize this as a hard problem right away.
Like mcdowella mentioned in a comment, this problem, as stated, is np-hard. A classic approach would be Integer Programming.
The following implementation uses the Julia programming language and the JuMP library as modelling tool. The Integer/Mixed-Integer-Programming solver used is cbc, but a commercial solver as Gurobi may used too (only 2 lines of code-change needed!). Besides Gurobi, all the mentioned tools are open-source!
Code
using JuMP
using Cbc
# PARAMS
N_PLAYERS = 15
N_TEAMS = 3
LOWER_SKILL = 100
UPPER_SKILL = 3000
# RANDOM INSTANCE
SKILL = Int[]
for p = 1:N_PLAYERS
push!(SKILL, rand(LOWER_SKILL:UPPER_SKILL))
end
# MODEL
m = Model(solver=CbcSolver())
bigM = sum(SKILL)^2 # more tight bound possible
# VARS
#defVar(m, x[1:N_PLAYERS, 1:N_TEAMS], Bin) # player-team assignment vars
#defVar(m, 0 <= tsum_pos[1:N_TEAMS,1:N_TEAMS] <= bigM) # abs-linearization: pos-part
#defVar(m, 0 <= tsum_neg[1:N_TEAMS,1:N_TEAMS] <= bigM) # abs-linearization: neg-part
# CONSTRAINTS
# each player is assigned to exactly one team
for p = 1:N_PLAYERS
#addConstraint(m, sum{x[p,t], t=1:N_TEAMS} == 1)
end
# temporary team sum expresions
team_sums = AffExpr[]
for t = 1:N_TEAMS
#defExpr(y, SKILL[p] * sum{x[p,t], p=1:N_PLAYERS})
push!(team_sums, y)
end
# errors <-> splitted abs-vars equality
for t1 = 1:N_TEAMS
for t2 = 1:N_TEAMS
if t1 != t2
#addConstraint(m, (team_sums[t1] - team_sums[t2]) == (tsum_pos[t1,t2] - tsum_neg[t1,t2]))
end
end
end
# objective
#setObjective(m, Min, sum{tsum_pos[i,j] + tsum_neg[i,j], i=1:N_TEAMS, j=1:N_TEAMS}) # symmetry could be used
# SOLVE
tic()
status = solve(m)
toc()
# OUTPUT
println("Objective is: ", getObjectiveValue(m))
println("Solution: ")
println("Player skills: ", SKILL)
for p = 1:N_PLAYERS
for t = 1:N_TEAMS
if getValue(x[p,t]) > 0.5
println("player ", p, " in team ", t)
end
end
end
for t=1:N_TEAMS
sum_ = 0
for p=1:N_PLAYERS
if getValue(x[p,t]) > 0.5
sum_ += SKILL[p]
end
end
println("team: ", t, " -> ", sum_)
end
println(sum(SKILL))
This modelling uses some linearization-trick to model absolute values, as needed for a L1-norm-based error like you described in your post!
Output
elapsed time: 9.785739578 seconds
Objective is: 28.00000000000063 # REMARK: error is doubled because of symmetries which could be changed
Solution:
Player skills: [2919,1859,1183,1128,495,1436,2215,2045,651,540,2924,2367,1176,334,1300]
player 1 in team 3
player 2 in team 1
player 3 in team 3
player 4 in team 1
player 5 in team 3
player 6 in team 2
player 7 in team 2
player 8 in team 1
player 9 in team 1
player 10 in team 1
player 11 in team 3
player 12 in team 2
player 13 in team 2
player 14 in team 2
player 15 in team 1
team: 1 -> 7523
team: 2 -> 7528
team: 3 -> 7521
22572

Hint on how extract a text from a List<string>

In my java application I have a List<String> sbuff_Test = new ArrayList<>(); structure that I fill during the execution. When the sbuff_Test is ready, I put every string of it in a jTextArea. My output is something like that:
Please choose Node:
2 : Low s23_t0
1 : High s23_t0 (Id = 0)
* TESTPAD MAIN MENU (v10r0p0) *
----------------------------------
LowPT PAD s23_t0 on node 2
1 = Initialize PAD
2 = PRODE MENU
3 = TTC MENU
4 = CM MENU
5 = FPGA MENU
6 = LINK MENU
7 = CAN MENU
8 = ELMB MENU
9 = SPLITTER MENU
10 = Change CURRENT PAD
11 = Reset full PAD
12 = Warm Initialize PAD
13 = Change Pad configuration
14 = Change CM latencies
15 = Phase measurement
16 = Power ON/OFF
17 = Print PAD Status
18 = Measurement loop
19 = Read CM trigger frequencies
20 = fast check of locks
21 = TRIGGER MENU
22 = Read CM BC ids
23 = Test CM BC ids with prode
24 = Test CM BC ids with TTC
25 = Test init low-high
0 = Quit
TESTPAD: ELMB MENU
(1) ELMB reset
(2) power OFF/ON ELMB on Node 1
(3) ELMB firm/hard version
(4) set CAN-debug ON/OFF
(5) set the communication rate
(6) download XPG file into FLASH for localInit
(0) exit
2
Firmware Version SV22
Hardware Version pad8
TESTPAD: ELMB MENU
(1) ELMB reset
(2) power OFF/ON ELMB on Node 1
(3) ELMB firm/hard version
(4) set CAN-debug ON/OFF
(5) set the communication rate
(6) download XPG file into FLASH for localInit
(0) exit
Now, I want an hint on how extract only the text that I need; for example, for the text above:
LowPT PAD s23_t0 on node 2
Firmware Version SV22
Hardware Version pad8
The trouble is that the part of text that I must delete is variable and I can't find an approach for this problem. What do you suggest for a similar problem? Thanks for the hint.
EDIT:
To delete the unwanted phrase you just need to use matcher.replaceAll("") method, In this Example I will use the old patterns:
String text = jTextAreaName.getText();
//This is the list of the wanted groups
String[] patterns = new String[]{"(LowPT .+)[\\r\\n]", "(Firmware Version .+)[\\r\\n]", "(Hardware Version .+)[\\r\\n]"};
//Then delete the three matched groups like this
for(int i=0; i<patterns.length; i++) {
Pattern pattern = Pattern.compile(patterns[i]);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
text = matcher.replaceAll("");
}
}
Here's the Updated DEMO.
In that case you need to use a Regex Matcher and matching groups to only extract the wanted parts from it:
String text = jTextAreaName.getText();
//This is the list of the wanted groups
String[] patterns = new String[]{"(LowPT .+)[\\r\\n]", "(Firmware Version .+)[\\r\\n]", "(Hardware Version .+)[\\r\\n]"};
//Then extract the three matched groups like this
String myResult="";
for(int i=0; i<patterns.length; i++) {
//compile each matching group and find matches.
Pattern pattern = Pattern.compile(patterns[i]);
Matcher matcher = pattern.matcher(text);
while(matcher.find()) {
myResult += matcher.group(1);
myResult += "\n";
}
}
This a Live DEMO where you can test it, giving the following result:
LowPT PAD s23_t0 on node 2
Firmware Version SV22
Hardware Version pad8
Explanation:
(LowPT .+)[\\r\\n] is a matching group for the line LowPT PAD s23_t0 on node 2.
(Firmware Version .+)[\\r\\n] is a matching group for the line Firmware Version SV22.
(Hardware Version .+)[\\r\\n] is a matching group for the line Hardware Version pad8.
If you only need LowPT,Firmware and Hardware Version ,read your file line by line.
If your line contains one of the above keywords print the current line and continue to the next one.

Processing an array of data with a pattern in it in Java?

So I've got this String with book information:
String data = "Harry Potter 1 | J.K. Rowling| 350 | Fantasy | Hunger Games | Suzanne Collins | 500 | Fantasy | The KingKiller Chronicles | Patrick Rothfuss | 400 | Heroic Fantasy"
Then I split the String:
String splitData = data.split("\\|");
This will cause Harry Potter 1 to be in position 0, J.K. Rowling to be in position 1, 350 to be in position 2, etc.
You might see a pattern in here, which is the fact that at position 0 is a title of a book, at position 1 is the author, at position 2 is the amount of pages and at position 3 is the genre. Then it starts again at position 4, which is again the title of a book, position 5 being the Author of the book, etc etc. I assume that you understand where I'm going.
Now let's say that I want to display all those elements separately, like printing all the titles apart, all the authors, all the amount of pages, etc. How would I accomplish this?
This should be possible to do since the titles are in 0, 4, 8. The authors are in 1, 5, 9, etc.
String data = "Harry Potter 1 | J.K. Rowling| 350 | Fantasy | Hunger Games | Suzanne Collins | 500 | Fantasy | The KingKiller Chronicles | Patrick Rothfuss | 400 | Heroic Fantasy";
String[] splitData = data.split("\\|");
for(int i=0; i<splitData.length;i++) {
if(i % 4 == 0) System.out.println("Title: "+splitData[i]);
else if(i % 4 == 1) System.out.println("Author: "+splitData[i]);
else if(i % 4 == 2) System.out.println("Pages: "+splitData[i]);
else if(i % 4 == 3) System.out.println("Genre: "+splitData[i]);
}
Difficult, isnt it?
You can recall that for loop lets you perform any modifications in the last expression, not only i++. For this case, you can use i += 4. Then in each iteration the name will ne at splitData[i], the author at splitData[i+1], the number of pages at splitData[i+2], and the genre at splitData[i+3].

Fastest way to strip all non-printable characters from a Java String

What is the fastest way to strip all non-printable characters from a String in Java?
So far I've tried and measured on 138-byte, 131-character String:
String's replaceAll() - slowest method
517009 results / sec
Precompile a Pattern, then use Matcher's replaceAll()
637836 results / sec
Use StringBuffer, get codepoints using codepointAt() one-by-one and append to StringBuffer
711946 results / sec
Use StringBuffer, get chars using charAt() one-by-one and append to StringBuffer
1052964 results / sec
Preallocate a char[] buffer, get chars using charAt() one-by-one and fill this buffer, then convert back to String
2022653 results / sec
Preallocate 2 char[] buffers - old and new, get all chars for existing String at once using getChars(), iterate over old buffer one-by-one and fill new buffer, then convert new buffer to String - my own fastest version
2502502 results / sec
Same stuff with 2 buffers - only using byte[], getBytes() and specifying encoding as "utf-8"
857485 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as a constant Charset.forName("utf-8")
791076 results / sec
Same stuff with 2 byte[] buffers, but specifying encoding as 1-byte local encoding (barely a sane thing to do)
370164 results / sec
My best try was the following:
char[] oldChars = new char[s.length()];
s.getChars(0, s.length(), oldChars, 0);
char[] newChars = new char[s.length()];
int newLen = 0;
for (int j = 0; j < s.length(); j++) {
char ch = oldChars[j];
if (ch >= ' ') {
newChars[newLen] = ch;
newLen++;
}
}
s = new String(newChars, 0, newLen);
Any thoughts on how to make it even faster?
Bonus points for answering a very strange question: why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?
Update
Suggestion from ratchet freak yields impressive 3105590 results / sec performance, a +24% improvement!
Suggestion from Ed Staub yields yet another improvement - 3471017 results / sec, a +12% over previous best.
Update 2
I've tried my best to collected all the proposed solutions and its cross-mutations and published it as a small benchmarking framework at github. Currently it sports 17 algorithms. One of them is "special" - Voo1 algorithm (provided by SO user Voo) employs intricate reflection tricks thus achieving stellar speeds, but it messes up JVM strings' state, thus it's benchmarked separately.
You're welcome to check it out and run it to determine results on your box. Here's a summary of results I've got on mine. It's specs:
Debian sid
Linux 2.6.39-2-amd64 (x86_64)
Java installed from a package sun-java6-jdk-6.24-1, JVM identifies itself as
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)
Different algorithms show ultimately different results given a different set of input data. I've ran a benchmark in 3 modes:
Same single string
This mode works on a same single string provided by StringSource class as a constant. The showdown is:
Ops / s │ Algorithm
──────────┼──────────────────────────────
6 535 947 │ Voo1
──────────┼──────────────────────────────
5 350 454 │ RatchetFreak2EdStaub1GreyCat1
5 249 343 │ EdStaub1
5 002 501 │ EdStaub1GreyCat1
4 859 086 │ ArrayOfCharFromStringCharAt
4 295 532 │ RatchetFreak1
4 045 307 │ ArrayOfCharFromArrayOfChar
2 790 178 │ RatchetFreak2EdStaub1GreyCat2
2 583 311 │ RatchetFreak2
1 274 859 │ StringBuilderChar
1 138 174 │ StringBuilderCodePoint
994 727 │ ArrayOfByteUTF8String
918 611 │ ArrayOfByteUTF8Const
756 086 │ MatcherReplace
598 945 │ StringReplaceAll
460 045 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
Multiple strings, 100% of strings contain control characters
Source string provider pre-generated lots of random strings using (0..127) character set - thus almost all strings contained at least one control character. Algorithms received strings from this pre-generated array in round-robin fashion.
Ops / s │ Algorithm
──────────┼──────────────────────────────
2 123 142 │ Voo1
──────────┼──────────────────────────────
1 782 214 │ EdStaub1
1 776 199 │ EdStaub1GreyCat1
1 694 628 │ ArrayOfCharFromStringCharAt
1 481 481 │ ArrayOfCharFromArrayOfChar
1 460 067 │ RatchetFreak2EdStaub1GreyCat1
1 438 435 │ RatchetFreak2EdStaub1GreyCat2
1 366 494 │ RatchetFreak2
1 349 710 │ RatchetFreak1
893 176 │ ArrayOfByteUTF8String
817 127 │ ArrayOfByteUTF8Const
778 089 │ StringBuilderChar
734 754 │ StringBuilderCodePoint
377 829 │ ArrayOfByteWindows1251
224 140 │ MatcherReplace
211 104 │ StringReplaceAll
In charted form:
(source: greycat.ru)
Multiple strings, 1% of strings contain control characters
Same as previous, but only 1% of strings was generated with control characters - other 99% was generated in using [32..127] character set, so they couldn't contain control characters at all. This synthetic load comes the closest to real world application of this algorithm at my place.
Ops / s │ Algorithm
──────────┼──────────────────────────────
3 711 952 │ Voo1
──────────┼──────────────────────────────
2 851 440 │ EdStaub1GreyCat1
2 455 796 │ EdStaub1
2 426 007 │ ArrayOfCharFromStringCharAt
2 347 969 │ RatchetFreak2EdStaub1GreyCat2
2 242 152 │ RatchetFreak1
2 171 553 │ ArrayOfCharFromArrayOfChar
1 922 707 │ RatchetFreak2EdStaub1GreyCat1
1 857 010 │ RatchetFreak2
1 023 751 │ ArrayOfByteUTF8String
939 055 │ StringBuilderChar
907 194 │ ArrayOfByteUTF8Const
841 963 │ StringBuilderCodePoint
606 465 │ MatcherReplace
501 555 │ StringReplaceAll
381 185 │ ArrayOfByteWindows1251
In charted form:
(source: greycat.ru)
It's very hard for me to decide on who provided the best answer, but given the real-world application best solution was given/inspired by Ed Staub, I guess it would be fair to mark his answer. Thanks for all who took part in this, your input was very helpful and invaluable. Feel free to run the test suite on your box and propose even better solutions (working JNI solution, anyone?).
References
GitHub repository with a benchmarking suite
using 1 char array could work a bit better
int length = s.length();
char[] oldChars = new char[length];
s.getChars(0, length, oldChars, 0);
int newLen = 0;
for (int j = 0; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;
newLen++;
}
}
s = new String(oldChars, 0, newLen);
and I avoided repeated calls to s.length();
another micro-optimization that might work is
int length = s.length();
char[] oldChars = new char[length+1];
s.getChars(0, length, oldChars, 0);
oldChars[length]='\0';//avoiding explicit bound check in while
int newLen=-1;
while(oldChars[++newLen]>=' ');//find first non-printable,
// if there are none it ends on the null char I appended
for (int j = newLen; j < length; j++) {
char ch = oldChars[j];
if (ch >= ' ') {
oldChars[newLen] = ch;//the while avoids repeated overwriting here when newLen==j
newLen++;
}
}
s = new String(oldChars, 0, newLen);
If it is reasonable to embed this method in a class which is not shared across threads, then you can reuse the buffer:
char [] oldChars = new char[5];
String stripControlChars(String s)
{
final int inputLen = s.length();
if ( oldChars.length < inputLen )
{
oldChars = new char[inputLen];
}
s.getChars(0, inputLen, oldChars, 0);
etc...
This is a big win - 20% or so, as I understand the current best case.
If this is to be used on potentially large strings and the memory "leak" is a concern, a weak reference can be used.
Well I've beaten the current best method (freak's solution with the preallocated array) by about 30% according to my measures. How? By selling my soul.
As I'm sure everyone that has followed the discussion so far knows this violates pretty much any basic programming principle, but oh well. Anyways the following only works if the used character array of the string isn't shared between other strings - if it does whoever has to debug this will have every right deciding to kill you (without calls to substring() and using this on literal strings this should work as I don't see why the JVM would intern unique strings read from an outside source). Though don't forget to make sure the benchmark code doesn't do it - that's extremely likely and would help the reflection solution obviously.
Anyways here we go:
// Has to be done only once - so cache those! Prohibitively expensive otherwise
private Field value;
private Field offset;
private Field count;
private Field hash;
{
try {
value = String.class.getDeclaredField("value");
value.setAccessible(true);
offset = String.class.getDeclaredField("offset");
offset.setAccessible(true);
count = String.class.getDeclaredField("count");
count.setAccessible(true);
hash = String.class.getDeclaredField("hash");
hash.setAccessible(true);
}
catch (NoSuchFieldException e) {
throw new RuntimeException();
}
}
#Override
public String strip(final String old) {
final int length = old.length();
char[] chars = null;
int off = 0;
try {
chars = (char[]) value.get(old);
off = offset.getInt(old);
}
catch(IllegalArgumentException e) {
throw new RuntimeException(e);
}
catch(IllegalAccessException e) {
throw new RuntimeException(e);
}
int newLen = off;
for(int j = off; j < off + length; j++) {
final char ch = chars[j];
if (ch >= ' ') {
chars[newLen] = ch;
newLen++;
}
}
if (newLen - off != length) {
// We changed the internal state of the string, so at least
// be friendly enough to correct it.
try {
count.setInt(old, newLen - off);
// Have to recompute hash later on
hash.setInt(old, 0);
}
catch(IllegalArgumentException e) {
e.printStackTrace();
}
catch(IllegalAccessException e) {
e.printStackTrace();
}
}
// Well we have to return something
return old;
}
For my teststring that gets 3477148.18ops/s vs. 2616120.89ops/s for the old variant. I'm quite sure the only way to beat that could be to write it in C (probably not though) or some completely different approach nobody has thought about so far. Though I'm absolutely not sure if the timing is stable across different platforms - produces reliable results on my box (Java7, Win7 x64) at least.
You could split the task into a several parallel subtasks, depending of processor's quantity.
I was so free and wrote a small benchmark for different algorithms. It's not perfect, but I take the minimum of 1000 runs of a given algorithm 10000 times over a random string (with about 32/200% non printables by default). That should take care of stuff like GC, initialization and so on - there's not so much overhead that any algorithm shouldn't have at least one run without much hindrance.
Not especially well documented, but oh well. Here we go - I included both of ratchet freak's algorithms and the basic version. At the moment I randomly initialize a 200 chars long string with uniformly distributed chars in the range [0, 200).
IANA low-level java performance junkie, but have you tried unrolling your main loop? It appears that it could allow some CPU's to perform checks in parallel.
Also, this has some fun ideas for optimizations.
It can go even faster. Much faster*. How? By leveraging System.arraycopy which is native method. So to recap:
Return the same String if it's "clean".
Avoid allocating a new char[] on every iteration
Use System.arraycopy for moving the elements x positions back
public class SteliosAdamantidis implements StripAlgorithm {
private char[] copy = new char[128];
#Override
public String strip(String s) throws Exception {
int length = s.length();
if (length > copy.length) {
int newLength = copy.length * 2;
while (length > newLength) newLength *= 2;
copy = new char[newLength];
}
s.getChars(0, length, copy, 0);
int start = 0; //where to start copying from
int offset = 0; //number of non printable characters or how far
//behind the characters should be copied to
int index = 0;
//fast forward to the first non printable character
for (; index < length; ++index) {
if (copy[index] < ' ') {
start = index;
break;
}
}
//string is already clean
if (index == length) return s;
for (; index < length; ++index) {
if (copy[index] < ' ') {
if (start != index) {
System.arraycopy(copy, start, copy, start - offset, index - start);
}
++offset;
start = index + 1; //handling subsequent non printable characters
}
}
if (length != start) {
//copy the residue -if any
System.arraycopy(copy, start, copy, start - offset, length - start);
}
return new String(copy, 0, length - offset);
}
}
This class is not thread safe but I guess that if one wants to handle a gazillion of strings on separate threads then they can afford 4-8 instances of the StripAlgorithm implementation inside a ThreadLocal<>
Trivia
I used as reference the RatchetFreak2EdStaub1GreyCat2 solution. I was surprised that this wasn't performing any good on my machine. Then I wrongfully thought that the "bailout" mechanism didn't work and I moved it at the end. It skyrocketed performance. Then I though "wait a minute" and I realized that the condition works always it's just better at the end. I don't know why.
...
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3508771.93 3.54x +3.9%
...
2. RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%
The test is not 100% accurate. At first I was an egoist and I've put my test second on the array of algorithms. It had some lousy results on the first run and then I moved it at the end (let the others warm up the JVM for me :) ) and then it came first.
Results
Oh and of course the results. Windows 7, jdk1.8.0_111 on a relatively old machine, so expect different results on newer hardware and or OS.
Rankings: (1.000.000 strings)
17. StringReplaceAll 990099.01 1.00x +0.0%
16. ArrayOfByteWindows1251 1642036.12 1.66x +65.8%
15. StringBuilderCodePoint 1724137.93 1.74x +5.0%
14. ArrayOfByteUTF8Const 2487562.19 2.51x +44.3%
13. StringBuilderChar 2531645.57 2.56x +1.8%
12. ArrayOfByteUTF8String 2551020.41 2.58x +0.8%
11. ArrayOfCharFromArrayOfChar 2824858.76 2.85x +10.7%
10. RatchetFreak2 2923976.61 2.95x +3.5%
9. RatchetFreak1 3076923.08 3.11x +5.2%
8. ArrayOfCharFromStringCharAt 3322259.14 3.36x +8.0%
7. EdStaub1 3378378.38 3.41x +1.7%
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3508771.93 3.54x +3.9%
5. EdStaub1GreyCat1 3787878.79 3.83x +8.0%
4. MatcherReplace 4716981.13 4.76x +24.5%
3. RatchetFreak2EdStaub1GreyCat1 5319148.94 5.37x +12.8%
2. RatchetFreak2EdStaub1GreyCatLateBail 6060606.06 6.12x +13.9%
1. SteliosAdamantidis 9615384.62 9.71x +58.7%
Rankings: (10.000.000 strings)
17. ArrayOfByteWindows1251 1647175.09 1.00x +0.0%
16. StringBuilderCodePoint 1728907.33 1.05x +5.0%
15. StringBuilderChar 2480158.73 1.51x +43.5%
14. ArrayOfByteUTF8Const 2498126.41 1.52x +0.7%
13. ArrayOfByteUTF8String 2591344.91 1.57x +3.7%
12. StringReplaceAll 2626740.22 1.59x +1.4%
11. ArrayOfCharFromArrayOfChar 2810567.73 1.71x +7.0%
10. RatchetFreak2 2948113.21 1.79x +4.9%
9. RatchetFreak1 3120124.80 1.89x +5.8%
8. ArrayOfCharFromStringCharAt 3306878.31 2.01x +6.0%
7. EdStaub1 3399048.27 2.06x +2.8%
6. RatchetFreak2EdStaub1GreyCatEarlyBail 3494060.10 2.12x +2.8%
5. EdStaub1GreyCat1 3818251.24 2.32x +9.3%
4. MatcherReplace 4899559.04 2.97x +28.3%
3. RatchetFreak2EdStaub1GreyCat1 5302226.94 3.22x +8.2%
2. RatchetFreak2EdStaub1GreyCatLateBail 5924170.62 3.60x +11.7%
1. SteliosAdamantidis 9680542.11 5.88x +63.4%
* Reflection -Voo's answer
I've put an asterisk on the Much faster statement. I don't think that anything can go faster than reflection in that case. It mutates the String's internal state and avoids new String allocations. I don't think one can beat that.
I tried to uncomment and run Voo's algorithm and I got an error that offset field doesn't exit. IntelliJ complains that it can't resolve count either. Also (if I'm not mistaken) the security manager might cut reflection access to private fields and thus this solution won't work. That's why this algorithm doesn't appear in my test run. Otherwise I was curious to see myself although I believe that a non reflective solution can't be faster.
why using "utf-8" charset name directly yields better performance than using pre-allocated static const Charset.forName("utf-8")?
If you mean String#getBytes("utf-8") etc.: This shouldn't be faster - except for some better caching - since Charset.forName("utf-8") is used internally, if the charset is not cached.
One thing might be that you're using different charsets (or maybe some of your code does transparently) but the charset cached in StringCoding doesn't change.

Categories

Resources