Find relevant parts in a collection of strings

Find relevant parts in a collection of strings - java

I've got a set of path strings:
/content/example-site/global/library/about/contact/thank-you.html
/content/example-site/global/corporate/about/contact/thank-you.html
/content/example-site/countries/uk/about/contact/thank-you.html
/content/example-site/countries/de/about/contact/thank-you.html
/content/example-site/others/about/contact/thank-you.html
...
(Often the paths are much longer than this)
As you can see it is difficult to notice the differences immediately. That's why I would like to highlight the relevant parts in the strings.
To find the differences I currently calculate the common prefix and suffix of all strings:
String prefix = getCommonPrefix(paths);
String suffix = getCommonSuffix(paths);
for (String path : paths) {
String relevantPath = path.substring(prefix.length(), path.length() - suffix.length());
// OUTPUT: prefix + "<b>" + relevantPath + "</b>" + suffix
}
For the prefix I'm using StringUtils.getCommonPrefix from Commons Lang.
For the suffix I couldn't find a utility (neither in Commons nor in Guava, the later has only one for exactly two strings). So I had to write my own - similar to the one from Commons Lang.
I'm now wondering, if I missed some function in one of the libraries - or
if there is an easy way with Java 8 streaming functions?

Here is a little hack, I do not say it is optimal nor nothing but it could be interesting to follow this path if no other option is available:
String[] reversedPaths = new String[paths.length];
for (int i = 0; i < paths.length; i++) {
reversedPaths[i] = StringUtils.reverse(paths[i]);
}
String suffix = StringUtils.reverse(StringUtils.getCommonPrefix(reversedPaths));

You could inverse each path, find the prefix of these inversed strings and inverse said prefix to get a common suffix.
Like this:
String commonSuffix = new StringBuffer(getCommonPrefix(paths.stream().map(path -> new StringBuffer(path).reverse().toString()).collect(Collectors.toList()))).reverse().toString();
I personally do not like this solution a lot, because you create a new StringBuffer for every path in your list. That is how java works some times, but it is at least ugly if not dangerous for performance. You could write you own function
public static String invert(String s) { // invert s using char[] }
if you want.

Related

Combining strings without using plus '+'

I am writing unit tests now and I need to create a specific string.
I have now defined something like this:
private final String at = "#:";
private String error, effect, one, two, three, four;
in setUp (#Before):
error = RandomStringUtils.randomAlphabetic (3);
one = RandomStringUtils.randomAlphabetic (6);
two = RandomStringUtils.randomAlphabetic (8);
three = RandomStringUtils.randomAlphabetic (2);
four = RandomStringUtils.randomAlphabetic (6);
effect = (error + at + one + at + two + at + three + at + four);
The combination of the strings with the pluses looks terribly ugly and amateurish. Is it possible to do it somehow more efficiently using anything else? For example pattern? I dont know. Thanks for help :)

For simplicity, you can also do:
String.join(at, error,one, two, three, four);

You can use the java built-in StringBuilder
StringBuilder sb = new StringBuilder();
sb.append(error);
sb.append(at);
sb.append(one);
...
effect = sb.toString();

If the "#:" is a consistent separator, and you're using Java 8+, you might find that String.join is your friend. This would look something like:
effect = String.join("#:", error, one, two, three, four);
Guessing a little bit from your variable names, but as a little background and just in case it's helpful, if you want/need to use a Stream you can also use Collectors.joining, which could look something like:
List<String> stringParts = ...
stringParts.stream()
.collect(Collectors.joining("#:"));
This will join everything in the list with "#:" as a delimiter. You can also add a constant prefix and/or suffix which might be relevant for your error variable, like:
String error = RandomStringUtils.randomAlphabetic(3);
List<String> stringParts = ...
stringParts.stream()
.collect(Collectors.joining("#:", error, ""));

Alternative to successive String.replace

I want to replace some strings in a String input :
string=string.replace("<h1>","<big><big><big><b>");
string=string.replace("</h1>","</b></big></big></big>");
string=string.replace("<h2>","<big><big>");
string=string.replace("</h2>","</big></big>");
string=string.replace("<h3>","<big>");
string=string.replace("</h3>","</big>");
string=string.replace("<h4>","<b>");
string=string.replace("</h4>","</b>");
string=string.replace("<h5>","<small><b>");
string=string.replace("</h5>","</b><small>");
string=string.replace("<h6>","<small>");
string=string.replace("</h6>","</small>");
As you can see this approach is not the best, because each time I have to search for the portion to replace etc, and Strings are immutable... Also the input is large, which means that some performance issues are to be considered.
Is there any better approach to reduce the complexity of this code ?

Although StringBuilder.replace() is a huge improvement compared to String.replace(), it is still very far from being optimal.
The problem with StringBuilder.replace() is that if the replacement has different length than the replaceable part (applies to our case), a bigger internal char array might have to be allocated, and the content has to be copied, and then the replace will occur (which also involves copying).
Imagine this: You have a text with 10.000 characters. If you want to replace the "XY" substring found at position 1 (2nd character) to "ABC", the implementation has to reallocate a char buffer which is at least larger by 1, has to copy the old content to the new array, and it has to copy 9.997 characters (starting at position 3) to the right by 1 to fit "ABC" into the place of "XY", and finally characters of "ABC" are copied to the starter position 1. This has to be done for every replace! This is slow.
Faster Solution: Building Output On-The-Fly
We can build the output on-the-fly: parts that don't contain replaceable texts can simply be appended to the output, and if we find a replaceable fragment, we append the replacement instead of it. Theoretically it's enough to loop over the input only once to generate the output. Sounds simple, and it's not that hard to implement it.
Implementation:
We will use a Map preloaded with mappings of the replaceable-replacement strings:
Map<String, String> map = new HashMap<>();
map.put("<h1>", "<big><big><big><b>");
map.put("</h1>", "</b></big></big></big>");
map.put("<h2>", "<big><big>");
map.put("</h2>", "</big></big>");
map.put("<h3>", "<big>");
map.put("</h3>", "</big>");
map.put("<h4>", "<b>");
map.put("</h4>", "</b>");
map.put("<h5>", "<small><b>");
map.put("</h5>", "</b></small>");
map.put("<h6>", "<small>");
map.put("</h6>", "</small>");
And using this, here is the replacer code: (more explanation after the code)
public static String replaceTags(String src, Map<String, String> map) {
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
for (int pos = 0;;) {
int ltIdx = src.indexOf('<', pos);
if (ltIdx < 0) {
// No more '<', we're done:
sb.append(src, pos, src.length());
return sb.toString();
}
sb.append(src, pos, ltIdx); // Copy chars before '<'
// Check if our hit is replaceable:
boolean mismatch = true;
for (Entry<String, String> e : map.entrySet()) {
String key = e.getKey();
if (src.regionMatches(ltIdx, key, 0, key.length())) {
// Match, append the replacement:
sb.append(e.getValue());
pos = ltIdx + key.length();
mismatch = false;
break;
}
}
if (mismatch) {
sb.append('<');
pos = ltIdx + 1;
}
}
}
Testing it:
String in = "Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End";
System.out.println(in);
System.out.println(replaceTags(in, map));
Output: (wrapped to avoid scroll bar)
Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End
This solution is faster than using regular expressions as that involves much overhead, like compiling a Pattern, creating a Matcher etc. and regexp is also much more general. It also creates many temporary objects under the hood which are thrown away after the replace. Here I only use a StringBuilder (plus char array under its hood) and the code iterates over the input String only once. Also this solution is much faster that using StringBuilder.replace() as detailed at the top of this answer.
Notes and Explanation
I initialized the StringBuilder in the replaceTags() method like this:
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
So basically I created it with an initial capacity of 150% of the length of the original String. This is because our replacements are longer than the replaceable texts, so if replacing occurs, the output will obviously be longer than the input. Giving a larger initial capacity to StringBuilder will result in no internal char[] reallocation at all (of course the required initial capacity depends on the replaceable-replacement pairs and their frequency/occurrence in the input, but this +50% is a good upper estimation).
I also utilized the fact that all replaceable strings start with a '<' character, so finding the next potential replaceable position becomes blazing-fast:
int ltIdx = src.indexOf('<', pos);
It's just a simple loop and char comparisons inside String, and since it always starts searching from pos (and not from the start of the input), overall the code iterates over the input String only once.
And finally to tell if a replaceable String does occur at the potential position, we use the String.regionMatches() method to check the replaceable stings which is also blazing-fast as all it does is just compares char values in a loop and returns at the very first mismatching character.
And a PLUS:
The question doesn't mention it, but our input is an HTML document. HTML tags are case-insensitive which means the input might contain <H1> instead of <h1>.
To this algorithm this is not a problem. The regionMatches() in the String class has an overload which supports case-insensitive comparison:
boolean regionMatches(boolean ignoreCase, int toffset, String other,
int ooffset, int len);
So if we want to modify our algorithm to also find and replace input tags which are the same but are written using different letter case, all we have to modify is this one line:
if (src.regionMatches(true, ltIdx, key, 0, key.length())) {
Using this modified code, replaceable tags become case-insensitive:
Yo<H1>TITLE</H1><h3>Hi!</h3>Nice day.<H6>Hi back!</H6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End

For performance - use StringBuilder.
For convenience you can use Map to store values and replacements.
Map<String, String> map = new HashMap<>();
map.put("<h1>","<big><big><big><b>");
map.put("</h1>","</b></big></big></big>");
map.put("<h2>","<big><big>");
...
StringBuilder builder = new StringBuilder(yourString);
for (String key : map.keySet()) {
replaceAll(builder, key, map.get(key));
}
... To replace all occurences in StringBuilder you can check here:
Replace all occurrences of a String using StringBuilder?
public static void replaceAll(StringBuilder builder, String from, String to)
{
int index = builder.indexOf(from);
while (index != -1)
{
builder.replace(index, index + from.length(), to);
index += to.length(); // Move to the end of the replacement
index = builder.indexOf(from, index);
}
}

Unfortunately StringBuilder doesn't provide a replace(string,string) method, so you might want to consider using Pattern and Matcher in conjunction with StringBuffer:
String input = ...;
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("</?(h1|h2|...)>");
Matcher m = p.matcher( input );
while( m.find() )
{
String match = m.group();
String replacement = ...; //get replacement for match, e.g. by lookup in a map
m.appendReplacement( sb, replacement );
}
m.appendTail( sb );
You could do something similar with StringBuilder but in that case you'd have to implement appendReplacement etc. yourself.
As for the expression you could also just try and match any html tag (although that might cause problems since regex and arbitrary html don't fit very well) and when the lookup doesn't have any result you just replace the match with itself.

The particular example you provide seems to be HTML or XHTML. Trying to edit HTML or XML using regular expressions is frought with problems. For the kind of editing you seem to be interested in doing you should look at using XSLT. Another possibility is to use SAX, the streaming XML parser, and have your back-end write the edited output on the fly. If the text is actually HTML, you might be better using a tolerant HTML parser, such as JSoup, to build a parsed representation of the document (like the DOM), and manipulate that before outputting it.

StringBuilder is backed by a char array. So, unlike String instances, it is mutable. Thus, you can call indexOf() and replace() on the StringBuilder.

I would do something like this
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
if (tagEquals(str, i, "h1")) {
sb.append("<big><big><big><b>");
i += 2;
} else (tagEquals(s, i, "/h1")) {
...
} else {
sb.append(str.charAt(i));
}
}
tagEquals is a func which checks a tag name

Use Apache Commons StringUtils.replaceEach.
String[] searches = new String[]{"<h1>", "</h1>", "<h2>", ...};
String[] replacements = new String[]("<big><big><big><b>", "</b></big></big></big>", "<big><big>" ...};
string = StringUtils.replaceEach(string, searches, replacements);

Unused variable in a loop

I need to build a pattern string according to an argument list. If the arguments are "foo", "bar", "data", then pattern should be: "?, ?, ?"
My code is:
List<String> args;
...
for(String s : args) {
pattern += "?,";
}
pattern = pattern.substring(0, pattern.length()-1);
It works fine, the only concern is, s is not used, it seems the code is a little dirty.
Any improvements for this?
I hope something like:
for(args.size()) {
...
}
But apparently there isn't..

You could use the class for loop with conditions:
for (int i = 0, s < args.size(); i++)
In this case, i is being used as a counting variable.
Other than that, there aren't any improvements to be made, although there isn't a need for improvements.

I usually do that in Haskell / Python style - naming it with "_". That way it's sort of obvious that variable is intentionally unused:
int n = 0;
for (final Object _ : iterable) { ++n; }
IntelliJ still complains, though :)

Another option is to use the Java Stream's api. It's pretty neat.
String output = args
.stream()
.map( string -> "?" ) // transform each string into a ?
.collect( Collectors.joining( "," ) ); // collect and join on ,

Why not use
for (int i = 0; i < args.size(); i++) {
...
}
You use the for each block if you want to make use of the contents of whatever you iterate on. For example, you use for (String s : args) if you know you're going to use each String value present in args. And it looks like here, you don't need the actual Strings.

If you have Guava around, you could try combining Joiner with Collections.nCopies:
Joiner.on(", ").join(Collections.nCopies(args.size(), "?"));

What you're looking for is a concept known as a "join". In stronger languages, like Groovy, it's available in the standard library, and you could write, for instance args.join(',') to get what you want. With Java, you can get a similar effect with StringUtils.join(args, ",") from Commons Lang (a library that should be included in every Java project everywhere).
Update: I obviously missed an important part with my original answer. The list of strings needs to be turned into question marks first. Commons Collections, another library that should always be included in Java apps, lets you do that with CollectionUtils.transform(args,new ConstantTransformer<String, String>("?")). Then pass the result to the join that I originally mentioned. Of course, this is getting a bit unwieldy in Java, and a more imperative approach might be more appropriate.
For the sake of comparison, the entire thing can be solved in Groovy and many other languages with something like args.collect{'?'}.join(','). In Java, with the utilities I mentioned, that goes more like:
StringUtils.join(
CollectionUtils.transform(args,
new ConstantTransformer<String, String>("?")),
",");
Quite a bit less readable...

I would suggest using StringBuilder along with the classic for loop here.
String pattern = "";
if (args.size() > 0) {
StringBuilder sb = new StringBuilder("?");
for(int i = 1; i < args.size(); i++) {
sb.append(", ?");
}
pattern = sb.toString();
}
If you don't want to use a for loop (as you stated not concise enough) use a while instead:
int count;
String pattern = "";
if ((count = args.size()) > 0) {
StringBuilder sb = new StringBuilder("?");
while (count-- > 1) {
sb.append(", ?");
}
pattern = sb.toString();
}
Also, see When to use StringBuilder?
At the point where you're concatenating in a loop - that's usually when the compiler can't substitute StringBuilder by itself.

String parsing based on mask

I have several string which multiple masks. I would like to know is there any better way of handling strings with mask parsing rather than String.spilt and loop over tokens and identify sequence etc. This code also gets clumsy that lots of token logic have to coded.
Sample masks can be:
PROD-LOC-STATE-CITY
PROD-DEST-STATE-ZIP
PROD-OZIP-DZIP-VER-INS
Sample Strings:
CoolDuo-GROUND-NYC-10082
Sample code:
String[] arr = input.split("-");
int pos = 0;
for(String k:arr){
if(pos == 0) {
//-- k is of PROD
...
...
}
..
...
pos++;
}
Above type of code is kept for every mask type.

You can use regex groups to get target strings by group names http://docs.oracle.com/javase/tutorial/essential/regex/groups.html. Check this Regex Named Groups in Java
If you can't use named groups, you can do it in this way (if your are absolutely sure in your strings structure):
final static int PROD_POS = 1;
final static int STATE_POS = 3;
...
Pattern pattern = Pattern.compile("(some_regexp)-(some_regexp)-(some_regexp)");
Matcher matcher = pattern.matcher(input);
if ( matcher.matches() ) {
String state = matcher.group(STATE_POS);
}

If you really want to delve in quite deep into this problem when your masks gets quite too big to manage, you can use some sort of lexical analysis packages available to java.
If you want to get a basis of what that really means look here (http://en.wikipedia.org/wiki/Lexical_analysis)
A popular package out there for java is JFlex (http://jflex.de/), but there are many others out there, just Google it for best results!
Best of luck

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.

You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.

Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.

You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.

One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Find relevant parts in a collection of strings - java

Related

Combining strings without using plus '+'

Alternative to successive String.replace

Unused variable in a loop

String parsing based on mask

Efficient way to search for a set of strings in a string in Java

Categories

Resources