Best way to search for a special string in a text

Best way to search for a special string in a text - java

If I have a piece of text around 3000 characters. I want search for strings with certain characteristics for example strings like [*].
That is, I want to get [a] and [bc] from
sjfhshdkfjhskdhfksdf[a]sfdsgfsdf[bc]
I know there is an algorithm called KMP that guarantee a linear time searching operation through a text, but here I don't have a fixed string to be found, maybe I have to use some regular expression at some place.
How can I do this better than O(n^2)? Is there any light libraries for this if I'm using java?

No libraries needed, you've effectively described a use case for regex! They are highly optimized for searching, and in this case will be O(n).
String str = "sjfhshdkfjhskdhfksdf[a]sfdsgfsdf[bc]";
List<String> allMatches = new ArrayList<>();
Matcher m = Pattern.compile("\\[[^\\]]*]").matcher(str);
while (m.find()) {
allMatches.add(m.group());
}
Regex Demo
If you have any doubts though and really want some O(n) that you can see, here's an algorithm:
String str = "sjfhshdkfjhskdhfksdf[a]sfdsgfsdf[bc]";
List<String> allMatches = new ArrayList<>();
for (int i = str.indexOf('['), j; i != -1; i = str.indexOf('[', j + 1)) {
j = str.indexOf(']', i + 1);
// if `j` is -1, the brackets are unbalanced. Perhaps throw an Exception?
allMatches.add(str.substring(i, j + 1));
}

Here's how to do it in one line:
String[] hits = str.replaceAll("^.*?\\[|][^\\]]*$", "").split("].*?\\[");
This works by stripping off leading and trailing chars up to and including the first/last opening/closing square bracket, then splits on a close bracket to the next opening bracket (inclusive).

Related

Split a java string among < > brackets, including the brackets, but only if no space between brackets

I need to be able to turn a string, for instance "This and <those> are.", into a string array of the form ["This and ", "<those>", " are."]. I have been trying to using the String.split() command, and I've gotten this regex:
"(?=[<>])"
However, this just gets me ["This and ", "<those", "> are."]. I can't figure out a good regex to get the brackets all on the same element, and I also can't have spaces between those brackets. So for instance, "This and <hey there> are." Should be simply split to ["This and <hey there> are."]. Ideally I'd like to just rely solely on the split command for this operation. Can anyone point me in the right direction?

Not actually possible; given that the 'separator' needs to match 0 characters it needs to be all lookahead/lookbehind, and those require fixed-size lookups; you need to look ahead arbitrarily far into the string to know if a space is going to occur or not, thus, what you want? Impossible.
Just write a regexp that FINDS the construct you want, that's a lot simpler. Simply Pattern.compile("<\\w+>") (taking a select few liberties on what you intend a thing-in-brackets to look like. If truly it can be ANYTHING except spaces and the closing brace, "<[^ >]+>" is what you want).
Then, just loop through, finding as you go:
private static final Pattern TOKEN_FINDER = Pattern.compile("<\\w+>");
List<String> parse(String in) {
Matcher m = TOKEN_FINDER.matcher(in);
if (!m.find()) return List.of(in);
var out = new ArrayList<String>();
int pos = 0;
do {
int s = m.start();
if (s > pos) out.add(in.substring(pos, s));
out.add(m.group());
pos = m.end();
} while (m.find());
if (pos < in.length()) out.add(in.substring(pos));
return out;
}
Let's try it:
System.out.println(parse("This and <those> are."));
System.out.println(parse("This and <hey there> are."));
System.out.println(parse("<edgecase>2"));
System.out.println(parse("3<edgecase>"));
prints:
[This and , <those>, are.]
[This and <hey there> are.]
[<edgecase>]
[<edgecase>, 2]
[3, <edgecase>]
seems like what you wanted.

How to split a string by a newline and a fixed number of tabs like "\n\t" in Java?

My input string is the following:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
My intended result is
dir,
subdir1,
subdir2\n\t\tfile.ext
The requirement is to split the input by "\n\t" but not "\n\t\t".
A simple try of
String[] answers = input.split("\n\t");
also splits "\tfile.ext" from the last entry. Is there a simple regular expression to solve the problem? Thanks!

You can split on a newline and tab, and assert not a tab after it to the right.
\n\t(?!\t)
See a regex demo.
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] answers = input.split("\\n\\t(?!\\t)");
System.out.println(Arrays.toString(answers));
Output
[dir, subdir1, subdir2
file.ext]

If you are looking for a generic approach, it highly depends on what format will input generally have. If your format is static for all possible inputs (dir\n\tdir2\n\tdir3\n\t\tfile.something) one way to do it is the following:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] answers = input.split("\n\t");
for (int i = 1; i < answers.length; i++)
if (answers[i].contains("\t"))
answers[i-1] = answers[i-1] + "\n\t" + answers[i];
String[] answersFinal = Arrays.copyOf(answers, answers.length-1);
for (int i = 0; i < answersFinal.length; i++)
answersFinal[i] = answers[i];
for (String s : answersFinal)
System.out.println(s);
However this is not a good solution and I would suggest reformatting your input to include a special sequence of characters that you can use to split the input, for example:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
input = input.replaceAll("\n\t", "%%%").replaceAll("%%%\t", "\n\t\t");
And then split the input with '%%%', you will get your desired output.
But again, this highly depends on how generic you want it to be, the best solution is to use an overall different approach to achieve what you want, but I cannot provide it since I don't have enough information on what you are developing.

You can simply do:
String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
String[] modifiedInput = input.replaceAll("\n\t\t", "####").replaceAll("\n\t", "§§§§").replaceAll("####", "\n\t\t").split("§§§§");
Replace each \n\t\t which contain the \n\t
Replace each \n\t
Change back the \n\t\t as you seemingly want to preserve it
Make the split.
Not very efficient but still works fast enough if you won't use it in mass data situations.
This approach is more efficient as it only uses 2 splits but only works if there is only one element prefixed with \n\t\t at the end. Accessing an Array is kind of cheap O(1) so constant time. More code but less full iterations (replaceAll, split).
final String input = "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext";
final String[] s1 = input.split("\n\t\t");
final String last = s1[s1.length - 1];
final String[] modifiedInput = s1[0].split("\n\t");
modifiedInput[modifiedInput.length -1] = modifiedInput[modifiedInput.length -1] + "\n\t\t" + last;

How to split a string and save the 2 characters that I split with?

I am trying to split a given string using the java split method while the string should be devided by two different characters (+ and -) and I am willing to save the characters inside the array aswell in the same index the string has been saven.
for example :
input : String s = "4x^2+3x-2"
output :
arr[0] = 4x^2
arr[1] = +3x
arr[2] = -2
I know how to get the + or - characters in a different index between the numbers but it is not helping me,
any suggestions please?

You can face this problem in many ways. I´m sure there are clever and fancy ways to split this expression. I will show you the simplest problem-solving process that can help you.
State the problem you need to solve, the input and output
Problem: Split a math expression into subexpressions at + and - signals
Input: 4x^2+3x-2
Output: 4x^2,+3x,-2
Create a pseudo code with some logic you might think works
Given an expression string
Create an empty list of expressions
Create a subExpression string
For each character in the expression
Check if the character is + ou - then
add the subExpression in the list and create a new empty subexpression
otherwise, append the character in the subExpression
In the end, add the left subexpression in the list
Implement the pseudo-code in the programming language of your choice
String expression = "4x^2+3x-2";
List<String> expressions = new ArrayList();
StringBuilder subExpression = new StringBuilder();
for (int i = 0; i < expression.length(); i++) {
char character = expression.charAt(i);
if (character == '-' || character == '+') {
expressions.add(subExpression.toString());
subExpression = new StringBuilder(String.valueOf(character));
} else {
subExpression.append(String.valueOf(character));
}
}
expressions.add(subExpression.toString());
System.out.println(expressions);
Output
[4x^2, +3x, -2]
You will end with one algorithm that works for your problem. You can start to improve it.

Try this code:
String s = "4x^2+3x-2";
s = s.replace("+", "#+");
s = s.replace("-", "#-");
String[] ss = s.split("#");
for (int i = 0; i < ss.length; i++) {
Log.e("XOP",ss[i]);
}
This code replaces + and - with #+ and #- respectively and then splits the string with #. That way the + and - operators are not lost in the result.
If you require # as input character then you can use any other Unicode character instead of #.

Try this one:
String s = "4x^2+3x-2";
String[] arr = s.split("[\\+-]");
for(int i=0;i<arr.length;i++){
System.out.println(arr[i]);
}

Personally I like it better to have positive matches of patterns, especially if the split pattern itself is empty.
So for instance you could use a Pattern and Matcher like this:
Pattern p = Pattern.compile("(^|[+-])([^+-]*)");
Matcher m = p.matcher("4x^2+3x-2");
while (m.find()) {
System.out.printf("%s or %s %s%n", m.group(), m.group(1), m.group(2));
}
This matches the start of the string or a plus or minus: ^|[+-], followed by any amount of characters that are not a plus or minus: [^+-]*.
Do note that the ^ first matches the start of the string, and is then used to negate a character class when used between brackets. Regular expressions are tricky like that.
Bonus: you can also use the two groups (within the parenthesis in the pattern) to match the operators - if any.
All this is presuming that you want to use/test regular expressions; generally things like this require a parser rather than a regular expression.
A one-liner for persons thinking that this is too complex:
var expressions = Pattern.compile("^|[+-][^+-]*")
.matcher("4x^2+3x-2")
.results()
.map(r -> r.group())
.collect(Collectors.toList());

Splitting string in between two characters in Java

I am currently attempting to interpret some code I wrote for something. The information I would like to split looks something like this:
{hey=yes}TEST
What I am trying to accomplish, is splitting above string in between '}' and 'T' (T, which could be any letter). The result I am after is (in pseudocode):
["{hey=yes}", "TEST"]
How would one go about doing so? I know basic regex, but have never gotten into using it to split strings in between letters before.
Update:
In order to split the string I am using the String.split method. Do tell if there is a better way to go about doing this.

You can use String's split method, as follow:
String str = "{hey=foo}TEST";
String[] split = str.split("(?<=})");
System.out.println(split[0] + ", " + split[1]);
It splits the string and prints this:
{hey=foo}, TEST
?<=}, is to split after the character } and keep the character while doing it. By default, if you just split on a character, it will be removed by the split.
This other answer provides a complete explanation of all options when using the split method:
how-to-split-string-with-some-separator-but-without-removing-that-separator-in-j

Usage of regexp for such a small piece of code can be really slow, if it is repeated thousands of times (e.g. like analysing Alfresco metadata for lot of documents).
Look at this snippet:
String s = "{key=value}SOMETEXT";
String[] e = null;
long now = 0L;
now = new Date().getTime();
for (int i = 0; i < 3000000; i++) {
e = s.split("(?<=})");
}
System.out.println("Regexp: " + (new Date().getTime() - now));
now = new Date().getTime();
for (int i = 0; i < 3000000; i++) {
int idx = s.indexOf('}') + 1;
e = new String[] { s.substring(0, idx), s.substring(idx) };
}
System.out.println("IndexOf:" + (new Date().getTime() - now));
result is
Regexp: 2544
IndexOf:113
This means that regexp is 25 times slower than a (easier) substring. Keep it in mind: it can make the difference between a efficient code and a elegant (!) one.

If you're looking for a regex approach and also want some validation that input follows the expected syntax you probably want something like this:
public List<String> splitWithRegexp(String string)
{
Matcher matcher = Pattern.compile("(\\{.*\\})(.*)").matcher(string);
if (matcher.find())
return Arrays.asList(matcher.group(1), matcher.group(2));
else
throw new IllegalArgumentException("Input didn't match!");
}
The parenthesis in the regexp captures groups, which you can access with matcher.group(n) calls. Group 0 matches the whole pattern.

Efficient way to search for a set of strings in a string in Java

I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.

You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.

It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.

Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.

When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.

You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.

You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.

One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best way to search for a special string in a text - java

Here's how to do it in one line: String[] hits = str.replaceAll("^.?\\[|][^\\]]$", "").split("].*?\\["); This works by stripping off leading and trailing chars up to and including the first/last opening/closing square bracket, then splits on a close bracket to the next opening bracket (inclusive).

Related

Split a java string among < > brackets, including the brackets, but only if no space between brackets

How to split a string by a newline and a fixed number of tabs like "\n\t" in Java?

How to split a string and save the 2 characters that I split with?

Splitting string in between two characters in Java

Efficient way to search for a set of strings in a string in Java

Categories

Resources