Taking and skipping groups of strings? - java

I worked with strings in a couple of languages and then something bothered me about how we can select characters or slices (substrings) from the strings. Like we can get substrings from a string or a character from a particular position, but I was not able to find any method or operator which returns certain slices of a particular length skipping particular characters. Below is the explanation.
So suppose I have the following string: I am an example string. From this string, I want to be able to get groups of string of let's say length 2 and skip certain characters, let's say 3. Now to make things more interesting let's say I can start at any index, which for this example we'll take 5. So the string which I should get from the above conditions should be the following: anam sng. Illustration below (* for take, ! for skip).
** ** ** **
I am an example string.
| !!! !!! !!! !
Start Position --+
I know you could implement that using counting variables which keep track of each character whether to take or not using if condition. But I was thinking of a mathematical way or maybe even an inbuilt method or operator in some languages that could do the job.
I also searched whether Regex could do the job. But couldn't come up with anything.

Generic solution: skip first start characters, when replace all occurrences of regex (.{0,n}).{0,m} by the first group.
Python:
import re
input = 'I am an example string.'
n = 2
m = 3
start = 5
print(re.sub('(.{0,%d}).{0,%d}' % (n, m), "\\1", input[start:]))
Java:
final String input = "I am an example string.";
final int n = 2;
final int m = 3;
final int start = 5;
final String regex = String.format("(.{0,%d}).{0,%d}", n, m);
System.out.println(input.substring(start).replaceAll(regex, "$1"));
C++11:
string input = "I am an example string.";
int n = 2;
int m = 3;
int start = 5;
stringstream s;
s << "(.{0," << n << "}).{0," << m << "}";
regex r(s.str());
cout << regex_replace(input.substr(start), r, "$1");

Regex can do. You only need to try a little harder :)
public static void main(String[] args) {
String s = "I am an example stringpppqq";
Pattern p = Pattern.compile("(.{1,2})(?:.{3}|.{0,2}$)");
int index = 5;
Matcher m = p.matcher(s);
StringBuilder sb = new StringBuilder();
while (index < s.length() && m.find(index)) {
System.out.println(m.group(1));
sb.append(m.group(1));
index = index + 5;
System.out.println(index);
}
System.out.println(sb);
}
O/P :
anam sngqq

Python don't have this kind of slicing, you must use a loop. But you can do it with a comprehension list:
text = 'I am a sample string'
s = 5 # start position
l = 2 # slice length
d = 3 # distance between slices
chunks = [text[p:p + l] for p in range(s, len(text), l + d]
result = ''.join(chunks)
With a RegEx you can match a two-length string in a group and a three-length string.
import re
regex = r'(..)...'
found = re.findall(regex, text[s:]) # list of tuples
result = ''.join(f[0] for f in found)

Related

Find out recursive pattern in a string java

In one of my interview I had asked one program on java string, I am unable to answer it. I don't know it is a simple program or complex one. I have explored on the internet for it, but unable to find the exact solution for it. My question is as follow,
I have supposed one string which contains recursive pattern like,
String str1 = "abcabcabc";
In above string recursive pattern is "abc" which repeated in one string, because this string only contains "abc" pattern recursively.
if I passed this string to a function/method as a parameter that function/method should return me "This string has a recursive pattern." If that string doesn't have any recursive pattern then simply function/method should return "This string doesn't contain the recursive pattern."
Following are probabilities,
String str1 = "abcabcabc";
//This string contains recursive pattern 'abc'
String str2 = "abcabcabcabdac";
//This string doesn't contains recursive pattern
String str2 = "abcddabcddabcddddabc";
//This string contains recursive pattern 'abc' & 'dd'
Can anybody suggest me solution/algorithm for this, I am struggling with it. What is the best way for different probabilities, so that I implement?
From LeetCode
public boolean repeatedSubstringPattern(String str) {
int l = str.length();
for(int i=l/2;i>=1;i--) {
if(l%i==0) {
int m = l/i;
String subS = str.substring(0,i);
StringBuilder sb = new StringBuilder();
for(int j=0;j<m;j++) {
sb.append(subS);
}
if(sb.toString().equals(str)) return true;
}
}
return false;
}
The length of the repeating substring must be a divisor of the length of the input string
Search for all possible divisor of str.length, starting for length/2
If i is a divisor of length, repeat the substring from 0 to i the number of times i is contained in s.length
If the repeated substring is equals to the input str return true
Solution is not in Javascript. However, problem looked interesting, so attempted to solve it in python. Apologies!
In python, I wrote a logic which worked [Could be written much better, thought the logic would help you]
Script is
def check(lst):
return all(x in lst[-1] for x in lst)
s = raw_input("Enter string:: ")
if check(sorted(s.split(s[0])[1:])):
print("String, {} is recursive".format(s))
else:
print("String, {} is NOT recursive".format(s))
Output of the script:
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabcabdac
String, abcabcabcabdac is NOT recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcabcabc
String, abcabcabc is recursive
[mac] kgowda#blr-mp6xx:~/Desktop/my_work/play$ python dup.py
Enter string:: abcddabcddabcddddabc
String, abcddabcddabcddddabc is recursive
This can also be solved using a part of the Knuth–Morris–Pratt Algorithm.
The idea is to build a 1-D array with each entry representing a character in the word. For each character i in the word we check if there is a prefix which is also a suffix in the word up 0 to i. The reason being if we have common suffix and prefix we can continue searching from the character after prefix ends which we update the array with the corresponding character index.
For s="abcababcababcab", the array will be
Index : 0 1 2 3 4 5 6 7 8
String: a b c a b c a b c
KMP : 0 0 0 1 2 3 4 5 6
For Index = 2, we see that there is no suffix which is also the prefix in the string ab i.e) up until Index = 2
For Index = 4, the suffix ab(Index = 3, 4) is same as the prefix ab(Index = 0, 1) so we update the KMP[4] = 2 which is the index of the pattern from which we have to resume searching.
Thus KMP[i] holds the index of the string s where prefix matches the longest possible suffix in the range 0 to i plus 1. Which essentially means that the a prefix with length index + 1 - KMP[index] exists in the string previously. using this information we can find out if all the substrings of that length are the same.
For Index = 8, we know KMP[index] = 6, which means there is a prefix(s[3] to s[5]) of length 9 - 6 = 3 which is equal to the suffix(s[6] to s[8]), If this is the only repetitive pattern we have this will follow
For a clearer explanation of this algorithm please check this video lecture.
This table can be build in linear time,
vector<int> buildKMPtable(string word)
{
vector<int> kmp(word.size());
int j=0;
for(int i=1; i < word.size(); ++i)
{
j = word[j] == word[i] ? j : kmp[j-1];
if(word[j] == word[i])
{
kmp[i] = j + 1;
++j;
}
else
{
kmp[i] = j;
}
}
return kmp;
}
bool repeatedSubstringPattern(string s) {
auto kmp = buildKMPtable(s);
if(kmp[s.size() -1] == 0) // Occurs when the string has no prefix with suffix ending at the last character of the string
{
return false;
}
int diff = s.size() - kmp[s.size() -1]; //Length of the repetitive pattern
if(s.size() % diff != 0) //Length of repetitive pattern must be a multiple of the size of the string
{
return false;
}
// Check if that repetitive pattern is the only repetitive pattern.
string word = s.substr(0, diff);
int w_size = word.size();
for(int i=0; i < w_size; ++i)
{
int j = i;
while(j < s.size())
{
if(word[i] == s[j])
{
j += w_size;
}
else
{
return false;
}
}
}
return true;
}
If you know the 'parts' in advance, then the answer could be Recursive regular expressions, it seems.
So for abcabcabc we need an expression like abc(?R)* where:
abc matches the literal characters
(?R) recurses the pattern
A * to match between zero and unlimited number of times
The third one is a little trickier. See this regex101 link but it looks like:
((abc)|(dd))(?R)*
where we have either 'abc' or 'dd' and there are any number of these.
Otherwise, I don't see how you could determine from just a string that it has some undefined recursive structure like this.

Java regex - one liner for counting matches

Is there a one liner to replace the while loop?
String formatSpecifier = "%(\\d+\\$)?([-#+ 0,(\\<]*)?(\\d+)?(\\.\\d+)?([tT])?([a-zA-Z%])";
Pattern pattern = Pattern.compile(formatSpecifier);
Matcher matcher = pattern.matcher("hello %s my name is %s");
// CAN BELOW BE REPLACED WITH A ONE LINER?
int counter = 0;
while (matcher.find()) {
counter++;
}
Personally I don't see any reason to aim for one-liner given the original code is already easy to understand. Anyway, several ways if you insists:
1. Make a helper method
make something like this
static int countMatches(Matcher matcher) {
int counter = 0;
while (matcher.find())
counter++;
return counter;
}
so your code become
int counter = countMatches(matcher);
2. Java 9
Matcher in Java 9 provides results() which returns a Stream<MatchResult>. So your code becomes
int counter = matcher.results().count();
3. String Replace
Similar to what the other answer suggest.
Here I am replacing with null character (which is almost not used in any normal string), and do the counting by split:
Your code become:
int counter = matcher.replaceAll("\0").split("\0", -1).length - 1;
Yes: replace any occurrence by a char that can be neither in the pattern nor in the string to match, then count the number of occurrences of this char.
Here I choose X, for the example to be simple. You should use a char not so often used (see UTF-8 special chars for instance).
final int counter = pattern.matcher("hello %s my name is %s").replaceAll("X").replaceAll("[^X]", "").length();
Value computed for counter is 2 with your example.

Getting match of Group with Asterisk?

How can I get the content for a group with an asterisk?
For example I'd like to pare a comma separated list, e. g. 1,2,3,4,5.
private static final String LIST_REGEX = "^(\\d+)(,\\d+)*$";
private static final Pattern LIST_PATTERN = Pattern.compile(LIST_REGEX);
public static void main(String[] args) {
final String list = "1,2,3,4,5";
final Matcher matcher = LIST_PATTERN.matcher(list);
System.out.println(matcher.matches());
for (int i = 0, n = matcher.groupCount(); i < n; i++) {
System.out.println(i + "\t" + matcher.group(i));
}
}
And the output is
true
0 1,2,3,4,5
1 1
How can I get every single entry, i. e. 1, 2, 3, ...?
I am searching for a common solution. This is only a demonstrative example.
Please imagine a more complicated regex like ^\\[(\\d+)(,\\d+)*\\]$ to match a list like [1,2,3,4,5]
You can use String.split().
for (String segment : "1,2,3,4,5".split(","))
System.out.println(segment);
Or you can repeatedly capture with assertion:
Pattern pattern = Pattern.compile("(\\d),?");
for (Matcher m = pattern.matcher("1,2,3,4,5");; m.find())
m.group(1);
For your second example you added you can do a similar match.
for (String segment : "!!!!![1,2,3,4,5] //"
.replaceFirst("^\\D*(\\d(?:,\\d+)*)\\D*$", "$1")
.split(","))
System.out.println(segment);
I made an online code demo. I hope this is what you wanted.
how can I get all the matches (zero, one or more) for a arbitary group with an asterisk (xyz)*? [The group is repeated and I would like to get every repeated capture.]
No, you cannot. Regex Capture Groups and Back-References tells why:
The Returned Value for a Given Group is the Last One Captured
Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z]_)+, when you inspect the match, Group 1 will be D_. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched.
I assume you may be looking for something like the following, this will handle both of your examples.
private static final String LIST_REGEX = "^\\[?(\\d+(?:,\\d+)*)\\]?$";
private static final Pattern LIST_PATTERN = Pattern.compile(LIST_REGEX);
public static void main(String[] args) {
final String list = "[1,2,3,4,5]";
final Matcher matcher = LIST_PATTERN.matcher(list);
matcher.find();
int i = 0;
String[] vals = matcher.group(1).split(",");
System.out.println(matcher.matches());
System.out.println(i + "\t" + matcher.group(1));
for (String x : vals) {
i++;
System.out.println(i + "\t" + x);
}
}
Output
true
0 1,2,3,4,5
1 1
2 2
3 3
4 4
5 5

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.
why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.
try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo
Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");
Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

Splitting up input using regular expressions in Java

I am making a program that lets a user input a chemical for example C9H11N02. When they enter that I want to split it up into pieces so I can have it like C9, H11, N, 02. When I have it like this I want to make changes to it so I can make it C10H12N203 and then put it back together. This is what I have done so far. using the regular expression I have used I can extract the integer value, but how would I go about get C10, H11 etc..?
System.out.println("Enter Data");
Scanner k = new Scanner( System.in );
String input = k.nextLine();
String reg = "\\s\\s\\s";
String [] data;
data = input.split( reg );
int m = Integer.parseInt( data[0] );
int n = Integer.parseInt( data[1] );
It can be done using look arounds:
String[] parts = input.split("(?<=.)(?=[A-Z])");
Look arounds are zero-width, non-consuming assertions.
This regex splits the input where the two look arounds match:
(?<=.) means "there is a preceding character" (ie not at the start of input)
(?=[A-Z]) means "the next character is a capital letter" (All elements start with A-Z)
Here's a test, including a double-character symbol for some edge cases:
public static void main(String[] args) {
String input = "C9KrBr2H11NO2";
String[] parts = input.split("(?<=.)(?=[A-Z])");
System.out.println(Arrays.toString(parts));
}
Output:
[C9, Kr, Br2, H11, N, O2]
If you then wanted to split up the individual components, use a nested call to split():
public static void main(String[] args) {
String input = "C9KrBr2H11NO2";
for (String component : input.split("(?<=.)(?=[A-Z])")) {
// split on non-digit/digit boundary
String[] symbolAndNumber = component.split("(?<!\\d)(?=\\d)");
String element = symbolAndNumber[0];
// elements without numbers won't be split
String count = symbolAndNumber.length == 1 ? "1" : symbolAndNumber[1];
System.out.println(element + " x " + count);
}
}
Output:
C x 9
Kr x 1
Br x 2
H x 11
N x 1
O x 2
Did you accidentally put zeroes into some of those formula where the letter "O" (oxygen) was supposed to be? If so:
"C10H12N2O3".split("(?<=[0-9A-Za-z])(?=[A-Z])");
[C10, H12, N2, O3]
"CH2BrCl".split("(?<=[0-9A-Za-z])(?=[A-Z])");
[C, H2, Br, Cl]
I believe the following code should allow you to extract the various elements and their associated count. Of course, brackets make things more complicated, but you didn't ask about them!
Pattern pattern = Pattern.compile("([A-Z][a-z]*)([0-9]*)");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
String element = matcher.group(1);
int count = 1;
if (matcher.groupCount > 1) {
try {
count = Integer.parseInt(matcher.group(2));
} catch (NumberFormatException e) {
// Regex means we should never get here!
}
}
// Do stuff with this component
}

Categories

Resources