Not getting desired results with multiple regex matching in same string

Not getting desired results with multiple regex matching in same string - java

I have a unique problem statement where I have to perform regex on an input string using triple characters. e.g. if my input is ABCDEFGHI, a pattern search for BCD should return false since I am treating my input as ABC+DEF+GHI and need to compare my regex pattern with these triple characters.
Similarly, regex pattern DEF will return true since it matches one of the triplets. Using this problem statement, assume that my input is QWEABCPOIUYTREWXYZASDFGHJKLABCMNBVCXZASXYZFGH and I am trying to get all output strings that start with triplet ABC and end with XYZ. So, in above input, my outputs should be two strings: ABCPOIUYTREWXYZ and ABCMNBVCXZASXYZ.
Also, I have to store these strings in an ArrayList. Below is my function:
public static void newFindMatches (String text, String startRegex, String endRegex, List<String> output) {
int startPos = 0;
int endPos = 0;
int i = 0;
// Making sure that substrings are always valid
while ( i < text.length()-2) {
// Substring for comparing triplets
String subText = text.substring(i, i+3);
Pattern startP = Pattern.compile(startRegex);
Pattern endP = Pattern.compile(endRegex);
Matcher startM = startP.matcher(subText);
if (startM.find()) {
// If a match is found, set the start position
startPos = i;
for (int j = i; j < text.length()-2; j+=3) {
String subText2 = text.substring(j, j+3);
Matcher endM = endP.matcher(subText2);
if (endM.find()) {
// If match for end pattern is found, set the end position
endPos = j+3;
// Add the string between start and end positions to ArrayList
output.add(text.substring(startPos, endPos));
i = j;
}
}
}
i = i+3;
}
}
Upon running this function in main as follows:
String input = "QWEABCPOIUYTREWXYZASDFGHJKLABCMNBVCXZASXYZFGH";
String start = "ABC";
String end = "XYZ";
List<String> results = new ArrayList<String> ();
newFindMatches(input, start, end, results);
for (int x = 0; x < results.size(); x++) {
System.out.println("Output String number "+(x+1)+" is: "+results.get(x));
}
I get the following output:
Output String number 1 is: ABCPOIUYTREWXYZ
Output String number 2 is: ABCPOIUYTREWXYZASDFGHJKLABCMNBVCXZASXYZ
Notice that first string is correct. However, for the second string, program is again reading from start of input string. Instead, i want the program to read after the last end pattern (i.e. skip the first search and unwanted characters such as ASDFGHJKL and should only print 2nd string as: ABCMNBVCXZASXYZ
Thanks for your responses

The problem here is that when you find your end match (the if statement within the for loop), you don't stop the for loop. So it just keeps looking for more end-matches until it hits the for-loop end condition j < text.length()-2. When you find your match and process it, you should end the loop using "break;". Place "break;" after the i=j line.
Note that technically the second answer your current program gave you is correct, that is also a substring that begins with ABC and ends with XYZ. You might want to rethink the correct output for your program. You could accommodate that situation by not setting i=j when you find a match, so that the only incrementing of i is the i=i+3 line, iterating across the triplets (and not adding the break).

Related

How to find first occurance of whitespace(tab+space+etc) in java?

So I have something like this
System.out.println(some_string.indexOf("\\s+"));
this gives me -1
but when I do with specific value like \t or space
System.out.println(some_string.indexOf("\t"));
I get the correct index.
Is there any way I can get the index of the first occurrence of whitespace without using split, as my string is very long.
PS - if it helps, here is my requirement. I want the first number in the string which is separated from the rest of the string by a tab or space ,and i am trying to avoid split("\\s+")[0]. The string starts with that number and has a space or tab after the number ends

The point is: indexOf() takes a char, or a string; but not a regular expression.
Thus:
String input = "a\tb";
System.out.println(input);
System.out.println(input.indexOf('\t'));
prints 1 because there is a TAB char at index 1.
System.out.println(input.indexOf("\\s+"));
prints -1 because there is no substring \\s+ in your input value.
In other words: if you want to use the powers of regular expressions, you can't use indexOf(). You would be rather looking towards String.match() for example. But of course - that gives a boolean result; not an index.
If you intend to find the index of the first whitespace, you have to iterate the chars manually, like:
for (int index = 0; index < input.length(); index++) {
if (Character.isWhitespace(input.charAt(index))) {
return index;
}
}
return -1;

Something of this sort might help? Though there are better ways to do this.
class Sample{
public static void main(String[] args) {
String s = "1110 001";
int index = -1;
for(int i = 0; i < s.length(); i++ ){
if(Character.isWhitespace(s.charAt(i))){
index = i;
break;
}
}
System.out.println("Required Index : " + index);
}
}

Well, to find with a regular expression, you'll need to use the regular expression classes.
Pattern pat = Pattern.compile("\\s");
Matcher m = pat.matcher(s);
if ( m.find() ) {
System.out.println( "Found \\s at " + m.start());
}
The find method of the Matcher class locates the pattern in the string for which the matcher was created. If it succeeds, the start() method gives you the index of the first character of the match.
Note that you can compile the pattern only once (even create a constant). You just have to create a Matcher for every string.

Simplify & condense multiple editorial operations on an array. Java

I have some raw output that I want to clean up and make presentable but right now I go about it in a very ugly and cumbersome way, I wonder if anyone might know a clean and elegant way in which to perform the same operation.
int size = charOutput.size();
for (int i = size - 1; i >= 1; i--)
{
if(charOutput.get(i).compareTo(charOutput.get(i - 1)) == 0)
{
charOutput.remove(i);
}
}
for(int x = 0; x < charOutput.size(); x++)
{
if(charOutput.get(x) == '?')
{
charOutput.remove(x);
}
}
String firstOne = Arrays.toString(charOutput.toArray());
String secondOne = firstOne.replaceAll(",","");
String thirdOne = secondOne.substring(1, secondOne.length() - 1);
String output = thirdOne.replaceAll(" ","");
return output;

ZouZou has the right code for fixing the final few calls in your code. I have some suggestions for the for loops. I hope I got them right...
These work after you get the String represented by charOutput, using a method such as the one suggested by ZouZou.
Your first block appears to remove all repeated letters. You can use a regular expression for that:
Pattern removeRepeats = Pattern.compile("(.)\\1{1,}");
// "(.)" creates a group that matches any character and puts it into a group
// "\\1" gets converted to "\1" which is a reference to the first group, i.e. the character that "(.)" matched
// "{1,}" means "one or more"
// So the overall effect is "one or more of a single character"
To use:
removeRepeats.matcher(s).replaceAll("$1");
// This creates a Matcher that matches the regex represented by removeRepeats to the contents of s, and replaces the parts of s that match the regex represented by removeRepeats with "$1", which is a reference to the first group captured (i.e. "(.)", which is the first character matched"
To remove the question mark, just do
Pattern removeQuestionMarks = Pattern.compile("\\?");
// Because "?" is a special symbol in regex, you have to escape it with a backslash
// But since backslashes are also a special symbol, you have to escape the backslash too.
And then to use, do the same thing as was done above except with replaceAll("");
And you're done!
If you really wanted to, you can combine a lot of regex into two super-regex expressions (and one normal regex expression):
Pattern p0 = Pattern.compile("(\\[|\\]|\\,| )"); // removes brackets, commas, and spaces
Pattern p1 = Pattern.compile("(.)\\1{1,}"); // Removes duplicate characters
Pattern p2 = Pattern.compile("\\?");
String removeArrayCharacters = p0.matcher(charOutput.toString()).replaceAll("");
String removeDuplicates = p1.matcher(removeArrayCharacters).replaceAll("$1");
return p2.matcher(removeDuplicates).replaceAll("");

Use a StringBuilder and append each character you want, at the end just return myBuilder.toString();
Instead of this:
String firstOne = Arrays.toString(charOutput.toArray());
String secondOne = firstOne.replaceAll(",","");
String thirdOne = secondOne.substring(1, secondOne.length() - 1);
String output = thirdOne.replaceAll(" ","");
return output;
Simply do:
StringBuilder sb = new StringBuilder();
for(Character c : charOutput){
sb.append(c);
}
return sb.toString();
Note that you are doing a lot of unnecessary work (by iterating through the list and removing some elements). What you can actually do is just iterate one time and then if the condition fullfits your requirements (the two adjacent characters are not the same and no question mark) then append it to the StringBuilder directly.
This task could also be a job for a regular expression.

If you don't want to use Regex try this version to remove consecutive characters and '?':
int size = charOutput.size();
if (size == 1) return Character.toString((Character)charOutput.get(0));
else if (size == 0) return null;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < size - 1; i++) {
Character temp = (Character)charOutput.get(i);
if (!temp.equals(charOutput.get(i+1)) && !temp.equals('?'))
sb.append(temp);
}
//for the last element
if (!charOutput.get(size-1).equals(charOutput.get(size-2))
&& !charOutput.get(size-1).equals('?'))
sb.append(charOutput.get(size-1));
return sb.toString();

Finding the longest substring between a "start" string and one of 3 possible "end" strings

So my question is substring-related.
How do you find the longest possible substring between a starting string and one of three ending strings? I also need to find the index of the original string that the largest substring starts at.
So:
Start string:
"ATG"
3 possible end strings:
"TAG"
"TAA"
"TGA"
An example original string might be:
"SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF"
So the result of that should give me:
- Longest substring length: 23 (from the substring ATGDFSDFAKJDNKSJFNSDTGA)
- Index of longest substring: 10
I cannot use Regex.
Thanks for any help!

This is arguably the easiest way, and it's just one line:
String target = str.replaceAll(".*ATG(.*)(TAG|TAA|TGA).*", "$1");
To find the index:
int index = str.indexOf("ATG") + 3;
Note: I have interpreted your remark "I cannot use regex" to mean "I am unskilled at regex", because if it's a java question, regex is available.

Well, this looks like a fun one.
It seems the most straightforward way to do this would be to build your own mini finite state machine. You would have to parse each character in the string and keep track of all possible character sequences that would terminate the sequence.
If you hit a 'T', you need to jump ahead and look at the next character. If it's an 'A' or a 'G' you need to jump ahead again, otherwise, add those tokens to your string. Continue the pattern until you get to the end of the original string, or match one of your terminal patterns.
So, maybe something that looks like this (simplified example):
String longestSequence(String original) {
StringBuilder sb = new StringBuilder();
char[] tokens = original.toCharArray();
for (int i = 0; i < tokens.length; ++i) {
// read each token, and compare / look ahead to see if you should keep going or terminate.
}
return sb.toString();
}

match your string to this regex:
ATG[A-Z]+(TAG|TAA|TGA)
if multiple match occurs then iterate and keep the one with highest length.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// using pattern with flags
Pattern pattern = Pattern.compile("ATG[A-Z]+(TAG|TAA|TGA)");
Matcher matcher = pattern.matcher( yourInputStringHere );
while (matcher.find()) {
System.out.println("Found the text \"" + matcher.group()
+ "\" starting at " + matcher.start()
+ " and ending at index " + matcher.end());
}

There are already some beautiful and elegant solutions to your problem (Bohemian and inquisitive). If you still - as originally stated - can't use regex, here's an alternative. This code is not especially elegant, and as pointed, there are better ways to do it, but it should at least clearly show you the logic behind the solution to your problem.
How do you find the longest possible substring between a starting string
and one of three ending strings?
First, find the index of starting string, then find the index of each ending string, and get substrings for each ending, then their length. Remember that if string is not found, its index will be -1.
String originalString = "SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF";
String STARTING_STRING = "ATG";
String END1 = "TAG";
String END2 = "TAA";
String END3 = "TGA";
//let's find the index of STARTING_STRING
int posOfStartingString = originalString.indexOf(STARTING_STRING);
//if found
if (posOfStartingString != -1) {
int tagPos[] = new int[3];
//let's find the index of each ending strings in the original string
tagPos[0] = originalString.indexOf(END1, posOfStartingString+3);
tagPos[1] = originalString.indexOf(END2, posOfStartingString+3);
tagPos[2] = originalString.indexOf(END3, posOfStartingString+3);
int lengths[] = new int[3];
//we can now use the following methods:
//public String substring(int beginIndex, int endIndex)
//where beginIndex is our posOfStartingString
//and endIndex is position of each ending string (if found)
//
//and finally, String.length() to get the length of each substring
if (tagPos[0] != -1) {
lengths[0] = originalString.substring(posOfStartingString, tagPos[0]).length();
}
if (tagPos[1] != -1) {
lengths[1] = originalString.substring(posOfStartingString, tagPos[1]).length();
}
if (tagPos[2] != -1) {
lengths[2] = originalString.substring(posOfStartingString, tagPos[2]).length();
}
} else {
//no starting string in original string
}
lengths[] table now contains length of strings starting with STARTING_STRING and 3 respective endings. Then just find which one is the longest and you will have your answer.
I also need to find the index of the original string that the largest substring starts at.
This will be the index of where starting string starts, in this case 10.

Removing duplicate same characters in a row

I am trying to create a method which will either remove all duplicates from a string or only keep the same 2 characters in a row based on a parameter.
For example:
helllllllo -> helo
or
helllllllo -> hello - This keeps double letters
Currently I remove duplicates by doing:
private String removeDuplicates(String word) {
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < word.length(); i++) {
char letter = word.charAt(i);
if (buffer.length() == 0 && letter != buffer.charAt(buffer.length() - 1)) {
buffer.append(letter);
}
}
return buffer.toString();
}
If I want to keep double letters I was thinking of having a method like private String removeDuplicates(String word, boolean doubleLetter)
When doubleLetter is true it will return hello not helo
I'm not sure of the most efficient way to do this without duplicating a lot of code.

why not just use a regex?
public class RemoveDuplicates {
public static void main(String[] args) {
System.out.println(new RemoveDuplicates().result("hellllo", false)); //helo
System.out.println(new RemoveDuplicates().result("hellllo", true)); //hello
}
public String result(String input, boolean doubleLetter){
String pattern = null;
if(doubleLetter) pattern = "(.)(?=\\1{2})";
else pattern = "(.)(?=\\1)";
return input.replaceAll(pattern, "");
}
}
(.) --> matches any character and puts in group 1.
?= --> this is called a positive lookahead.
?=\\1 --> positive lookahead for the first group
So overall, this regex looks for any character that is followed (positive lookahead) by itself. For example aa or bb, etc. It is important to note that only the first character is part of the match actually, so in the word 'hello', only the first l is matched (the part (?=\1) is NOT PART of the match). So the first l is replaced by an empty String and we are left with helo, which does not match the regex
The second pattern is the same thing, but this time we look ahead for TWO occurrences of the first group, for example helllo. On the other hand 'hello' will not be matched.
Look here for a lot more: Regex
P.S. Fill free to accept the answer if it helped.

try
String s = "helllllllo";
System.out.println(s.replaceAll("(\\w)\\1+", "$1"));
output
helo

Taking this previous SO example as a starting point, I came up with this:
String str1= "Heelllllllllllooooooooooo";
String removedRepeated = str1.replaceAll("(\\w)\\1+", "$1");
System.out.println(removedRepeated);
String keepDouble = str1.replaceAll("(\\w)\\1{2,}", "$1");
System.out.println(keepDouble);
It yields:
Helo
Heelo
What it does:
(\\w)\\1+ will match any letter and place it in a regex capture group. This group is later accessed through the \\1+. Meaning that it will match one or more repetitions of the previous letter.
(\\w)\\1{2,} is the same as above the only difference being that it looks after only characters which are repeated more than 2 times. This leaves the double characters untouched.
EDIT:
Re-read the question and it seems that you want to replace multiple characters by doubles. To do that, simply use this line:
String keepDouble = str1.replaceAll("(\\w)\\1+", "$1$1");

Try this, this will be most efficient way[Edited after comment]:
public static String removeDuplicates(String str) {
int checker = 0;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < str.length(); ++i) {
int val = str.charAt(i) - 'a';
if ((checker & (1 << val)) == 0)
buffer.append(str.charAt(i));
checker |= (1 << val);
}
return buffer.toString();
}
I am using bits to identify uniqueness.
EDIT:
Whole logic is that if a character has been parsed then its corrresponding bit is set and next time when that character comes up then it will not be added in String Buffer the corresponding bit is already set.

Java String Break by space after position

I have a lengthy string and want to break it up into a number of sub-strings so I can display it in a menu as a paragraph rather than a single long line. But I don't want to break it up in the middle of a word (so a break every n characters won't work).
So I want to break the string up by the first occurrence of any of the characters in a String after a certain point (in my case, the characters would be a space and a semi-colon, but they could be anything).
Something like:
String result[] = breakString(baseString, // String
lineLength, // int
breakChars) // String

Consider splitting by the break chars first and then summing the lengths of the segments that result from that split until you reach your line length.

Here is one way. I took "by the first occurrence of any of the characters in a String after a certain point" to mean that the next instance of breakChars after a certain lineLength should be the end of a line. So, breakString("aaabc", 2, "b") would return {"aaab", "c"}.
static String[] breakString(String baseString, int lineLength, String breakChars) {
// find `lineLength` or more characters of the String, until the `breakChars` string
Pattern p = Pattern.compile(".{" + lineLength + ",}?" + Pattern.quote(breakChars));
Matcher m = p.matcher(baseString);
List<String> list = new LinkedList<>();
int index = 0;
while (m.find(index)) {
String s = m.group();
list.add(s);
// find another match starting at the end of the last one
index = m.end();
}
if (index < baseString.length() - 1) {
list.add(baseString.substring(index));
}
return list.toArray(new String[list.size()]);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Not getting desired results with multiple regex matching in same string - java

Related

How to find first occurance of whitespace(tab+space+etc) in java?

Simplify & condense multiple editorial operations on an array. Java

Finding the longest substring between a "start" string and one of 3 possible "end" strings

Removing duplicate same characters in a row

Java String Break by space after position

Categories

Resources