Java- How to find non-alphabetical letters in a string? (Quick way)

Java- How to find non-alphabetical letters in a string? (Quick way) - java

In Java, given a string, like "abc#df" where the character '#' could be ANY other non-letter, like '%', '^', '&', etc. What would be the most efficient way to find that index? I know that a for loop would be kind of quick (depending on the string length), but what about any other quicker methods? A method that finds all index(es) of non-alphabetical letters or the closest one to a given index (like indexOf(string, startingIdx))
Thanks!

A for loop, you can use the Character class to determine if each character is a Letter (or other type). See: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)

You should probably use a regular expression:
Pattern patt = Pattern.compile("[^A-Za-z]");
Matcher mat = patt.matcher("avc#dgh");
boolean found = mat.find();
System.out.println(found ? mat.start() : -1);

You could use regex to split the string on anything that is not alphabetic:
String str = "abc#df";
String[] split = str.split("[^A-za-z]");
Then you can use the length of the strings in that array to find the index of the non - alphabetic chars:
int firstIndex = split[0].length();
And so on:
int secondIndex = firstIndex + split[1].length();

Related

why split() produces extra , after sets limit -1

I want to split Area Code and preceding number from Telephone number without brackets so i did this.
String pattern = "[\\(?=\\)]";
String b = "(079)25894029".trim();
String c[] = b.split(pattern,-1);
for (int a = 0; a < c.length; a++)
System.out.println("c[" + a + "]::->" + c[a] + "\nLength::->"+ c[a].length());
Output:
c[0]::-> Length::->0
c[1]::->079 Length::->3
c[2]::->25894029 Length::->8
Expected Output:
c[0]::->079 Length::->3
c[1]::->25894029 Length::->8
So my question is why split() produces and extra blank at the start, e.g
[, 079, 25894029]. Is this its behavior, or I did something go wrong here?
How can I get my expected outcome?

First you have unnecessary escaping inside your character class. Your regex is same as:
String pattern = "[(?=)]";
Now, you are getting an empty result because ( is the very first character in the string and split at 0th position will indeed cause an empty string.
To avoid that result use this code:
String str = "(079)25894029";
toks = (Character.isDigit(str.charAt(0))? str:str.substring(1)).split( "[(?=)]" );
for (String tok: toks)
System.out.printf("<<%s>>%n", tok);
Output:
<<079>>
<<25894029>>

From the Java8 Oracle docs:
When there is a positive-width match at the beginning of this string
then an empty leading substring is included at the beginning of the
resulting array. A zero-width match at the beginning however never
produces such empty leading substring.
You can check that the first character is an empty string, if yes then trim that empty string character.

Your regex has problems, as does your approach - you can't solve it using your approach with any regex. The magic one-liner you seek is:
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
This removes all leading/trailing non-digits, then splits on non-digits. This will handle many different formats and separators (try a few yourself).
See live demo of this:
String b = "(079)25894029".trim();
String[] c = b.replaceAll("^\\D+|\\D+$", "").split("\\D+");
System.out.println(Arrays.toString(c));
Producing this:
[079, 25894029]

Regular expression for counting words in a sentence

public static int getWordCount(String sentence) {
return sentence.split("(([a-zA-Z0-9]([-][_])*[a-zA-Z0-9])+)", -1).length
+ sentence.replaceAll("([[a-z][A-Z][0-9][\\W][-][_]]*)", "").length() - 1;
}
My intention is to count the number of words in a sentence. The input to this function is the the lengthy sentence. It may have 255 words.
The word should take hyphens or underscores in between
Function should only count valid words means special character should not be counted eg. &&&& or #### should not count as a word.
The above regular expression is working fine, but when hyphen or underscore comes in between the word eg: co-operation, the count returning as 2, it should be 1. Can anyone please help?

Instead of using .split and .replaceAll which are quite expensive operations, please use an approach with constant memory usage.
Based on your specifications, you seem to look for the following regex:
[\w-]+
Next you can use this approach to count the number of matches:
public static int getWordCount(String sentence) {
Pattern pattern = Pattern.compile("[\\w-]+");
Matcher matcher = pattern.matcher(sentence);
int count = 0;
while (matcher.find())
count++;
return count;
}
online jDoodle demo.
This approach works in (more) constant memory: when splitting, the program constructs an array, which is basically useless, since you never inspect the content of the array.
If you don't want words to start or end with hyphens, you can use the following regex:
\w+([-]\w+)*

This part ([-][_])* is wrong. The notation [xyz] means "any single one of the characters inside the brackets" (see http://www.regular-expressions.info/charclass.html). So effectively, you allow exactly the character - and exactly the character _, in that order.
Fixing your group makes it work:
[a-zA-Z0-9]+([-_][a-zA-Z0-9]+)*
and it can be further simplified using \w to
\w+(-\w+)*
because \w matches 0..9, A..Z, a..z and _ (http://www.regular-expressions.info/shorthand.html) and so you only need to add -.

if you can use java 8:
long wordCount = Arrays.stream(sentence.split(" ")) //split the sentence into words
.filter(s -> s.matches("[\\w-]+")) //filter only matching words
.count();

With java 8
public static int getColumnCount(String row) {
return (int) Pattern.compile("[\\w-]+")
.matcher(row)
.results()
.count();
}

Java Regex from beginning to first char

How can I find any word from beginning of string to first char "~" using java?
Example:
Worddjjfdskfjsdkfjdsj ~ Word ~ Word
I want it to capture
Worddjjfdskfjsdkfjdsj

You can also do it without regex in a very simple way.
First of all use indexOf() String method to find the index of the "~" character. Then use the substring() method to extract the string you are lookin for.
Here is an example:
String stringToProcess = "hello~world";
int charIndex = stringToProcess.indexOf('~');
String finalString = stringToProcess.substring(0, charIndex);

You can use this regex to capture all character from start of string ^ to first occurrence of ~:
^[^~]*
[^~]* is negation based regex that matches 0 or more of anything but ~

Without regex it can be solved
Simply split your string by ~.
String str[] = "Worddjjfdskfjsdkfjdsj ~ Word ~ Word".split("~");
System.out.println(str[0]);

Here is regular expression that you can use: ^(.*?)~.
However in your simple case you do not need regular expressions at all. Use indexOf() and substring():
int tilda = str.indexOf('~');
if (tilda >= 0) {
word = str.substring(0, tilda);
}

Finding the longest substring between a "start" string and one of 3 possible "end" strings

So my question is substring-related.
How do you find the longest possible substring between a starting string and one of three ending strings? I also need to find the index of the original string that the largest substring starts at.
So:
Start string:
"ATG"
3 possible end strings:
"TAG"
"TAA"
"TGA"
An example original string might be:
"SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF"
So the result of that should give me:
- Longest substring length: 23 (from the substring ATGDFSDFAKJDNKSJFNSDTGA)
- Index of longest substring: 10
I cannot use Regex.
Thanks for any help!

This is arguably the easiest way, and it's just one line:
String target = str.replaceAll(".*ATG(.*)(TAG|TAA|TGA).*", "$1");
To find the index:
int index = str.indexOf("ATG") + 3;
Note: I have interpreted your remark "I cannot use regex" to mean "I am unskilled at regex", because if it's a java question, regex is available.

Well, this looks like a fun one.
It seems the most straightforward way to do this would be to build your own mini finite state machine. You would have to parse each character in the string and keep track of all possible character sequences that would terminate the sequence.
If you hit a 'T', you need to jump ahead and look at the next character. If it's an 'A' or a 'G' you need to jump ahead again, otherwise, add those tokens to your string. Continue the pattern until you get to the end of the original string, or match one of your terminal patterns.
So, maybe something that looks like this (simplified example):
String longestSequence(String original) {
StringBuilder sb = new StringBuilder();
char[] tokens = original.toCharArray();
for (int i = 0; i < tokens.length; ++i) {
// read each token, and compare / look ahead to see if you should keep going or terminate.
}
return sb.toString();
}

match your string to this regex:
ATG[A-Z]+(TAG|TAA|TGA)
if multiple match occurs then iterate and keep the one with highest length.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// using pattern with flags
Pattern pattern = Pattern.compile("ATG[A-Z]+(TAG|TAA|TGA)");
Matcher matcher = pattern.matcher( yourInputStringHere );
while (matcher.find()) {
System.out.println("Found the text \"" + matcher.group()
+ "\" starting at " + matcher.start()
+ " and ending at index " + matcher.end());
}

There are already some beautiful and elegant solutions to your problem (Bohemian and inquisitive). If you still - as originally stated - can't use regex, here's an alternative. This code is not especially elegant, and as pointed, there are better ways to do it, but it should at least clearly show you the logic behind the solution to your problem.
How do you find the longest possible substring between a starting string
and one of three ending strings?
First, find the index of starting string, then find the index of each ending string, and get substrings for each ending, then their length. Remember that if string is not found, its index will be -1.
String originalString = "SDAFKJDAFKATGDFSDFAKJDNKSJFNSDTGASDFKJSDNKFJSNDJFATGDSDFKJNSDFTAGSDFSDATGFF";
String STARTING_STRING = "ATG";
String END1 = "TAG";
String END2 = "TAA";
String END3 = "TGA";
//let's find the index of STARTING_STRING
int posOfStartingString = originalString.indexOf(STARTING_STRING);
//if found
if (posOfStartingString != -1) {
int tagPos[] = new int[3];
//let's find the index of each ending strings in the original string
tagPos[0] = originalString.indexOf(END1, posOfStartingString+3);
tagPos[1] = originalString.indexOf(END2, posOfStartingString+3);
tagPos[2] = originalString.indexOf(END3, posOfStartingString+3);
int lengths[] = new int[3];
//we can now use the following methods:
//public String substring(int beginIndex, int endIndex)
//where beginIndex is our posOfStartingString
//and endIndex is position of each ending string (if found)
//
//and finally, String.length() to get the length of each substring
if (tagPos[0] != -1) {
lengths[0] = originalString.substring(posOfStartingString, tagPos[0]).length();
}
if (tagPos[1] != -1) {
lengths[1] = originalString.substring(posOfStartingString, tagPos[1]).length();
}
if (tagPos[2] != -1) {
lengths[2] = originalString.substring(posOfStartingString, tagPos[2]).length();
}
} else {
//no starting string in original string
}
lengths[] table now contains length of strings starting with STARTING_STRING and 3 respective endings. Then just find which one is the longest and you will have your answer.
I also need to find the index of the original string that the largest substring starts at.
This will be the index of where starting string starts, in this case 10.

Finding multiple substrings using boundaries in Java

Alright so here is my problem. Basically I have a string with 4 words in it, with each word seperated by a #. What I need to do is use the substring method to extract each word and print it out. I am having trouble figuring out the parameters for it though. I can always get the first one right, but the following ones generally have problems.
Here is the first piece of the code:
word = format.substring( 0 , format.indexOf('#') );
Now from what I understand this basically means start at the beginning of the string, and end right before the #. So using the same logic, I tried to extract the second word like so:
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#') );
//The plus one so I don't start at the #.
But with this I continually get errors saying it doesn't exist. I figured that the compiler was trying to read the first # before the second word, so I rewrote it like so:
wordTwo = format.substring (wordlength + 1, 1 + wordLength + format.indexOf('#') );
And with this it just completely screws it up, either not printing the second word or not stopping in the right place. If I could get any help on the formatting of this, it would be greatly appreciated. Since this is for a class, I am limited to using very basic methods such as indexOf, length, substring etc. so if you could refrain from using anything to complex that would be amazing!

If you have to use substring then you need to use the variant of indexOf that takes a start. This means you can start look for the second # by starting the search after the first one. I.e.
wordTwo = format.substring ( wordlength + 1 , format.indexOf('#', wordlength + 1 ) );
There are however much better ways of splitting a string on a delimiter like this. You can use a StringTokenizer. This is designed for splitting strings like this. Basically:
StringTokenizer tok = new StringTokenizer(format, "#");
String word = tok.nextToken();
String word2 = tok.nextToken();
String word3 = tok.nextToken();
Or you can use the String.split method which is designed for splitting strings. e.g.
String[] parts = String.split("#");
String word = parts[0];
String word2 = parts[1];
String word3 = parts[2];

You can go with split() for this kind of formatting strings.
For instance if you have string like,
String text = "Word1#Word2#Word3#Word4";
You can use delimiter as,
String delimiter = "#";
Then create an string array like,
String[] temp;
For splitting string,
temp = text.split(delimiter);
You can get words like this,
temp[0] = "Word1";
temp[1] = "Word2";
temp[2] = "Word3";
temp[3] = "Word4";

Use split() method to do this with "#" as the delimiter
String s = "hi#vivek#is#good";
String temp = new String();
String[] arr = s.split("#");
for(String x : arr){
temp = temp + x;
}
Or if you want to exact each word... you have it already in arr
arr[0] ---> First Word
arr[1] ---> Second Word
arr[2] ---> Third Word

I suggest that you've a look at the Javadoc for String before you proceed further.
Since this is your homework, I'll give you a couple of hints and maybe you can solve it yourself:
The format for subString is public void subString(int beginIndex, int endIndex). As per the javadoc for this method:
Returns a new string that is a substring of this string. The substring
begins at the specified beginIndex and extends to the character at
index endIndex - 1. Thus the length of the substring is
endIndex-beginIndex.
Note that if you've to use this method, understand that you'll have to shift your beginIndex and endIndex each time because in your situation, you'll have multiple words that are separated by #.
However if you look closely, there's another method in String class that might be helpful to you. That's the public String[] split(String regex) method. The javadoc for this one states:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
The split() method looks pretty interesting for your case. You can split your String with the delimiter that you have as the parameter to this method, get the String array and work with that.
Hope this helps you to understand your problem and get started towards a solution :)

Since this is a home work, it may be better to have try to write it your self. But I will give a clue.
Clue:
The indexOf method has another overload: int indexOf(int chr,
int fromIndex) which find the first character chr in the string
from the fromIndex.
http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html
From this clue, the program will look something like this:
Find the index of the first '#' from the start of the string.
Extract the word from 0th character to that index.
Find the index of the first '#' from the character AFTER the first '#'.
Extract the word from the first '#' that index.
... Just do it until you get 4 words or the string ends.
Hope this helps.

I don't know why you're forced to use String#substring, but as others have mentioned, it seems like the wrong method for the kind of functionality you need.
String#split(String regex) is what you would use for such a problem, or, if your input sequence is something you don't control, I would suggest you look at the overloaded method String#split(String regex, int limit); this way you can impose a limit on the amount of matches you make, controlling your resulting array.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java- How to find non-alphabetical letters in a string? (Quick way) - java

A for loop, you can use the Character class to determine if each character is a Letter (or other type). See: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isAlphabetic(int)

You should probably use a regular expression: Pattern patt = Pattern.compile("[^A-Za-z]"); Matcher mat = patt.matcher("avc#dgh"); boolean found = mat.find(); System.out.println(found ? mat.start() : -1);

Related

why split() produces extra , after sets limit -1

Regular expression for counting words in a sentence

Java Regex from beginning to first char

Finding the longest substring between a "start" string and one of 3 possible "end" strings

Finding multiple substrings using boundaries in Java

Categories

Resources