How to mimic transliterate in Java?

How to mimic transliterate in Java? - java

In Perl, I usually use the transliteration to count the number of characters in a string that match a set of possible characters. Things like:
$c1=($a =~ y[\x{0410}-\x{042F}\x{0430}-\x{044F}]
[\x{0410}-\x{042F}\x{0430}-\x{044F}]);
would count the number of Cyrillic characters in $a. As in the previous example I have two classes (or two ranges, if you prefer), I have some other with some more classes:
$c4=($a =~ y[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]
[\x{AC00}-\x{D7AF}\x{1100}-\x{11FF}\x{3130}-\x{318F}\x{A960}-\x{A97F}\x{D7B0}-\x{D7FF}]);
Now, I need to do a similar thing in Java. Is there a similar construct in Java? Or I need to iterate over all characters, and check if it is between the limits of each class?
Thank you

Haven't seen anything like tr/// in Java.
You could use something like this to count all the matches tho:
Pattern p = Pattern.compile("[\\x{0410}-\\x{042F}\\x{0430}-\\x{044F}]",
Pattern.CANON_EQ);
Matcher m = p.matcher(string);
int count = 0;
while (m.find())
count++;

For good order: using the Java Unicode support.
int countCyrillic(String s) {
int n = 0;
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
if (UnicodeScript.of(codePoint) == UnicodeScript.CYRILLIC) {
++n;
}
}
return n;
}
This uses the full Unicode (where two 16 bit chars may represent a Unicode "code point."
And in Java the class Character.UnicodeScript has already everything you need.
Or:
int n = s.replaceAll("\\P{CYRILLIC}", "").length();
Here \\P is the negative of \\p{CYRILLIC} the Cyrillic group.

You can try to play with something like this:
s.replaceAll( "[^\x{0410}-\x{042F}\x{0430}-\x{044F}]*([\x{0410}-\x{042F}\x{0430}-\x{044F}])?", "$1" ).length()
The idea was borrowed from here: Simple way to count character occurrences in a string

Related

Whats is the best way to find the common prefix for String(s)?

I need to find a way to find the common prefix for a string(s), for example:
The strings can be appended with either 2 or 4 characters like so:
hello12
hello1234
Sometimes though, the strings can also end up like so:
hello11
hello1122
But the common prefix is now hello11 and it should only be hello. I also need to be able to get hello from the string when there is only one string.
I have written the following code below and it works when ALL strings are unique and do not have any common appended values.
String prefix = "";
if(listStrings.size() > 0) {
prefix = listStrings.get(0);
for(int i = 1; i < listStrings.size(); i++) {
String nextString = listStrings.get(i);
int j;
for(j = 0; j < Math.min(prefix.length(), listStrings.get(i).length()); j++) {
if(prefix.charAt(j) != nextString.charAt(j)) {
break;
}
}
prefix = listStrings.get(i).substring(0, j);
}
}
This code produces hello when the following inputs are:
hello1234
hello5678
hellothere
It doesn't give me hello when the following inputs are:
hello1122
hello1124
hello1124
hello1134
I expect the output to be "hello" no matter what is input into the algorithm.

Your problem is that just looking at all kinds of characters is the wrong approach.
You basically stated: I want my prefix to NOT contain digits.
In other words: when collecting your max prefix, you should stop looking at any string as soon as you encounter the first character that doesn't match your criteria (which seems to be: the character represents a digit). You could rely on the Character class and its isDigit() method for that.
But the real point here: you need to clarify your requirements. If you are unable to clearly express the exact definition of your "prefixes", then writing java code is the wrong priority. In other words: are we talking about letters here, versus digits? What about whitespaces? Pure "ASCII" or beware, arbitrary Unicode?
Thus the real answer is: step back, and make up your mind conceptually what constitutes a valid prefix, and what exactly tells you "the prefix just ended here".

If you need to get prefix without digits you can do next:
for (j = 0; j < Math.min(prefix.length(), listStrings.get(i).length()); j++) {
int symCode = (int) prefix.charAt(j);
if (prefix.charAt(j) != nextString.charAt(j) ||
!((symCode > 64 && symCode < 91) || (symCode > 96 && symCode < 123))) {
break;
}
}

indexOf() vs regex for identifying special characters like $ and {

I would like to check if a special character like { or $is present in a string or not. I used regexp but during code review I was asked to use indexOf() instead regex( as its costlier). I would like to understand how indexOf() is used to identify special characters. (I familiar that this can be done to index substring)
String photoRoot = "http://someurl/${TOKEN1}/${TOKEN2}";
Pattern p = Pattern.compile("\\$\\{(.*?)\\}");
Matcher m = p.matcher(photoRoot);
if (m.find()) {
// logic to be performed
}

There are more then one indexOf(...) methods but all of them treat all characters the same, there is no need to escape any characters while using these methods.
Here is how you can get the two tokens by using some of the indexOf(...) methods:
String photoRoot = "http://someurl/${TOKEN1}/${TOKEN2}";
String startDelimiter = "${";
char endDelimiter = '}';
int start = -1, end = -1;
while (true) {
start = photoRoot.indexOf(startDelimiter, end);
end = photoRoot.indexOf(endDelimiter, start + startDelimiter.length());
if (start != -1 && end != -1) {
System.out.println(photoRoot.substring(start + startDelimiter.length(), end));
} else {
break;
}
}

If you're only looking to find a couple of different special characters you'd just use indexOf("$") or indexOf("}"). You will need to specify each special character you want to find separately.
There is no way though to have it find the index of every special character in one statement: https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#indexOf(int)

If you just need to check for 2 characters as in your question, the answer will be
var found = photoRoot.indexOf("$") >=0 ||| photoRoot.indexOf("?") >=0;

It's always difficult to guess while there is contradicting information. The code does not look for special characters, it searches for a pattern - and indexOf will not help you there.
Titus' answer is good for avoiding pattern matching if you need to find the pattern ${...} (as opposed to "identifying special characters")
If (as the reviewer appears to think) you just need to look for any of a set of special characters you can apply indexOf( on_special_char ) repeatedly, but you can also do
for( int i = 0; i < photoRoot.length(); ++i ){
if( "${}".indexOf( photoRoot.charAt(i) ) >= 0 ){
// one of the special characters is at pos i
}
}
Not sure where the performance "break even" between multiple indexOf calls on the target string and the (above) iteration on the target with indexOf on the (short) string containing the specials is. But it may be easier to maintain and permits dynamic adaption to the set of specials.
Of course, the simple
photoRoot.matches( ".*" + Pattern.quote( specials ) + ".*" );
is also dynamically adaptable.

Algorithm for IP Address-like String extraction? [Java]

Given a strings like:
S5.4.2
3.2
SD45.G.94L5456.294.1004.8888.0.23QWZZZZ
5.44444444444444444444444444444.5GXV
You would need to return:
5.4.2
3.2
5456.294.1004.8888.0.23
5.44444444444444444444444444444.5
Is there a smart way to write a method to extract only the IP address-like number from it? (I say IP Address-like because I'm not sure if IP addresses are usually a set amount of numbers or not). One of my friends suggested that there might be a regex for what I'm looking for and so I found this. The problem with that solution is I think it is limited to only 4 integers total, and also won't expect ints like with 4+ digits in between dots.
I would need it to recognize the pattern:
NUMS DOT (only one) NUMS DOT (only one) NUMS
Meaning:
234..54.89 FAIL
.54.6.10 FAIL
256 FAIL
234.7885.924.120.2.0.4 PASS
Any ideas? Thanks everyone. I've been at this for a few hours and can't figure this out.

Here is an approach using Regex:
private static String getNumbers(String toTest){
String IPADDRESS_PATTERN =
"(\\d+\\.)+\\d+";
Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
Matcher matcher = pattern.matcher(toTest);
if (matcher.find()) {
return matcher.group();
}
else{
return "did not match anything..";
}
}
This will match the number-dot-number-... pattern with an infinite amount of numbers.
I modified this Answer a bit.

There are many ways to do it. This is the best way, I guess.
public static String NonDigits(final CharSequence input)
{
final StringBuilder sb = new StringBuilder(input.length());
for(int i = 0; i < input.length(); i++){
final char c = input.charAt(i);
if(c > 47 && c < 58){
sb.append(c);
}
}
return sb.toString();
}
A CharSequence is a readable sequence of characters. This interface
provides uniform, read-only access to many different kinds of
character sequences.
Look at this ASCII Table. The ascii values from 47 to 58 are 0 to 9. So, this is the best possible way to extract the digits than any other way.
"final" keyword is more like if set once, cannot be modified. Declaring string as final is good practise.
Same code, just to understand the code in a better way:
public static String NonDigits(String input)
{
StringBuilder sb = new StringBuilder(input.length());
for(int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
if(c > 47 && c < 58)
{
sb.append(c);
}
}
return sb.toString();
}

String.replaceAll for multiple characters

I have a line with ^||^ as my delimiter, I am using
int charCount = line.replaceAll("[^" + fileSeperator + "]", "").length();
if(fileSeperator.length()>1)
{
charCount=charCount/fileSeperator.length();
System.out.println(charCount+"char count between");
}
This does not work if i have a line that has stray | or ^ as it counts these as well. How can i modify the regex or any other suggestions?

If I understand correctly, what you're really trying to do is count the number of times that ^||^ appears in your String.
If that's the case, you can use:
Matcher m = Pattern.compile(Pattern.quote("^||^")).matcher(line);
int count = 0;
while ( m.find() )
count++;
System.out.println(count + "char count between");
But you really don't need the regex engine for this.
int startIndex = 0;
int count = 0;
while ( true ) {
int newIndex = line.indexOf(fileDelimiter, startIndex);
if ( newIndex == -1 ) {
break;
} else {
startIndex = newIndex + 1;
count++;
}
}

Certain characters have special meanings in a regular expression, such as ^ and |. These must be escaped with a backslash in order for them to be treated as normal characters and not as special characters. For example, the following regular expression matches all caret (^) and pipe (|) characters (note the backslashes): [\^\|]
The Pattern.quote() method can be used to escape all of the special characters in a given String.
String quoted = Pattern.quote("^||^"); //returns "\^\|\|\^";
Also note that a character class only matches one character. Thus, the regex [^\^\|\|\^] will match all characters except ^ and |, not all characters except the sequence ^||^. If your intention is to count the number of delimiters (^||^) in a String, then a better approach might be to use the String.indexOf(String, int) method.

Mark Peters's answer seems better. I edited so my answer won't cause any confusion.

You should replace it like this with proper escaping since your delimiter has all special character of regex:
line.replaceAll("\\^\\|\\|\\^", "");
OR else don't use regex at all and call replace method like this:
line.replace("^||^", "");

Lazy solutions.
Depending on the end goal (the println statement is a little confusing):
int numberOfDelimiters = (line.length() - line.replace(fileSeparator,"").length())
/ fileSeparator.length();
int numberOfNonDelimiterChars = line.replace(fileSeparator,"").length();

Java: Finding the number of word matches in a given string

I am trying to find the number of word matches for a given string and keyword combination, like this:
public int matches(String keyword, String text){
// ...
}
Example:
Given the following calls:
System.out.println(matches("t", "Today is really great, isn't that GREAT?"));
System.out.println(matches("great", "Today is really great, isn't that GREAT?"));
The result should be:
0
2
So far I found this: Find a complete word in a string java
This only returns if the given keyword exists but not how many occurrences. Also, I am not sure if it ignores case sensitivity (which is important for me).
Remember that substrings should be ignored! I only want full words to be found.
UPDATE
I forgot to mention that I also want keywords that are separated via whitespace to match.
E.g.
matches("today is", "Today is really great, isn't that GREAT?")
should return 1

Use a regular expression with word boundaries. It's by far the easiest choice.
int matches = 0;
Matcher matcher = Pattern.compile("\\bgreat\\b", Pattern.CASE_INSENSITIVE).matcher(text);
while (matcher.find()) matches++;
Your milage may vary on some foreign languages though.

How about taking advantage of indexOf ?
s1 = s1.toLowerCase(Locale.US);
s2 = s2.toLowerCase(Locale.US);
int count = 0;
int x;
int y = s2.length();
while((x=s1.indexOf(s2)) != -1){
count++;
s1 = s1.substr(x,x+y);
}
return count;
Efficient version
int count = 0;
int y = s2.length();
for(int i=0; i<=s1.length()-y; i++){
int lettersMatched = 0;
int j=0;
while(s1[i]==s2[j]){
j++;
i++;
lettersMatched++;
}
if(lettersMatched == y) count++;
}
return count;
For more efficient solution, you will have to modify KMP algorithm a little. Just google it, its simple.

well,you can use "split" to separate the words and find if there exists a word matches exactly.
hope that helps!

one option would be RegEx. Basically it sounds like you are looking to match a word with any punctuation on the left or right. so:
" great."
" great!"
" great "
" great,"
"Great"
would all match, but
"greatest"
wouldn't

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to mimic transliterate in Java? - java

Haven't seen anything like tr/// in Java. You could use something like this to count all the matches tho: Pattern p = Pattern.compile("[\\x{0410}-\\x{042F}\\x{0430}-\\x{044F}]", Pattern.CANON_EQ); Matcher m = p.matcher(string); int count = 0; while (m.find()) count++;

You can try to play with something like this: s.replaceAll( "[^\x{0410}-\x{042F}\x{0430}-\x{044F}]*([\x{0410}-\x{042F}\x{0430}-\x{044F}])?", "$1" ).length() The idea was borrowed from here: Simple way to count character occurrences in a string

Related

Whats is the best way to find the common prefix for String(s)?

indexOf() vs regex for identifying special characters like $ and {

Algorithm for IP Address-like String extraction? [Java]

String.replaceAll for multiple characters

Java: Finding the number of word matches in a given string

Categories

Resources