How to remove surrogate characters in Java?

How to remove surrogate characters in Java? - java

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.
Thanks in advance for your help.
public static String removeSurrogates(String query) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < query.length() - 1; i++) {
char firstChar = query.charAt(i);
char nextChar = query.charAt(i+1);
if (Character.isSurrogatePair(firstChar, nextChar) == false) {
sb.append(firstChar);
} else {
i++;
}
}
if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
&& Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
sb.append(query.charAt(query.length() - 1));
}
return sb.toString();
}

Here's a couple things:
Character.isSurrogate(char c):
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.
Checking for pairs seems pointless, why not just remove all surrogates?
x == false is equivalent to !x
StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).
I suggest this:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char c = query.charAt(i);
// !isSurrogate(c) in Java 7
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
}
return sb.toString();
}
Breaking down the if statement
You asked about this statement:
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:
static boolean isSurrogate(char c) {
return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}
static boolean isNotSurrogate(char c) {
return !isSurrogate(c);
}
...
if (isNotSurrogate(c)) {
sb.append(firstChar);
}

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).
Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:
return query.replaceAll("[^\u0000-\uffff]", "");

why not simply
for (int i = 0; i < query.length(); i++)
char c = query.charAt(i);
if(!isHighSurrogate(c) && !isLowSurrogate(c))
sb.append(c);
you probably should replace them with "?", instead of out right erasing them.

Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char ch = query.charAt(i);
if (Character.isHighSurrogate(ch))
i++;//skip the next char is it's supposed to be low surrogate
else
sb.append(ch);
}
return sb.toString();
}

if remove, all these solutions are useful
but if repalce, below is better
StringBuffer sb = new StringBuffer();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if(Character.isHighSurrogate(c)){
sb.append('*');
}else if(!Character.isLowSurrogate(c)){
sb.append(c);
}
}
return sb.toString();

Related

Substring alternative

So I'm creating a program that will output the first character of a string and then the first character of another string. Then the second character of the first string and the second character of the second string, and so on.
I created what is below, I was just wondering if there is an alternative to this using a loop or something rather than substring
public class Whatever
{
public static void main(String[] args)
{
System.out.println (interleave ("abcdefg", "1234"));
}
public static String interleave(String you, String me)
{
if (you.length() == 0) return me;
else if (me.length() == 0) return you;
return you.substring(0,1) + interleave(me, you.substring(1));
}
}
OUTPUT: a1b2c3d4efg

Well, if you really don't want to use substrings, you can use String's toCharArray() method, then you can use a StringBuilder to append the chars. With this you can loop through each of the array's indices.
Doing so, this would be the outcome:
public static String interleave(String you, String me) {
char[] a = you.toCharArray();
char[] b = me.toCharArray();
StringBuilder out = new StringBuilder();
int maxLength = Math.max(a.length, b.length);
for( int i = 0; i < maxLength; i++ ) {
if( i < a.length ) out.append(a[i]);
if( i < b.length ) out.append(b[i]);
}
return out.toString();
}
Your code is efficient enough as it is, though. This can be an alternative, if you really want to avoid substrings.

This is a loop implementation (not handling null value, just to show the logic):
public static String interleave(String you, String me) {
StringBuilder result = new StringBuilder();
for (int i = 0 ; i < Math.max(you.length(), me.length()) ; i++) {
if (i < you.length()) {
result.append(you.charAt(i)); }
if (i < me.length()) {
result.append(me.charAt(i));
}
}
return result.toString();
}

The solution I am proposing is based on the expected output - In your particular case consider using split method of String since you are interleaving by on character.
So do something like this,
String[] xs = "abcdefg".split("");
String[] ys = "1234".split("");
Now loop over the larger array and ensure interleave ensuring that you perform length checks on the smaller one before accessing.

To implement this as a loop you would have to maintain the position in and keep adding until one finishes then tack the rest on. Any larger sized strings should use a StringBuilder. Something like this (untested):
int i = 0;
String result = "";
while(i <= you.length() && i <= me.length())
{
result += you.charAt(i) + me.charAt(i);
i++;
}
if(i == you.length())
result += me.substring(i);
else
result += you.substring(i);

Improved (in some sense) #BenjaminBoutier answer.
StringBuilder is the most efficient way to concatenate Strings.
public static String interleave(String you, String me) {
StringBuilder result = new StringBuilder();
int min = Math.min(you.length(), me.length());
String longest = you.length() > me.length() ? you : me;
int i = 0;
while (i < min) { // mix characters
result.append(you.charAt(i));
result.append(me.charAt(i));
i++;
}
while (i < longest.length()) { // add the leading characters of longest
result.append(longest.charAt(i));
i++;
}
return result.toString();
}

Making code to clean string of unwanted characters

I already made all the code for it but I have some issues. Not all the invalid characters are getting removed, I was unable to pickup a pattern though. I've been trying for a long time now to figure out what is causing this and I finally decided to ask you guys to see if someone can figure it out.
Here is the char array of valid characters (All other characters will be removed from string):
static char[] validCharsUsername ={'Q','q','W','w','E','e','R','r','T','t','Y','y','U','u','I','i','O','o','P','p','A','a','S','s','D','d','F','f','G','g','H','h','J','j','K','k','L','l','Z','z','X','x','C','c','V','v','B','b','N','n','M','m','1','2','3','4','5','6','7','8','9','0','_','-'};
Here is the code. (this.validChars is refering to the array):
public String cleanString(String text){
StringBuilder sb = new StringBuilder(text);
for(int i = 0;i < sb.length() - 1;i++){
char character = sb.charAt(i);
int index = 0;
char indexChar = this.validChars[0];
boolean valid = false;
while(index < this.validChars.length - 1){
index++;
indexChar = this.validChars[index];
if(character == indexChar){
valid = true;
index = this.validChars.length;
}
}
if(!valid){
if(character == ' '){
sb.deleteCharAt(i);
sb.insert(i, '_');
}else{
sb.deleteCharAt(i);
}
i = 0;
}
}
return sb.toString();
}

Maybe consider using regular expressions which. Regex which will match all characters in range a-z and all digits 0-9 can look like [a-zA-Z0-9]. Regex which will match all characters except mentioned earlier can look like [^a-zA-Z0-9] so your code could look like
public String cleanString(String text){
return text.replaceAll("[^a-zA-Z0-9]","");
}
In case you want also to let spaces or any other characters stay you can add them to this character class and change return statement to text.replaceAll("[^a-zA-Z0-9\\s]",""); (\\s represents whitespaces).

try use this code :
public static String cleanString(String text){
StringBuilder sb = new StringBuilder("");
for(int i = 0;i < text.length();i++){
for (int j = 0; j < validCharsUsername.length; j++) {
if (validCharsUsername[j] == text.charAt(i)) {
sb.append(text.charAt(i));
break;
}
}
}
return sb.toString();
}
UPDATE
Fist i think it is C# and i wrote C# Code, but now i changed it to java

The fastest method of determining if a string is a palindrome

I need an algorithm that verify with the fastest possible execution time, if a string is a palindrome ( the string can be a proposition with uppercase or lowercase letter, spaces etc.). All of this in Java. I got a sample :
bool isPalindrome(string s) {
int n = s.length();
s = s.toLowerCase();
for (int i = 0; i < (n / 2) + 1; ++i) {
if (s.charAt(i) != s.charAt(n - i - 1)) {
return false;
}
}
return true;
}
I transformed the string in lowercase letter using .toLowerCase() function, but I don't know how much it affects the execution time .
And as well I don't know how to solve the problem with punctuation and spaces between words in a effective way.

I think you can just check for string reverse, not?
StringBuilder sb = new StringBuilder(str);
return str.equals(sb.reverse().toString());
Or, for versions earlier than JDK 1.5:
StringBuffer sb = new StringBuffer(str);
return str.equals(sb.reverse().toString());

This avoids any copying. The functions isBlank and toLowerCase are rather unspecified in your question, so define them the way you want. Just an example:
boolean isBlank(char c) {
return c == ' ' || c == ',';
}
char toLowerCase(char c) {
return Character.toLowerCase(c);
}
Don't worry about the costs of method calls, that's what the JVM excels at.
for (int i = 0, j = s.length() - 1; i < j; ++i, --j) {
while (isBlank(s.charAt(i))) {
i++;
if (i >= j) return true;
}
while (isBlank(s.charAt(j))) {
j--;
if (i >= j) return true;
}
if (toLowerCase(s.charAt(i)) != toLowerCase(s.charAt(j))) return false;
}
return true;
Try to benchmark this... I'm hoping mu solution could be the fastest, but without measuring you never know.

Your solution seems just fine when it comes to effectiveness.
As for your second problem, you can just remove all spaces and dots etc before you start testing:
String stripped = s.toLowerCase().replaceAll("[\\s.,]", "");
int n = stripped.length();
for (int i = 0; i < (n / 2) + 1; ++i) {
if (stripped.charAt(i) != stripped.charAt(n - i - 1)) {
...

Effective is not the same of efficient.
Your answer is effective as long you consider spaces, special characters and so on. Even accents could be problematic.
About efficiency, toLowerCase is O(n) and any regexp parsing will be O(n) also. If you are concerning about that, convert and compare char by char should be the best option.

Here is my try:
public static boolean isPalindrome(String s)
{
int index1 = 0;
int index2 = s.length() -1;
while (index1 < index2)
{
if(s.charAt(index1) != s.charAt(index2))
{
return false;
}
index1 ++;
index2 --;
}
return true;
}

Here's some insight to my way of detecting a palindrome using Java. Feel free to ask question :) Hope I could help in some way....
import java.util.Scanner;
public class Palindrome {
public static void main(String[]args){
if(isReverse()){System.out.println("This is a palindrome.");}
else{System.out.print("This is not a palindrome");}
}
public static boolean isReverse(){
Scanner keyboard = new Scanner(System.in);
System.out.print("Please type something: ");
String line = ((keyboard.nextLine()).toLowerCase()).replaceAll("\\W","");
return (line.equals(new StringBuffer(line).reverse().toString()));
}
}

In normal cases :
StringBuilder sb = new StringBuilder(myString);
String newString=sb.reverse().toString();
return myString.equalsIgnoreCase(newString);
In case of case sensitive use :
StringBuilder sb = new StringBuilder(myString);
String newString=sb.reverse().toString();
return myString.equals(newString);

Java Stringbuilder.replace

Consider the following inputs:
String[] input = {"a9", "aa9", "a9a9", "99a99a"};
What would be the most efficient way whilst using a StringBuilder to replace any digit directly prior to a nine with the next letter after it in the alphabet?
After processing these inputs the output should be:
String[] output = {"b9", "ab9", "b9b9", "99b99a"}
I've been scratching my head for a while and the StringBuilder.setCharAt was the best method I could think of.
Any advice or suggestions would be appreciated.

Since you have to look at every character, you'll never perform better than linear in the size of the buffer. So you can just do something like
for (int i=1; buffer.length() ++i) // Note this starts at "1"
if (buffer.charAt[i] == '9')
buffer.setCharAt(i-1, buffer.getCharAt(i-1) + 1);

You can following code:
String[] input = {"a9", "aa9", "a9a9", "99a99a", "z9", "aZ9"};
String[] output = new String[input.length];
Pattern pt = Pattern.compile("([a-z])(?=9)", Pattern.CASE_INSENSITIVE);
for (int i=0; i<input.length; i++) {
Matcher mt = pt.matcher(input[i]);
StringBuffer sb = new StringBuffer();
while (mt.find()) {
char ch = mt.group(1).charAt(0);
if (ch == 'z') ch = 'a';
else if (ch == 'Z') ch = 'A';
else ch++;
mt.appendReplacement(sb, String.valueOf(ch));
}
mt.appendTail(sb);
output[i] = sb.toString();
}
System.out.println(Arrays.toString(output));
OUTPUT:
[b9, ab9, b9b9, 99b99a, a9, aA9]

You want to use a very simple state machine. For each character you're looping through in the input string, keep track of a boolean. If the character is a 9, set the boolean to true. If the character is a letter add one to the letter and set the boolean to false. Then add the character to the output stringbuilder.
For input you use a Reader. For output use a StringBuilder.

Use a 1 token look ahead parser technique. Here is some psuedoish code:
for (int index = 0; index < buffer.length(); ++index)
{
if (index < buffer.length() - 1)
{
if (buffer.charAt(index + 1) == '9')
{
char current = buffer.charAt(index) + 1; // this is probably not the best technique for this.
buffer.setCharAt(index, current);
}
}
}

another solution is for example to use
StringUtils.indexOf(String str, char searchChar, int startPos)
in a way as Ernest Friedman-Hill pointed, take this as experimental example, not the most performant

What is an efficient way to replace many characters in a string?

String handling in Java is something I'm trying to learn to do well. Currently I want to take in a string and replace any characters I find.
Here is my current inefficient (and kinda silly IMO) function. It was written to just work.
public String convertWord(String word)
{
return word.toLowerCase().replace('á', 'a')
.replace('é', 'e')
.replace('í', 'i')
.replace('ú', 'u')
.replace('ý', 'y')
.replace('ð', 'd')
.replace('ó', 'o')
.replace('ö', 'o')
.replaceAll("[-]", "")
.replaceAll("[.]", "")
.replaceAll("[/]", "")
.replaceAll("[æ]", "ae")
.replaceAll("[þ]", "th");
}
I ran 1.000.000 runs of it and it took 8182ms. So how should I proceed in changing this function to make it more efficient?
Solution found:
Converting the function to this
public String convertWord(String word)
{
StringBuilder sb = new StringBuilder();
char[] charArr = word.toLowerCase().toCharArray();
for(int i = 0; i < charArr.length; i++)
{
// Single character case
if(charArr[i] == 'á')
{
sb.append('a');
}
// Char to two characters
else if(charArr[i] == 'þ')
{
sb.append("th");
}
// Remove
else if(charArr[i] == '-')
{
}
// Base case
else
{
sb.append(word.charAt(i));
}
}
return sb.toString();
}
Running this function 1.000.000 times takes 518ms. So I think that is efficient enough. Thanks for the help guys :)

You could create a table of String[] which is Character.MAX_VALUE in length. (Including the mapping to lower case)
As the replacements got more complex, the time to perform them would remain the same.
private static final String[] REPLACEMENT = new String[Character.MAX_VALUE+1];
static {
for(int i=Character.MIN_VALUE;i<=Character.MAX_VALUE;i++)
REPLACEMENT[i] = Character.toString(Character.toLowerCase((char) i));
// substitute
REPLACEMENT['á'] = "a";
// remove
REPLACEMENT['-'] = "";
// expand
REPLACEMENT['æ'] = "ae";
}
public String convertWord(String word) {
StringBuilder sb = new StringBuilder(word.length());
for(int i=0;i<word.length();i++)
sb.append(REPLACEMENT[word.charAt(i)]);
return sb.toString();
}

My suggestion would be:
Convert the String to a char[] array
Run through the array, testing each character one by one (e.g. with a switch statement) and replacing it if needed
Convert the char[] array back to a String
I think this is probably the fastest performance you will get in pure Java.
EDIT: I notice you are doing some changes that change the length of the string. In this case, the same principle applies, however you need to keep two arrays and increment both a source index and a destination index separately. You might also need to resize the destination array if you run out of target space (i.e. reallocate a larger array and arraycopy the existing destination array into it)

My implementation is based on look up table.
public static String convertWord(String str) {
char[] words = str.toCharArray();
char[] find = {'á','é','ú','ý','ð','ó','ö','æ','þ','-','.',
'/'};
String[] replace = {"a","e","u","y","d","o","o","ae","th"};
StringBuilder out = new StringBuilder(str.length());
for (int i = 0; i < words.length; i++) {
boolean matchFailed = true;
for(int w = 0; w < find.length; w++) {
if(words[i] == find[w]) {
if(w < replace.length) {
out.append(replace[w]);
}
matchFailed = false;
break;
}
}
if(matchFailed) out.append(words[i]);
}
return out.toString();
}

My first choice would be to use a StringBuilder because you need to remove some chars from the string.
Second choice would be to iterate throw the array of chars and add the treated char to another array of the inicial size of the string. Then you would need to copy the array to trim the possible unused positions.
After that, I would make some performance tests to see witch one is better.

I doubt, that you can speed up the 'character replacement' at all really. As for the case of regular expression replacement, you may compile the regexs beforehand

Use the function String.replaceAll.
Nice article similar with what you want: link

Any time we have problems like this we use regular expressions are they are by far the fastest way to deal with what you are trying to do.
Have you already tried regular expressions?

What i see being inefficient is that you are gonna check again characters that have already been replaced, which is useless.
I would get the charArray of the String instance, iterate over it, and for each character spam a series of if-else like this:
char[] array = word.toCharArray();
for(int i=0; i<array.length; ++i){
char currentChar = array[i];
if(currentChar.equals('é'))
array[i] = 'e';
else if(currentChar.equals('ö'))
array[i] = 'o';
else if(//...
}

I just implemented this utility class that replaces a char or a group of chars of a String. It is equivalent to bash tr and perl tr///, aka, transliterate. I hope it helps someone!
package your.package.name;
/**
* Utility class that replaces chars of a String, aka, transliterate.
*
* It's equivalent to bash 'tr' and perl 'tr///'.
*
*/
public class ReplaceChars {
public static String replace(String string, String from, String to) {
return new String(replace(string.toCharArray(), from.toCharArray(), to.toCharArray()));
}
public static char[] replace(char[] chars, char[] from, char[] to) {
char[] output = chars.clone();
for (int i = 0; i < output.length; i++) {
for (int j = 0; j < from.length; j++) {
if (output[i] == from[j]) {
output[i] = to[j];
break;
}
}
}
return output;
}
/**
* For tests!
*/
public static void main(String[] args) {
// Example from: https://en.wikipedia.org/wiki/Caesar_cipher
String string = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG";
String from = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
String to = "XYZABCDEFGHIJKLMNOPQRSTUVW";
System.out.println();
System.out.println("Cesar cypher: " + string);
System.out.println("Result: " + ReplaceChars.replace(string, from, to));
}
}
This is the output:
Cesar cypher: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
Result: QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove surrogate characters in Java? - java

why not simply for (int i = 0; i < query.length(); i++) char c = query.charAt(i); if(!isHighSurrogate(c) && !isLowSurrogate(c)) sb.append(c); you probably should replace them with "?", instead of out right erasing them.

Related

Substring alternative

Making code to clean string of unwanted characters

The fastest method of determining if a string is a palindrome

Java Stringbuilder.replace

What is an efficient way to replace many characters in a string?

Categories

Resources