Making code to clean string of unwanted characters

Making code to clean string of unwanted characters - java

I already made all the code for it but I have some issues. Not all the invalid characters are getting removed, I was unable to pickup a pattern though. I've been trying for a long time now to figure out what is causing this and I finally decided to ask you guys to see if someone can figure it out.
Here is the char array of valid characters (All other characters will be removed from string):
static char[] validCharsUsername ={'Q','q','W','w','E','e','R','r','T','t','Y','y','U','u','I','i','O','o','P','p','A','a','S','s','D','d','F','f','G','g','H','h','J','j','K','k','L','l','Z','z','X','x','C','c','V','v','B','b','N','n','M','m','1','2','3','4','5','6','7','8','9','0','_','-'};
Here is the code. (this.validChars is refering to the array):
public String cleanString(String text){
StringBuilder sb = new StringBuilder(text);
for(int i = 0;i < sb.length() - 1;i++){
char character = sb.charAt(i);
int index = 0;
char indexChar = this.validChars[0];
boolean valid = false;
while(index < this.validChars.length - 1){
index++;
indexChar = this.validChars[index];
if(character == indexChar){
valid = true;
index = this.validChars.length;
}
}
if(!valid){
if(character == ' '){
sb.deleteCharAt(i);
sb.insert(i, '_');
}else{
sb.deleteCharAt(i);
}
i = 0;
}
}
return sb.toString();
}

Maybe consider using regular expressions which. Regex which will match all characters in range a-z and all digits 0-9 can look like [a-zA-Z0-9]. Regex which will match all characters except mentioned earlier can look like [^a-zA-Z0-9] so your code could look like
public String cleanString(String text){
return text.replaceAll("[^a-zA-Z0-9]","");
}
In case you want also to let spaces or any other characters stay you can add them to this character class and change return statement to text.replaceAll("[^a-zA-Z0-9\\s]",""); (\\s represents whitespaces).

try use this code :
public static String cleanString(String text){
StringBuilder sb = new StringBuilder("");
for(int i = 0;i < text.length();i++){
for (int j = 0; j < validCharsUsername.length; j++) {
if (validCharsUsername[j] == text.charAt(i)) {
sb.append(text.charAt(i));
break;
}
}
}
return sb.toString();
}
UPDATE
Fist i think it is C# and i wrote C# Code, but now i changed it to java

Related

Pig it method that I am trying to make trouble checking punctuation at the end java

I am trying to answer this question.
Move the first letter of each word to the end of it, then add "ay" to the end of the word. Leave punctuation marks untouched.
This is what I did so far:
public static String pigIt(String str) {
//Populating the String argument into the String Array after splitting them by spaces
String[] strArray = str.split(" ");
System.out.println("\nPrinting strArray: " + Arrays.toString(strArray));
String toReturn = "";
for (int i = 0; i < strArray.length; i++) {
String word = strArray[i];
for (int j = 1; j < word.length(); j++) {
toReturn += Character.toString(word.charAt(j));
}
//Outside of inner for loop
if (!(word.contains("',.!?:;")) && (i != strArray.length - 1)) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " ";
} else if (word.contains("',.!?:;")) {
toReturn += Character.toString(word.charAt(0)) + "ay" + " " + strArray[strArray.length - 1];
}
}
return toReturn;
}
It is supposed to return the punctuation mark without adding "ay" + "". I think I am overthinking but please help. Please see the below debugger.

One of the problems here is that your else if statement is never being invoked. The .contains method will not work with multiple characters like that unless you are trying to match them all. In your conditions you are essentially asking if the word matches that entire string "',.!?:;". If you just keep the exclamation point in there it will work invoke it. I don't know how else you can use contains besides making a condition for each one like word.contains("!")|| word.contains(",")|| word.contains("'"), etc.. You can also use regex for this problem.
Alternatively, you can use something like,
Character ch = new Character(yourString.charAt(i));
if(!Character.isAlphabetic(yourString.charAt(i))) {
to determine if a character is not an alphabetical one, and is a symbol or punctuation.

I think the best way is not relay on str.split("\\s++"), because you could have punctuation in any plase. The best one is to look through the string and find all not letter or digit symbols. After that you can define a word borders and translate it.
public static String pigIt(String str) {
StringBuilder buf = new StringBuilder();
for (int i = 0, j = 0; j <= str.length(); j++) {
char ch = j < str.length() ? str.charAt(j) : '\0';
if (Character.isLetterOrDigit(ch))
continue;
if (i < j) {
buf.append(str.substring(i + 1, j));
buf.append(str.charAt(i));
buf.append("ay");
}
if (ch != '\0')
buf.append(ch);
i = j + 1;
}
return buf.toString();
}
Output:
System.out.println(pigIt(",Hello, !World")); // ,elloHay, !orldWay

Regex may be difficult to start with but is very powerful:
public static String pigIt(String str) {
return str.replaceAll("([a-zA-Z])([a-zA-Z]*)", "$2$1ay");
}
The () specify groups. So I have one group with the first alphabet character and a second group with the remaining alphabet characters.
In the replace parameter you can refer to these groups ($1, $2).
String.replaceAll will search all matching string parts and apply the replacement. Non matching characters like the punctuations are left untouched.
public static void main(String[] args) {
System.out.println("Hello, World, ! -->"+ pigIt("Hello, World, !"));
System.out.println("Hello?, Wo$, F, ! -->"+ pigIt("Hello?, Wo$, F, !"));
}
The output of this method is:
Hello, World, ! -->elloHay, orldWay, !
Hello?, Wo$, F, ! -->elloHay?, oWay$, Fay, !

How to check if a String can be formed from the characters of another String in Java?

A string is good if it can be formed by characters from chars. I want to return the sum of lengths of all good strings in words.
Input: words = ["cat","bt","hat","tree"], chars = "atach"
Output: 6
Explanation:
The strings that can be formed are "cat" and "hat" so the answer is 3 + 3 = 6.
Below is the code that I have written.
class Solution
{
public int countCharacters(String[] words, String chars)
{
int k = 0, count = 0;
Set<Character> set = new HashSet<>();
for(int i = 0; i < chars.length(); i++)
{
set.add(chars.charAt(i));
}
StringBuilder chrs = new StringBuilder();
for(Character ch : set)
{
chrs.append(ch);
}
for(int i = 0; i < words.length; i++)
{
char[] ch = words[i].toCharArray();
for(int j = 0; j < ch.length; j++)
{
if(chrs.contains("" + ch[j]))
{
k++;
}
}
if(k == words[i].length())
{
count+= k;
}
}
return count;
}
}
Output:
Line 24: error: cannot find symbol
if(chrs.contains("" + ch[j]))
Can someone help me? What am I doing wrong in accessing the character?

The issues which I noticed is you are using contains() to compare a String and a character. But the contains() method is a Java method to check if String contains another substring or not.
So you can solve this by converting the character to a string.
Ex 1:
if(chars.contains(Character.toString(ch[j]))){
k++;
} else {
}
Ex 2:
f(chars.contains(""+ch[j]))
{
k++;
} else {
}
Otherwise, You can compare if the string contains a char by using indexOf(). If the string isn't containing the char it return -1. Please refer bellow example.
Ex:
if(chars.indexOf(ch[j])!=-1){
k++;
} else {
}

contains tells you if a string is contained in another string. But in your case ch[j] is not a string but a char, so you can't use contains.
Instead, use indexOf, it returns -1 if the char is not present in the string.

the most simple way is
chars.contains("" + ch[i]);

Here, ch[j] is not a string but a char, so you can't use contains as you've done. Instead, make the following change.
chars.contains(String.valueOf(ch[j]));

Filter bad words | java 'replace'

In an attempt to filter the bad words, I found the 'replace' function in java is not as handy as intended.
Please find below the code :
Eg : consider the word 'abcde' and i want to filter it to 'a***e'.
String test = "abcde";
for (int i = 1; i < sdf.length() - 1; i++) {
test= test.replace(test.charAt(i), '*');
}
System.out.print(test);
Output : a***e
But if the String is String test = "bbcde";, the output is ****e. It seems, if the word has repetitive letters(as in here), the replace function replaces the repetitive letters
too.
Why is it so? I want to filter the words excluding the first and last letter.

That is because String.replace(char, char) replaces all occurrences of the first character (according to its Javadoc).
What you want is probably more like this:
char[] word = test.toCharArray();
for (int i = 1; i < word.lengh - 1; i++) { // make sure to start at second char, and end at one-but-last char
word[i] = '*';
}
System.out.println(String.copyValueOf(word));

since String.replace(char, char) replaces all occurrences of specified char, this would be a better approach for your requirement:
String test = "abcde";
String replacement = "";
for (int i = 0; i < sdf.length(); i++) {
replacement += "*";
}
test= test.replace(sdf, replacement );
System.out.print(test);

It seems, if the word has repetitive letters(as in here), the replace function replaces the repetitive letters too. Why is it so?
Why? Because that's just how it works, exactly as the API documentation of String.replace(char oldChar, char newChar) says:
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
If you just want to replace the content of the string by the first letter, some asterisks and the last letter, then you don't need to use replace at all.
String test = "abcde";
if (test.length() >= 1) {
StringBuilder result = new StringBuilder();
result.append(test.charAt(0));
for (int i = 0; i < test.length() - 2; ++i) {
result.append('*');
}
result.append(test.charAt(test.length() - 1));
test = result.toString();
}
System.out.println(test);

public static void main(String[] args) {
String test = "bbcde";
String output = String.valueOf(test.charAt(0));
for (int i = 1; i < test.length() - 1; i++) {
output = output + "*";
}
output = output + String.valueOf(test.charAt(test.length() - 1));
System.out.print(output);
}

You should use the replaceAll-Function:
Link
With this you can replace all times you find a given substring in a string (f.e. "abcde") and replace all these with another string (f.e. "a***e").
String test = "abcde";
String replacement = "";
for (int i = 0; i < test.length(); i++) {
if (i==0 || i==test.length()-1){
replacement += test.charAt(i);
} else {
replacement += "*";
}
}
sdf = sdf.replaceAll(test, replacement);
System.out.print(test);

How to remove surrogate characters in Java?

I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
I have written the following method for now and I am curious to know if there is a direct and optimal way to handle this.
Thanks in advance for your help.
public static String removeSurrogates(String query) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < query.length() - 1; i++) {
char firstChar = query.charAt(i);
char nextChar = query.charAt(i+1);
if (Character.isSurrogatePair(firstChar, nextChar) == false) {
sb.append(firstChar);
} else {
i++;
}
}
if (Character.isHighSurrogate(query.charAt(query.length() - 1)) == false
&& Character.isLowSurrogate(query.charAt(query.length() - 1)) == false) {
sb.append(query.charAt(query.length() - 1));
}
return sb.toString();
}

Here's a couple things:
Character.isSurrogate(char c):
A char value is a surrogate code unit if and only if it is either a low-surrogate code unit or a high-surrogate code unit.
Checking for pairs seems pointless, why not just remove all surrogates?
x == false is equivalent to !x
StringBuilder is better in cases where you don't need synchronization (like a variable that never leaves local scope).
I suggest this:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char c = query.charAt(i);
// !isSurrogate(c) in Java 7
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
}
return sb.toString();
}
Breaking down the if statement
You asked about this statement:
if (!(Character.isHighSurrogate(c) || Character.isLowSurrogate(c))) {
sb.append(firstChar);
}
One way to understand it is to break each operation into its own function, so you can see that the combination does what you'd expect:
static boolean isSurrogate(char c) {
return Character.isHighSurrogate(c) || Character.isLowSurrogate(c);
}
static boolean isNotSurrogate(char c) {
return !isSurrogate(c);
}
...
if (isNotSurrogate(c)) {
sb.append(firstChar);
}

Java strings are stored as sequences of 16-bit chars, but what they represent is sequences of unicode characters. In unicode terminology, they are stored as code units, but model code points. Thus, it's somewhat meaningless to talk about removing surrogates, which don't exist in the character / code point representation (unless you have rogue single surrogates, in which case you have other problems).
Rather, what you want to do is to remove any characters which will require surrogates when encoded. That means any character which lies beyond the basic multilingual plane. You can do that with a simple regular expression:
return query.replaceAll("[^\u0000-\uffff]", "");

why not simply
for (int i = 0; i < query.length(); i++)
char c = query.charAt(i);
if(!isHighSurrogate(c) && !isLowSurrogate(c))
sb.append(c);
you probably should replace them with "?", instead of out right erasing them.

Just curious. If char is high surrogate is there a need to check the next one? It is supposed to be low surrogate. The modified version would be:
public static String removeSurrogates(String query) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < query.length(); i++) {
char ch = query.charAt(i);
if (Character.isHighSurrogate(ch))
i++;//skip the next char is it's supposed to be low surrogate
else
sb.append(ch);
}
return sb.toString();
}

if remove, all these solutions are useful
but if repalce, below is better
StringBuffer sb = new StringBuffer();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if(Character.isHighSurrogate(c)){
sb.append('*');
}else if(!Character.isLowSurrogate(c)){
sb.append(c);
}
}
return sb.toString();

What is an efficient way to replace many characters in a string?

String handling in Java is something I'm trying to learn to do well. Currently I want to take in a string and replace any characters I find.
Here is my current inefficient (and kinda silly IMO) function. It was written to just work.
public String convertWord(String word)
{
return word.toLowerCase().replace('á', 'a')
.replace('é', 'e')
.replace('í', 'i')
.replace('ú', 'u')
.replace('ý', 'y')
.replace('ð', 'd')
.replace('ó', 'o')
.replace('ö', 'o')
.replaceAll("[-]", "")
.replaceAll("[.]", "")
.replaceAll("[/]", "")
.replaceAll("[æ]", "ae")
.replaceAll("[þ]", "th");
}
I ran 1.000.000 runs of it and it took 8182ms. So how should I proceed in changing this function to make it more efficient?
Solution found:
Converting the function to this
public String convertWord(String word)
{
StringBuilder sb = new StringBuilder();
char[] charArr = word.toLowerCase().toCharArray();
for(int i = 0; i < charArr.length; i++)
{
// Single character case
if(charArr[i] == 'á')
{
sb.append('a');
}
// Char to two characters
else if(charArr[i] == 'þ')
{
sb.append("th");
}
// Remove
else if(charArr[i] == '-')
{
}
// Base case
else
{
sb.append(word.charAt(i));
}
}
return sb.toString();
}
Running this function 1.000.000 times takes 518ms. So I think that is efficient enough. Thanks for the help guys :)

You could create a table of String[] which is Character.MAX_VALUE in length. (Including the mapping to lower case)
As the replacements got more complex, the time to perform them would remain the same.
private static final String[] REPLACEMENT = new String[Character.MAX_VALUE+1];
static {
for(int i=Character.MIN_VALUE;i<=Character.MAX_VALUE;i++)
REPLACEMENT[i] = Character.toString(Character.toLowerCase((char) i));
// substitute
REPLACEMENT['á'] = "a";
// remove
REPLACEMENT['-'] = "";
// expand
REPLACEMENT['æ'] = "ae";
}
public String convertWord(String word) {
StringBuilder sb = new StringBuilder(word.length());
for(int i=0;i<word.length();i++)
sb.append(REPLACEMENT[word.charAt(i)]);
return sb.toString();
}

My suggestion would be:
Convert the String to a char[] array
Run through the array, testing each character one by one (e.g. with a switch statement) and replacing it if needed
Convert the char[] array back to a String
I think this is probably the fastest performance you will get in pure Java.
EDIT: I notice you are doing some changes that change the length of the string. In this case, the same principle applies, however you need to keep two arrays and increment both a source index and a destination index separately. You might also need to resize the destination array if you run out of target space (i.e. reallocate a larger array and arraycopy the existing destination array into it)

My implementation is based on look up table.
public static String convertWord(String str) {
char[] words = str.toCharArray();
char[] find = {'á','é','ú','ý','ð','ó','ö','æ','þ','-','.',
'/'};
String[] replace = {"a","e","u","y","d","o","o","ae","th"};
StringBuilder out = new StringBuilder(str.length());
for (int i = 0; i < words.length; i++) {
boolean matchFailed = true;
for(int w = 0; w < find.length; w++) {
if(words[i] == find[w]) {
if(w < replace.length) {
out.append(replace[w]);
}
matchFailed = false;
break;
}
}
if(matchFailed) out.append(words[i]);
}
return out.toString();
}

My first choice would be to use a StringBuilder because you need to remove some chars from the string.
Second choice would be to iterate throw the array of chars and add the treated char to another array of the inicial size of the string. Then you would need to copy the array to trim the possible unused positions.
After that, I would make some performance tests to see witch one is better.

I doubt, that you can speed up the 'character replacement' at all really. As for the case of regular expression replacement, you may compile the regexs beforehand

Use the function String.replaceAll.
Nice article similar with what you want: link

Any time we have problems like this we use regular expressions are they are by far the fastest way to deal with what you are trying to do.
Have you already tried regular expressions?

What i see being inefficient is that you are gonna check again characters that have already been replaced, which is useless.
I would get the charArray of the String instance, iterate over it, and for each character spam a series of if-else like this:
char[] array = word.toCharArray();
for(int i=0; i<array.length; ++i){
char currentChar = array[i];
if(currentChar.equals('é'))
array[i] = 'e';
else if(currentChar.equals('ö'))
array[i] = 'o';
else if(//...
}

I just implemented this utility class that replaces a char or a group of chars of a String. It is equivalent to bash tr and perl tr///, aka, transliterate. I hope it helps someone!
package your.package.name;
/**
* Utility class that replaces chars of a String, aka, transliterate.
*
* It's equivalent to bash 'tr' and perl 'tr///'.
*
*/
public class ReplaceChars {
public static String replace(String string, String from, String to) {
return new String(replace(string.toCharArray(), from.toCharArray(), to.toCharArray()));
}
public static char[] replace(char[] chars, char[] from, char[] to) {
char[] output = chars.clone();
for (int i = 0; i < output.length; i++) {
for (int j = 0; j < from.length; j++) {
if (output[i] == from[j]) {
output[i] = to[j];
break;
}
}
}
return output;
}
/**
* For tests!
*/
public static void main(String[] args) {
// Example from: https://en.wikipedia.org/wiki/Caesar_cipher
String string = "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG";
String from = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
String to = "XYZABCDEFGHIJKLMNOPQRSTUVW";
System.out.println();
System.out.println("Cesar cypher: " + string);
System.out.println("Result: " + ReplaceChars.replace(string, from, to));
}
}
This is the output:
Cesar cypher: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
Result: QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Making code to clean string of unwanted characters - java

Related

Pig it method that I am trying to make trouble checking punctuation at the end java

How to check if a String can be formed from the characters of another String in Java?

Filter bad words | java 'replace'

How to remove surrogate characters in Java?

What is an efficient way to replace many characters in a string?

Categories

Resources