I'd like to normalize any extended ascii characters, but exclude umlauts.
If I'd like to include umlauts, I would go for:
Normalizer.normalize(value, Normalizer.Form.NFKD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
But how can I exclude german umlauts?
As a result I would like to get:
source: üöäâÇæôøñÁ
desired result: üöäaCaeoonA or similar
From here I see 2 solutions, the first one is quite dirty the second is quite boring to implement I guess.
Remove from the string you want to normalize the characters with umlauts, then after normalization put them back.
Don't use the pre-buit pattern p{InCombiningDiacriticalMarks}. Instead build your own one excluding umlaut.
Take a look at :
Regex: what is InCombiningDiacriticalMarks?
Unicode blocks
Combining Diacritical Marks and Combining Diacritical Marks for Symbols
// Latin to ASCII - mostly
private static final String TAB_00C0 = "" +
"AAAAÄAACEEEEIIII" +
"DNOOOOÖ×OUUUÜYTß" +
"aaaaäaaceeeeiiii" +
"dnooooö÷ouuuüyty" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzs";
private static HashMap<Character, String> LIGATURES = new HashMap<>(){{
put('æ', "ae");
put('œ', "oe");
put('þ', "th");
put("ij", "ij");
put('ð', "dh");
put("Æ", "AE");
put("Œ", "OE");
put("Þ", "TH");
put("Ð", "DH");
put("IJ", "IJ");
//TODO
}};
public static String removeAllButUmlauts(String value) {
value = Normalizer.normalize(value, Normalizer.Form.NFC);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < source.length(); i++) {
char c = source.charAt(i);
String l = LIGATURES.get(c);
if (l != null){
sb.append(l);
} else if (c < 0xc0) {
sb.append(c); // ASCII and C1 control codes
} else if (c >= 0xc0 && c <= 0x17f) {
c = TAB_00C0.charAt(c - 0xc0); // common single latin letters
sb.append(c);
} else {
// anything else, including Vietnamese and rare diacritics
l = Normalizer.normalize(Character.toString(c), Normalizer.Form.NFKD)
.replaceAll("[\\p{InCombiningDiacriticalMarks}]+", "");
sb.append(l);
}
}
return sb.toString();
}
and then
String value = "üöäâÇæôøñÁ";
String after = removeAllButUmlauts(value);
System.out.println(after)
gives:
üöäaCaeoonA
Related
public class Main {
public static void main(String[] args) {
String name = "the-stealth-warrior";
for (int i = 0; i < name.length();i++){
if (name.charAt(i) == '-'){
char newName = Character.toUpperCase(name.charAt(i+1));
newName += name.charAt(i + 1);
i++;
}
}
}
}
I try to loop in every char and check if the I == '-' convert the next letter to be uppercase and append to a new String.
We can try using a split approach with the help of a stream:
String name = "the-stealth-warrior";
String parts = name.replaceAll("^.*?-", "");
String output = Arrays.stream(parts.split("-"))
.map(x -> x.substring(0, 1).toUpperCase() + x.substring(1))
.collect(Collectors.joining(""));
output = name.split("-", 2)[0] + output;
System.out.println(output); // theStealthWarrior
I think the most concise way to do this would be with regexes:
String newName = Pattern.compile("-+(.)?").matcher(name).replaceAll(mr -> mr.group(1).toUpperCase());
Note that Pattern.compile(...) can be stored rather than re-evaluating it each time.
A more verbose (but probably more efficient way) to do it would be to build the string using a StringBuilder:
StringBuilder sb = new StringBuilder(name.length());
boolean uc = false; // Flag to know whether to uppercase the char.
int len = name.codePointsCount(0, name.length());
for (int i = 0; i < name.len; ++i) {
int c = name.codePointAt(i);
if (c == '-') {
// Don't append the codepoint, but flag to uppercase the next codepoint
// that isn't a '-'.
uc = true;
} else {
if (uc) {
c = Character.toUpperCase(c);
uc = false;
}
sb.appendCodePoint(c);
}
}
String newName = sb.toString();
Note that you can't reliably uppercase single codepoints in specific locales, e.g. ß in Locale.GERMAN.
I have a string that is this
Temperature: 98.6°F (37.0°C)
Ultimately would like to convert it to look like this
98.6\u00b0F (37.0\u00b0C)
I wind up with all the solutions making this a ? or some other char, what i want to do is put a string for the unicode solution there.
All of the solutions that i have come across or tried don't seem to work.
Thanks in advance.
Just loop through the characters of the string and replace non-ASCII characters with the Unicode escape:
String s = "Temperature: 98.6°F (37.0°C)";
StringBuilder buf = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c >= 0x20 && c <= 0x7E) // visible ASCII character
buf.append(c);
else
buf.append(String.format("\\u%04x", (int) c));
}
String t = buf.toString();
System.out.println(t);
Output
Temperature: 98.6\u00b0F (37.0\u00b0C)
In Java 9+, it's even simpler:
String s = "Temperature: 98.6°F (37.0°C)";
String t = Pattern.compile("[^ -~]").matcher(s)
.replaceAll(r -> String.format("\\\\u%04x", (int) r.group().charAt(0)));
System.out.println(t);
There is a way to split a string into repeating characters using a regex function but I want to do it without using it.
for example, given a string like: "EE B" my output will be an array of strings e.g
{"EE", " ", "B"}
my approach is:
given a string I will first find the number of unique characters in a string so I know the size of the array. Then I will change the string to an array of characters. Then I will check if the next character is the same or not. if it is the same then append them together if not begin a new string.
my code so far..
String myinput = "EE B";
char[] cinput = new char[myinput.length()];
cinput = myinput.toCharArray(); //turn string to array of characters
int uniquecha = myinput.length();
for (int i = 0; i < cinput.length; i++) {
if (i != myinput.indexOf(cinput[i])) {
uniquecha--;
} //this should give me the number of unique characters
String[] returninput = new String[uniquecha];
Arrays.fill(returninput, "");
for (int i = 0; i < uniquecha; i++) {
returninput[i] = "" + myinput.charAt(i);
for (int j = 0; j < myinput.length - 1; j++) {
if (myinput.charAt(j) == myinput.charAt(j + 1)) {
returninput[j] += myinput.charAt(j + 1);
} else {
break;
}
}
} return returninput;
but there is something wrong with the second part as I cant figure out why it is not beginning a new string when the character changes.
You question says that you don't want to use regex, but I see no reason for that requirement, other than this is maybe homework. If you are open to using regex here, then there is a one line solution which splits your input string on the following pattern:
(?<=\S)(?=\s)|(?<=\s)(?=\S)
This pattern uses lookarounds to split whenever what precedes is a non whitespace character and what proceeds is a whitespace character, or vice-versa.
String input = "EE B";
String[] parts = input.split("(?<=\\S)(?=\\s)|(?<=\\s)(?=\\S)");
System.out.println(Arrays.toString(parts));
[EE, , B]
^^ a single space character in the middle
Demo
If I understood correctly, you want to split the characters in a string so that similar-consecutive characters stay together. If that's the case, here is how I would do it:
public static ArrayList<String> splitString(String str)
{
ArrayList<String> output = new ArrayList<>();
String combo = "";
//iterates through all the characters in the input
for(char c: str.toCharArray()) {
//check if the current char is equal to the last added char
if(combo.length() > 0 && c != combo.charAt(combo.length() - 1)) {
output.add(combo);
combo = "";
}
combo += c;
}
output.add(combo); //adds the last character
return output;
}
Note that instead of using an array (has a fixed size) to store the output, I used an ArrayList, which has a variable size. Also, instead of checking the next character for equality with the current one, I preferred to use the last character for that. The variable combo is used to temporarily store the characters before they go to output.
Now, here is one way to print the result following your guidelines:
public static void main(String[] args)
{
String input = "EEEE BCD DdA";
ArrayList<String> output = splitString(input);
System.out.print("[");
for(int i = 0; i < output.size(); i++) {
System.out.print("\"" + output.get(i) + "\"");
if(i != output.size()-1)
System.out.print(", ");
}
System.out.println("]");
}
The output when running the above code will be:
["EEEE", " ", "B", "C", "D", " ", "D", "d", "A"]
How to capitalize the first and last letters of every word in a string
i have done it this way -
String cap = "";
for (int i = 0; i < sent.length() - 1; i++)
{
if (sent.charAt(i + 1) == ' ')
{
cap += Character.toUpperCase(sent.charAt(i)) + " " + Character.toUpperCase(sent.charAt(i + 2));
i += 2;
}
else
cap += sent.charAt(i);
}
cap += Character.toUpperCase(sent.charAt(sent.length() - 1));
System.out.print (cap);
It does not work when the first word is of more than single character
Please use simple functions as i am a beginner
Using apache commons lang library it becomes very easy to do:
String testString = "this string is needed to be 1st and 2nd letter-uppercased for each word";
testString = WordUtils.capitalize(testString);
testString = StringUtils.reverse(testString);
testString = WordUtils.capitalize(testString);
testString = StringUtils.reverse(testString);
System.out.println(testString);
ThiS StrinG IS NeedeD TO BE 1sT AnD 2nD Letter-uppercaseD FoR EacH
WorD
You should rather split your String with a whitespace as character separator, then for each token apply toUpperCase() on the first and the last character and create a new String as result.
Very simple sample :
String cap = "";
String sent = "hello world. again.";
String[] token = sent.split("\\s+|\\.$");
for (String currentToken : token){
String firstChar = String.valueOf(Character.toUpperCase(currentToken.charAt(0)));
String between = currentToken.substring(1, currentToken.length()-1);
String LastChar = String.valueOf(Character.toUpperCase(currentToken.charAt(currentToken.length()-1)));
if (!cap.equals("")){
cap += " ";
}
cap += firstChar+between+LastChar;
}
Of course you should favor the use of StringBuilder over String as you perform many concatenations.
Output result : HellO World. AgaiN
Your code is missing out the first letter of the first word. I would treat this as a special case, i.e.
cap = ""+Character.toUpperCase(sent.charAt(0));
for (int i = 1; i < sent.length() - 1; i++)
{
.....
Of course, there are much easier ways to do what you are doing.
Basically you just need to iterate over all characters and replace them if one of the following conditions is true:
it's the first character
it's the last character
the previous character was a whitespace (or whatever you want, e.g. punctuation - see below)
the next character is a whitespace (or whatever you want, e.g. punctuation - see below)
If you use a StringBuilder for performance and memory reasons (don't create a String in every iteration which += would do) it could look like this:
StringBuilder sb = new StringBuilder( "some words in a list even with longer whitespace in between" );
for( int i = 0; i < sb.length(); i++ ) {
if( i == 0 || //rule 1
i == (sb.length() - 1 ) || //rule 2
Character.isWhitespace( sb.charAt( i - 1 ) ) || //rule 3
Character.isWhitespace( sb.charAt( i + 1 ) ) ) { //rule 4
sb.setCharAt( i, Character.toUpperCase( sb.charAt( i ) ) );
}
}
Result: SomE WordS IN A LisT EveN WitH LongeR WhitespacE IN BetweeN
If you want to check for other rules as well (e.g. punctuation etc.) you could create a method that you call for the previous and next character and which checks for the required properties.
String stringToSearch = "this string is needed to be first and last letter uppercased for each word";
// First letter upper case using regex
Pattern firstLetterPtn = Pattern.compile("(\\b[a-z]{1})+");
Matcher m = firstLetterPtn.matcher(stringToSearch);
StringBuffer sb = new StringBuffer();
while(m.find()){
m.appendReplacement(sb,m.group().toUpperCase());
}
m.appendTail(sb);
stringToSearch = sb.toString();
sb.setLength(0);
// Last letter upper case using regex
Pattern LastLetterPtn = Pattern.compile("([a-z]{1}\\b)+");
m = LastLetterPtn.matcher(stringToSearch);
while(m.find()){
m.appendReplacement(sb,m.group().toUpperCase());
}
m.appendTail(sb);
System.out.println(sb.toString());
output:
ThiS StrinG IS NeedeD TO BE FirsT AnD LasT LetteR UppercaseD FoR EacH WorD
public static String basicEncrypt(String s) {
String toReturn = "";
for (int j = 0; j < s.length(); j++) {
toReturn += (int)s.charAt(j);
}
//System.out.println("Encrypt: " + toReturn);
return toReturn;
}
Is there any way to reverse this to find the original string? Much appreciated.
Under the assumption that you only use ASCII characters (32-255 codes) the algorithm is simple:
Take the first character of input
If it's 1 or 2 - take and cut off next two digits and convert to character
If it's any other character - take and cut off next digit and convert to character
Go to 1. if some input left
Here is a quick'n'dirty Scala implementation:
def decrypt(s: String): String = s.headOption match {
case None => ""
case Some('1') | Some('2') => s.substring(0, 3).toInt.toChar + decrypt(s.substring(3))
case Some(_) => s.substring(0, 2).toInt.toChar + decrypt(s.substring(2))
}
Yes, if taken in account that your original string consists of characters between and including (32) and unicode charcode 299, see http://www.asciitable.com/
Psuedo code
ret=<empty string>
while not end of string
n=next number from string
if n<3 charcode= n + next 2 numbers
else
charcode=n + next number
ret=ret + character(charcode)
end while
Charcodes under space (newlines and carriage returns)and above 299 will thwart this algorithm. This algorithm can be fixed to include characters up to charcode 319.
private static String basicDecrypt(String s) {
String result = "";
String buffer = "";
for (int i = 0; i < s.length(); i++) {
buffer += s.charAt(i);
if ((buffer.charAt(0) == '1' && buffer.length() == 3) || (buffer.charAt(0) != '1' && buffer.length() == 2)) {
result += (char) Integer.parseInt(buffer);
buffer = "";
}
}
return result;
}
This is a very basic decryption method. It will only work for [A-Za-z0-9]+ US ASCII.
Just for the fun of it, another couple of versions; Java, US-ASCII only, chars 0x14-0xc7;
public static String basicDecrypt(String input)
{
StringBuffer output = new StringBuffer();
Matcher matcher = Pattern.compile("(1..|[2-9].)").matcher(input);
while(matcher.find())
output.append((char)Integer.parseInt(matcher.group()));
return output.toString();
}
For 0x1e-0xff, replace the regex with "([12]..|[3-9].)"
...and a somewhat briefer Linq'y C# version.
private static string BasicDecrypt(string input)
{
return new string(Regex.Matches(input, "(1..|[2-9].)").Cast<Match>()
.Select(x => (char) Int32.Parse(x.Value)).ToArray());
}