How to convert accented letters to regular char in Java

How to convert accented letters to regular char in Java - java

How do I convert Æ and á into a regular English char with Java ? What I have is something like this : Local TV from Paraná. How to convert it to [Parana] ?

Look at icu4j or the JDK 1.6 Normalizer:
public String removeAccents(String text) {
return Normalizer.normalize(text, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

As far as I know, there's no way to do this automatically -- you'd have to substitute manually using String.replaceAll.
String str = "Paraná";
str = str.replaceAll("á", "a");
str = str.replaceAll("Æ", "a");

Try something similar to the following code snippet:
import org.apache.commons.lang3.StringUtils;
public class Test {
public static void main(String[] args) {
String original = new String("Ramesh Öhrman");
try {
System.out.println(StringUtils.stripAccents(original));
} catch (Exception e) {
}
}
}
Output: Ramesh Ohrman

Related

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

I need to split a string in Java (first remove whitespaces between quotes and then split at whitespaces.)
"abc test=\"x y z\" magic=\" hello \" hola"
becomes:
firstly:
"abc test=\"xyz\" magic=\"hello\" hola"
and then:
abc
test="xyz"
magic="hello"
hola
Scenario :
I am getting a string something like above from input and I want to break it into parts as above. One way to approach was first remove the spaces between quotes and then split at spaces. Also string before quotes complicates it. Second one was split at spaces but not if inside quote and then remove spaces from individual split. I tried capturing quotes with "\"([^\"]+)\"" but I'm not able to capture just the spaces inside quotes. I tried some more but no luck.

We can do this using a formal pattern matcher. The secret sauce of the answer below is to use the not-much-used Matcher#appendReplacement method. We pause at each match, and then append a custom replacement of anything appearing inside two pairs of quotes. The custom method removeSpaces() strips all whitespace from each quoted term.
public static String removeSpaces(String input) {
return input.replaceAll("\\s+", "");
}
String input = "abc test=\"x y z\" magic=\" hello \" hola";
Pattern p = Pattern.compile("\"(.*?)\"");
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer("");
while (m.find()) {
m.appendReplacement(sb, "\"" + removeSpaces(m.group(1)) + "\"");
}
m.appendTail(sb);
String[] parts = sb.toString().split("\\s+");
for (String part : parts) {
System.out.println(part);
}
abc
test="xyz"
magic="hello"
hola
Demo
The big caveat here, as the above comments hinted at, is that we are really using a regex engine as a rudimentary parser. To see where my solution would fail fast, just remove one of the quotes by accident from a quoted term. But, if you are sure you input is well formed as you have showed us, this answer might work for you.

I wanted to mention the java 9's Matcher.replaceAll lambda extension:
// Find quoted strings and remove there whitespace:
s = Pattern.compile("\"[^\"]*\"").matcher(s)
.replaceAll(mr -> mr.group().replaceAll("\\s", ""));
// Turn the remaining whitespace in a comma and brace all.
s = '{' + s.trim().replaceAll("\\s+", ", ") + '}';

Probably the other answer is better but still I have written it so I will post it here ;) It takes a different approach
public static void main(String[] args) {
String test="abc test=\"x y z\" magic=\" hello \" hola";
Pattern pattern = Pattern.compile("([^\\\"]+=\\\"[^\\\"]+\\\" )");
Matcher matcher = pattern.matcher(test);
int lastIndex=0;
while(matcher.find()) {
String[] parts=matcher.group(0).trim().split("=");
boolean newLine=false;
for (String string : parts[0].split("\\s+")) {
if(newLine)
System.out.println();
newLine=true;
System.out.print(string);
}
System.out.println("="+parts[1].replaceAll("\\s",""));
lastIndex=matcher.end();
}
System.out.println(test.substring(lastIndex).trim());
}
Result is
abc
test="xyz"
magic="hello"
hola

It sounds like you want to write a basic parser/Tokenizer. My bet is that after you make something that can deal with pretty printing in this structure, you will soon want to start validating that there arn't any mis-matching "'s.
But in essence, you have a few stages for this particular problem, and Java has a built in tokenizer that can prove useful.
import java.util.LinkedList;
import java.util.List;
import java.util.StringTokenizer;
import java.util.stream.Collectors;
public class Q50151376{
private static class Whitespace{
Whitespace(){ }
#Override
public String toString() {
return "\n";
}
}
private static class QuotedString {
public final String string;
QuotedString(String string) {
this.string = "\"" + string.trim() + "\"";
}
#Override
public String toString() {
return string;
}
}
public static void main(String[] args) {
String test = "abc test=\"x y z\" magic=\" hello \" hola";
StringTokenizer tokenizer = new StringTokenizer(test, "\"");
boolean inQuotes = false;
List<Object> out = new LinkedList<>();
while (tokenizer.hasMoreTokens()) {
final String token = tokenizer.nextToken();
if (inQuotes) {
out.add(new QuotedString(token));
} else {
out.addAll(TokenizeWhitespace(token));
}
inQuotes = !inQuotes;
}
System.out.println(joinAsStrings(out));
}
private static String joinAsStrings(List<Object> out) {
return out.stream()
.map(Object::toString)
.collect(Collectors.joining());
}
public static List<Object> TokenizeWhitespace(String in){
List<Object> out = new LinkedList<>();
StringTokenizer tokenizer = new StringTokenizer(in, " ", true);
boolean ignoreWhitespace = false;
while (tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
boolean whitespace = token.equals(" ");
if(!whitespace){
out.add(token);
ignoreWhitespace = false;
} else if(!ignoreWhitespace) {
out.add(new Whitespace());
ignoreWhitespace = true;
}
}
return out;
}
}

Remove a character from java string using hex code

i would like to remove a character from java string using hex code:
i am trying following code but seems to not be correct as the character isn't replaced: ÿ
String str ="test ÿ";
str.replaceAll("\\x{9F}","")
is there any thing wrong with the syntax of the hex code? Thanks.

Could you please try this:
public class AsciiHexCode {
public static void main(String[] args) {
String str = "test ÿ";
String result = str.replaceAll("[^\\x00-\\x7F]", "");
System.out.println("result : "+ result);
}
}

To mach ÿ you need \u00ff instead, as Jon mentioned.
String replaced = str.replace("\u00ff", "");
in your case.

regex does not like out#

I write the following code o remove all hashtag words from my text:
public static void main(String[] args) {
System.out
.println(removeHashtag("Got an infection in my eye. Pharmacist thinks something bitten me. This wouldn't have happened under Simeone. Wenger a#sarcasm #wengerin"));
}
public static String removeHashtag(String commentstr) {
String arrWord[] = commentstr.split(" ");
String sentenceWithoutHash = commentstr;
System.out.println(sentenceWithoutHash);
for (int i = 0; i < arrWord.length; i++) {
if (arrWord[i].contains("#")) {
String regex = "\\s*\\" + arrWord[i] + "\\b\\s*";
sentenceWithoutHash = sentenceWithoutHash.replaceAll(regex, "");
}
}
return sentenceWithoutHash;
}
But this code does not work wih this text
Got an infection in my eye. Pharmacist thinks something bitten me. This wouldn't have happened under Simeone. Wenger out#sarcasm #wengerin"
it seems that regex does not like out#
Can anyone help?

You can use this regex to remove any word containing #:
String rep = str.replaceAll("\\s*\\w*#\\w*\\s*", "");
RegEx Demo

This will work as per your condition
((?:[^\s]+)?#[^\s]+)
Regex Demo
String x = str.replaceAll("((?:[^\\s]+)?#[^\\s]+)", "")

java tokenizer for strings

I have a text file and want to tokenize its lines -- but only the sentences with the # character.
For example, given...
Buah... Molt bon concert!! #Postconcert #gintonic
...I want to print only #Postconcert #gintonic.
I have already tried this code with some changes...
public class MyTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
tokenize("Europe3.txt","allo.txt");
}
public static void tokenize(String sFile,String sFileOut) {
String sLine="", sToken="";
MyBufferedReaderWriter f = new MyBufferedReaderWriter();
f.openRFile(sFile);
MyBufferedReaderWriter fOut = new MyBufferedReaderWriter();
fOut.openWFile(sFileOut);
while ((sLine=f.readLine()) != null) {
//StringTokenizer st = new StringTokenizer(sLine, "#");
String[] tokens = sLine.split("\\#");
for (String token : tokens)
{
fOut.writeLine(token);
//System.out.println(token);
}
/*while (st.hasMoreTokens()) {
sToken = st.nextToken();
System.out.println(sToken);
}*/
}
f.closeRFile();
}
}
Can anyone help?

You can try something like with Regex:
package com.stackoverflow.answers;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HashExtractor {
public static void main(String[] args) {
String strInput = "Buah... Molt bon concert!! #Postconcert #gintonic";
String strPattern = "(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)";
Pattern pattern = Pattern.compile(strPattern);
Matcher matcher = pattern.matcher(strInput);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}

As per the given example, when using the split() function the values would be stored something like this:
tokens[0]=Buah... Molt bon concert!!
tokens[1]=Postconcert
tokens[2]=gintonic
So you just need to skip first value and append '#' (if you need that in your other) to the other string values.
Hope this helps.

You have not specially asked for this, but I assume you try to extract all the #hashtags from your textfile.
To do this, Regex is your friend:
String text = "Buah... Molt bon concert!! #Postconcert #gintonic";
System.out.println(getHashTags(text));
public Collection<String> getHashTags(String text) {
Pattern pattern = Pattern.compile("(#\\w+)");
Matcher matcher = pattern.matcher(text);
Set<String> htags = new HashSet();
while (matcher.find()) {
htags.add(matcher.group(1));
}
return htags;
}
Compile a pattern like this #\w+, everything that starts with a # followed by one or more (+) word character (\w).
Then we have to escape the \ for java with a \\.
And finally put this expression in a group to get access to the matched text by surrounding it with braces (#\w+).
For every match, add the first matched group to the set htags, finally we get a set with all the hashtags in it.
[#gintonic, #Postconcert]

How do I split a string on a fixed character sequence?

Suppose I have following string:
String asd = "this is test ass this is test"
and I want to split the string using "ass" character sequence.
I used:
asd.split("ass");
It doesn't work. What do I need to do?

It seems to work fine for me:
public class Test
{
public static void main(String[] args) {
String asd = "this is test ass this is test";
String[] bits = asd.split("ass");
for (String bit : bits) {
System.out.println("'" + bit + "'");
}
}
}
Result:
'this is test '
' this is test'
Is your real delimiter different perhaps? Don't forget that split uses its parameter as a regular expression...

String asd = "this is test foo this is test";
String[] parts = asd.split("foo");
Try this it will work

public class Splitter {
public static void main(final String[] args) {
final String asd = "this is test ass this is test";
final String[] parts = asd.split("ass");
for (final String part : parts) {
System.out.println(part);
}
}
}
Prints:
this is test
this is test
Under Java 6. What output were you expecting?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert accented letters to regular char in Java - java

How do I convert Æ and á into a regular English char with Java ? What I have is something like this : Local TV from Paraná. How to convert it to [Parana] ?

Look at icu4j or the JDK 1.6 Normalizer: public String removeAccents(String text) { return Normalizer.normalize(text, Normalizer.Form.NFD) .replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); }

As far as I know, there's no way to do this automatically -- you'd have to substitute manually using String.replaceAll. String str = "Paraná"; str = str.replaceAll("á", "a"); str = str.replaceAll("Æ", "a");

Related

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

Remove a character from java string using hex code

regex does not like out#

java tokenizer for strings

How do I split a string on a fixed character sequence?

Categories

Resources