Convert a text with special unicode to normal text (java) - java

I have a text which includes numerous unicode (?) characters in it, like the followings:
passaic$002c new jersey
Which should be : passaic, new jersey
Albert_W$002E_Barney
Which should be : albert w. barney
Roosevelt_High_School_$0028Yonkers$002C_New_York$0029
which should be: Roosevelt_High_School_(Yonkers,_New_York)
I searched the web and there is a big list of these characters: http://colemak.com/pub/mac/wordherd_source.txt
Do you know any fast method that I can replace these characters with their original characters? Note that I don't want to replace each of these characters one by one (like using replaceAll.) Instead I want to use a function that has already implemented this (maybe an external library)

Try native2ascii tool of java. Refer http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/native2ascii.html

Assuming those are UTF-16BE encoded values you can just use parse the values and cast to char:
public static String parse(CharSequence csq) {
StringBuilder out = new StringBuilder();
Matcher matcher = Pattern.compile("\\$(\\p{XDigit}{4}+)").matcher(csq);
int last = 0;
while (matcher.find()) {
out.append(csq.subSequence(last, matcher.start()));
String hex = matcher.group(1);
char ch = (char) Integer.parseInt(hex, 16);
out.append(ch);
last = matcher.end();
}
out.append(csq.subSequence(last, csq.length()));
return out.toString();
}

Related

Convert ASCII representation of unicode to unicode

I have an application that get som Strings by JSON.
The problem is that I think that they are sending it as ASCII and the text really should be in unicode.
For example, there are parts of the string that is "\u00f6" which is the swedish letter "ö"
For example the swedish word for "buy" is "köpa" and the string I get is "k\u00f6pa"
Is there an easy way for me after I recived this String in java to convert it to the correct representation?
That is, I want to convert strings like "k\u00f6pa" to "köpa"
Thank for all help!
Well, that is easy enough, just use a JSON library. With Jackson for instance you will:
final ObjectMapper mapper = new ObjectMapper();
final JsonNode node = mapper.readTree(your, source, here);
The JsonNode will in fact be a TextNode; you can just retrieve the text as:
node.textValue()
Note that this IS NOT an "ASCII representation" of a String; it just happens that JSON strings can contain UTF-16 code unit character escapes like this one.
(you will lose the quotes around the value, though, but that is probably what you expect anyway)
The hex code is just 2 bytes of integer, which an int can handle just fine -- so you can just use Integer.parse(s, 16) where s is the string without the "\u" prefix. Then you just narrow that int to a char, which is guaranteed to fit.
Throw in some regex (to validate the string and also extract the hex code), and you're all done.
Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})");
Matcher m = p.matcher(arg);
if (m.matches()) {
String code = m.group(1);
int i = Integer.parseInt(code, 16);
char c = (char) i;
System.out.println(c);
}

Replace multiple characters in a string in Java

I have some strings with equations in the following format ((a+b)/(c+(d*e))).
I also have a text file that contains the names of each variable, e.g.:
a velocity
b distance
c time
etc...
What would be the best way for me to write code so that it plugs in velocity everywhere a occurs, and distance for b, and so on?
Don't use String#replaceAll in this case if there is slight chance part you will replace your string contains substring that you will want to replace later, like "distance" contains a and if you will want to replace a later with "velocity" you will end up with "disvelocityance".
It can be same problem as if you would like to replace A with B and B with A. For this kind of text manipulation you can use appendReplacement and appendTail from Matcher class. Here is example
String input = "((a+b)/(c+(d*e)))";
Map<String, String> replacementsMap = new HashMap<>();
replacementsMap.put("a", "velocity");
replacementsMap.put("b", "distance");
replacementsMap.put("c", "time");
StringBuffer sb = new StringBuffer();
Pattern p = Pattern.compile("\\b(a|b|c)\\b");
Matcher m = p.matcher(input);
while (m.find())
m.appendReplacement(sb, replacementsMap.get(m.group()));
m.appendTail(sb);
System.out.println(sb);
Output:
((velocity+distance)/(time+(d*e)))
This code will try to find each occurrence of a or b or c which isn't part of some word (it doesn't have any character before or after it - done with help of \b which represents word boundaries). appendReplacement is method which will append to StringBuffer text from last match (or from beginning if it is first match) but will replace found match with new word (I get replacement from Map). appendTail will put to StringBuilder text after last match.
Also to make this code more dynamic, regex should be generated automatically based on keys used in Map. You can use this code to do it
StringBuilder regexBuilder = new StringBuilder("\\b(");
for (String word:replacementsMap.keySet())
regexBuilder.append(Pattern.quote(word)).append('|');
regexBuilder.deleteCharAt(regexBuilder.length()-1);//lets remove last "|"
regexBuilder.append(")\\b");
String regex = regexBuilder.toString();
I'd make a hashMap mapping the variable names to the descriptions, then iterate through all the characters in the string and replace each occurrance of a recognised key with it's mapping.
I would use a StringBuilder to build up the new string.
Using a hashmap and iterating over the string as A Boschman suggested is one good solution.
Another solution would be to do what others have suggested and do a .replaceAll(); however, you would want to use a regular expression to specify that only the words matching the whole variable name and not a substring are replaced. A regex using word boundary '\b' matching will provide this solution.
String variable = "a";
String newVariable = "velocity";
str.replaceAll("\\b" + variable + "\\b", newVariable);
See http://docs.oracle.com/javase/tutorial/essential/regex/bounds.html
For string str, use the replaceAll() function:
str = str.toUpperCase(); //Prevent substitutions of characters in the middle of a word
str = str.replaceAll("A", "velocity");
str = str.replaceAll("B", "distance");
//etc.

how to convert string into it's "html" ascii code using Java?

e.g.
B is uppercase B.
so if I have string like "BOY". I want it converted to BOY
I'm hoping there's already a library I can use. I've searched the net but I didn't see it.
thanks
Those codes are nothing but concatenation of &# and ; with the Unicode Codepoint for each character. You can iterate over each character in the string, and do:
output.append("&#")
.append((int)ch)
.append(";");
Where, output refers to a StringBuilder instance.
You could try writing your own utility:
String input = "BOY";
char[] chars = input.toCharArray();
StringBuilder output = new StringBuilder();
for (char c : chars)
{
output.append("&#").append((int) c).append(";");
}
output content after execution:
BOY

Splitting strings based on a delimiter

I am trying to break apart a very simple collection of strings that come in the forms of
0|0
10|15
30|55
etc etc. Essentially numbers that are seperated by pipes.
When I use java's string split function with .split("|"). I get somewhat unpredictable results. white space in the first slot, sometimes the number itself isn't where I thought it should be.
Can anybody please help and give me advice on how I can use a reg exp to keep ONLY the integers?
I was asked to give the code trying to do the actual split. So allow me to do that in hopes to clarify further my problem :)
String temp = "0|0";
String splitString = temp.split("|");
results
\n
0
|
0
I am trying to get
0
0
only. Forever grateful for any help ahead of time :)
I still suggest to use split(), it skips null tokens by default. you want to get rid of non numeric characters in the string and only keep pipes and numbers, then you can easily use split() to get what you want. or you can pass multiple delimiters to split (in form of regex) and this should work:
String[] splited = yourString.split("[\\|\\s]+");
and the regex:
import java.util.regex.*;
Pattern pattern = Pattern.compile("\\d+(?=([\\|\\s\\r\\n]))");
Matcher matcher = pattern.matcher(yourString);
while (matcher.find()) {
System.out.println(matcher.group());
}
The pipe symbol is special in a regexp (it marks alternatives), you need to escape it. Depending on the java version you are using this could well explain your unpredictable results.
class t {
public static void main(String[]_)
{
String temp = "0|0";
String[] splitString = temp.split("\\|");
for (int i=0; i<splitString.length; i++)
System.out.println("splitString["+i+"] is " + splitString[i]);
}
}
outputs
splitString[0] is 0
splitString[1] is 0
Note that one backslash is the regexp escape character, but because a backslash is also the escape character in java source you need two of them to push the backslash into the regexp.
You can do replace white space for pipes and split it.
String test = "0|0 10|15 30|55";
test = test.replace(" ", "|");
String[] result = test.split("|");
Hope this helps for you..
You can use StringTokenizer.
String test = "0|0";
StringTokenizer st = new StringTokenizer(test);
int firstNumber = Integer.parseInt(st.nextToken()); //will parse out the first number
int secondNumber = Integer.parseInt(st.nextToken()); //will parse out the second number
Of course you can always nest this inside of a while loop if you have multiple strings.
Also, you need to import java.util.* for this to work.
The pipe ('|') is a special character in regular expressions. It needs to be "escaped" with a '\' character if you want to use it as a regular character, unfortunately '\' is a special character in Java so you need to do a kind of double escape maneuver e.g.
String temp = "0|0";
String[] splitStrings = temp.split("\\|");
The Guava library has a nice class Splitter which is a much more convenient alternative to String.split(). The advantages are that you can choose to split the string on specific characters (like '|'), or on specific strings, or with regexps, and you can choose what to do with the resulting parts (trim them, throw ayway empty parts etc.).
For example you can call
Iterable<String> parts = Spliter.on('|').trimResults().omitEmptyStrings().split("0|0")
This should work for you:
([0-9]+)
Considering a scenario where in we have read a line from csv or xls file in the form of string and need to separate the columns in array of string depending on delimiters.
Below is the code snippet to achieve this problem..
{ ...
....
String line = new BufferedReader(new FileReader("your file"));
String[] splittedString = StringSplitToArray(stringLine,"\"");
...
....
}
public static String[] StringSplitToArray(String stringToSplit, String delimiter)
{
StringBuffer token = new StringBuffer();
Vector tokens = new Vector();
char[] chars = stringToSplit.toCharArray();
for (int i=0; i 0) {
tokens.addElement(token.toString());
token.setLength(0);
i++;
}
} else {
token.append(chars[i]);
}
}
if (token.length() > 0) {
tokens.addElement(token.toString());
}
// convert the vector into an array
String[] preparedArray = new String[tokens.size()];
for (int i=0; i < preparedArray.length; i++) {
preparedArray[i] = (String)tokens.elementAt(i);
}
return preparedArray;
}
Above code snippet contains method call to StringSplitToArray where in the method converts the stringline into string array splitting the line depending on the delimiter specified or passed to the method. Delimiter can be comma separator(,) or double code(").
For more on this, follow this link : http://scrapillars.blogspot.in

How to replace characters using Regex

I received string from IBM Mainframe like below (2bytes graphic fonts)
" ;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;₩;~;!;@;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
and, I wanna change these characters to 1 byte ascii codes
How can I replace these using java.util.regex.Matcher, String.replaceAll() in Java
target characters :
;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;\;~;!;#;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
This is not (as other responders are saying) a character-encoding issue, but regexes are still the wrong tool. If Java had an equivalent of Perl's tr/// operator, that would be the right tool, but you can hand-code it easily enough:
public static String convert(String oldString)
{
String oldChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=₩~!@#$%^&*()_+|[]{}:"',./<>?";
String newChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=\\~!##$%^&*()_+|[]{}:\"',./<>?";
StringBuilder sb = new StringBuilder();
int len = oldString.length();
for (int i = 0; i < len; i++)
{
char ch = oldString.charAt(i);
int pos = oldChars.indexOf(ch);
sb.append(pos < 0 ? ch : newChars.charAt(pos));
}
return sb.toString();
}
I'm assuming each character in the first string corresponds to the character at the same position in the second string, and that the first character (U+3000, 'IDEOGRAPHIC SPACE') should be converted to an ASCII space (U+0020).
Be sure to save the source file as UTF-8, and include the -encoding UTF-8 option when you compile it (or tell your IDE to do so).
Don't think this one's about regex, it's about encoding. Should be possible to read into a String with 2-byte and then write it with any other encoding.
Look here for supported encodings.

Categories

Resources