How do I properly decode the following string in Java
http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru
When I use URLDecoder.decode() I the following error
java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0"
Thanks,
Dave
According to Wikipedia, "there exist a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value".
Continuing: "This behavior is not specified by any RFC and has been rejected by the W3C".
Your URL contains such tokens, and the Java URLDecoder implementation doesn't support those.
%uXXXX encoding is non-standard, and was actually rejected by W3C, so it's natural, that URLDecoder does not understand it.
You can make small function, which will fix it by replacing each occurrence of %uXXYY with %XX%YY in your encoded string. Then you can procede and decode the fixed string normally.
we started with Vartec's solution but found out additional issues. This solution works for UTF-16, but it can be changed to return UTF-8. The replace all is left for clarity reasons and you can read more at http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript
static public String unescape(String escaped) throws UnsupportedEncodingException
{
// This code is needed so that the UTF-16 won't be malformed
String str = escaped.replaceAll("%0", "%u000");
str = str.replaceAll("%1", "%u001");
str = str.replaceAll("%2", "%u002");
str = str.replaceAll("%3", "%u003");
str = str.replaceAll("%4", "%u004");
str = str.replaceAll("%5", "%u005");
str = str.replaceAll("%6", "%u006");
str = str.replaceAll("%7", "%u007");
str = str.replaceAll("%8", "%u008");
str = str.replaceAll("%9", "%u009");
str = str.replaceAll("%A", "%u00A");
str = str.replaceAll("%B", "%u00B");
str = str.replaceAll("%C", "%u00C");
str = str.replaceAll("%D", "%u00D");
str = str.replaceAll("%E", "%u00E");
str = str.replaceAll("%F", "%u00F");
// Here we split the 4 byte to 2 byte, so that decode won't fail
String [] arr = str.split("%u");
Vector<String> vec = new Vector<String>();
if(!arr[0].isEmpty())
{
vec.add(arr[0]);
}
for (int i = 1 ; i < arr.length ; i++) {
if(!arr[i].isEmpty())
{
vec.add("%"+arr[i].substring(0, 2));
vec.add("%"+arr[i].substring(2));
}
}
str = "";
for (String string : vec) {
str += string;
}
// Here we return the decoded string
return URLDecoder.decode(str,"UTF-16");
}
After having had a good look at the solution presented by #ariy I created a Java based solution that is also resilient against encoded characters that have been chopped into two parts (i.e. half of the encoded character is missing). This happens in my usecase where I need to decode long urls that are sometimes chopped at 2000 chars length. See What is the maximum length of a URL in different browsers?
public class Utils {
private static Pattern validStandard = Pattern.compile("%([0-9A-Fa-f]{2})");
private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$");
private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])");
private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$");
public static String resilientUrlDecode(String input) {
String cookedInput = input;
if (cookedInput.indexOf('%') > -1) {
// Transform all existing UTF-8 standard into UTF-16 standard.
cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1");
// Discard chopped encoded char at the end of the line (there is no way to know what it was)
cookedInput = choppedStandard.matcher(cookedInput).replaceAll("");
// Handle non standard (rejected by W3C) encoding that is used anyway by some
// See: https://stackoverflow.com/a/5408655/114196
if (cookedInput.contains("%u")) {
// Transform all existing non standard into UTF-16 standard.
cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2");
// Discard chopped encoded char at the end of the line
cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll("");
}
}
try {
return URLDecoder.decode(cookedInput,"UTF-16");
} catch (UnsupportedEncodingException e) {
// Will never happen because the encoding is hardcoded
return null;
}
}
}
Related
Just like the picture, I'd like to convert between the encoded UTF-8 String and Native String in Java.
Would anyone some suggestions? Thanks a lot!
ps.
For example,
String a = "这是一个例子,this is a example";
String b = null;
// block A: processing a, and let b = "这是一个例子,this is a example"
How to implement the "block A"?
Apache Commons Lang StringEscapeUtils.unescapeXml(...) is what you want. Depending on where your original string came from, one of the HTML variants may be more appropriate.
Use like so:
String a = "这是一个例子,this is a example";
String b = StringEscapeUtils.unescapeXml(a);
// block A: processing a, and let b = "这是一个例子,this is a example"
System.out.println(a);
System.out.println(b);
Output:
这是一个例子,this is a example
这是一个例子,this is a example
There are methods for converting the other way also.
You can use Charset. See the documentation here
Charset.forName("UTF-8").encode(text)
Or
you can also use getBytes() method of 'java.lang.String' Class
text.getBytes(Charset.forName("UTF-8"));
documentation:
public byte[] getBytes(Charset charset)
Encodes this String into a sequence of bytes using the given charset,
storing the result into a
new byte array.
This method always replaces malformed-input and unmappable-character
sequences with this charset's default replacement byte array. The
CharsetEncoder class should be used when more control over the
encoding process is required.
Parameters: charset - The Charset to be used to encode the String
Returns: The resultant byte array
Since:
1.6
To the right are hexadecimal numeric HTML entities.
Now the apache commons library has a StringEscapeUtils which can convert from that to String, but the reverse is not obvious (= should be tried, might give named entities).
public static void main(String[] args) throws InterruptedException {
String a = "这是一个例子,this is a example";
String b = fromHtmlEntities(a);
System.out.println(b);
String a2 = toHtmlEntities(b);
System.out.println(a2.equals(a));
System.out.println(a);
System.out.println(a2);
}
public static String fromHtmlEntities(String s) {
Pattern numericEntityPattern = Pattern.compile("\\&#[Xx]([0-9A-Fa-f]{1,6});");
Matcher m = numericEntityPattern.matcher(s);
StringBuffer sb = new StringBuffer();
while (m.find()) {
int codePoint = Integer.parseInt(m.group(1), 16);
String replacement = new String(new int[] { codePoint }, 0, 1);
m.appendReplacement(sb, replacement);
}
m.appendTail(sb);
return sb.toString();
}
// Uses java 8
public static String toHtmlEntities(String s) {
int[] codePoints = s.codePoints().flatMap(
(cp) -> cp < 128 // ASCII?
? IntStream.of(cp)
: String.format("&#x%X;", cp).codePoints())
.toArray();
return new String(codePoints, 0, codePoints.length);
}
I am reading data from xml. When I checked in eclipse console I found I am getting the whole data with some square boxes. Example If there is 123 in excel sheet i am getting 123 with some square boxes. I used trim() to avoid such things but didnot get success because trim() method trims only white spaces. But I found those characters have ASCII value -17, -20 .. I dont want to trim only white spaces I want to trim those square boxes also
So I have used the following method to trim those characters and I got success.
What is the more appropriate way of trimming a string
Trimming a string
String trimData(String accessNum){
StringBuffer sb = new StringBuffer();
try{
if((accessNum != null) && (accessNum.length()>0)){
// Log.i("Settings", accessNum+"Access Number length....."+accessNum.length());
accessNum = accessNum.trim();
byte[] b = accessNum.getBytes();
for(int i=0; i<b.length; i++){
System.out.println(i+"....."+b[i]);
if(b[i]>0){
sb.append((char)(b[i]));
}
}
// Log.i("Settigs", accessNum+"Trimming....");
}}catch(Exception ex){
}
return sb.toString();
}
Edited
use Normalizer (since java 6)
public static final Pattern DIACRITICS_AND_FRIENDS
= Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");
private static String stripDiacritics(String str) {
str = Normalizer.normalize(str, Normalizer.Form.NFD);
str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
return str;
}
And here and here are complete solution.
And if you only want to remove all non printable characters from a string, use
rawString.replaceAll("[^\\x20-\\x7e]", "")
Ref : replace special characters in string in java and How to remove high-ASCII characters from string like ®, ©, ™ in Java
Try this:
str = (str == null) ? null :
str.replaceAll("[^\\p{Print}\\p{Space}]", "").trim();
I have a String with binary data in it (1110100) I want to get the text out so I can print it (1110100 would print "t"). I tried this, it is similar to what I used to transform my text to binary but it's not working at all:
public static String toText(String info)throws UnsupportedEncodingException{
byte[] encoded = info.getBytes();
String text = new String(encoded, "UTF-8");
System.out.println("print: "+text);
return text;
}
Any corrections or suggestions would be much appreciated.
Thanks!
You can use Integer.parseInt with a radix of 2 (binary) to convert the binary string to an integer:
int charCode = Integer.parseInt(info, 2);
Then if you want the corresponding character as a string:
String str = new Character((char)charCode).toString();
This is my one (Working fine on Java 8):
String input = "01110100"; // Binary input as String
StringBuilder sb = new StringBuilder(); // Some place to store the chars
Arrays.stream( // Create a Stream
input.split("(?<=\\G.{8})") // Splits the input string into 8-char-sections (Since a char has 8 bits = 1 byte)
).forEach(s -> // Go through each 8-char-section...
sb.append((char) Integer.parseInt(s, 2)) // ...and turn it into an int and then to a char
);
String output = sb.toString(); // Output text (t)
and the compressed method printing to console:
Arrays.stream(input.split("(?<=\\G.{8})")).forEach(s -> System.out.print((char) Integer.parseInt(s, 2)));
System.out.print('\n');
I am sure there are "better" ways to do this but this is the smallest one you can probably get.
I know the OP stated that their binary was in a String format but for the sake of completeness I thought I would add a solution to convert directly from a byte[] to an alphabetic String representation.
As casablanca stated you basically need to obtain the numerical representation of the alphabetic character. If you are trying to convert anything longer than a single character it will probably come as a byte[] and instead of converting that to a string and then using a for loop to append the characters of each byte you can use ByteBuffer and CharBuffer to do the lifting for you:
public static String bytesToAlphabeticString(byte[] bytes) {
CharBuffer cb = ByteBuffer.wrap(bytes).asCharBuffer();
return cb.toString();
}
N.B. Uses UTF char set
Alternatively using the String constructor:
String text = new String(bytes, 0, bytes.length, "ASCII");
public static String binaryToText(String binary) {
return Arrays.stream(binary.split("(?<=\\G.{8})"))/* regex to split the bits array by 8*/
.parallel()
.map(eightBits -> (char)Integer.parseInt(eightBits, 2))
.collect(
StringBuilder::new,
StringBuilder::append,
StringBuilder::append
).toString();
}
Here is the answer.
private String[] splitByNumber(String s, int size) {
return s.split("(?<=\\G.{"+size+"})");
}
The other way around (Where "info" is the input text and "s" the binary version of it)
byte[] bytes = info.getBytes();
BigInteger bi = new BigInteger(bytes);
String s = bi.toString(2);
Look at the parseInt function. You may also need a cast and the Character.toString function.
Also you can use alternative solution without streams and regular expressions (based on casablanca's answer):
public static String binaryToText(String binaryString) {
StringBuilder stringBuilder = new StringBuilder();
int charCode;
for (int i = 0; i < binaryString.length(); i += 8) {
charCode = Integer.parseInt(binaryString.substring(i, i + 8), 2);
String returnChar = Character.toString((char) charCode);
stringBuilder.append(returnChar);
}
return stringBuilder.toString();
}
you just need to append the specified character as a string to character sequence.
we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Đonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo
You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator class, which apparently can do what you are looking for.
Here's my converter which uses lucene...
private final KeywordTokenizer keywordTokenizer = new KeywordTokenizer(new StringReader(""));
private final ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(keywordTokenizer);
private final TermAttribute termAttribute = (TermAttribute) asciiFoldingFilter.getAttribute(TermAttribute.class);
public String process(String line)
{
if (line != null)
{
try
{
keywordTokenizer.reset(new StringReader(line));
if (asciiFoldingFilter.incrementToken())
{
return termAttribute.term();
}
}
catch (IOException e)
{
logger.warn("Failed to parse: " + line, e);
}
}
return null;
}
I'm using something like this:
Transliterator transliterator = Transliterator.getInstance("Any-Latin; Upper; Lower; NFD; [:Nonspacing Mark:] Remove; NFC", Transliterator.FORWARD);
Is there some simple way to do it, without million of .replaceAll() calls?
If you just support European, Latin-based languages, around 100 should be enough; that's definitely doable: Grab the Unicode charts for Latin-1 Supplement and Latin Extended-A and get the String.replace party started. :-)
I am reading an XML document (UTF-8) and ultimately displaying the content on a Web page using ISO-8859-1. As expected, there are a few characters are not displayed correctly, such as “, – and ’ (they display as ?).
Is it possible to convert these characters from UTF-8 to ISO-8859-1?
Here is a snippet of code I have written to attempt this:
BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8"));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close();
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
I'm not quite sure what's going awry, but I believe it's readLine() that's causing the grief (since the strings would be Java/UTF-16 encoded?). Another variation I tried was to replace latin1 with
byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");
I have read previous posts on the subject and I'm learning as I go. Thanks in advance for your help.
I'm not sure if there is a normalization routine in the standard library that will do this. I do not think conversion of "smart" quotes is handled by the standard Unicode normalizer routines - but don't quote me.
The smart thing to do is to dump ISO-8859-1 and start using UTF-8. That said, it is possible to encode any normally allowed Unicode code point into a HTML page encoded as ISO-8859-1. You can encode them using escape sequences as shown here:
public final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
Example usage:
String foo = "This is Cyrillic Ya: \u044F\n"
+ "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C";
StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder());
System.out.println(sb.toString());
Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as “. A couple of other arbitrary code points are likewise encoded.
Care needs to be taken with this approach. If your text needs to be escaped for HTML, that needs to be done before the above code or the ampersands end up being escaped.
Depending on your default encoding, following lines could cause problem,
byte[] latin1 = sb.toString().getBytes("ISO-8859-1");
return new String(latin1);
In Java, String/Char is always in UTF-16BE. Different encoding is only involved when you convert the characters to bytes. Say your default encoding is UTF-8, the latin1 buffer is treated as UTF-8 and some sequence of Latin-1 may form invalid UTF-8 sequence and you will get ?.
With Java 8, McDowell's answer can be simplified like this (while preserving correct handling of surrogate pairs):
public final class HtmlEncoder {
private HtmlEncoder() {
}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) {
int codePoint = iterator.nextInt();
if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) {
out.append((char) codePoint);
} else {
out.append("&#x");
out.append(Integer.toHexString(codePoint));
out.append(";");
}
}
return out;
}
}
when you instanciate your String object, you need to indicate which encoding to use.
So replace :
return new String(latin1);
by
return new String(latin1, "ISO-8859-1");