Related
I have a string in format AB123. I want to split it between the AB and 123 so AB123 becomes AB 123. The contents of the string can differ but the format stays the same. Is there a way to do this?
Following up with the latest information you provided (2 letters then 3 numbers):
myString.subString(0, 2) + " " + myString.subString(2)
What this does: you split your input string myString at the 2nd character and append a space at this position.
Explanation: \D represents non-digit and \d represents a digit in a regular expression and I used ternary operation in the regex to split charter to the number.
String string = "AB123";
String[] split = string.split("(?<=\\D)(?=\\d)");
System.out.println(split[0]+" "+split[1]);
Try
String a = "abcd1234";
int i;
for(i = 0; i < a.length(); i++){
char c = a.charAt(i);
if( '0' <= c && c <= '9' )
break;
}
String alphaPart = a.substring(0, i);
String numberPart = a.substring(i);
Hope this helps
Although I would personally use the method provided in #RakeshMothukur's answer, since it also works when the letter or digit counts increase/decrease later on, I wanted to provide an additional method to insert the space between the two letters and three digits:
String str = "AB123";
StringBuilder sb = new StringBuilder(str);
sb.insert(2, " "); // Insert a space at 0-based index 2; a.k.a. after the first 2 characters
String result = sb.toString(); // Convert the StringBuilder back to a String
Try it online.
Here you go. I wrote it in very simple way to make things clear.
What it does is : After it takes user input, it converts the string into Char array and it checks single character if its INT or non INT.
In each iteration it compares the data type with the prev character and prints accordingly.
Alternate Solutions
1) Using ASCII range (difficulty = easy)
2) Override a method and check 2 variables at a time. (difficulty = Intermediate)
import org.omg.CORBA.INTERNAL;
import java.io.InputStreamReader;
import java.util.*;
import java.io.BufferedReader;
public class Main {
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
char[] s = br.readLine().toCharArray();
int prevflag, flag = 0;
for (int i = 0; i < s.length; i++) {
int a = Character.getNumericValue(s[i]);
String b = String.valueOf(s[i]);
prevflag = flag;
flag = checktype(a, b);
if ((prevflag == flag) || (i == 0))
System.out.print(s[i]);
else
System.out.print(" " + s[i]);
}
}
public static int checktype(int x, String y) {
int flag = 0;
if (String.valueOf(x).equals(y))
flag = 1; // INT
else
flag = 2; // non INT
return flag;
}
}
I was waiting for a compile to finish before heading out, so threw together a slightly over-engineered example with basic error checking and a test.
import java.text.ParseException;
import java.util.LinkedList;
public class Main {
static public class ParsedData {
public final String prefix;
public final Integer number;
public ParsedData(String _prefix, Integer _number) {
prefix = _prefix;
number = _number;
}
#Override
public String toString() {
return prefix + "\t" + number.toString();
}
}
static final String TEST_DATA[] = {"AB123", "JX7272", "FX402", "ADF123", "JD3Q2", "QB778"};
public static void main(String[] args) {
parseDataArray(TEST_DATA);
}
public static ParsedData[] parseDataArray(String[] inputs) {
LinkedList<ParsedData> results = new LinkedList<ParsedData>();
for (String s : TEST_DATA) {
try {
System.out.println("Parsing: " + s);
if (s.length() != 5) throw new ParseException("Input Length incorrect: " + s.length(), 0);
String _prefix = s.substring(0, 2);
Integer _num = Integer.parseInt(s.substring(2));
results.add(new ParsedData(_prefix, _num));
} catch (ParseException | NumberFormatException e) {
System.out.printf("\"%s\", %s\n", s, e.toString());
}
}
return results.toArray(new ParsedData[results.size()]);
}
}
Main
public class Main
{
public static void main(String[] args)
{
System.out.println(Dupe.Eliminate("Testing UppeR and loweR"));
System.out.println(Dupe.Eliminate("UppeR is BetteR"));
}
}
Class
public class Dupe
{
public static String Eliminate(String input)
{
char[] chrArray = input.toCharArray();
String letter ="";
for (char value:chrArray){
if (letter.indexOf(value) == -1){
letter += value;
}
}
return letter;
}
}
I am trying to eliminate duplicate letters e.g. Hello would be Helo. Which I have achieved, however, what I want to implement is that it won't matter if it's uppercase or lowercase, it will still be classed as a duplicate so Hehe would be He, not Heh. Should I .equals... each individual letter or is there an efficient way? sorry for asking if it's simple question for you guys.
This is how I would approach this. This might not be the most efficient way to do it, but you can try this.
public class Main
{
public static void main(String[] args)
{
System.out.println(Dupe.Eliminate("Testing UppeR and loweR"));
}
}
class Dupe
{
public static String Eliminate(String input)
{
char[] chrArray = input.toCharArray();
String letter ="";
for(int index = 0; index < chrArray.length; index++)
{
int j = 0;
boolean flag = true;
//this while loop is used to check if the next character is already existed in the string (ignoring the uppercase or lowercase)
while(j < letter.length())
{
if((int)chrArray[index] == letter.charAt(j) || (int)chrArray[index] == ((int)letter.charAt(j)+32) ) //32 is because the difference between the ascii value of the uppercase and lowercase letter is 32
{
flag = false;
break;
}
else
j++;
}
if(flag == true)
{
letter += chrArray[index];
}
}
return letter;
}
}
you can have 2 checks in place with upper case and lower case characters:
public static String Eliminate(String input)
{
char[] chrArray = input.toCharArray();
String letter ="";
for (char value:chrArray){
if (letter.indexOf(value.toLowerCase()) == -1 && letter.indexOf(value.toUpperCase()) == -1){
letter += value;
}
}
return letter;
}
Here you go, this will replace all duplicate characters no matter how many in the sequence.
public static void main(String[] args)
{
String duped = "aaabbccddeeffgg";
final Pattern p = Pattern.compile("(\\w)\\1+");
final Matcher m = p.matcher(duped);
while (m.find())
System.out.println("Duplicate character " + (duped = duped.replaceAll(m.group(), m.group(1))));
}
If you are looking for duplicates like: abacd to replace both a's, try this as the regex given in Pattern.compile(".*([0-9A-Za-z])\\1+.*")
Here's another (stateful) way to do it:-
String s = "Hehe";
Set<String> found = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
String result = s.chars()
.mapToObj(c -> "" + (char) c)
.filter(found::add)
.collect(Collectors.joining());
System.out.println(result);
Output: He
For records:
X means anything
Y means year
M means month
N means numeric
A means alphabet
For example:
my input mask from database is like this:
XXXYMXXXXXA
and my input is:
39JY412345O
i want check this input is valid or invalid but i can't check it with mask, I want replace mask with regular expression like this for its input mask:
/^.{3}Y[0-9]{1}.{5}[a-zA-Z]{1}$/
I don't have regular expression, I have input mask only.I have input validation and it use regular expression for checking valid or invalid inputs. I should replace regular expression with my input mask ( 200 kind of input mask ) and I use its regular expression for validation
I need to write a method that translates from an input mask (such as "XXXYMXXXXXA") to a regex in the java.lang.regex.Pattern format (such as ".{3}Y[0-9]{1}.{5}[a-zA-Z]{1}")
This is my method code: ( but I want best practice for this solution )
private String replaceAll(String pattern, String value, String replaceValue) {
String str = value;
str = str.replaceAll(pattern, replaceValue.concat("{").concat("1").concat("}"));
return str;
}
and method calls:
String anything = "[Xx]";
String alphabet = "[Aa]";
String number = "[Nn]";
String word = getName();
word = replaceAll(anything, word, ".");
word = replaceAll(alphabet, word, "[A-Za-z]");
word = replaceAll(number, word, "[0-9]");
Assuming a general approach, there is a mapping between one char in the mask (e.g. 'X') to one part of a regular expression (e.g. '.'), and recurrent mask chars result in a numeric quantifier (like {3}).
So I've put together a helper class, and a simple test method, so maybe this is a point to start from.
Helper class:
import java.util.HashMap;
import java.util.Map;
public class PatternBuilder {
protected Map<Character, String> mappings = new HashMap<Character, String>();
protected boolean caseSensitive = false;
public PatternBuilder() {
}
public PatternBuilder(boolean caseSensitive) {
this.caseSensitive = caseSensitive;
}
public PatternBuilder addDefinition(char input, String mapping) {
if (this.caseSensitive) {
this.mappings.put(input, mapping);
} else {
this.mappings.put(Character.toLowerCase(input), mapping);
}
return this;
}
public String buildRegexPattern(String mask) {
if ((mask == null) || (mask.length() == 0)) {
return "";
}
StringBuilder patternBuffer = new StringBuilder();
char lastChar = 0;
int count = 0;
for (int i = 0; i < mask.length(); i++) {
char c = mask.charAt(i);
if (this.caseSensitive == false) {
c = Character.toLowerCase(c);
}
if (c != lastChar) {
if (count > 0) {
String mapped = mappings.get(lastChar);
if (mapped == null) {
// mapping for char not defined
return "";
}
patternBuffer.append(mapped);
patternBuffer.append("{").append(count).append("}");
}
lastChar = c;
count = 1;
} else {
count++;
}
}
if (count > 0) {
String mapped = mappings.get(lastChar);
if (mapped == null) {
mapped = ".";
}
patternBuffer.append(mapped);
patternBuffer.append("{").append(count).append("}");
}
return patternBuffer.toString();
}
}
Usage:
PatternBuilder patternBuilder = new PatternBuilder()
.addDefinition('X', ".")
.addDefinition('Y', "Y")
.addDefinition('M', "[0-9]")
.addDefinition('N', "\\d")
.addDefinition('A', "[a-zA-Z]");
String rePattern = patternBuilder.buildRegexPattern("XxxYMXXXXXA"); // case insensitive, x == X
System.out.println("Pattern: '" + rePattern + "'");
Pattern p = Pattern.compile(rePattern);
String[] tests = new String[]{
"39JY412345O", // Original, match
"39JY41234FO", // replaced 5 with F, still matching
"39JY4123457", // replaced O with 7, no match
"A9JY4123457" // replaced 3 with A, no match
};
for (String s : tests) {
Matcher m = p.matcher(s);
System.out.println("Test '" + s + "': " + m.matches());
}
My output:
Pattern: '.{3}Y{1}[0-9]{1}.{5}[a-zA-Z]{1}'
Test '39JY412345O': true
Test '39JY41234FO': true
Test '39JY4123457': false
Test 'A9JY4123457': false
I need to convert Arabic/Persian Numbers to its English equal (for example convert "۲" to "2")
How can I do this?
I suggest you have a ten digit lookup String and replace all the digits one at a time.
public static void main(String... args) {
System.out.println(arabicToDecimal("۴۲"));
}
//used in Persian apps
private static final String extendedArabic = "\u06f0\u06f1\u06f2\u06f3\u06f4\u06f5\u06f6\u06f7\u06f8\u06f9";
//used in Arabic apps
private static final String arabic = "\u0660\u0661\u0662\u0663\u0664\u0665\u0666\u0667\u0668\u0669";
private static String arabicToDecimal(String number) {
char[] chars = new char[number.length()];
for(int i=0;i<number.length();i++) {
char ch = number.charAt(i);
if (ch >= 0x0660 && ch <= 0x0669)
ch -= 0x0660 - '0';
else if (ch >= 0x06f0 && ch <= 0x06F9)
ch -= 0x06f0 - '0';
chars[i] = ch;
}
return new String(chars);
}
prints
42
The reason for using the strings as a lookup is that other characters such as . - , would be left as is. In fact a decimal number would be unchanged.
I achived this by java.math.BigDecimal Class, Below is the code snippet
String arabicNumerals = "۴۲۴۲.۴۲";
String englishNumerals = new BigDecimal(arabic).toString();
System.out.println("Number In Arabic : "+arabicNumerals);
System.out.println("Number In English : "+englishNumerals);
Result
Number In Arabic : ۴۲۴۲.۴۲
Number In English : 4242.42
NB: The above code will not work if there are any characteors other than numeric digits in arabicNumerals, for example: ۴,۲۴۲.۴۲ will result in a java.lang.NumberFormatException, so you may remove other characters using Character.isDigit(char ch) in another logic and use the above code. All normal cases are working.
I found a simpler and faster way which includes the two arabic code pages too.
public static String convertToEnglishDigits(String value)
{
String newValue = value.replace("١", "1").replace("٢", "2").replace("٣", "3").replace("٤", "4").replace("٥", "5")
.replace("٦", "6").replace("7", "٧").replace("٨", "8").replace("٩", "9").replace("٠", "0")
.replace("۱", "1").replace("۲", "2").replace("۳", "3").replace("۴", "4").replace("۵", "5")
.replace("۶", "6").replace("۷", "7").replace("۸", "8").replace("۹", "9").replace("۰", "0");
return newValue;
}
It will return the numbers in English format or vise versa if you change the replace from.
("۰", "0") to ("0","۰")
Try this guys:
/**
* Utility class to detect arabic languages and convert numbers into arabic digits.
*
* #author Ahmed Shakil
* #date 09-24-2012
*/
public final class ArabicUtil {
private static final char[] DIGITS = {'\u0660','\u0661','\u0662','\u0663','\u0664','\u0665','\u0666','\u0667','\u0668','\u0669'};
/**
* Returns <code>true</code> if the provided language code uses arabic characters; othersise <code>false</code>.
* #param lang ISO language code.
* #return <code>true</code> if the provided language code uses arabic characters; othersise <code>false</code>
*/
public static boolean isArabic (String lang) {
return "ar".equals(lang) || "fa".equals(lang) || "ur".equals(lang);
}
/**
* Convert digits in the specified string to arabic digits.
*/
public static String convertDigits (String str) {
if (str == null || str.length() == 0) return str;
char[] s = new char[str.length()];
for(int i =0;i<s.length;i++)
s[i] = toDigit( str.charAt( i ) );
return new String(s);
}
/**
* Convert single digit in the specified string to arabic digit.
*/
public static char toDigit (char ch) {
int n = Character.getNumericValue( (int)ch );
return n >=0 && n < 10 ? ARABIC[n] : ch;
}
/**
* Convert an int into arabic string.
*/
public static String toString (int num) {
return convertDigits( Integer.toString( num ) );
}
}
BTW there is a difference between arabic digits vs. urdu/farsi:
Arabic:
private static final char[] ARABIC = {'\u0660', '\u0661', '\u0662', '\u0663', '\u0664', '\u0665', '\u0666', '\u0667', '\u0668', '\u0669'};
Urdu or Farsi:
private static final char[] URDU_FARSI = {'\u06f0', '\u06f1', '\u06f2', '\u06f3', '\u06f4', '\u06f5', '\u06f6', '\u06f7', '\u06f8', '\u06f9'};
First make it work, then make it look nice ;-)
public static char persianDigitToEnglish(char persianDigit) {
return (char) (((int)persianDigit) - ((int)'۲' - (int)'2'));
}
Works for 2, unfortunately I don't know other Persian digits, could You give it a try?
assertThat(persianDigitToEnglish('۲')).isEqualTo('2');
EDIT: (based on Peter Lawrey String version, but uses StringBuilder)
public static String persianDigitToEnglish(String persianNumber) {
StringBuilder chars = new StringBuilder(persianNumber.length());
for (int i = 0; i < persianNumber.length(); i++)
chars.append(persianDigitToEnglish(persianNumber.charAt(i)));
return chars.toString();
}
private static char persianDigitToEnglish(char persianDigit) {
return (char) (((int)persianDigit) - ((int)'۲' - (int)'2'));
}
so trivial answer:
public static String convertNumbersToPersian(String str)
{
String answer = str;
answer = answer.replace("1","١");
answer = answer.replace("2","٢");
answer = answer.replace("3","٣");
answer = answer.replace("4","٤");
answer = answer.replace("5","٥");
answer = answer.replace("6","٦");
answer = answer.replace("7","٧");
answer = answer.replace("8","٨");
answer = answer.replace("9","٩");
answer = answer.replace("0","٠");
return answer;
}
and
public static String convertNumbersToEnglish(String str) {
String answer = str;
answer = answer.replace("١", "1");
answer = answer.replace("٢", "2");
answer = answer.replace("٣", "3");
answer = answer.replace("٤", "4");
answer = answer.replace("٥", "5");
answer = answer.replace("٦", "6");
answer = answer.replace("٧", "7");
answer = answer.replace("٨", "8");
answer = answer.replace("٩", "9");
answer = answer.replace("٠", "0");
return answer;
}
Character.getNumericValue(ch) saved my life, generic solution for any locale.
static String replaceNonstandardDigits(String input) {
if (input == null || input.isEmpty()) {
return input;
}
StringBuilder builder = new StringBuilder();
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (Character.isDigit(ch) && !(ch >= '0' && ch <= '9')) {
int numericValue = Character.getNumericValue(ch);
if (numericValue >= 0) {
builder.append(numericValue);
}
} else {
builder.append(ch);
}
}
return builder.toString();
}
i think the best way is to change the Locale to what you want for example,
for double number :
NumberFormat fmt = NumberFormat.getNumberInstance(Locale.US);
d = Double.parseDouble(s);
for String :
NumberFormat.getNumberInstance(Locale.US).format(s);
or DecimalFormat:
double num;
DecimalFormat df = new DecimalFormat("###.###");
df.setDecimalFormatSymbols(new DecimalFormatSymbols(Locale.US));
String s = df.format(num);
While I was looking for the most performant solution I mixed Kishath and Sileria answers and came up with a clean and fast result:
public class StringLocalizer {
private static final char[] ENGLISH_NUMBERS = {'\u0030', '\u0031', '\u0032', '\u0033', '\u0034', '\u0035', '\u0036', '\u0037', '\u0038', '\u0039'};
private static final char[] PERSIAN_NUMBERS = {'\u06f0', '\u06f1', '\u06f2', '\u06f3', '\u06f4', '\u06f5', '\u06f6', '\u06f7', '\u06f8', '\u06f9'};
private static final char[] ARABIC_NUMBERS = {'\u0660', '\u0661', '\u0662', '\u0663', '\u0664', '\u0665', '\u0666', '\u0667', '\u0668', '\u0669'};
public static String on(String input) {
String lang = Locale.getDefault().getLanguage();
boolean isPersian = "fa".equals(lang) || "ur".equals(lang);
boolean isArabic = "ar".equals(lang);
if (isPersian) return input
.replace(ENGLISH_NUMBERS[0], PERSIAN_NUMBERS[0])
.replace(ENGLISH_NUMBERS[1], PERSIAN_NUMBERS[1])
.replace(ENGLISH_NUMBERS[2], PERSIAN_NUMBERS[2])
.replace(ENGLISH_NUMBERS[3], PERSIAN_NUMBERS[3])
.replace(ENGLISH_NUMBERS[4], PERSIAN_NUMBERS[4])
.replace(ENGLISH_NUMBERS[5], PERSIAN_NUMBERS[5])
.replace(ENGLISH_NUMBERS[6], PERSIAN_NUMBERS[6])
.replace(ENGLISH_NUMBERS[7], PERSIAN_NUMBERS[7])
.replace(ENGLISH_NUMBERS[8], PERSIAN_NUMBERS[8])
.replace(ENGLISH_NUMBERS[9], PERSIAN_NUMBERS[9]);
else if (isArabic) return input
.replace(ENGLISH_NUMBERS[0], ARABIC_NUMBERS[0])
.replace(ENGLISH_NUMBERS[1], ARABIC_NUMBERS[1])
.replace(ENGLISH_NUMBERS[2], ARABIC_NUMBERS[2])
.replace(ENGLISH_NUMBERS[3], ARABIC_NUMBERS[3])
.replace(ENGLISH_NUMBERS[4], ARABIC_NUMBERS[4])
.replace(ENGLISH_NUMBERS[5], ARABIC_NUMBERS[5])
.replace(ENGLISH_NUMBERS[6], ARABIC_NUMBERS[6])
.replace(ENGLISH_NUMBERS[7], ARABIC_NUMBERS[7])
.replace(ENGLISH_NUMBERS[8], ARABIC_NUMBERS[8])
.replace(ENGLISH_NUMBERS[9], ARABIC_NUMBERS[9]);
else return input
.replace(PERSIAN_NUMBERS[0], ENGLISH_NUMBERS[0])
.replace(PERSIAN_NUMBERS[1], ENGLISH_NUMBERS[1])
.replace(PERSIAN_NUMBERS[2], ENGLISH_NUMBERS[2])
.replace(PERSIAN_NUMBERS[3], ENGLISH_NUMBERS[3])
.replace(PERSIAN_NUMBERS[4], ENGLISH_NUMBERS[4])
.replace(PERSIAN_NUMBERS[5], ENGLISH_NUMBERS[5])
.replace(PERSIAN_NUMBERS[6], ENGLISH_NUMBERS[6])
.replace(PERSIAN_NUMBERS[7], ENGLISH_NUMBERS[7])
.replace(PERSIAN_NUMBERS[8], ENGLISH_NUMBERS[8])
.replace(PERSIAN_NUMBERS[9], ENGLISH_NUMBERS[9])
.replace(ARABIC_NUMBERS[0], ENGLISH_NUMBERS[0])
.replace(ARABIC_NUMBERS[1], ENGLISH_NUMBERS[1])
.replace(ARABIC_NUMBERS[2], ENGLISH_NUMBERS[2])
.replace(ARABIC_NUMBERS[3], ENGLISH_NUMBERS[3])
.replace(ARABIC_NUMBERS[4], ENGLISH_NUMBERS[4])
.replace(ARABIC_NUMBERS[5], ENGLISH_NUMBERS[5])
.replace(ARABIC_NUMBERS[6], ENGLISH_NUMBERS[6])
.replace(ARABIC_NUMBERS[7], ENGLISH_NUMBERS[7])
.replace(ARABIC_NUMBERS[8], ENGLISH_NUMBERS[8])
.replace(ARABIC_NUMBERS[9], ENGLISH_NUMBERS[9]);
}
}
Note that here we assumed localizing is done between English and Persian or Arabic, so if you also need to include another language in replacing criteria just add the missing replace clauses.
This code will work with decimal points also:
public class mainsupport {
public static void main(String args[]){
// String Numtoconvert="15.3201" ;
// String Numtoconvert="458" ;
String Numtoconvert="٨٧٫٥٩٨" ; // integer value 87.598
System.out.println(getUSNumber(Numtoconvert));
}
private static String getUSNumber(String Numtoconvert){
NumberFormat formatter = NumberFormat.getInstance(Locale.US);
try {
if(Numtoconvert.contains("٫"))
Numtoconvert=formatter.parse(Numtoconvert.split("٫")[0].trim())+"."+formatter.parse(Numtoconvert.split("٫")[1].trim());
else
Numtoconvert=formatter.parse(Numtoconvert).toString();
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return Numtoconvert;
}
This prints 87.598.
The following seems to me to be the simple and obvious solution. I don’t know why it hasn’t been posted before.
Locale persian = Locale.forLanguageTag("fa");
NumberFormat nf = NumberFormat.getIntegerInstance(persian);
String persianIntegerString = "۲۱";
int parsedInteger = nf.parse(persianIntegerString).intValue();
System.out.println(parsedInteger);
Output is:
21
If we’ve got a string with a decimal point in it (or just one that may have that), use getInstance instead of getIntegerInstance. At the same time I am taking an Arabic string this time to demonstrate that this works too.
Locale arabic = Locale.forLanguageTag("ar");
NumberFormat nf = NumberFormat.getInstance(arabic);
String arabicDecimalString = "٣٤٫٥٦";
double parsedDouble = nf.parse(arabicDecimalString).doubleValue();
System.out.println(parsedDouble);
34.56
In many cases the number formats can also parse numbers in other locales, but I doubt that it is always the case, so I would not want to rely on it.
Use Locale class to convert numbers.
Locale locale = new Locale("ar");
String formattedArabic = format(locale, "%d", value));
Try this for converting Persian/Arabic numbers to English:
public static String convertToEnglish(String arabicNumber) {
for (int i = 0; i <= 9; i++) {
arabicNumber= arabicNumber.replace((char) (1776 + i),
(char) (48 + i));
}
return arabicNumber;
}
I think instead of replacing the digits one by one (which would only work for decimal numbers), you should parse your number with a persian NumberFormat to a number, and then (if necessary) use a english NumberFormat to format it again.
I've been experimenting with various bits of Java code trying to come up with something that will encode a string containing quotes, spaces and "exotic" Unicode characters and produce output that's identical to JavaScript's encodeURIComponent function.
My torture test string is: "A" B ± "
If I enter the following JavaScript statement in Firebug:
encodeURIComponent('"A" B ± "');
—Then I get:
"%22A%22%20B%20%C2%B1%20%22"
Here's my little test Java program:
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
public class EncodingTest
{
public static void main(String[] args) throws UnsupportedEncodingException
{
String s = "\"A\" B ± \"";
System.out.println("URLEncoder.encode returns "
+ URLEncoder.encode(s, "UTF-8"));
System.out.println("getBytes returns "
+ new String(s.getBytes("UTF-8"), "ISO-8859-1"));
}
}
—This program outputs:
URLEncoder.encode returns %22A%22+B+%C2%B1+%22
getBytes returns "A" B ± "
Close, but no cigar! What is the best way of encoding a UTF-8 string using Java so that it produces the same output as JavaScript's encodeURIComponent?
EDIT: I'm using Java 1.4 moving to Java 5 shortly.
This is the class I came up with in the end:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
/**
* Utility class for JavaScript compatible UTF-8 encoding and decoding.
*
* #see http://stackoverflow.com/questions/607176/java-equivalent-to-javascripts-encodeuricomponent-that-produces-identical-output
* #author John Topley
*/
public class EncodingUtil
{
/**
* Decodes the passed UTF-8 String using an algorithm that's compatible with
* JavaScript's <code>decodeURIComponent</code> function. Returns
* <code>null</code> if the String is <code>null</code>.
*
* #param s The UTF-8 encoded String to be decoded
* #return the decoded String
*/
public static String decodeURIComponent(String s)
{
if (s == null)
{
return null;
}
String result = null;
try
{
result = URLDecoder.decode(s, "UTF-8");
}
// This exception should never occur.
catch (UnsupportedEncodingException e)
{
result = s;
}
return result;
}
/**
* Encodes the passed String as UTF-8 using an algorithm that's compatible
* with JavaScript's <code>encodeURIComponent</code> function. Returns
* <code>null</code> if the String is <code>null</code>.
*
* #param s The String to be encoded
* #return the encoded String
*/
public static String encodeURIComponent(String s)
{
String result = null;
try
{
result = URLEncoder.encode(s, "UTF-8")
.replaceAll("\\+", "%20")
.replaceAll("\\%21", "!")
.replaceAll("\\%27", "'")
.replaceAll("\\%28", "(")
.replaceAll("\\%29", ")")
.replaceAll("\\%7E", "~");
}
// This exception should never occur.
catch (UnsupportedEncodingException e)
{
result = s;
}
return result;
}
/**
* Private constructor to prevent this class from being instantiated.
*/
private EncodingUtil()
{
super();
}
}
Looking at the implementation differences, I see that:
MDC on encodeURIComponent():
literal characters (regex representation): [-a-zA-Z0-9._*~'()!]
Java 1.5.0 documentation on URLEncoder:
literal characters (regex representation): [-a-zA-Z0-9._*]
the space character " " is converted into a plus sign "+".
So basically, to get the desired result, use URLEncoder.encode(s, "UTF-8") and then do some post-processing:
replace all occurrences of "+" with "%20"
replace all occurrences of "%xx" representing any of [~'()!] back to their literal counter-parts
Using the javascript engine that is shipped with Java 6:
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
public class Wow
{
public static void main(String[] args) throws Exception
{
ScriptEngineManager factory = new ScriptEngineManager();
ScriptEngine engine = factory.getEngineByName("JavaScript");
engine.eval("print(encodeURIComponent('\"A\" B ± \"'))");
}
}
Output: %22A%22%20B%20%c2%b1%20%22
The case is different but it's closer to what you want.
I use java.net.URI#getRawPath(), e.g.
String s = "a+b c.html";
String fixed = new URI(null, null, s, null).getRawPath();
The value of fixed will be a+b%20c.html, which is what you want.
Post-processing the output of URLEncoder.encode() will obliterate any pluses that are supposed to be in the URI. For example
URLEncoder.encode("a+b c.html").replaceAll("\\+", "%20");
will give you a%20b%20c.html, which will be interpreted as a b c.html.
I came up with my own version of the encodeURIComponent, because the posted solution has one problem, if there was a + present in the String, which should be encoded, it will converted to a space.
So here is my class:
import java.io.UnsupportedEncodingException;
import java.util.BitSet;
public final class EscapeUtils
{
/** used for the encodeURIComponent function */
private static final BitSet dontNeedEncoding;
static
{
dontNeedEncoding = new BitSet(256);
// a-z
for (int i = 97; i <= 122; ++i)
{
dontNeedEncoding.set(i);
}
// A-Z
for (int i = 65; i <= 90; ++i)
{
dontNeedEncoding.set(i);
}
// 0-9
for (int i = 48; i <= 57; ++i)
{
dontNeedEncoding.set(i);
}
// '()*
for (int i = 39; i <= 42; ++i)
{
dontNeedEncoding.set(i);
}
dontNeedEncoding.set(33); // !
dontNeedEncoding.set(45); // -
dontNeedEncoding.set(46); // .
dontNeedEncoding.set(95); // _
dontNeedEncoding.set(126); // ~
}
/**
* A Utility class should not be instantiated.
*/
private EscapeUtils()
{
}
/**
* Escapes all characters except the following: alphabetic, decimal digits, - _ . ! ~ * ' ( )
*
* #param input
* A component of a URI
* #return the escaped URI component
*/
public static String encodeURIComponent(String input)
{
if (input == null)
{
return input;
}
StringBuilder filtered = new StringBuilder(input.length());
char c;
for (int i = 0; i < input.length(); ++i)
{
c = input.charAt(i);
if (dontNeedEncoding.get(c))
{
filtered.append(c);
}
else
{
final byte[] b = charToBytesUTF(c);
for (int j = 0; j < b.length; ++j)
{
filtered.append('%');
filtered.append("0123456789ABCDEF".charAt(b[j] >> 4 & 0xF));
filtered.append("0123456789ABCDEF".charAt(b[j] & 0xF));
}
}
}
return filtered.toString();
}
private static byte[] charToBytesUTF(char c)
{
try
{
return new String(new char[] { c }).getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
return new byte[] { (byte) c };
}
}
}
I came up with another implementation documented at, http://blog.sangupta.com/2010/05/encodeuricomponent-and.html. The implementation can also handle Unicode bytes.
This is a straightforward example Ravi Wallau's solution:
public String buildSafeURL(String partialURL, String documentName)
throws ScriptException {
ScriptEngineManager scriptEngineManager = new ScriptEngineManager();
ScriptEngine scriptEngine = scriptEngineManager
.getEngineByName("JavaScript");
String urlSafeDocumentName = String.valueOf(scriptEngine
.eval("encodeURIComponent('" + documentName + "')"));
String safeURL = partialURL + urlSafeDocumentName;
return safeURL;
}
public static void main(String[] args) {
EncodeURIComponentDemo demo = new EncodeURIComponentDemo();
String partialURL = "https://www.website.com/document/";
String documentName = "Tom & Jerry Manuscript.pdf";
try {
System.out.println(demo.buildSafeURL(partialURL, documentName));
} catch (ScriptException se) {
se.printStackTrace();
}
}
Output:
https://www.website.com/document/Tom%20%26%20Jerry%20Manuscript.pdf
It also answers the hanging question in the comments by Loren Shqipognja on how to pass a String variable to encodeURIComponent(). The method scriptEngine.eval() returns an Object, so it can converted to String via String.valueOf() among other methods.
I have found PercentEscaper class from google-http-java-client library, that can be used to implement encodeURIComponent quite easily.
PercentEscaper from google-http-java-client javadoc
google-http-java-client home
I have successfully used the java.net.URI class like so:
public static String uriEncode(String string) {
String result = string;
if (null != string) {
try {
String scheme = null;
String ssp = string;
int es = string.indexOf(':');
if (es > 0) {
scheme = string.substring(0, es);
ssp = string.substring(es + 1);
}
result = (new URI(scheme, ssp, null)).toString();
} catch (URISyntaxException usex) {
// ignore and use string that has syntax error
}
}
return result;
}
for me this worked:
import org.apache.http.client.utils.URIBuilder;
String encodedString = new URIBuilder()
.setParameter("i", stringToEncode)
.build()
.getRawQuery() // output: i=encodedString
.substring(2);
or with a different UriBuilder
import javax.ws.rs.core.UriBuilder;
String encodedString = UriBuilder.fromPath("")
.queryParam("i", stringToEncode)
.toString() // output: ?i=encodedString
.substring(3);
In my opinion using a standard library is a better idea rather than post processing manually. Also #Chris answer looked good, but it doesn't work for urls, like "http://a+b c.html"
Guava library has PercentEscaper:
Escaper percentEscaper = new PercentEscaper("-_.*", false);
"-_.*" are safe characters
false says PercentEscaper to escape space with '%20', not '+'
This is what I'm using:
private static final String HEX = "0123456789ABCDEF";
public static String encodeURIComponent(String str) {
if (str == null) return null;
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
StringBuilder builder = new StringBuilder(bytes.length);
for (byte c : bytes) {
if (c >= 'a' ? c <= 'z' || c == '~' :
c >= 'A' ? c <= 'Z' || c == '_' :
c >= '0' ? c <= '9' : c == '-' || c == '.')
builder.append((char)c);
else
builder.append('%')
.append(HEX.charAt(c >> 4 & 0xf))
.append(HEX.charAt(c & 0xf));
}
return builder.toString();
}
It goes beyond Javascript's by percent-encoding every character that is not an unreserved character according to RFC 3986.
This is the oposite conversion:
public static String decodeURIComponent(String str) {
if (str == null) return null;
int length = str.length();
byte[] bytes = new byte[length / 3];
StringBuilder builder = new StringBuilder(length);
for (int i = 0; i < length; ) {
char c = str.charAt(i);
if (c != '%') {
builder.append(c);
i += 1;
} else {
int j = 0;
do {
char h = str.charAt(i + 1);
char l = str.charAt(i + 2);
i += 3;
h -= '0';
if (h >= 10) {
h |= ' ';
h -= 'a' - '0';
if (h >= 6) throw new IllegalArgumentException();
h += 10;
}
l -= '0';
if (l >= 10) {
l |= ' ';
l -= 'a' - '0';
if (l >= 6) throw new IllegalArgumentException();
l += 10;
}
bytes[j++] = (byte)(h << 4 | l);
if (i >= length) break;
c = str.charAt(i);
} while (c == '%');
builder.append(new String(bytes, 0, j, UTF_8));
}
}
return builder.toString();
}
I used
String encodedUrl = new URI(null, url, null).toASCIIString();
to encode urls.
To add parameters after the existing ones in the url I use UriComponentsBuilder