JIS X 0208 conversion: how to handle unified (merged) codepoints

JIS X 0208 conversion: how to handle unified (merged) codepoints - java

I'm trying to convert Java characters to JIS X 0208 "x-JIS0208" encoding (or any compatible, like EUC-JP, but not Shift-JIS), but I want unified (merged) codepoints to be handled correctly.
For example, 高 is assigned to row 25 column 66 in this JISX0208 chart, and a look-alike character 髙, while classified as an unassigned codepoint, is merged with the former. I quote from wikipedia: "both the form [ ] (高) and the less common form with a ladder-like construction (髙) are subsumed into the same code point".
I tried this in code the code below, and whatever encoding I try, I always get either an exception or the unassigned-character-placeholder ? (either ASCII or full-width).
Is there a way, perhaps a different endoding or an entirely different way of converting, so both these characters return the same codepoint? Alternatively, is there an API to find such characters so I can merge them before converting?
static Charset charset1 = Charset.forName("x-JIS0208");
static Charset charset2 = Charset.forName("EUC-JP");
static Charset[] charsets = {charset1, charset2};
static CharBuffer in = CharBuffer.allocate(1);
public static void main(String[] args) throws Exception
{
CharsetEncoder[] encoders = new CharsetEncoder[charsets.length];
for (int i = 0; i < charsets.length; i++)
encoders[i] = charsets[i].newEncoder();
char[] testChars = {'　', 'Ａ', '？', '亜', '唖', '蔭', '高', '髙'};
for (char ch : testChars)
{
System.out.print("'" + ch + "'\t(" + Integer.toHexString(ch) + ")\t=");
for (int i = 0; i < charsets.length; i++)
{
System.out.print("\t" + interpret(encode1(encoders[i], ch)));
System.out.print("\t" + interpret(encode2(charsets[i], ch)));
}
System.out.println();
}
}
private static String interpret(int i)
{
if (i == -1)
return "excepti";
if (i < 0x80)
return "'" + (char)i + "'";
return Integer.toHexString(i);
}
private static int encode1(CharsetEncoder encoder, char ch)
{
in.rewind();
in.put(ch);
in.rewind();
try
{
ByteBuffer out = encoder.encode(in);
if (out.limit() == 1)
return out.get(0) & 0xFF;
return out.get(1) & 0xFF | (out.get(0) & 0xFF) << 8;
}
catch (CharacterCodingException e)
{
return -1;
}
}
private static int encode2(Charset charset, char ch)
{
in.rewind();
in.put(ch);
in.rewind();
ByteBuffer out = charset.encode(in);
if (out.limit() == 1)
return out.get(0) & 0xFF;
return out.get(1) & 0xFF | (out.get(0) & 0xFF) << 8;
}
The output:
'　' (3000) = 2121 2121 a1a1 a1a1
'Ａ' (ff21) = 2341 2341 a3c1 a3c1
'？' (ff1f) = 2129 2129 a1a9 a1a9
'亜' (4e9c) = 3021 3021 b0a1 b0a1
'唖' (5516) = 3022 3022 b0a2 b0a2
'蔭' (852d) = 307e 307e b0fe b0fe
'高' (9ad8) = 3962 3962 b9e2 b9e2
'髙' (9ad9) = excepti 2129 excepti '?'
Note: I'm only interested in converting single characters, lots of them, not strings or streams, so I actually prefer a different method (if exists) that doesn't allocate a ByteBuffer every conversion.

髙 is not contained in JIS X 0208, but is containd in Microsoft Windows code page 932 (MS932). This is a variant of Shift JIS encoding, and is a superset of JIS X 0208 charset.
You should use the name "Windows-31j" for MS932, like:
Charset.forName("Windows-31j");
rather than Charset.forName("x-JIS0208");.
EDIT
The mapping table for some characters like 𨦇 and 鋏 (scissors) is distributed from the government of Japan, like National Tax Agency (see JIS縮退マップ（Ver.1.0.0）) .
But these mapping tables don't contain the character 髙. I think this is because 髙 is not contained in JIS X 0208 nor JIS X 0213.
So, I think you will have to replace 髙 with 高 manually (with String#replaceAll()), or make your own custom Charset with CharsetProvider.

I only knew that in the spec "ARIB STD-B24" (for ISDB-T 1seg in JP), this character is coding with DRCS pattern data, from DRCS-1 to DRCS-15, and each set consists of 94
characters.

Related

How to convert a binary in surrogate pairs to unicode?

Anyone can help figure out a surrogate pairs problem?
The source binary is #{EDA0BDEDB883}（encoded by Hessian/Java）, how to decode it to 😃 or "^(1F603)"?
I have checked UTF-16 wiki, but it only telled me a half story.
My problem is how to convert #{EDA0BDEDB883} to \ud83d and \ude03?
My aim is to rewrite the python3 program to Rebol or Red-lang，just parsing binary data, without any libs.
This is how python3 do it:
def _decode_surrogate_pair(c1, c2):
"""
Python 3 no longer decodes surrogate pairs for us; we have to do it
ourselves.
"""
# print(c1.encode('utf-8')) # \ud83d
# print(c2.encode('utf-8')) # \ude03
if not('\uD800' <= c1 <= '\uDBFF') or not ('\uDC00' <= c2 <= '\uDFFF'):
raise Exception("Invalid UTF-16 surrogate pair")
code = 0x10000
code += (ord(c1) & 0x03FF) << 10
code += (ord(c2) & 0x03FF)
return chr(code)
def _decode_byte_array(bytes):
s = ''
while(len(bytes)):
b, bytes = bytes[0], bytes[1:]
c = b.decode('utf-8', 'surrogatepass')
if '\uD800' <= c <= '\uDBFF':
b, bytes = bytes[0], bytes[1:]
c2 = b.decode('utf-8', 'surrogatepass')
c = _decode_surrogate_pair(c, c2)
s += c
return s
bytes = [b'\xed\xa0\xbd', b'\xed\xb8\x83']
print(_decode_byte_array(bytes))
public static void main(String[] args) throws Exception {
// "😃"
// "\uD83D\uDE03"
final byte[] bytes1 = "😃".getBytes(StandardCharsets.UTF_16);
// [-2, -1, -40, 61, -34, 3]
// #{FFFFFFFE} #{FFFFFFFF} #{FFFFFFD8} #{0000003D} #{FFFFFFDE} #{00000003}
System.out.println(Arrays.toString(bytes1));
}

Unable to call a .dll function correclty

I am calling a shared .dll function using JNA Java. From the documentation, the function can be invoked to receive parameters using Visual C++ as below;
PMSifEncodeKcdLcl(PCHAR ff, PCHAR Dta, BOOL Dbg, PCHAR szOpId, PCHAR szOpFirst, PCHAR szOpLast);
From the doc:
ff - A single ASCII character.
Dta - Points to a null-terminated string.
Dbg - a boolean flag
szOpId - points to a null-terminated string
szOpFirst - points to a null-terminated string
szOpLast - points to a null-terminated string
The string is built from a number of Data Fields. The format for each Data Field within the string is as follows:
RS FI data
RS = Record Separator.
Indicates the start of the Data Field. A single ASCII Record Separator [RS] character (hex 1E)
FI = Field Identifier - Indicates the type of data in the field. A single ASCII character.
data = the actual data. A number of ASCII characters, dependent on the Field Identifier. Sometimes the data is variable in length. The Record Separator of the following field indicates the end of a Data Field (or for the last field, the NULL character at the end of the string).
An Answer Code is returned in field ff. Answer Data (if any) is returned in field Dta
I have cross checked the JNA documentation to confirm field mappings but still no success. After trying for days. I came up with the code below;
My Java Code:
/* JNA interface class
*/
public class JNALocksInterface {
public interface LockLibrary extends StdCallLibrary {
LockLibrary INSTANCE = (LockLibrary) Native.loadLibrary("path_to_dll", LockLibrary.class);
public void PMSifEncodeKcdLcl(byte[] ff, byte[] dta, boolean debug, String szOpid, String szOpFirst, String szOpLast);
}
}
/*My Calling Class Code*/
JNALocksInterface.LockLibrary INSTANCE = JNALocksInterface.LockLibrary.INSTANCE;
String dta = "*R101*L101*TSingle Room*NMatu*FZachary*URegular Guest*D201805021347*O201805030111";
String ff = "A";
byte[] dataBytes = new byte[dta.length() + 1];
System.arraycopy(dta.getBytes("UTF-8"), 0, dataBytes, 0, dta.length());
dataBytes[dta.length()] = 0;
byte[] dtaByteArray = new byte[dta.length() + 1];
byte[] ffByteArray = ff.getBytes("UTF-8");
for (int i = 0; i < dataBytes.length; i++) {
String s1 = String.format("%8s", Integer.toBinaryString(dataBytes[i] & 0xFF)).replace(' ', '0');
// System.out.println(s1);
if((char)dataBytes[i] == '*')
{
dtaByteArray[i] = 30;
}
else{
int val = Integer.parseInt(s1, 2);
byte b = (byte) val;
dtaByteArray[i] = b;
}
}
byte[] commandCodeFinal = new byte[1];
for (int i = 0; i < ffByteArray.length; i++) {
String s2 = String.format("%8s", Integer.toBinaryString(ffByteArray[i] & 0xFF)).replace(' ', '0');
System.out.println(s2);
int val = Integer.parseInt(s2, 2);
byte b = (byte) val;
commandCodeFinal[i] = b;
}
String userNameBytes = "test";
String userFirstNameBytes = "test";
String userLastNameBytes = "test";
INSTANCE.PMSifEncodeKcdLcl(commandCodeFinal, dtaByteArray, false, userNameBytes, userFirstNameBytes, userLastNameBytes);
I am getting a wrong response on field ff and dta as shown below.
FF Response >> :
DTA Response >> 0101IR101L101TSingle RoomNMatuFZacharyURegular GuestD201805021347O2018050
I am replacing "*" with the ascii record separator.
Can someone show me how to correctly call the function using JNA? I've searched all over but still no success.

Solved IT. Used Unicode Field separator and used JNA Memory object and it Worked!
Was also using WIndows 10 64 bit. Changed to Windows 7 32 bit and it worked!!
Replaced the code with below snippet;
String fieldSeparator = "\u001e"
String dataTest = fieldSeparator+"R101"+fieldSeparator+"TSingle Room"+fieldSeparator+"FShujaa"+fieldSeparator+"NMatoke"
+ fieldSeparator+"URegular Guest"+fieldSeparator+"D201805040842"+fieldSeparator+"O201805051245";
String dataTestPadded = org.apache.commons.lang.StringUtils.rightPad(dataTest,30,'0');
System.out.println("Padded string >> " + dataTestPadded);
String data = dataTest;
//getPayloadToSend(payLoadSample) + (char)00;
String commandCode = "A";
Memory commandCodeMemory = new Memory(commandCode.length()+1);
commandCodeMemory.setString(0, commandCode);
Memory dataMemory = new Memory(data.length()+1);
dataMemory.setString(0, data);
//dataMemory.setString(1, "0");
System.out.println("Registerring >> " + INSTANCE.PMSifRegister("42860149", "BatchClient")) ;
INSTANCE.PMSifEncodeKcdLcl(commandCodeMemory, dataMemory, false, "ZKMATU", "zACHARY", "tESTING");
System.out.println("FF Response >> " + commandCodeMemory.getString(0));
System.out.println("DTA Response >> " + dataMemory.getString(0));
INSTANCE.PMSifUnregister();

Xor a string that is uint16 or uint32

I am trying to recreate the following logic I created in JAVA to swift:
public String xorMessage(String message, String key) {
try {
if (message == null || key == null) return null;
char[] keys = key.toCharArray();
char[] mesg = message.toCharArray();
int ml = mesg.length;
int kl = keys.length;
char[] newmsg = new char[ml];
for (int i = 0; i < ml; i++) {
newmsg[i] = (char)(mesg[i] ^ keys[i % kl]);
}//for i
return new String(newmsg);
} catch (Exception e) {
return null;
}
I have reached till here while coding in swift3:
import UIKit
import Foundation
let t = "22-Jun-2017 12:30 pm"
let m = "message"
print(UInt8(t))
let a :[UInt8] = Array(t.utf8)
let v = m.characters.map{String ($0) }
print(v)
func encodeWithXorByte(key: UInt8 , Input : String) -> String {
return String(bytes: Input.utf8.map{$0 ^ key}, encoding: String.Encoding.utf8) ?? ""
}
var ml :Int = Int( m.characters.count )
var kl :Int = Int (t.characters.count)
var f = [String]()
for i in 0..<ml{
let key = a[i%kl]
let input = v[i]
f.append(String(bytes: input.utf8.map{$0 ^ key} , encoding : String.Encoding.utf8)!)
// f.append(<#T##newElement: Character##Character#>)
//m[i] = input.utf8.map{$0 ^ key}
}
I am trying to obtain a string(message) which has been xor'ed with a key passed into the above function. But my code in swift is not working as it is returning character array and I want a string, if I try to cast the character array to string it does not show the unicode like \u{0001} etc in the string...
Suppose I get following output :
["_", "W", "^", "9", "\u{14}", "\t", "H"]
and then I try to convert to string, I get this:
_W^9 H
I want :
_W^9\u{14}\tH
Please help.

There are different problems. First, if your intention is to print
"unprintable" characters in a string \u{} escaped then you can use
the .debugDescription method. Example:
let s = "a\u{04}\u{08}b"
print(s) // ab
print(s.debugDescription) // "a\u{04}\u{08}b"
Next, your Swift code converts the string to UTF-8, xor's the bytes
and then converts the result back to a String. That can easily fail
if the xor'ed byte sequence is not valid UTF-8.
The Java code operates on UTF-16 code units, so the equivalent Swift
code would be
func xorMessage(message: String, key: String) -> String {
let keyChars = Array(key.utf16)
let keyLen = keyChars.count
let newMsg = message.utf16.enumerated().map { $1 ^ keyChars[$0 % keyLen] }
return String(utf16CodeUnits: newMsg, count: newMsg.count)
}
Example:
let t = "22-Jun-2017 12:30 pm"
let m = "message"
let encrypted = xorMessage(message: m, key: t)
print(encrypted.debugDescription) // "_W^9\u{14}\tH"
Finally, even that can produce unexpected results unless you restrict
the input (key and message) to ASCII characters. Example:
let m = "😀"
print(Array(m.utf16).map { String($0, radix: 16)} ) // ["d83d", "de00"]
let t = "a€"
print(Array(t.utf16).map { String($0, radix: 16)} ) // ["61", "20ac"]
let e = xorMessage(message: m, key: t)
print(Array(e.utf16).map { String($0, radix: 16)} ) // ["fffd", "feac"]
let d = xorMessage(message: e, key: t)
print(Array(d.utf16).map { String($0, radix: 16)} ) // ["ff9c", "fffd"]
print(d) // ﾜ�
print(d == m) // false
The problem is that the xor'ing produces an invalid UTF-16 sequence
(an unbalanced surrogate pair), which is then replaced by the
"replacement character" U+FFFD.
I don't know how Java handles this, but Swift strings cannot invalid
Unicode scalar values, so the only solution would be to represent
the result as an [UInt16] array instead of a String.

Replace all extended ASCII characters by their "original" character [duplicate]

This question already has answers here:
Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
(12 answers)
Closed 5 days ago.
Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?
Example:
Input: orčpžsíáýd
Output: orcpzsiayd
It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.

Start with java.text.Normalizer.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction
This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in Unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.
It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented é with the unaccented e:
import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;
public class T {
public static void main( final String[] args ) {
final var text = "Brévis";
System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}

As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):
String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
System.out.println(input);
// Prints "This is a funky String"
Note:
The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.

The solution by #virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that #virgo47's (the accepted answer is about 26x slower than #virgo47's on my machine).

EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.
While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):
/**
* Mirror of the unicode table from 00c0 to 017f without diacritics.
*/
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
"DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
"aaaaaaaceeeeiiii" +
"\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzF";
/**
* Returns string without diacritics - 7 bit approximation.
*
* #param source string to convert
* #return corresponding string without diacritics
*/
public static String removeDiacritic(String source) {
char[] vysl = new char[source.length()];
char one;
for (int i = 0; i < source.length(); i++) {
one = source.charAt(i);
if (one >= '\u00c0' && one <= '\u017f') {
one = tab00c0.charAt((int) one - '\u00c0');
}
vysl[i] = one;
}
return new String(vysl);
}
Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.
Enjoy :-)

System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
worked for me. The output of the snippet above gives "aee" which is what I wanted, but
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));
didn't do any substitution.

Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks
https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics
"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."
Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.

I have faced the same issue related to Strings equality check, One of the comparing string has
ASCII character code 128-255.
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20].
To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));
Output in Bytes:
S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97]
S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode
String initials = Unidecode.decode( s2 );
➩ using Guava: Google Core Libraries for Java.
String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
For URL encode for the space use Guava laibrary.
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
➩ To overcome this problem used String.replaceAll() with some RegularExpression.
// \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
s2 = s2.replaceAll("\\p{Zs}", " ");
s2 = s2.replaceAll("[^\\p{ASCII}]", " ");
s2 = s2.replaceAll(" ", " ");
➩ Using java.text.Normalizer.Form.
This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them.
s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
Testing String and outputs on different approaches like ➩ Unidecode, Normalizer, StringUtils.
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
Using Unidecode is the best choice, My final Code shown below.
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}

I suggest Junidecode . It will handle not only 'Ł' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.

One of the best way using regex and Normalizer if you have no library is :
public String flattenToAscii(String s) {
if(s == null || s.trim().length() == 0)
return "";
return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}
This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).
Otherwise, you have to use the p{ASCII} pattern.
Regards.

Since this solution is already available in StringUtils.stripAccents() at Maven Repository and working for Ł as mentioned by #DavidS.
But I need this to be working for both Ø and Ł So modified as below. May be help full for others too.
Update
This is modified version of StringUtils.stripAccents(String obj), that contains old functionality along with handling both Ø and Ł chars.
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
for (int i = 0; i < decomposed.length(); i++) {
if (decomposed.charAt(i) == '\u0141') {
decomposed.setCharAt(i, 'L');
} else if (decomposed.charAt(i) == '\u0142') {
decomposed.setCharAt(i, 'l');
}else if (decomposed.charAt(i) == '\u00D8') {
decomposed.setCharAt(i, 'O');
}else if (decomposed.charAt(i) == '\u00F8') {
decomposed.setCharAt(i, 'o');
}
}
// Note that this doesn't correctly remove ligatures...
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}
Input string Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø
output string L This is a funky String O o

#David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped.
The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String norm = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = norm.length(); i < n; ++i) {
char c = norm.charAt(i);
int type = Character.getType(c);
//Log.d(TAG,""+c);
//by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
if (type != Character.NON_SPACING_MARK){
out[j] = c;
j++;
}
}
//Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
return new String(out);
}

I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing:
Composite Unicode
Precomposed Unicode
For example "Ồ" written by Composite Unicode is different from "Ồ" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference.
In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300)
In Precomposed Unicode, "Ồ" is single char (U+1ED2)
I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.

A fast and safer way
public static String removeDiacritics(String str) {
if (str == null)
return null;
if (str.isEmpty())
return "";
int len = str.length();
StringBuilder sb
= new StringBuilder(len);
//iterate string codepoints
for (int i = 0; i < len; ) {
int codePoint = str.codePointAt(i);
int charCount
= Character.charCount(codePoint);
if (charCount > 1) {
for (int j = 0; j < charCount; j++)
sb.append(str.charAt(i + j));
i += charCount;
continue;
}
else if (codePoint <= 127) {
sb.append((char)codePoint);
i++;
continue;
}
sb.append(
java.text.Normalizer
.normalize(
Character.toString((char)codePoint),
java.text.Normalizer.Form.NFD)
.charAt(0));
i++;
}
return sb.toString();
}

Faced the same issue, here's solution using Kotlin extension
val String.stripAccents: String
get() = Regex("\\p{InCombiningDiacriticalMarks}+")
.replace(
Normalizer.normalize(this, Normalizer.Form.NFD),
""
)
usage
val textWithoutAccents = "some accented string".stripAccents

In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:
fun stripAccents(s: String):String{
if (s == null) {
return "";
}
val chars: CharArray = s.toCharArray()
var sb = StringBuilder(s)
var cont: Int = 0
while (chars.size > cont) {
var c: kotlin.Char
c = chars[cont]
var c2:String = c.toString()
//these are my needs, in case you need to convert other accents just Add new entries aqui
c2 = c2.replace("Ã", "A")
c2 = c2.replace("Õ", "O")
c2 = c2.replace("Ç", "C")
c2 = c2.replace("Á", "A")
c2 = c2.replace("Ó", "O")
c2 = c2.replace("Ê", "E")
c2 = c2.replace("É", "E")
c2 = c2.replace("Ú", "U")
c = c2.single()
sb.setCharAt(cont, c)
cont++
}
return sb.toString()
}
to use these fun cast the code like this:
var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()
str = stripAccents(str) //call the function

Java encryption and Force.com apex encryption

I need to convert this java code in force.com apex. i tried to use Crypto class to get same encryption but not getting how can i get same value for the variable "fingerprintHash" in the last in APEX . Can Anyone help me in this technical issue?
Random generator = new Random();
sequence =Long.parseLong(sequence+""+generator.nextInt(1000));
timeStamp = System.currentTimeMillis() / 1000;
try {
SecretKey key = new SecretKeySpec(transactionKey.getBytes(), "HmacMD5");
Mac mac = Mac.getInstance("HmacMD5");
mac.init(key);
String inputstring = loginID + "^" + sequence + "^" + timeStamp + "^" + amount + "^";
byte[] result = mac.doFinal(inputstring.getBytes());
StringBuffer strbuf = new StringBuffer(result.length * 2);
for (int i = 0; i < result.length; i++) {
if (((int) result[i] & 0xff) < 0x10) {
strbuf.append("0");
}
strbuf.append(Long.toString((int) result[i] & 0xff, 16));
}
fingerprintHash = strbuf.toString(); //need this result for variable x_fp_hash
The apex code I was trying is :-
String API_Login_Id='6########';
String TXn_Key='6###############';
String amount='55';
sequence = '300';
long timeStamp = System.currentTimeMillis()/1000;
String inputStr = API_Login_Id + '^' + sequence + '^' + timeStamp + '^' + amount + '^';
String algorithmName = 'hmacMD5';
Blob mac = Crypto.generateMac(algorithmName,Blob.valueOf(inputStr),Blob.valueOf( TXn_Key));
String macUrl =EncodingUtil.urlEncode(EncodingUtil.base64Encode(mac), 'UTF-8');

The problem would seem to be that you are hex encoding the output on the javaside, but base64 encoding the output on the apex side, try using EncodingUtils.convertToHex instead of EncodingUtils.base64Encode

You look like you're heading along the right lines with regards to the encryption, however you're using a time stamp as part of your input string, and so unless you're astronomically lucky you're always encoding different strings. While you're working on porting the code, remove the timestamp so that you can be sure your input strings are the same - if they're not the same then you'll never get the same result.
Once you've established that your encryption is working as desired, then you can put the timestamp back into the code safe in the knowledge that it'll be functioning the same way as the original java code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JIS X 0208 conversion: how to handle unified (merged) codepoints - java

I only knew that in the spec "ARIB STD-B24" (for ISDB-T 1seg in JP), this character is coding with DRCS pattern data, from DRCS-1 to DRCS-15, and each set consists of 94 characters.

Related

How to convert a binary in surrogate pairs to unicode?

Unable to call a .dll function correclty

Xor a string that is uint16 or uint32

Replace all extended ASCII characters by their "original" character [duplicate]

Java encryption and Force.com apex encryption

Categories

Resources