//PROBLEM SOLVED
I wrote a program to convert EBCDIC string to hex.
I have a problem with some signs.
So, I read string to bytes, then change them to hex(every two signs)
Problem is that, JAVA converts Decimal ASCII 136 sign according to https://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php to Decimal ASCII 63.
Which is problematic and wrong, becouse it's the only wrong converted character.
EBCDIC 88 in UltraEdit
//edit -added code
int[] bytes = toByte(0, 8);
String bitmap = hexToBin(toHex(bytes));
//ebytes[] - ebcdic string
public int[] toByte(int from, int to){
int[] newBytes = new int[to - from + 1];
int k = 0;
for(int i = from; i <= to; i++){
newBytes[k] = ebytes[i] & 0xff;
k++;
}
return newBytes;
}
public String toHex(int[] hex){
StringBuilder sb = new StringBuilder();
for (int b : hex) {
if(Integer.toHexString(b).length() == 1){
sb.append("0" + Integer.toHexString(b));
}else{
sb.append(Integer.toHexString(b));
}
}
return sb.toString();
}
public String hexToBin(String hex){
String toReturn = new BigInteger(hex, 16).toString(2);
return String.format("%" + (hex.length() * 4) + "s", toReturn).replace(' ', '0');
}
//edit2
Changing encoding in Eclipse to ISO-8859-1 helped, but I lose some signs while reading text from a file.
//edit3
Problem solved by changing the way of reading file.
Now, I read it byte by byte and parse it to char.
Before, it was line by line.
There is no ASCII-value of 136 since ASII is only 7 bit - everything beyond 127 is some custom extended codepage (the linked table seems to use some sort of windows-codepage, e.g. Cp1252). Since it is printing a ? it seems you are using a codepage that doesn't have a character assigned to the value 136 - e.g. some flavour of ISO-8859.
Solution.
Change encoding in Eclipse to more proper (in my example ISO-8859-1).
Change the way of reading file.
Now, I read it byte by byte and parse it to char.
Before, it was line by line and this is how I lost some chars.
Related
I have an array of bytes that contains a sentence. I need to convert the lowercase letters on this sentence into uppercase letters. Here is the function that I did:
public void CharUpperBuffAJava(byte[] word) {
for (int i = 0; i < word.length; i++) {
if (!Character.isUpperCase(word[i]) && Character.isLetter(word[i])) {
word[i] -= 32;
}
}
return cchLength;
}
It will work fine with sentences like: "a glass of water". The problem is it must work with all ANSI characters, which includes "ç,á,é,í,ó,ú" and so on. The method Character.isLetter doesn't work with these letters and, therefore, they are not converted into uppercase.
Do you know how can I identify these ANSI characters as a letter in Java?
EDIT
If someone wants to know, I did method again after the answers and now it looks like this:
public static int CharUpperBuffAJava(byte[] lpsz, int cchLength) {
String value;
try {
value = new String(lpsz, 0, cchLength, "Windows-1252");
String upperCase = value.toUpperCase();
byte[] bytes = upperCase.getBytes();
for (int i = 0; i < cchLength; i++) {
lpsz[i] = bytes[i];
}
return cchLength;
} catch (UnsupportedEncodingException e) {
return 0;
}
}
You need to "decode" the byte[] into a character string. There are several APIs to do this, but you must specify the character encoding that is use for the bytes. The overloaded versions that don't use an encoding will give different results on different machines, because they use the platform default.
For example, if you determine that the bytes were encoded with Windows-1252 (sometimes referred to as ANSI).
String s = new String(bytes, "Windows-1252");
String upper = s.toUpperCase();
Convert the byte array into a string, supporting the encoding. Then call toUpperCase(). Then, you can call getBytes() on the string if you need it as a byte array after capitalizing.
Can't you simply use:
String s = new String(bytes, "cp1252");
String upper = s.toUpperCase(someLocale);
Wouldn't changing the character set do the trick before conversion? The internal conversion logic of Java might work fine. Something like http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html, but use ASCII as the target character set.
I am looking at this table:
http://slayeroffice.com/tools/ascii/
But anything > 227 appears to be a letter, but to make it upper case you would subtract 27 from the ASCII value.
I received string from IBM Mainframe like below (2bytes graphic fonts)
" ;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;₩;~;!;@;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
and, I wanna change these characters to 1 byte ascii codes
How can I replace these using java.util.regex.Matcher, String.replaceAll() in Java
target characters :
;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;\;~;!;#;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
This is not (as other responders are saying) a character-encoding issue, but regexes are still the wrong tool. If Java had an equivalent of Perl's tr/// operator, that would be the right tool, but you can hand-code it easily enough:
public static String convert(String oldString)
{
String oldChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=₩~!@#$%^&*()_+|[]{}:"',./<>?";
String newChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=\\~!##$%^&*()_+|[]{}:\"',./<>?";
StringBuilder sb = new StringBuilder();
int len = oldString.length();
for (int i = 0; i < len; i++)
{
char ch = oldString.charAt(i);
int pos = oldChars.indexOf(ch);
sb.append(pos < 0 ? ch : newChars.charAt(pos));
}
return sb.toString();
}
I'm assuming each character in the first string corresponds to the character at the same position in the second string, and that the first character (U+3000, 'IDEOGRAPHIC SPACE') should be converted to an ASCII space (U+0020).
Be sure to save the source file as UTF-8, and include the -encoding UTF-8 option when you compile it (or tell your IDE to do so).
Don't think this one's about regex, it's about encoding. Should be possible to read into a String with 2-byte and then write it with any other encoding.
Look here for supported encodings.
How can I convert ASCII values to hexadecimal and binary values (not their string representation in ASCII)? For example, how can I convert the decimal value 26 to 0x1A?
So far, I've tried converting using the following steps (see below for actual code):
Converting value to bytes
Converting each byte to int
Converting each int to hex, via String.toString(intValue, radix)
Note: I did ask a related question about writing hex values to a file.
Clojure code:
(apply str
(for [byte (.getBytes value)]
(.replace (format "%2s" (Integer/toString (.intValue byte) 16)) " " "0")))))
Java code:
Byte[] bytes = "26".getBytes();
for (Byte data : bytes) {
System.out.print(String.format("%2s", Integer.toString(data.intValue(), 16)).replace(" ", "0"));
}
System.out.print("\n");
Hexadecimal, decimal, and binary integers are not different things -- there's only one underlying representation of an integer. The one thing you said you're trying to avoid -- "the ASCII string representation" -- is the only thing that's different. The variable is always the same, it's just how you print it that's different.
Now, it's not 100% clear to me what you're trying to do. But given the above, the path is clear: if you've got a String, convert it to an int by parsing (i.e., using Integer.parseInt()). Then if you want it printed in some format, it's easy to print that int as whatever you want using, for example, printf format specifiers.
If you actually want hexadecimal strings, then (format "%02X" n) is much simpler than the hoops you jump through in your first try. If you don't, then just write the integer values to a file directly without trying to convert them to a string.
Something like (.write out (read-string string-representing-a-number)) should be sufficient.
Here are your three steps rolled up into one line of clojure:
(apply str (map #(format "0x%02X " %) (.getBytes (str 42))))
convert to bytes (.getBytes (str 42))
no actual need for step 2
convert each byte to a string of characters representing it in hex
or you can make it look more like your steps with the "thread last" macro
(->> (str 42) ; start with your value
(.getBytes) ; put it in an array of bytes
(map #(format "0x%02X " %)) ; make hex string representation
(apply str)) ; optionally wrap it up in a string
static String decimalToHex(String decimal, int minLength) {
Long n = Long.parseLong(decimal, 10);
// Long.toHexString formats assuming n is unsigned.
String hex = Long.toHexString(Math.abs(n), 16);
StringBuilder sb = new StringBuilder(minLength);
if (n < 0) { sb.append('-'); }
int padding = minLength - hex.length - sb.length();
while (--padding >= 0) { sb.append('0'); }
return sb.append(hex).toString();
}
//get Unicode for char
char theChar = 'a';
//use this to go from i`enter code here`nt to Unicode or HEX or ASCII
int theValue = 26;
String hex = Integer.toHexString(theValue);
while (hex.length() < 4) {
hex = "0" + hex;
}
String unicode = "\\u" + (hex);
System.out.println(hex);
How can I get the UTF8 code of a char in Java ?
I have the char 'a' and I want the value 97
I have the char 'é' and I want the value 233
here is a table for more values
I tried Character.getNumericValue(a) but for a it gives me 10 and not 97, any idea why?
This seems very basic but any help would be appreciated!
char is actually a numeric type containing the unicode value (UTF-16, to be exact - you need two chars to represent characters outside the BMP) of the character. You can do everything with it that you can do with an int.
Character.getNumericValue() tries to interpret the character as a digit.
You can use the codePointAt(int index) method of java.lang.String for that. Here's an example:
"a".codePointAt(0) --> 97
"é".codePointAt(0) --> 233
If you want to avoid creating strings unnecessarily, the following works as well and can be used for char arrays:
Character.codePointAt(new char[] {'a'},0)
Those "UTF-8" codes are no such thing. They're actually just Unicode values, as per the Unicode code charts.
So an 'é' is actually U+00E9 - in UTF-8 it would be represented by two bytes { 0xc3, 0xa9 }.
Now to get the Unicode value - or to be more precise the UTF-16 value, as that's what Java uses internally - you just need to convert the value to an integer:
char c = '\u00e9'; // c is now e-acute
int i = c; // i is now 233
This produces good result:
int a = 'a';
System.out.println(a); // outputs 97
Likewise:
System.out.println((int)'é');
prints out 233.
Note that the first example only works for characters included in the standard and extended ASCII character sets. The second works with all Unicode characters. You can achieve the same result by multiplying the char by 1.
System.out.println( 1 * 'é');
Your question is unclear. Do you want the Unicode codepoint for a particular character (which is the example you gave), or do you want to translate a Unicode codepoint into a UTF-8 byte sequence?
If the former, then I recommend the code charts at http://www.unicode.org/
If the latter, then the following program will do it:
public class Foo
{
public static void main(String[] argv)
throws Exception
{
char c = '\u00E9';
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bos, "UTF-8");
out.write(c);
out.flush();
byte[] bytes = bos.toByteArray();
for (int ii = 0 ; ii < bytes.length ; ii++)
System.out.println(bytes[ii] & 0xFF);
}
}
(there's also an online Unicode to UTF8 page, but I don't have the URL on this machine)
My method to do it is something like this:
char c = 'c';
int i = Character.codePointAt(String.valueOf(c), 0);
// testing
System.out.println(String.format("%c -> %d", c, i)); // c -> 99
You can create a simple loop to list all the UTF-8 characters available like this:
public class UTF8Characters {
public static void main(String[] args) {
for (int i = 12; i <= 999; i++) {
System.out.println(i +" - "+ (char)i);
}
}
}
There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);
For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020
\u0057\u006f\u0072\u006c\u0064"
It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The article gives you link to Maven Central where you can get artifacts and github where you can get the project itself. The library comes with well written javadoc and source code.
Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently).
Thanks!
You can use java.nio for an easy solution:
// first encode the utf-16 string as a ByteBuffer
ByteBuffer bb = Charset.forName("utf-16").encode(CharBuffer.wrap(utf16str));
// then decode those bytes as US-ASCII
CharBuffer ascii = Charset.forName("US-ASCII").decode(bb);
How about this:
String input = ... // my UTF-16 string
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (ch <= 0xFF) {
sb.append(ch);
}
}
byte[] ascii = sb.toString().getBytes("ISO-8859-1"); // aka LATIN-1
This is probably not the most efficient way to do this conversion for large strings since we copy the characters twice. However, it has the advantage of being straightforward.
BTW, strictly speaking there is no such character set as 8-bit ASCII. ASCII is a 7-bit character set. LATIN-1 is the nearest thing there is to an "8-bit ASCII" character set (and block 0 of Unicode is equivalent to LATIN-1) so I'll assume that's what you mean.
EDIT: in the light of the update to the question, the solution is even simpler:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
ascii[i] = (byte) input.charAt(i);
}
This solution is more efficient. Since we now know how many bytes to expect, we can preallocate the byte array and in copy the (truncated) characters without using a StringBuilder as intermediate buffer.
However, I'm not convinced that dealing with bad data in this way is sensible.
EDIT 2: there is one more obscure "gotcha" with this. Unicode actually defines code points (characters) to be "roughly 21 bit" values ... 0x000000 to 0x10FFFF ... and uses surrogates to represent codes > 0x00FFFF. In other words, a Unicode codepoint > 0x00FFFF is actually represented in UTF-16 as two "characters". Neither my answer or any of the others take account of this (admittedly esoteric) point. In fact, dealing with codepoints > 0x00FFFF in Java is rather tricky in general. This stems from the fact that 'char' is a 16 bit type and String is defined in terms of 'char'.
EDIT 3: maybe a more sensible solution for dealing with unexpected characters that don't convert to ASCII is to replace them with the standard replacement character:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
ascii[i] = (ch <= 0xFF) ? (byte) ch : (byte) '?';
}
Java internally represents strings in UTF-16. If a String object is what you are starting with, you can encode using String.getBytes(Charset c), where you might specify US-ASCII (which can map code points 0x00-0x7f) or ISO-8859-1 (which can map code points 0x00-0xff, and may be what you mean by "8-bit ASCII").
As for adding "bad data"... ASCII or ISO-8859-1 strings simply can't represent values outside of a certain range. I believe getBytes will simply drop characters it's not able to represent in the destination character set.
Since this is an exercise, it sounds like you need to implement this manually. You can think of an encoding (e.g. UTF-16 or ASCII) as a lookup table that matches a sequence of bytes to a logical character (a codepoint).
Java uses UTF-16 strings, which means that any given codepoint can be represented in one or two char variables. Whether you want to handle the two-char surrogate pairs depends on how likely you think your application is to encounter them (see the Character class for detecting them). ASCII only uses the first 7 bits of an octet (byte), so the valid range of values is 0 to 127. UTF-16 uses identical values for this range (they're just wider). This can be confirmed with this code:
Charset ascii = Charset.forName("US-ASCII");
byte[] buffer = new byte[1];
char[] cbuf = new char[1];
for (int i = 0; i <= 127; i++) {
buffer[0] = (byte) i;
cbuf[0] = (char) i;
String decoded = new String(buffer, ascii);
String utf16String = new String(cbuf);
if (!utf16String.equals(decoded)) {
throw new IllegalStateException();
}
System.out.print(utf16String);
}
System.out.println("\nOK");
Therefore, you can convert UTF-16 to ASCII by casting a char to a byte.
You can read more about Java character encoding here.
Just to optimize on the accepted answer and not pay any penalty if the string is already all ascii characters, here is the optimized version. Thanks #stephen-c
public static String toAscii(String input) {
final int length = input.length();
int ignoredChars = 0;
byte[] ascii = null;
for (int i = 0; i < length; i++) {
char ch = input.charAt(i);
if (ch > 0xFF) {
//-- ignore this non-ascii character
ignoredChars++;
if (ascii == null) {
//-- first non-ascii character. Create a new ascii array with all ascii characters
ascii = new byte[input.length() - 1]; //-- we know, the length will be at less by at least 1
for (int j = 0; j < i-1; j++) {
ascii[j] = (byte) input.charAt(j);
}
}
} else if (ascii != null) {
ascii[i - ignoredChars] = (byte) ch;
}
}
//-- (ignoredChars == 0) is the same as (ascii == null) i.e. no non-ascii characters found
return ignoredChars == 0 ? input : new String(Arrays.copyOf(ascii, length - ignoredChars));
}