Is this ASCII ?
And how can I print it in understanding language like in char ?
I get this answer from my PortCom.
Here is how I read :
boolean ok = false;
int read = 0;
System.out.println("In Read :");
while(ok == false) {
int availableBytes = 0;
try {
availableBytes = inputStream.available();
if (availableBytes > 0) {
read = read + availableBytes;
int raw = inputStream.read(readBuffer, read-availableBytes, availableBytes);
System.out.println("Inpustream ="+raw);
traduction = new String(readBuffer, read-availableBytes, availableBytes);
System.out.println("2=>" + traduction);
Response = new String(readBuffer, "UTF-8"); // bytes -> String
} catch (IOException e) {
}
if (availableBytes == 0 && (read == 19 || read == 8)){
ok = true;
}
}
As I read your comments, I am under the impression that you're a little confused as to what a character and ASCII are.
Characters are numbers. Plain dumb numbers. It just so happens that people created standard mappings between numbers and letters. For instance, according to the ASCII character map, 97 is a. The implications of this are that when display software sees 97, it knows that it has to find the glyph for the character a in a given font, and draw it to the screen.
Integer values 0 through 31, when interpreted with the ASCII character map, are so-called control characters and as such have no visual glyph associated with them. They tell software how to behave rather than what to display. For instance, the character #0 is the NUL character and is used to signal the end of a string with the C string library and has little to no practical use in most other languages. Off my head, character #13 is NL, for "new line", and it tells the rendering software to move the drawing cursor to the next line, rather than to render a character.
Most ASCII control characters are outdated and are not meant to be sent to text rendering software. As such, implementations decide how they deal with them if they don't know what to do. Many of them do nothing, some print question marks, and some print completely unrelated characters.
ASCII only maps integers from 0 to 128 to glyphs or control characters, which leaves another 128 possible integers in a byte undefined. Integers above 127 have no associated glyph in the ASCII standard, and only these can be called "not ASCII". So, what you should be asking, really, is "is that text?" rather than "is that ASCII?", because any sequence of integers between 0 and 127 is necessarily ASCII, which however says nothing about whether or not it's human-readable.
And the obvious answer to that question is "no, it's not text". Asking what it is if it's not text is asking us to be psychics, since there's no "universal bug" that maims text. It could be almost anything.
However, since you state that you're reading from a serial link, I'd advise you to check the bauds rate and other link settings, because there's no built-in mechanism to detect mismatches from on end to the other, and it can mangle data the way it does here.
Use the raw value instead of availableBytes:
traduction = new String(readBuffer, read-availableBytes, raw);
The raw indicates how many were actually read as opposed to how many you requested. If you ask 10 bytes and it reads 5, the remaining 5 will be unknown garbage.
UPDATE
The response is obviously wrong too and for the same reason:
Response = new String(readBuffer, "UTF-8");
You are telling it to convert the entire buffer even though you may have only read 1 byte. If you're a bit unlucky you'll get exceptions because not all byte combinations can be converted using UTF-8
Related
Hi all and thank you for the help in advance.
I have scoured the webs and have not really turned up with anything concrete as to my initial question.
I have a program I am developing in JAVA thats primary purpose is to read a .DAT file and extract certain values from it and then calculate an output based on the extracted values which it then writes back to the file.
The file is made up of records that are all the same length and format and thus it should be fairly straightforward to access, currently I am using a loop and and an if statement to find the first occurrence of a record and then through user input determine the length of each record to then loop through each record.
HOWEVER! The first record of this file is a blank (Or so I thought). As it turns out this first record is the key to the rest of the file in that the first few chars are ascii and reference the record length and the number of records contained within the file respectively.
below are a list of the ascii values themselves as found in the files (Disregard the " " the ascii is contained within them)
"#¼ ä "
"#g â "
"ÇG # "
"lj ‰ "
"Çò È "
"=¼ "
A friend of mine who many years ago use to code in Basic recons the first 3 chars refer to the record length and the following 9 refer to the number of records.
Basically what I am needing to do is convert this initial string of ascii chars to two decimals in order to work out the length of each record and the number of records.
Any assistance will be greatly appreciated.
Edit...
Please find below the Basic code used to access the file in the past, perhaps this will help?
CLS
INPUT "Survey System Data File? : ", survey$
survey$ = "f:\apps\survey\" + survey$
reclen = 3004
OPEN survey$ + ".dat" FOR RANDOM AS 1 LEN = reclen
FIELD #1, 3 AS RL$, 9 AS n$
GET #1, 1
RL = CVI(RL$): n = CVI(n$)
PRINT "Record Length = "; RL
reclen = RL
PRINT "Number of Records = "; n
CLOSE #1
Basically what I am looking for is something similar but in java.
ASCII is a special way to translate a bit pattern in a byte to a character, and that gives each character a numerical value; for the letter 'A' is this 65.
In Java, you can get that numerical value by converting the char to an int (ok, this gives you the Unicode value, but as for the ASCII characters the Unicode value is the same as for ASCII, this does not matter).
But now you need to know how the length is calculated: do you have to add the values? Or multiply them? Or append them? Or multiply them with 128^p where p is the position, and add the result? And, in the latter case, is the first byte on position 0 or position 3?
Same for the number of records, of course.
Another possible interpretation of the data is that the bytes are BCD encoded numbers. In that case, each nibble (4bit set) represents a number from 0 to 9. In that case, you have to do some bit manipulation to extract the numbers and concatenate them, from left (highest) to right (lowest). At least you do not have to struggle with the sequence and further interpretation here …
But as BCD would require 8-bit, this would be not the right interpretation if the file really contains ASCII, as ASCII is 7-bit.
I am writing a Huffman Compression/Decompression program. I have started writing my compression method and I am stuck. I am trying to read all bytes in the file and then put all of the bytes into a byte array. After putting all bytes into the byte array I create an int[] array that will store all the frequencies of each byte (with the index being the ASCII code).
It does include the extended ASCII table since the size of the int array is 256. However I encounter issues as soon as I read a special character in my file (AKA characters with a higher ASCII value than 127). I understand that a byte is signed and will wrap around to a negative value as soon as it crosses the 127 number limit (and an array index obviously cant be negative) so I tried to counter this by turning it into a signed value when I specify my index for the array (array[myByte&0xFF]).
This kind of worked but it gave me the wrong ASCII value (for example if the correct ASCII value for the character is 134 I instead got 191 or something). The even more annoying part is that I noticed that special characters are split into 2 separate bytes, which I feel will cause problems later (for example when I try to decompress).
How do I make my program compatible with every single type of character (this program is supposed to be able to compress/decompress pictures, mp3's etc).
Maybe I am taking the wrong approach to this, but I don't know what the right approach is. Please give me some tips for structuring this.
Tree:
package CompPck;
import java.util.TreeMap;
abstract class Tree implements Comparable<Tree> {
public final int frequency; // the frequency of this tree
public Tree(int freq) { frequency = freq; }
// compares on the frequency
public int compareTo(Tree tree) {
return frequency - tree.frequency;
}
}
class Leaf extends Tree {
public final int value; // the character this leaf represents
public Leaf(int freq, int val) {
super(freq);
value = val;
}
}
class Node extends Tree {
public final Tree left, right; // subtrees
public Node(Tree l, Tree r) {
super(l.frequency + r.frequency);
left = l;
right = r;
}
}
Build tree method:
public static Tree buildTree(int[] charFreqs) {
PriorityQueue<Tree> trees = new PriorityQueue<Tree>();
for (int i = 0; i < charFreqs.length; i++){
if (charFreqs[i] > 0){
trees.offer(new Leaf(charFreqs[i], i));
}
}
//assert trees.size() > 0;
while (trees.size() > 1) {
Tree a = trees.poll();
Tree b = trees.poll();
trees.offer(new Node(a, b));
}
return trees.poll();
}
Compression method:
public static void compress(File file){
try {
Path path = Paths.get(file.getAbsolutePath());
byte[] content = Files.readAllBytes(path);
TreeMap<Integer, String> treeMap = new TreeMap<Integer, String>();
File nF = new File(file.getName() + "_comp");
nF.createNewFile();
BitFileWriter bfw = new BitFileWriter(nF);
int[] charFreqs = new int[256];
// read each byte and record the frequencies
for (byte b : content){
charFreqs[b&0xFF]++;
System.out.println(b&0xFF);
}
// build tree
Tree tree = buildTree(charFreqs);
// build TreeMap
fillEncodeMap(tree, new StringBuffer(), treeMap);
} catch (IOException e) {
e.printStackTrace();
}
}
Encodings matter
If I take the character "ö" and read it in my file it will now be
represented by 2 different values (191 and 182 or something like that)
when its actual ASCII table value is 148.
That really depends, which kind of encoding was used to create your text file. Encodings determine how text messages are stored.
In UTF-8 the ö is stored as hex [0xc3, 0xb6] or [195, 182]
In ISO/IEC 8859-1 (= "Latin-1") it would be stored as hex [0xf6], or [246]
In Mac OS Central European, it would be hex [0x9a] or [154]
Please note, that the basic ASCII table itself doesn't really describe anything for that kind of character. ASCII only uses 7 bits, and by doing so only maps 128 codes.
Part of the problem is that in layman's terms, "ASCII" is sometimes used to describe extensions of ASCII as well, (e.g. like Latin-1)
History
There's actually a bit of history behind that. Originally ASCII was a very limited set of characters. When those weren't enough, each region started using the 8th bit to add their language-specific characters. Leading to all kind of compatibility issues.
Then there was some kind of consortium that made an inventory of all characters in all possible languages (and beyond). That set is called "unicode". It contains not just 128 or 256 characters, but thousands of them.
From that point on you would need more advanced encodings to cover them. UTF-8 is one of those encodings that covers that entire unicode set, and it does so while being kind-of backwards compatible with ASCII.
Each ASCII character is still mapped in the same way, but when 1-byte isn't enough, it will use the 8th bit to indicate that a 2nd byte will follow, which is the case for the ö character.
Tools
If you're using a more advanced text editor like Notepad++, then you can select your encoding from the drop-down menu.
In programming
Having said that, your current java source reads bytes, it's not reading characters. And I would think that it's a plus when it works on byte-level, because then it can support all encodings. Maybe you don't need to work on character level at all.
However, if it does matter for your specific algorithm. Let's say you've written an algorithm that is only supposed to handle Latin-1 encoding. So, then it's really going to work on "character-level" and not on "byte-level". In that case, consider reading directly to String or char[].
Java can do the heavy-lifting for you in that case. There are readers in java that will let you read a text-file directly to Strings/char[]. However, in those cases you should of course specify an encoding when you use them. Internally a single java character can contain up to 2 bytes of data.
Trying to convert bytes to characters manually is a tricky business. Unless you're working with plain old ASCII of course. The moment you see a value above 0x7F (127), (which are presented by negative values in byte) you're no longer working with simple ASCII. Then consider using something like: new String(bytes, StandardCharsets.UTF_8). There's no need to write a decoding algorithm from scratch.
I am trying to send data from PHP TCP server to JAVA TCP client.
I am comparing my results by comparing hex values of the data.
PHP script reads STDIN, sends it through socket one byte at a time and java reads it using DataInputStream.read(), converts to hex and displays.
If I manually type data into script - it works ok.
If I use file with data - it works OK
But when I assign /dev/urandom(even few bytes) - the data on the java side is coming corrupted. There is always a hex of value efbfbd in random places instead of correct data.
Please help me with this issue.
PHP code:
$f = fopen( 'php://stdin', 'rb' );
while($line = fread($f, 1)){
$length = 1;
echo bin2hex($line)."\n";
echo socket_write($client, $line, 1)."\n";
$sent = socket_write($client, $line, $length);
if ($sent === false) {
break;
}
// Check if the entire message has been sented
if ($sent < $length) {
// If not sent the entire message.
// Get the part of the message that has not yet been sented as message
$line = substr($line, $sent);
// Get the length of the not sented part
$length -= $sent;
}
Java code:
in = new DataInputStream(clientSocket.getInputStream());
byte[] data = new byte[1];
int count = 0;
while(in.available() > 0){
//System.out.println(in.available());
in.read(data);
String message = new String(data);
System.out.println(message);
//System.out.flush();
System.out.println( toHex(message) );
//in.flush();
message = "";
}
You're stumbling upon encoding. By calling new String(data) the byte array is converted using your default encoding to a string, whatever this encoding may is (you can set the encoding by java -Dfile.encoding=UTF-8 to UTF-8 for example).
The Java code you want would most likely look the following:
in = new DataInputStream(clientSocket.getInputStream());
byte[] data = new byte[1];
int count = 0;
while (in.available() > 0) {
// System.out.println(in.available());
in.read(data);
String hexMessage = Integer.toHexString(data[0] & 0xFF);
String stringMessage = new String(data, "UTF-8"); // US-ASCII, ISO-8859-1, ...
System.out.println(hexMessage);
}
Update: I missed the 32bit issue. The 8-bit byte, which is signed in Java, is sign-extended to a 32-bit int. To effectively undo this sign extension, one can mask the byte with 0xFF.
There are two main issues with your Java program.
First - the use of in.available(). It does not tell you how many bytes there are still in the message. It merely says how many bytes are ready in the stream and for available reading without being blocked. For example, if the server sends two packets of data over the socket, one has arrived, but one is still being sent over the Internet, and each packet has 200 bytes (this is just an example), then in the first call you'll get the answer 200. If you read 200 bytes, you're sure not to be blocked. But if the second packet has not arrived yet, your next check of in.available() will return 0. If you stop at this point, you only have half the data. Not what you wanted.
Typically you either have to read until you reach end-of-stream (InputStream.read() returns -1), and then you can't use the same stream anymore and you close the socket, or you have a specific protocol that tells you how many bytes to expect and you read that number of bytes.
But that's not the reason for the strange values you see in output from your program. The reason is that Java and PHP represent strings completely differently. In PHP, a string can contain any bytes at all, and the interpretation of them as characters is up to the prorgrammer.
This basically means that a PHP string is the equivalent of a byte[] in Java.
But Java Strings are completely different. It consists internally of an array of char, and char is always two bytes in UTF-16 encoding. When you convert bytes you read into a Java String, it's always done by encoding the bytes using some character encoding so that the appropriate characters are stored in the string.
For example, if your bytes are 44 4F 4C 4C, and the character encoding is ISO-8859-1, this will be interpreted as the characters \u0044, \u004F, \u004C, \u004C. It will be a string of four characters - "DOLL". But if your character encoding is UTF-16, the bytes will be interpreted as \u444F and \u4C4C. A string of only two characters, "䑏䱌".
When you were reading from the console or from a file, the data was probably in the encoding that Java expects by default. This is usually the case when the file is written in pure English, with just English letters, spaces and punctuation. These are all 7-bit characters which are the same in ISO-8859-1 and UTF-8, which are the common defaults. But in /dev/urandom you'd have some bytes in the range 80 through FF, which may be treated differently when interpreted into a UTF-16 Java string.
Furthermore, you didn't show your toHex() method in Java. It probably reads bytes back from the string again, but using which encoding? If you read the bytes into the String using ISO-8859-1, and got them out in UTF-8, you'd get completely different bytes.
If you want to see exactly what PHP sent you, don't put the bytes in a String. Write a toHex method that works on byte arrays, and use the byte[] you read directly.
Also, always remember to check the number of bytes returned by read() and only interpret that number of bytes! read() does not always fill the entire array. So in your new toHex() method, you need to also pass the number of bytes read as a parameter, so that it doesn't display the parts of the array after them. In your case you just have a one-byte array - which is not recommended - but even in this case, read() can return 0, and it's a perfectly legal value indicating that in this particular call to read() there were no bytes available although there may be some available in the next read().
As the comment above says you might be having troubles with the string representation of the bytes String message = new String(data); To be certain, you should get the data bytes and encode them in Base64 for example. You can use a library such as Apache Commons or Java 8 to do that. You should be able to do something similar in PHP to compare.
How I can start reading from a specific byte? I have following code:
try {
while ( (len = f.read(buffer)) > 0 ) {}
}
For example, I want to start to read at byte 50.
You should use the skip method.
http://developer.android.com/reference/java/io/InputStream.html#skip(long)
You can skip the number of bytes you want.
int nbToSkip = 50
while (nbToSkip > 0) {
int nbSkipped = f.skip( nbToSkip );
nbToSkip -= nbSkipped ;
}
I'm not sure what the type of 'f' is, however I will assume it's some kind of stream.
Also, do you mean you want to read in blocks of 50 bytes, or start from the 50th byte in the stream? I'll assume the latter.
Not sure what language you are using, but here goes anyway:
Maybe there's a seek kind of function with which you can seek to a specific position in the stream, like so:
f.seek(50, SEEK_START);
otherwise you can do it the poor man's way by just reading 50 bytes, or 50 times 1 byte from the stream.
Hard to give you a good answer without any more reference.
Basically I'm trying to use a BufferedWriter to write to a file using Java. The problem is, I'm actually doing some compression so I generate ints between 0 and 255, and I want to write the character who's ASCII value is equal to that int. When I try writing to the file, it writes many ? characters, so when I read the file back in, it reads those as 63, which is clearly not what I want. Any ideas how I can fix this?
Example code:
int a = generateCode(character); //a now has an int between 0 and 255
bw.write((char) a);
a is always between 0 and 255, but it sometimes writes '?'
You are really trying to write / read bytes to / from a file.
When you are processing byte-oriented data (as distinct from character-oriented data), you should be using InputStream and OutputStream classes and not Reader and Writer classes.
In this case, you should use FileInputStream / FileOutputStream, and wrap with a BufferedInputStream / BufferedOutputStream if you are doing byte-at-a-time reads and writes.
Those pesky '?' characters are due to issues the encoding/decoding process that happens when Java converts between characters and the default text encoding for your platform. The conversion from bytes to characters and back is often "lossy" ... depending on the encoding scheme used. You can avoid this by using the byte-oriented stream classes.
(And the answers that point out that ASCII is a 7-bit not 8-bit character set are 100% correct. You are really trying to read / write binary octets, not characters.)
You need to make up your mind what are you really doing. Are you trying to write some bytes to a file, or are you trying to write encoded text? Because these are different concepts in Java; byte I/O is handled by subclasses of InputStream and OutputStream, while character I/O is handled by subclasses of Reader and Writer. If what you really want to write is bytes to a file (which I'm guessing from your mention of compression), use an OutputStream, not a Writer.
Then there's another confusion you have, which is evident from your mention of "ASCII characters from 0-255." There are no ASCII characters above 127. Please take 15 minutes to read this: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (by Joel Spolsky). Pay particular attention to the parts where he explains the difference between a character set and an encoding, because it's critical for understanding Java I/O. (To review whether you understood, here's what you need to learn: Java Writers are classes that translate character output to byte output by applying a client-specified encoding to the text, and sending the bytes to an OutputStream.)
Java strings are based on 16 bit wide characters, it tries to perform conversions around that assumption if there is no clear specifications.
The following sample code, write and reads data directly as bytes, meaning 8-bit numbers which have an ASCII meaning associated with them.
import java.io.*;
public class RWBytes{
public static void main(String[] args)throws IOException{
String filename = "MiTestFile.txt";
byte[] bArray1 =new byte[5];
byte[] bArray2 =new byte[5];
bArray1[0]=65;//A
bArray1[1]=66;//B
bArray1[2]=67;//C
bArray1[3]=68;//D
bArray1[4]=69;//E
FileOutputStream fos = new FileOutputStream(filename);
fos.write(bArray1);
fos.close();
FileInputStream fis = new FileInputStream(filename);
fis.read(bArray2);
ByteArrayInputStream bais = new ByteArrayInputStream(bArray2);
for(int i =0; i< bArray2.length ; i++){
System.out.println("As the bytem value: "+ bArray2[i]);//as the numeric byte value
System.out.println("Converted as char to printiong to the screen: "+ String.valueOf((char)bArray2[i]));
}
}
}
A fixed subset of the 7 bit ASCII code is printable, A=65 for example, the 10 corresponds to the "new line" character which steps down one line on screen when found and "printed". Many other codes exist which manipulate a character oriented screen, these are invisible and manipulated the screen representation like tabs, spaces, etc. There are also other control characters which had the purpose of ringing a bell for example.
The higher 8 bit end above 127 is defined as whatever the implementer wanted, only the lower half have standard meanings associated.
For general binary byte handling there are no such qualm, they are number which represent the data. Only when trying to print to the screen the become meaningful in all kind of ways.