Scan Char from file in Java - java

The java project i am working on requires me to write the java equivalent of this C code:
void read_hex_char(char *filename, unsigned char *image)
{
int i;
FILE *ff;
short tmp_short;
ff = fopen(filename, "r");
for (i = 0; i < 100; i++)
{
fscanf(ff, "%hx", &tmp_short);
image[i] = tmp_short;
}
fclose(ff);
}
I have written this Java code.
void read_hex_char(String filename, char[] image) throws IOException
{
Scanner s=new Scanner(new BufferedReader(new FileReader(filename)));
for(int i=0;i<100;i++)
{
image[i]=s.nextShort();
}
s.close();
}
Is this code correct? If its not, what corrections should be done?

I would go with a FileInputStream and read byte to byte (a short is just two bytes, char is more "complex" than just a short http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html). Simple byte extraction code from my project :
public static byte[] readFile(File file) throws IOException {
FileInputStream in = new FileInputStream(file);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
int ch = -1;
while ((ch = in.read()) != -1)
bos.write(ch);
return bos.toByteArray();
}
For your example, the simplest is to find a few samples : run the C function on it then the java one and compare the results. It should give you informations.

Keep in mind, that java char type is pretty smart (smarter than just byte) and represents unicode character ( and reader classes perform processing of incoming byte stream into unicode characters possibly decoding or modifyiung bytes according to actual locale and charset settings ). From your source I guess that it is just actual bytes you want in
memory buffer. Here is really good explanation how to do this:
Convert InputStream to byte array in Java

For parsing hex values, you can use Short.parseShort(String s,int radix) with radix 16. In Java, char is a bit different than short, so if you intend to perform bitmap operations, short is probably the better type to use. However, note that Java doesn't have unsigned types, which may make some of the operations likely to be used in image processing (like bitwise operations) tricky.

Related

What is the fastest way to load the MD5 of an file?

I want to load the MD5 of may different files. I am following this answer to do that but the main problem is that the time taken to load the MD5 of the files ( May be in hundreds) is a lot.
Is there any way which can be used to find the MD5 of an file without consuming much time.
Note- The size of the file may be large ( May go up to 300MB).
This is the code which I am using -
import java.io.*;
import java.security.MessageDigest;
public class MD5Checksum {
public static byte[] createChecksum(String filename) throws Exception {
InputStream fis = new FileInputStream(filename);
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do {
numRead = fis.read(buffer);
if (numRead > 0) {
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
fis.close();
return complete.digest();
}
// see this How-to for a faster way to convert
// a byte array to a HEX string
public static String getMD5Checksum(String filename) throws Exception {
byte[] b = createChecksum(filename);
String result = "";
for (int i=0; i < b.length; i++) {
result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 );
}
return result;
}
public static void main(String args[]) {
try {
System.out.println(getMD5Checksum("apache-tomcat-5.5.17.exe"));
// output :
// 0bb2827c5eacf570b6064e24e0e6653b
// ref :
// http://www.apache.org/dist/
// tomcat/tomcat-5/v5.5.17/bin
// /apache-tomcat-5.5.17.exe.MD5
// 0bb2827c5eacf570b6064e24e0e6653b *apache-tomcat-5.5.17.exe
}
catch (Exception e) {
e.printStackTrace();
}
}
}
You cannot use hashes to determine any similarity of content.
For instance, generating the MD5 of hellostackoverflow1 and hellostackoverflow2 calculates two hashes where none of the characters of the string representation match (7c35[...]85fa vs b283[...]3d19). That's because a hash is calculated based on the binary data of the file, thus two different formats of the same thing - e.g. .txt and a .docx of the same text - have different hashes.
But as already noted, some speed might be achieved by using native code, thus the NDK. Additionally, if you still want to compare files for exact matches, first compare the size in bytes, after that use a hashing algorithm with enough speed and a low risk of collisions. As stated, CRC32 is fine.
Hash/CRC calculation takes some time as the file has to be read completely.
The code of createChecksum you presented is nearly optimal. The only parts that can be tweaked is the read buffer size (I would use a buffer size 2048 bytes or larger). However this may get you a maximum of 1-2% speed improvement.
If this is still too slow the only option left is to implement the hashing in C/C++ and use it as native method. Besides that there is nothing you can do.

Java Reading large files into byte array chunk by chunk

So I've been trying to make a small program that inputs a file into a byte array, then it will turn that byte array into hex, then binary. It will then play with the binary values (I haven't thought of what to do when I get to this stage) and then save it as a custom file.
I studied a lot of internet code and I can turn a file into a byte array and into hex, but the problem is I can't turn huge files into byte arrays (out of memory).
This is the code that is not a complete failure
public void rundis(Path pp) {
byte bb[] = null;
try {
bb = Files.readAllBytes(pp); //Files.toByteArray(pathhold);
System.out.println("byte array made");
} catch (Exception e) {
e.printStackTrace();
}
if (bb.length != 0 || bb != null) {
System.out.println("byte array filled");
//send to method to turn into hex
} else {
System.out.println("byte array NOT filled");
}
}
I know how the process should go, but I don't know how to code that properly.
The process if you are interested:
Input file using File
Read the chunk by chunk of the file into a byte array. Ex. each byte array record hold 600 bytes
Send that chunk to be turned into a Hex value --> Integer.tohexstring
Send that hex value chunk to be made into a binary value --> Integer.toBinarystring
Mess around with the Binary value
Save to custom file line by line
Problem:: I don't know how to turn a huge file into a byte array chunk by chunk to be processed.
Any and all help will be appreciated, thank you for reading :)
To chunk your input use a FileInputStream:
Path pp = FileSystems.getDefault().getPath("logs", "access.log");
final int BUFFER_SIZE = 1024*1024; //this is actually bytes
FileInputStream fis = new FileInputStream(pp.toFile());
byte[] buffer = new byte[BUFFER_SIZE];
int read = 0;
while( ( read = fis.read( buffer ) ) > 0 ){
// call your other methodes here...
}
fis.close();
To stream a file, you need to step away from Files.readAllBytes(). It's a nice utility for small files, but as you noticed not so much for large files.
In pseudocode it would look something like this:
while there are more bytes available
read some bytes
process those bytes
(write the result back to a file, if needed)
In Java, you can use a FileInputStream to read a file byte by byte or chunk by chunk. Lets say we want to write back our processed bytes. First we open the files:
FileInputStream is = new FileInputStream(new File("input.txt"));
FileOutputStream os = new FileOutputStream(new File("output.txt"));
We need the FileOutputStream to write back our results - we don't want to just drop our precious processed data, right? Next we need a buffer which holds a chunk of bytes:
byte[] buf = new byte[4096];
How many bytes is up to you, I kinda like chunks of 4096 bytes. Then we need to actually read some bytes
int read = is.read(buf);
this will read up to buf.length bytes and store them in buf. It will return the total bytes read. Then we process the bytes:
//Assuming the processing function looks like this:
//byte[] process(byte[] data, int bytes);
byte[] ret = process(buf, read);
process() in above example is your processing method. It takes in a byte-array, the number of bytes it should process and returns the result as byte-array.
Last, we write the result back to a file:
os.write(ret);
We have to execute this in a loop until there are no bytes left in the file, so lets write a loop for it:
int read = 0;
while((read = is.read(buf)) > 0) {
byte[] ret = process(buf, read);
os.write(ret);
}
and finally close the streams
is.close();
os.close();
And thats it. We processed the file in 4096-byte chunks and wrote the result back to a file. It's up to you what to do with the result, you could also send it over TCP or even drop it if it's not needed, or even read from TCP instead of a file, the basic logic is the same.
This still needs some proper error-handling to work around missing files or wrong permissions but that's up to you to implement that.
A example implementation for the process method:
//returns the hex-representation of the bytes
public static byte[] process(byte[] bytes, int length) {
final char[] hexchars = "0123456789ABCDEF".toCharArray();
char[] ret = new char[length * 2];
for ( int i = 0; i < length; ++i) {
int b = bytes[i] & 0xFF;
ret[i * 2] = hexchars[b >>> 4];
ret[i * 2 + 1] = hexchars[b & 0x0F];
}
return ret;
}

java: read large binary file

I need to read out a given large file that contains 500000001 binaries. Afterwards I have to translate them into ASCII.
My Problem occurs while trying to store the binaries in a large array. I get the warning at the definition of the array ioBuf:
"The literal 16000000032 of type int is out of range."
I have no clue how to save these numbers to work with them! Has somebody an idea?
Here is my code:
public byte[] read(){
try{
BufferedInputStream in = new BufferedInputStream(new FileInputStream("data.dat"));
ByteArrayOutputStream bs = new ByteArrayOutputStream();
BufferedOutputStream out = new BufferedOutputStream(bs);
byte[] ioBuf = new byte[16000000032];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
out.write(ioBuf, 0, bytesRead);
}
out.close();
in.close();
return bs.toByteArray();
}
The maximum Index of an Array is Integer.MAX_VALUE and 16000000032 is greater than Integer.MAX_VALUE
Integer.MAX_VALUE = 2^31-1 = 2147483647
2147483647 < 16000000032
You could overcome this by checking if the Array is full and create another and continue reading.
But i'm not quite sure if your approach is the best way to perform this. byte[Integer_MAX_VALUE] is huge ;)
Maybe you can split the input file in smaller chunks process them.
EDIT: This is how you could read a single int of your file. You can resize the buffer's size to the amount of data you want to read. But you tried to read the whole file at once.
//Allocate buffer with 4byte = 32bit = Integer.SIZE
byte[] ioBuf = new byte[4];
int bytesRead;
while ((bytesRead = in.read(ioBuf)) != -1){
//if bytesRead == 4 you read 1 int
//do your stuff
}
If you need to declare a large constant, append an 'L' to it which indicates to the compiler that is a long constant. However, as mentioned in another answer you can't declare arrays that large.
I suspect the purpose of the exercise is to learn how to use the java.nio.Buffer family of classes.
I made some progress by starting from scratch! But I still have a problem.
My idea is to read up the first 32 bytes, convert them to a int number. Then the next 32 bytes etc. Unfortunately I just get the first and don't know how to proceed.
I discovered following method for converting these numbers to int:
public static int byteArrayToInt(byte[] b){
final ByteBuffer bb = ByteBuffer.wrap(b);
bb.order(ByteOrder.LITTLE_ENDIAN);
return bb.getInt();
}
so now I have:
BufferedInputStream in=null;
byte[] buf = new byte[32];
try {
in = new BufferedInputStream(new FileInputStream("ndata.dat"));
in.read(buf);
System.out.println(byteArrayToInt(buf));
in.close();
} catch (IOException e) {
System.out.println("error while reading ndata.dat file");
}

How do i open a file in java and then call a hex editor to check the file type?

We are really stuck on this topic, this is the only code we have which converts a file into hex but we need to open a file and then for the java code to read the hex and extract certain bytes (e.g. the first 4 bytes for the file extension:
import java.io.*;
public class FileInHexadecimal
{
public static void main(String[] args) throws Exception
{
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int i = 0;
while ((i = fis.read()) != -1) {
if (i != -1) {
System.out.printf("%02X\n ", i);
}
}
fis.close();
}
}
Do not confuse internal and external representation - what you do when converting to hex is that you only create a different representation of the same bytes.
There is no need to convert to hex if you just want to read some bytes from the file - just read them. For example, to read the first four bytes, you can use something like
byte[] buffer = new byte[4];
FileInputStream fis = new FileInputStream("H://Sample_Word.docx");
int read = fis.read(buffer);
if (read != buffer.length) {
System.out.println("Short file!");
}
If you need to read data from an arbitrary position within the file, you might want to check RandomAccessFile instead of using a stream. RandomAccessFile allows to set the position where to start reading.

Search for a string in large file and save it's position in Java

I'm searching for a way to parse large files (about 5-10Go) and search for position (in byte) of some recurrent strings, the fastest as possible.
I've tried to use the RandomAccessFile reader by doing something like bellow:
RandomAccessFile lecteurFichier = new RandomAccessFile(<MyFile>, "r");
while (currentPointeurPosition < lecteurFichier.length()) {
char currentFileChar = (char) lecteurFichier.readByte();
// Test each char for matching my string (by appending chars until I found my string)
// and keep a trace of all found string's position
}
The problem is this code is too slow (maybe because I read byte by byte ?).
I also tried the solution bellow, which is perfect in term of speedness but I can't get my string's positions.
FileInputStream is = new FileInputStream(fichier.getFile());
FileChannel f = is.getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(64 * 1024);
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
long len = 0;
while ((len = f.read(buf)) != -1) {
buf.flip();
String data = "";
try {
int old_position = buf.position();
data = decoder.decode(buf).toString();
// reset buffer's position to its original so it is not altered:
buf.position(old_position);
}
catch (Exception e) {
e.printStackTrace();
}
buf.clear();
}
f.close();
Does anyone has a better solution to propose ?
Thank you in advance (and sorry for my spelling, I'm french)
Since your input data is encoded in an 8-bit encoding*, you can speed up the search by encoding the search string rather than decoding the file:
byte[] encoded = searchString.getBytes("ISO-8859-1");
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
int b;
long pos = -1;
while ((b = bis.read()) != -1) {
pos++;
if (encoded[0] == b) {
// see if rest of string matches
}
}
A BufferedInputStream should be pretty fast. Using ByteBuffer might be faster, but this is going to make the search logic more complicated because of the possibility of a string match than spans a buffer boundary.
Then there are various clever ways to optimize string searches that could be adapted to this situation ... where you are search a stream of bytes / characters rather than an array of bytes / characters. The Wikipedia page on String Searching is a good place to start.
Note that since we are reading and matching in a byte-wise fashion, the position is just the count of bytes read (or skipped), so there is no need to use a random access file.
* In fact this trick will work with many multibyte encodings too.
Searching for a 'needle' in a 'haystack' is a well-studied problem-Here's a related link on StackOverflow itself. I am sure the java implementations of the algorithms discussed should be available too. Why not try some of them,to see if they fit the job?

Categories

Resources