Cannot read binary file created in Java using Python - java

I have created a binary file using Java and memory mapping. It contains a list of integers from 1 to 10 million:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class MemoryMapWriter {
public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
File f = new File("file.bin");
f.delete();
FileChannel fc = new RandomAccessFile(f, "rw").getChannel();
long bufferSize=64*1000;
MappedByteBuffer mem =fc.map(FileChannel.MapMode.READ_WRITE, 0, bufferSize);
int start = 0;
long counter=1;
long HUNDREDK=100000;
long startT = System.currentTimeMillis();
long noOfMessage = HUNDREDK * 10 * 10;
for(;;)
{
if(!mem.hasRemaining())
{
start+=mem.position();
mem =fc.map(FileChannel.MapMode.READ_WRITE, start, bufferSize);
}
mem.putLong(counter);
counter++;
if(counter > noOfMessage )
break;
}
long endT = System.currentTimeMillis();
long tot = endT - startT;
System.out.println(String.format("No Of Message %s , Time(ms) %s ",noOfMessage, tot)) ;
}
then I have tried to read it using Python and memory mapping:
import pandas as pd
import numpy as np
import os
import shutil
import re
import mmap
a=np.memmap("file.bin",mode='r',dtype='int64')
print(a[0:9])
but printing first ten element, this is the result:
[ 72057594037927936, 144115188075855872, 216172782113783808,
288230376151711744, 360287970189639680, 432345564227567616,
504403158265495552, 576460752303423488, 648518346341351424,
720575940379279360]
What is wrong with my code?

You have a byte-order problem. 72057594037927936 in binary is 0x0100000000000000, 144115188075855872 is 0x0200000000000000, etc.
Java is writing longs to the buffer in big-endian order (most significant byte first) and Python is interpreting the resulting byte stream in little-endian order (least significant byte first).
One simple fix is to change the Java buffer's ByteOrder attribute:
mem.order(ByteOrder.LITTLE_ENDIAN);
Or tell Python to use big-endian order. Python doesn't seem to have an analogous option for its memmap functions, so this will probably require using struct.unpack_from to specify the byte order.

Related

How do I write a character at codepoint 80 to a file in Windows-1252?

I am trying to write bytes to a file in the windows-1252 charset. The example below, writing the raw bytes of a float to a file, is similar to what I'm doing in my actual program.
In the example given, I am writing the raw hex of 1.0f to test.txt. As the raw hex of 1.0f is 3f 80 00 00 I expect to get ?€(NUL)(NUL), as from what I can see in the Windows 1252 Wikipedia article, 0x3f should correspond to '?', 0x80 should correspond to '€', and 0x00 is 'NUL'. Everything goes fine until I actually try to write to the file; at that point, I get a java.nio.charset.UnmappableCharacterException on the console, and after the program stops on that exception the file only has a single '?' in it. The full console output is below the code down below.
It looks like Java considers the codepoint 0x80 unmappable in the windows-1252 codepage. However, this doesn't seem right – all the codepoints should map to actual characters in that codepage. The problem is definitely with the codepoint 0x80, as if I try with 0.5f (3f 00 00 00) it is happy to write ?(NUL)(NUL)(NUL) into the file, and does not throw the exception. Experimenting with other codepages doesn't seem to work either; looking at key encodings supported by the Java language here, only the UTF series will not give me an exception, but due to their encoding they don't give me codepoint 0x80 in the actual file.
I'm going to try just using bytes instead so I don't have to worry about string encoding, but is anyone able to tell me why my code below gives me the exception it does?
Code:
import java.io.IOException;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
public class CharsetTest {
public static void main(String[] args) {
float max = 1.0f;
System.out.println("Checking " + max);
String stringFloatFormatHex = String.format("%08x", Float.floatToRawIntBits(max));
System.out.println(stringFloatFormatHex);
byte[] bytesForFile = javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);
String stringForFile = new String(bytesForFile);
System.out.println(stringForFile);
String charset = "windows-1252";
try {
Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
output.write(stringForFile);
output.close();
} catch (IOException e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}
Console output:
Checking 1.0
3f800000
?�
Input length = 1
java.nio.charset.UnmappableCharacterException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:282)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:285)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
at java.io.BufferedWriter.close(BufferedWriter.java:265)
at CharsetTest.main(CharsetTest.java:21)
Edit: The problem is on the instruction String stringForFile = new String(bytesForFile);, below the DatatypeConverter. As I was constructing a string without providing a charset, it uses my default charset, which is UTF-8, which doesn't have a symbol for codepoint 80. However, it only throws an exception when it writes to a file. This doesn't happen in the code below because my refactor (keeping in mind Johannes Kuhn's suggestion in the comments) doesn't use the String(byte[]) constructor without specifying a charset.
Johannes Kuhn's suggestion about the String(byte[]) constructor gave me some good clues. I've ended up with the following code, which looks like it works fine: even printing the € symbol to the console as well as writing it to test.txt. That suggests that codepoint 80 can be translated using the windows-1252 codepage.
If I were to guess at this point why this code works but the other didn't, I'd still be confused, but I would guess it was something around the conversion in javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);. That looks to be the main difference, although I'm not sure why it would matter.
Anyway, the code below works (and I don't even have to turn it into a string; I can write the bytes to a file with FileOutputStream fos = new FileOutputStream("test.txt"); fos.write(bytes); fos.close();), so I'm happy with this one.
Code:
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.Writer;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
public class BytesCharsetTest {
public static void main(String[] args) {
float max = 1.0f;
System.out.println("Checking " + max);
int convInt = Float.floatToRawIntBits(max);
byte[] bytes = ByteBuffer.allocate(4).putInt(convInt).array();
String charset = "windows-1252";
try {
String stringForFile = new String(bytes, Charset.forName(charset));
System.out.println(stringForFile);
Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
output.write(stringForFile);
output.close();
} catch (IOException e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}
Console output:
Checking 1.0
?€
Process finished with exit code 0

ByteBuffer Missing Data When decoded As string

I'm reading and writing to a ByteBuffer
import org.assertj.core.api.Assertions;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
public class Solution{
public static void main(String[] args) throws Exception{
final CharsetEncoder messageEncoder = Charset.forName("ISO-8859-1").newEncoder();
String message = "TRANSACTION IGNORED";
String carrierName= "CARR00AB";
int messageLength = message.length()+carrierName.length()+8;
System.out.println(" --------Fill data---------");
ByteBuffer messageBuffer = ByteBuffer.allocate(4096);
messageBuffer.order(ByteOrder.BIG_ENDIAN);
messageBuffer.putInt(messageLength);
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer.put((byte) 0x2b);
messageBuffer.flip();
System.out.println("------------Extract Data Approach 1--------");
CharsetDecoder messageDecoder = Charset.forName("ISO-8859-1").newDecoder();
int lengthField = messageBuffer.getInt();
System.out.println("lengthField="+lengthField);
int responseLength = lengthField - 12;
System.out.println("responseLength="+responseLength);
String messageDecoded= messageDecoder.decode(messageBuffer).toString();
System.out.println("messageDecoded="+messageDecoded);
String decodedCarrier = messageDecoded.substring(0, carrierName.length());
System.out.println("decodedCarrier="+ decodedCarrier);
String decodedBody = messageDecoded.substring(carrierName.length(), messageDecoded.length() - 1);
System.out.println("decodedBody="+decodedBody);
Assertions.assertThat(messageLength).isEqualTo(lengthField);
Assertions.assertThat(decodedBody).isEqualTo(message);
Assertions.assertThat(decodedBody).isEqualTo(message);
ByteBuffer messageBuffer2 = ByteBuffer.allocate(4096);
messageBuffer2.order(ByteOrder.BIG_ENDIAN);
messageBuffer2.putInt(messageLength);
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer2.put((byte) 0x2b);
messageBuffer2.flip();
System.out.println("---------Extract Data Approach 2--------");
byte [] data = new byte[messageBuffer2.limit()];
messageBuffer2.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
}
}
It works fine but then I thought to refactor it, Please see approach 2 in above code
byte [] data = new byte[messageBuffer.limit()];
messageBuffer.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
Output= #CARR00ABTRANSACTION IGNORED+
Could you guys help me with explanation
why the integer is got missing in second approach while decoding ???
Is there any way to extract the integer in second approach??
Okay so you are trying to read an int from the Buffer which takes up 4 bits and then trying to get the whole data after reading 4 bits
What I have done is call messageBuffer2.clear(); after reading the int to resolve this issue. here is the full code
System.out.println(messageBuffer2.getInt());
byte[] data = new byte[messageBuffer2.limit()];
messageBuffer2.clear();
messageBuffer2.get(data);
String dataString = new String(data, StandardCharsets.ISO_8859_1);
System.out.println(dataString);
Output is:
35
#CARR0033TRANSACTION IGNORED+
Edit: So basically when you are calling clear it resets various variables and it also resets the position it's getting from and thats how it fixes it.

Serializing BitSet or Boolean array from Java to Base64 to Python

I need to get big Boolean arrays or BitSets from Java into Python via a text file. Ideally I want to go via a Base64 representation to stay compact, but still be able to embed the value in a CSV file. (So the boolean array will be one column in a CSV file.)
However I am having issues to get the byte alignment right. Where/how should I specify the correct byte order?
This is one example, working in the sense that it executes but not working in that my bits aren't where I want them.
Java:
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Base64;
import java.util.Base64.Encoder;
import java.util.BitSet;
public class basictest {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Encoder b64 = Base64.getEncoder();
String name = "name";
BitSet b = new BitSet();
b.set(444);
b.set(777);
b.set(555);
byte[] bBytes = b.toByteArray();
String fp_str = b64.encodeToString(bBytes);
BufferedWriter w = new BufferedWriter(new FileWriter("out.tsv"));
w.write(name + "\t" + fp_str + "\n");
w.close();
}
}
Python:
import numpy as np
import base64
from bitstring import BitArray, BitStream ,ConstBitStream
filename = "out.tsv"
with open(filename) as file:
data = file.readline().split('\t')
b_b64 = data[1]
b_bytes = base64.b64decode(b_b64)
b_bits = BitArray(bytes=b_bytes)
b_bits[444] # False
b_bits[555] # False
b_bits[777] # False
# but
b_bits[556] # True
# it's not shifted:
b_bits[445] # False
I am now reversing the bits in every byte using https://stackoverflow.com/a/5333563/1259675:
numbits = 8
r_bytes = [
sum(1<<(numbits-1-i) for i in range(numbits) if b>>i&1)
for b in b_bytes]
b_bits = BitArray(r_bytes)
This works, but is there a method that doesn't involve myself fiddling with the bits?
If:
the maximum bit to set is "sufficiently small".
and the data, you want to encode doesn't vary in size too much.
..then one approach can be:
Set max (+ min) significant bit(s in java) .
and ignore them in python .
, then it c(sh!)ould work without byte reversal, or further transformation:
// assuming a 1024 bit word
public static final int LEFT_SIGN = 0;
public static final int RIGHT_SIGN = 1025; //choose a size, that fits your needs [0 .. Integer.MAX_VALUE - 1 (<-theoretically)]
public static void main(String[] args) throws Exception {
...
b.set(LEFT_SIGN);
b.set(444 + 1);
b.set(777 + 1);
b.set(555 + 1);
b.set(RIGHT_SIGN);
...
and then in python:
# as before ..
b_bits[0] # Ignore!
b_bits[445] # True
b_bits[556] # True
b_bits[778] # True
b_bits[1025] # Ignore!;)
Your convenience (= encoding) 'd be the (maximum) "word length" ... with all its benefits and drawbacks.
We can use the bitarray package from python for this particular usecase.
from bitarray import bitarray
import base64
with open(filename) as file:
data = file.readline().strip().split('\t')
b_b64 = data[1]
b_bytes = base64.b64decode(b_b64)
bs = bitarray(endian='little')
bs.frombytes(b_bytes)
print bs

How do I create a 1 byte file in JAVA to store binary values (eg:10100011)?

1 byte = 8 bits, how can I create and store 11001100 in those 8 bits
and the file should be 1 byte in size?
What should be the file format?
All this in Java.
To write bytes to a file, you can use FileOutputStream.
See: Lesson: Basic I/O in Oracle's Java Tutorials and see the API documentation.
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
public class Example {
public static void main(String[] args) throws IOException {
try (OutputStream out = new FileOutputStream("example.bin")) {
out.write(0b11001100);
}
}
}

writing a file in Java without O_SYNC semantics

In C, when I call open() to open a file descriptor, I have to explicitly pass the O_SYNC flag to ensure that writes to this file will be persisted to disk by the time write() returns. If I want to, I can not supply O_SYNC to open(), and then my writes will return much more quickly because they only have to make it into a filesystem cache before returning. If I want to, later on I can force all outstanding writes to this file to be written to disk by calling fsync(), which blocks until that operation has finished. (More details are available on all this in the Linux man page.)
Is there any way to do this in Java? The most similar thing I could find was using a BufferedOutputStream and calling .flush() on it, but if I'm doing writes to randomized file offsets I believe this would mean the internal buffer for the output stream could end up consuming a lot of memory.
Using Java 7 NIO FileChannel#force method:
RandomAccessFile aFile = new RandomAccessFile("file.txt", "rw");
FileChannel inChannel = aFile.getChannel();
// .....................
// flushes all unwritten data from the channel to the disk
channel.force(true);
An important detail :
If the file does not reside on a local device then no such guarantee is made.
Based on Sergey Tachenov's comment, I found that you can use FileChannel for this. Here's some sample code that I think does the trick:
import java.nio.*;
import java.nio.channels.*;
import java.nio.file.*;
import java.nio.file.attribute.*;
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
import static java.nio.file.StandardOpenOption.*;
public class Main {
public static void main(String[] args) throws Exception {
// Open the file as a FileChannel.
Set<OpenOption> options = new HashSet<>();
options.add(WRITE);
// options.add(SYNC); <------- This would force O_SYNC semantics.
try (FileChannel channel = FileChannel.open(Paths.get("./test.txt"), options)) {
// Generate a bit data to write.
ByteBuffer buffer = ByteBuffer.allocate(4096);
for (int i = 0; i < 10; i++) {
buffer.put(i, (byte) i);
}
// Choose a random offset between 0 and 1023 and write to it.
long offset = ThreadLocalRandom.current().nextLong(0, 1024);
channel.write(buffer, offset);
}
}
}

Categories

Resources