I need to get big Boolean arrays or BitSets from Java into Python via a text file. Ideally I want to go via a Base64 representation to stay compact, but still be able to embed the value in a CSV file. (So the boolean array will be one column in a CSV file.)
However I am having issues to get the byte alignment right. Where/how should I specify the correct byte order?
This is one example, working in the sense that it executes but not working in that my bits aren't where I want them.
Java:
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.util.Base64;
import java.util.Base64.Encoder;
import java.util.BitSet;
public class basictest {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
Encoder b64 = Base64.getEncoder();
String name = "name";
BitSet b = new BitSet();
b.set(444);
b.set(777);
b.set(555);
byte[] bBytes = b.toByteArray();
String fp_str = b64.encodeToString(bBytes);
BufferedWriter w = new BufferedWriter(new FileWriter("out.tsv"));
w.write(name + "\t" + fp_str + "\n");
w.close();
}
}
Python:
import numpy as np
import base64
from bitstring import BitArray, BitStream ,ConstBitStream
filename = "out.tsv"
with open(filename) as file:
data = file.readline().split('\t')
b_b64 = data[1]
b_bytes = base64.b64decode(b_b64)
b_bits = BitArray(bytes=b_bytes)
b_bits[444] # False
b_bits[555] # False
b_bits[777] # False
# but
b_bits[556] # True
# it's not shifted:
b_bits[445] # False
I am now reversing the bits in every byte using https://stackoverflow.com/a/5333563/1259675:
numbits = 8
r_bytes = [
sum(1<<(numbits-1-i) for i in range(numbits) if b>>i&1)
for b in b_bytes]
b_bits = BitArray(r_bytes)
This works, but is there a method that doesn't involve myself fiddling with the bits?
If:
the maximum bit to set is "sufficiently small".
and the data, you want to encode doesn't vary in size too much.
..then one approach can be:
Set max (+ min) significant bit(s in java) .
and ignore them in python .
, then it c(sh!)ould work without byte reversal, or further transformation:
// assuming a 1024 bit word
public static final int LEFT_SIGN = 0;
public static final int RIGHT_SIGN = 1025; //choose a size, that fits your needs [0 .. Integer.MAX_VALUE - 1 (<-theoretically)]
public static void main(String[] args) throws Exception {
...
b.set(LEFT_SIGN);
b.set(444 + 1);
b.set(777 + 1);
b.set(555 + 1);
b.set(RIGHT_SIGN);
...
and then in python:
# as before ..
b_bits[0] # Ignore!
b_bits[445] # True
b_bits[556] # True
b_bits[778] # True
b_bits[1025] # Ignore!;)
Your convenience (= encoding) 'd be the (maximum) "word length" ... with all its benefits and drawbacks.
We can use the bitarray package from python for this particular usecase.
from bitarray import bitarray
import base64
with open(filename) as file:
data = file.readline().strip().split('\t')
b_b64 = data[1]
b_bytes = base64.b64decode(b_b64)
bs = bitarray(endian='little')
bs.frombytes(b_bytes)
print bs
Related
I am trying to write bytes to a file in the windows-1252 charset. The example below, writing the raw bytes of a float to a file, is similar to what I'm doing in my actual program.
In the example given, I am writing the raw hex of 1.0f to test.txt. As the raw hex of 1.0f is 3f 80 00 00 I expect to get ?€(NUL)(NUL), as from what I can see in the Windows 1252 Wikipedia article, 0x3f should correspond to '?', 0x80 should correspond to '€', and 0x00 is 'NUL'. Everything goes fine until I actually try to write to the file; at that point, I get a java.nio.charset.UnmappableCharacterException on the console, and after the program stops on that exception the file only has a single '?' in it. The full console output is below the code down below.
It looks like Java considers the codepoint 0x80 unmappable in the windows-1252 codepage. However, this doesn't seem right – all the codepoints should map to actual characters in that codepage. The problem is definitely with the codepoint 0x80, as if I try with 0.5f (3f 00 00 00) it is happy to write ?(NUL)(NUL)(NUL) into the file, and does not throw the exception. Experimenting with other codepages doesn't seem to work either; looking at key encodings supported by the Java language here, only the UTF series will not give me an exception, but due to their encoding they don't give me codepoint 0x80 in the actual file.
I'm going to try just using bytes instead so I don't have to worry about string encoding, but is anyone able to tell me why my code below gives me the exception it does?
Code:
import java.io.IOException;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
public class CharsetTest {
public static void main(String[] args) {
float max = 1.0f;
System.out.println("Checking " + max);
String stringFloatFormatHex = String.format("%08x", Float.floatToRawIntBits(max));
System.out.println(stringFloatFormatHex);
byte[] bytesForFile = javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);
String stringForFile = new String(bytesForFile);
System.out.println(stringForFile);
String charset = "windows-1252";
try {
Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
output.write(stringForFile);
output.close();
} catch (IOException e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}
Console output:
Checking 1.0
3f800000
?�
Input length = 1
java.nio.charset.UnmappableCharacterException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:282)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:285)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
at java.io.BufferedWriter.close(BufferedWriter.java:265)
at CharsetTest.main(CharsetTest.java:21)
Edit: The problem is on the instruction String stringForFile = new String(bytesForFile);, below the DatatypeConverter. As I was constructing a string without providing a charset, it uses my default charset, which is UTF-8, which doesn't have a symbol for codepoint 80. However, it only throws an exception when it writes to a file. This doesn't happen in the code below because my refactor (keeping in mind Johannes Kuhn's suggestion in the comments) doesn't use the String(byte[]) constructor without specifying a charset.
Johannes Kuhn's suggestion about the String(byte[]) constructor gave me some good clues. I've ended up with the following code, which looks like it works fine: even printing the € symbol to the console as well as writing it to test.txt. That suggests that codepoint 80 can be translated using the windows-1252 codepage.
If I were to guess at this point why this code works but the other didn't, I'd still be confused, but I would guess it was something around the conversion in javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);. That looks to be the main difference, although I'm not sure why it would matter.
Anyway, the code below works (and I don't even have to turn it into a string; I can write the bytes to a file with FileOutputStream fos = new FileOutputStream("test.txt"); fos.write(bytes); fos.close();), so I'm happy with this one.
Code:
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.Writer;
import java.nio.ByteBuffer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;
public class BytesCharsetTest {
public static void main(String[] args) {
float max = 1.0f;
System.out.println("Checking " + max);
int convInt = Float.floatToRawIntBits(max);
byte[] bytes = ByteBuffer.allocate(4).putInt(convInt).array();
String charset = "windows-1252";
try {
String stringForFile = new String(bytes, Charset.forName(charset));
System.out.println(stringForFile);
Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
output.write(stringForFile);
output.close();
} catch (IOException e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
}
Console output:
Checking 1.0
?€
Process finished with exit code 0
I'm reading and writing to a ByteBuffer
import org.assertj.core.api.Assertions;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
public class Solution{
public static void main(String[] args) throws Exception{
final CharsetEncoder messageEncoder = Charset.forName("ISO-8859-1").newEncoder();
String message = "TRANSACTION IGNORED";
String carrierName= "CARR00AB";
int messageLength = message.length()+carrierName.length()+8;
System.out.println(" --------Fill data---------");
ByteBuffer messageBuffer = ByteBuffer.allocate(4096);
messageBuffer.order(ByteOrder.BIG_ENDIAN);
messageBuffer.putInt(messageLength);
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer.put((byte) 0x2b);
messageBuffer.flip();
System.out.println("------------Extract Data Approach 1--------");
CharsetDecoder messageDecoder = Charset.forName("ISO-8859-1").newDecoder();
int lengthField = messageBuffer.getInt();
System.out.println("lengthField="+lengthField);
int responseLength = lengthField - 12;
System.out.println("responseLength="+responseLength);
String messageDecoded= messageDecoder.decode(messageBuffer).toString();
System.out.println("messageDecoded="+messageDecoded);
String decodedCarrier = messageDecoded.substring(0, carrierName.length());
System.out.println("decodedCarrier="+ decodedCarrier);
String decodedBody = messageDecoded.substring(carrierName.length(), messageDecoded.length() - 1);
System.out.println("decodedBody="+decodedBody);
Assertions.assertThat(messageLength).isEqualTo(lengthField);
Assertions.assertThat(decodedBody).isEqualTo(message);
Assertions.assertThat(decodedBody).isEqualTo(message);
ByteBuffer messageBuffer2 = ByteBuffer.allocate(4096);
messageBuffer2.order(ByteOrder.BIG_ENDIAN);
messageBuffer2.putInt(messageLength);
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(carrierName)));
messageBuffer2.put(messageEncoder.encode(CharBuffer.wrap(message)));
messageBuffer2.put((byte) 0x2b);
messageBuffer2.flip();
System.out.println("---------Extract Data Approach 2--------");
byte [] data = new byte[messageBuffer2.limit()];
messageBuffer2.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
}
}
It works fine but then I thought to refactor it, Please see approach 2 in above code
byte [] data = new byte[messageBuffer.limit()];
messageBuffer.get(data);
String dataString =new String(data, "ISO-8859-1");
System.out.println(dataString);
Output= #CARR00ABTRANSACTION IGNORED+
Could you guys help me with explanation
why the integer is got missing in second approach while decoding ???
Is there any way to extract the integer in second approach??
Okay so you are trying to read an int from the Buffer which takes up 4 bits and then trying to get the whole data after reading 4 bits
What I have done is call messageBuffer2.clear(); after reading the int to resolve this issue. here is the full code
System.out.println(messageBuffer2.getInt());
byte[] data = new byte[messageBuffer2.limit()];
messageBuffer2.clear();
messageBuffer2.get(data);
String dataString = new String(data, StandardCharsets.ISO_8859_1);
System.out.println(dataString);
Output is:
35
#CARR0033TRANSACTION IGNORED+
Edit: So basically when you are calling clear it resets various variables and it also resets the position it's getting from and thats how it fixes it.
I have created a binary file using Java and memory mapping. It contains a list of integers from 1 to 10 million:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
public class MemoryMapWriter {
public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {
File f = new File("file.bin");
f.delete();
FileChannel fc = new RandomAccessFile(f, "rw").getChannel();
long bufferSize=64*1000;
MappedByteBuffer mem =fc.map(FileChannel.MapMode.READ_WRITE, 0, bufferSize);
int start = 0;
long counter=1;
long HUNDREDK=100000;
long startT = System.currentTimeMillis();
long noOfMessage = HUNDREDK * 10 * 10;
for(;;)
{
if(!mem.hasRemaining())
{
start+=mem.position();
mem =fc.map(FileChannel.MapMode.READ_WRITE, start, bufferSize);
}
mem.putLong(counter);
counter++;
if(counter > noOfMessage )
break;
}
long endT = System.currentTimeMillis();
long tot = endT - startT;
System.out.println(String.format("No Of Message %s , Time(ms) %s ",noOfMessage, tot)) ;
}
then I have tried to read it using Python and memory mapping:
import pandas as pd
import numpy as np
import os
import shutil
import re
import mmap
a=np.memmap("file.bin",mode='r',dtype='int64')
print(a[0:9])
but printing first ten element, this is the result:
[ 72057594037927936, 144115188075855872, 216172782113783808,
288230376151711744, 360287970189639680, 432345564227567616,
504403158265495552, 576460752303423488, 648518346341351424,
720575940379279360]
What is wrong with my code?
You have a byte-order problem. 72057594037927936 in binary is 0x0100000000000000, 144115188075855872 is 0x0200000000000000, etc.
Java is writing longs to the buffer in big-endian order (most significant byte first) and Python is interpreting the resulting byte stream in little-endian order (least significant byte first).
One simple fix is to change the Java buffer's ByteOrder attribute:
mem.order(ByteOrder.LITTLE_ENDIAN);
Or tell Python to use big-endian order. Python doesn't seem to have an analogous option for its memmap functions, so this will probably require using struct.unpack_from to specify the byte order.
After a week of work I designed a binary file format, and made a Java reader for it. It's just an experiment, which works fine, unless I'm using the GZip compression function.
I called my binary type MBDF (Minimal Binary Database Format), and it can store 8 different types:
Integer (There is nothing like a byte, short, long or anything like that, since it is stored in flexible space (bigger numbers take more space))
Float-32 (32-bits floating point format, like java's float type)
Float-64 (64-bits floating point format, like java's double type)
String (A string in UTF-16 format)
Boolean
Null (Just specifies a null value)
Array (Something like java's ArrayList<Object>)
Compound (A String - Object map)
I used this data as test data:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xml: STRING "two length compound"
int: INTEGER 23
}
string1: STRING "Hello world!"
string2: STRING "3"
arr1: ARRAY [
STRING "Hello world!"
INTEGER 3
STRING "3"
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING "one length compound"
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
The xml key in a compound does matter!!
I made a file from it using this java code:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(false)
);
Here, the variable b is a MBDFBinary object, containing all the data given above. With the makeMBDF function it generates the ISO 8859-1 encoded string and if the given boolean is true, it compresses the string using GZip. Then, when writing, an extra information character is added at the beginning of the file, containing information about how to read it back.
Then, after writing the file, I read it back into java and parse it
MBDF mbdf = MBDFFile.readMBDFFromFile("/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf");
System.out.println(mbdf.getBinaryObject().parse());
This prints exactly the information mentioned above.
Then I try to use compression:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(true)
);
I do exactly the same to read it back as I did with the uncompressed file, which should work. It prints this information:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xUT: STRING 'two length compound'
int: INTEGER 23
}
string1: STRING 'Hello world!'
string2: STRING '3'
arr1: ARRAY [
STRING 'Hello world!'
INTEGER 3
STRING '3'
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING 'one length compound'
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
Comparing it to the initial information, the name xml changed into xUT for some reason...
After some research I found little differences in binary data between before the compression and after the compression. Such patterns as 110011 change into 101010.
When I make the name xml longer, like xmldm, it is just parsed as xmldm for some reason.
I currently saw the problem only occur on names with three characters.
Directly compressing and decompressing the generated string (without saving it to a file and reading that) does work, so maybe the bug is caused by the file encoding.
As far as I know, the string output is in ISO 8859-1 format, but I couldn't get the file encoding right. When a file is read, it is read as it has to be read, and all the characters are read as ISO 8859-1 characters.
I've some things that could be a reason, I actually don't know how to test them:
The GZip output has a different encoding than the uncompressed encoding, causing small differences while storing as a file.
The file is stored as UTF-8 format, just ignoring the order to be ISO 8859-1 encoding ( don't know how to explain :) )
There is a little bug in the java GZip libraries.
But which one is true, and if none of them is right, what is the true reason for this bug?
I couldn't figure it out right now.
The MBDFFile class, reading and storing the files:
/* MBDFFile.java */
package com.redgalaxy.mbdf;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MBDFFile {
public static MBDF readMBDFFromFile(String filename) throws IOException {
// FileInputStream is = new FileInputStream(filename);
// InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");
// BufferedReader br = new BufferedReader(isr);
//
// StringBuilder builder = new StringBuilder();
//
// String currentLine;
//
// while ((currentLine = br.readLine()) != null) {
// builder.append(currentLine);
// builder.append("\n");
// }
//
// builder.deleteCharAt(builder.length() - 1);
//
//
// br.close();
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
return new MBDF(new String(data, "ISO-8859-1"));
}
private static void writeToFile(String filename, byte[] txt) throws IOException {
// BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
//// FileWriter writer = new FileWriter(filename);
// writer.write(txt.getBytes("ISO-8859-1"));
// writer.close();
// PrintWriter pw = new PrintWriter(filename, "ISO-8859-1");
FileOutputStream stream = new FileOutputStream(filename);
stream.write(txt);
stream.close();
}
public static void writeMBDFToFile(String filename, MBDF info) throws IOException {
writeToFile(filename, info.pack().getBytes("ISO-8859-1"));
}
}
The pack function generates the final string for the file, in ISO 8859-1 format.
For all the other code, see my MBDF Github repository.
I commented the code I've tried, trying to show what I tried.
My workspace:
- Macbook Air '11 (High Sierra)
- IntellIJ Community 2017.3
- JDK 1.8
I hope this is enough information, this is actually the only way to make clear what I'm doing, and what exactly isn't working.
Edit: MBDF.java
/* MBDF.java */
package com.redgalaxy.mbdf;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
public class MBDF {
private String data;
private InfoTag tag;
public MBDF(String data) {
this.tag = new InfoTag((byte) data.charAt(0));
this.data = data.substring(1);
}
public MBDF(String data, InfoTag tag) {
this.tag = tag;
this.data = data;
}
public MBDFBinary getBinaryObject() throws IOException {
String uncompressed = data;
if (tag.isCompressed) {
uncompressed = GZipUtils.decompress(data);
}
Binary binary = getBinaryFrom8Bit(uncompressed);
return new MBDFBinary(binary.subBit(0, binary.getLen() - tag.trailing));
}
public static Binary getBinaryFrom8Bit(String s8bit) {
try {
byte[] bytes = s8bit.getBytes("ISO-8859-1");
return new Binary(bytes, bytes.length * 8);
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return new Binary(new byte[0], 0);
}
}
public static String get8BitFromBinary(Binary binary) {
try {
return new String(binary.getByteArray(), "ISO-8859-1");
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return "";
}
}
/*
* Adds leading zeroes to the binary string, so that the final amount of bits is 16
*/
private static String addLeadingZeroes(String bin, boolean is16) {
int len = bin.length();
long amount = (long) (is16 ? 16 : 8) - len;
// Create zeroes and append binary string
StringBuilder zeroes = new StringBuilder();
for( int i = 0; i < amount; i ++ ) {
zeroes.append(0);
}
zeroes.append(bin);
return zeroes.toString();
}
public String pack(){
return tag.getFilePrefixChar() + data;
}
public String getData() {
return data;
}
public InfoTag getTag() {
return tag;
}
}
This class contains the pack() method. data is already compressed here (if it should be).
For other classes, please watch the Github repository, I don't want to make my question too long.
Solved it by myself!
It seemed to be the reading and writing system. When I exported a file, I made a string using the ISO-8859-1 table to turn bytes into characters. I wrote that string to a text file, which is UTF-8. The big problem was that I used FileWriter instances to write it, which are for text files.
Reading used the inverse system. The complete file was read into memory as a string (memory consuming!!) and was then being decoded.
I didn't know a file was binary data, where specific formats of them form text data. ISO-8859-1 and UTF-8 are some of those formats. I had problems with UTF-8, because it splitted some characters into two bytes, which I couldn't manage...
My solution to it was to use streams. There exist FileInputStreams and FileOutputStreams in Java, which could be used for reading and writing binary files. I didn't use the streams, as I thought there was no big difference ("files are text, so what's the problem?"), but there is... I implemented this (by writing a new similar library) and I'm now able to pass every input stream to the decoder and every output stream to the encoder. To make uncompressed files, you need to pass a FileOutputStream. GZipped files could use GZipOutputStreams, relying on a FileOutputStream. If someone wants a string with the binary data, a ByteArrayOutputStream could be used. Same rules apply to reading, where the InputStream variant of the mentioned streams should be used.
No UTF-8 or ISO-8859-1 problems anymore, and it seemed to work, even with GZip!
I have written a code to split a .gz file into user specified parts using byte[] array. But the for loop is not reading/writing the last part of the parent file which is less than the array size. Can you please help me in fixing this?
package com.bitsighttech.collection.packaging;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
private FileInputStream fis;
private FileOutputStream fos;
private DataInputStream dis;
private DataOutputStream dos;
public boolean split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
String unit = m.group(2);
double FileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
FileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
FileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
FileSizeInMB = Double.parseDouble(FileSizeString) * KB;
} catch (NumberFormatException e) {
logger.error("invalid number [" + FileSizeInMB + "] for expected file size");
}
double fileSize = FileSizeInMB * MB;
int fileSizeInByte = (int) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
int splinterCount = 1;
fis = new FileInputStream(inputFile);
dis = new DataInputStream(new BufferedInputStream(fis));
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
dos = new DataOutputStream(new BufferedOutputStream(fos));
byte[] data = new byte[(int) fileSizeInByte];
while ( splinterCount <= expectedNoOfFiles ) {
int i;
for(i = 0; i<data.length-1; i++)
{
data[i] = s.readByte();
}
dos.write(data);
splinterCount ++;
}
}
catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return false;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return true;
}
public static void main(String args[])
{
String FilePath1 = "F:\\az.gz";
File file= new File(FilePath1);
FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
String splitlen = "1 MB";
fileSplitter.split(file, splitlen);
}
}
I'd suggest to make more methods. You've got a complicated string-handling section of code in split(); it would be best to make a method that takes the human-friendly string as input and returns the number you're looking for. (It would also make it far easier for you to test this section of the routine; there's no way you can test it now.)
Once it is split off and you're writing test cases, you'll probably find that the error message you generate if the string doesn't contain kb, mb, or gb is extremely confusing -- it blames the number 0 for the mistake rather than pointing out the string does not have the expected units.
Using an int to store the file size means your program will never handle files larger than two gigabytes. You should stick with long or double. (double feels wrong for something that is actually confined to integer values but I can't quickly think why it would fail.)
byte[] data = new byte[(int) fileSizeInByte];
Allocating several gigabytes like this is going to destroy your performance -- that's a potentially huge memory allocation (and one that might be considered under control of an adversary; depending upon your security model, this might or might not be a big deal). Don't try to work with the entire file in one piece.
You appear to be reading and writing the files one byte at a time. That's a guarantee to very slow performance. Doing some performance testing for another question earlier today, I found that my machine could read (from a hot cache) 2000 times faster using 131kb blocks than two-byte blocks. One-byte blocks would be even worse. A cold cache would be significantly worse for such small sizes.
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
You only appear to ever open one file output stream. Your post probably should have said "only the first works", because it looks like you've not yet tried it on a file that creates three or more pieces.
catch(Exception e)
At this point, you've got the ability to discover errors in your program; you choose to ignore them completely. Sure, you log an error message, but you cannot actually debug your program with the data you log. You should log at a minimum the exception type, message, and maybe even full stack-trace. This combination of data is immensely useful when trying to solve problems, especially in a few months when you've forgotten the details of how it works.
Can you please help me in fixing this?
I would use;
drop the DataInput/OutputStreams, you don't need them.
use in.read(data) to read the whole block instead on one byte at a time. Reading one byte at a time is so much slower!
or read the whole of the data array, you are reading one less.
stop when you reach the end of the file, it might not be a whole multiple of the size.
only write as much as you have read, if your blocks at 1 MB byte there is 100 KB left you should only read/write 100 KB at the end.
close your files when have finished, esp as you have a buffered stream.
you "split" writes everything to the same file (so its not actually splitting) You need to create, write to and close output files in a loop.
don't use fields when you could be/should be using local variables.
would use the length as a long in bytes.
the pattern ignores incorrect input and your pattern doesn't match the test you check for. e.g. your patten allows 1 G or 1 k but these will be treated as 1 MB.