Save to binary file - java

I'd like to save data to binary file using java. For example i have the number 101 and in my program the output file have 4 Bytes. How can i save the number only in three bits (101) in the output file ?
My program looks like this:
public static void main(String args[]) throws FileNotFoundException, IOException {
int i = 101;
DataOutputStream os = new DataOutputStream(new FileOutputStream("file"));
os.writeInt(i);
os.close();
}
I found something like that: http://www.developer.nokia.com/Community/Wiki/Bit_Input/Output_Stream_utility_classes_for_efficient_data_transfer

You can not write less than one byte to a file. If you want to write the binary number 101 then do int i = 5 and use os.write(i) instead. This will write one byte: 0000 0101.

First off, you can't write just 3 bits to a file, memory is aligned at specific values (8, 16, 32, 64 or even 128 bit, this is compiler/platform specific). If you write smaller sizes than that, they will be expanded to match the alignment.
Secondly, the decimal number 101, written in binary is 0b01100101. the binary number 0b00000101, is decimal 5.
Thirdly, these numbers are now only 1 Byte (8 bit) long, so you can use a char instead of an int.
And last but not least, to write non-integer numbers, use os.write()
So to get to what you want, first check if you want to write 0b01100101 or 0b00000101. Change the int to a char and to the appropriate number (you can write 0b01100101 in Java). And use os.write()

A really naive implementation, I hope it helps you grasp the idea. Also untested, likely to contain off-by-one errors etc...
class BitArray {
// public fields for example code, in real code encapsulate
public int bits=0; // actual count of stored bits
public byte[] buf=new byte[1];
private void doubleBuf() {
byte [] tmp = new byte[buf.length * 2];
System.arraycopy(buf, 0, tmp, 0, buf.length);
buf = tmp;
}
private int arrayIndex(int bitNum) {
return bitNum / 8;
}
private int bitNumInArray(int bitNum) {
return bitNum & 7; // change to change bit order in buf's bytes
}
// returns how many elements of buf are actually in use, for saving etc.
// note that last element usually contains unused bits.
public int getUsedArrayElements() {
return arrayIndex(this.bits-1) + 1;
}
// bitvalue is 0 for 0, non-0 for 1
public void setBit(byte bitValue, int bitNum) {
if (bitNum >= this.bits || bitNum < 0) throw new InvalidArgumentException();
if (bitValue == 0) this.buf[arrayIndex(bitNum)] &= ~((byte)1 << bitNumInArray(bitNum));
else this.buf[arrayIndex(bitNum)] |= (byte)1 << bitNumInArray(bitNum);
}
public void addBit(int bitValue) {
// this.bits is old bit count, which is same as index of new last bit
if (this.buf.length <= arrayIndex(this.bits)) doubleBuf();
++this.bits;
setBit(bitValue, this.bits-1);
}
int readBit(int bitNum) { // return 0 or 1
if (bitNum >= this.bits || bitNum < 0) throw new InvalidArgumentException();
byte value = buf[arrayIndex(bitNum)] & ((byte)1 << bitNumInArray(bitNum));
return (value == 0) ? 0 : 1;
}
void addBits(int bitCount, int bitValues) {
for (int num = bitCount - 1 ; num >= 0 ; --num) {
// change loop iteration order to change bit order of bitValues
addBit(bitValues & (1 << num));
}
}
For efficient solution, it should use int or long array instead of byte array, and include more efficient method for multiple bit addition (add parts of bitValues an entire buf array element at a time, not bit-by-bit like above).
To save this, you need to save the right number of bytes from buf, calculated by getUsedArrayElements().

Related

IO-Link CRC32 implementation

currently I try to implement the CRC32 check algorithm from IO-Link for IODD files in Java [1]. In IO-Link each sensor is described with an IODD xml file [2]. The file contains a stamp tag with a crc code. My problem here is that the algorithm uses only unsigned integers and Java does not uses this type of types. Is there still a way to implement that?
I also use JetBrains dotPeek program to decompile the IODDChecker (latest Version). The CRC-Logic is in the file clsCRC.cs. At the end of the file I found the method "CRC32" which calculates the CRC. The problem is that they use other integers (unsigned integers) for the lookup-table. An interesting fact is that the algorithm in the IODD checker (crc32_octetwise) is slightly different from the one described in the specification.
Hope you can help me :)
What I tested:
Java CRC32 from zip
Guava UnsignedInteger
BigInteger
long instead of int
Procedure:
Program read the IODD as String/byte array
Remove the CRC value from Stamp-Tag from the String/byte array
Create CRC32 from string/byte array
In the test IODD you can find the CRC 1195433981 which should be equal to the output of the algorithm
Algorithms:
There are two algorithms. The first one is bitwise and the second one is octetwise and uses an lookup-table which you can find here [2].
uint32_t crc32_bitwise(char *data, size_t length, uint32_t previousCrc32 = 1)
{
uint32_t crc = ~previousCrc32;
int j;
const uint8_t* current = (const uint8_t*) data;
while (length-- > 0)
{
crc ^= *current++;
for (j = 0; j < 8; j++)
{
crc = (crc >> 1) ^ (-(int32_t)(crc & 1) & 0xEB31D82E);
}
}
return ~crc;
}
uint32_t crc32_octetwise(uint8_t const * current, uint32_t length, uint32_t previousCrc32 = 1, const uint32_t crc32Lookup[256])
{
uint32_t crc = ~previousCrc32;
while (length-- > 0)
crc = (crc >> 8) ^ crc32Lookup[(crc & 0xFF) ^ *current++];
return ~crc;
}
[1] https://io-link.com/share/Downloads/Profiles/IOL-Profile_Firmware-Update_V11_10082_Sep19.zip
[2] https://ioddfinder.io-link.com/productvariants/search/11765
[3] Yet another Java CRC32 implementation related issue
Update 1
I forgot to add that I currently receive 1120211745 as CRC Code but 1195433981 is the correct code.
Update 2
The following Codes was implemented from me. The big arrays are the lookup tables. The first int array is the table from the [1] and the second one is the array from the decompiled source. In the original were the int numbers marked as unsigned (e.g. 1996959894U).
private static final int[] numArr1 = new int[] {
0x00000000, 0x9695C4CA, 0xFB4839C9, 0x6DDDFD03, 0x20F3C3CF, 0xB6660705, 0xDBBBFA06, 0x4D2E3ECC, 0x41E7879E, 0xD7724354, 0xBAAFBE57, 0x2C3A7A9D, 0x61144451, 0xF781809B, 0x9A5C7D98, 0x0CC9B952, 0x83CF0F3C, 0x155ACBF6, 0x788736F5, 0xEE12F23F, 0xA33CCCF3, 0x35A90839, 0x5874F53A, 0xCEE131F0, 0xC22888A2, 0x54BD4C68, 0x3960B16B, 0xAFF575A1, 0xE2DB4B6D, 0x744E8FA7, 0x199372A4, 0x8F06B66E, 0xD1FDAE25, 0x47686AEF, 0x2AB597EC, 0xBC205326, 0xF10E6DEA, 0x679BA920, 0x0A465423, 0x9CD390E9, 0x901A29BB, 0x068FED71, 0x6B521072, 0xFDC7D4B8, 0xB0E9EA74, 0x267C2EBE, 0x4BA1D3BD, 0xDD341777, 0x5232A119, 0xC4A765D3, 0xA97A98D0, 0x3FEF5C1A, 0x72C162D6, 0xE454A61C, 0x89895B1F, 0x1F1C9FD5, 0x13D52687, 0x8540E24D, 0xE89D1F4E, 0x7E08DB84, 0x3326E548, 0xA5B32182, 0xC86EDC81, 0x5EFB184B, 0x7598EC17, 0xE30D28DD, 0x8ED0D5DE, 0x18451114, 0x556B2FD8, 0xC3FEEB12, 0xAE231611, 0x38B6D2DB, 0x347F6B89, 0xA2EAAF43, 0xCF375240, 0x59A2968A, 0x148CA846, 0x82196C8C, 0xEFC4918F, 0x79515545, 0xF657E32B, 0x60C227E1, 0x0D1FDAE2, 0x9B8A1E28, 0xD6A420E4, 0x4031E42E, 0x2DEC192D, 0xBB79DDE7, 0xB7B064B5, 0x2125A07F, 0x4CF85D7C, 0xDA6D99B6, 0x9743A77A, 0x01D663B0, 0x6C0B9EB3, 0xFA9E5A79, 0xA4654232, 0x32F086F8, 0x5F2D7BFB, 0xC9B8BF31, 0x849681FD, 0x12034537, 0x7FDEB834, 0xE94B7CFE, 0xE582C5AC, 0x73170166, 0x1ECAFC65, 0x885F38AF, 0xC5710663, 0x53E4C2A9, 0x3E393FAA, 0xA8ACFB60, 0x27AA4D0E, 0xB13F89C4, 0xDCE274C7, 0x4A77B00D, 0x07598EC1, 0x91CC4A0B, 0xFC11B708, 0x6A8473C2, 0x664DCA90, 0xF0D80E5A, 0x9D05F359, 0x0B903793, 0x46BE095F, 0xD02BCD95, 0xBDF63096, 0x2B63F45C, 0xEB31D82E, 0x7DA41CE4, 0x1079E1E7, 0x86EC252D, 0xCBC21BE1, 0x5D57DF2B, 0x308A2228, 0xA61FE6E2, 0xAAD65FB0, 0x3C439B7A, 0x519E6679, 0xC70BA2B3, 0x8A259C7F, 0x1CB058B5, 0x716DA5B6, 0xE7F8617C, 0x68FED712, 0xFE6B13D8, 0x93B6EEDB, 0x05232A11, 0x480D14DD, 0xDE98D017, 0xB3452D14, 0x25D0E9DE, 0x2919508C, 0xBF8C9446, 0xD2516945, 0x44C4AD8F, 0x09EA9343, 0x9F7F5789, 0xF2A2AA8A, 0x64376E40, 0x3ACC760B, 0xAC59B2C1, 0xC1844FC2, 0x57118B08, 0x1A3FB5C4, 0x8CAA710E, 0xE1778C0D, 0x77E248C7, 0x7B2BF195, 0xEDBE355F, 0x8063C85C, 0x16F60C96, 0x5BD8325A, 0xCD4DF690, 0xA0900B93, 0x3605CF59, 0xB9037937, 0x2F96BDFD, 0x424B40FE, 0xD4DE8434, 0x99F0BAF8, 0x0F657E32, 0x62B88331, 0xF42D47FB, 0xF8E4FEA9, 0x6E713A63, 0x03ACC760, 0x953903AA, 0xD8173D66, 0x4E82F9AC, 0x235F04AF, 0xB5CAC065, 0x9EA93439, 0x083CF0F3, 0x65E10DF0, 0xF374C93A, 0xBE5AF7F6, 0x28CF333C, 0x4512CE3F, 0xD3870AF5, 0xDF4EB3A7, 0x49DB776D, 0x24068A6E, 0xB2934EA4, 0xFFBD7068, 0x6928B4A2, 0x04F549A1, 0x92608D6B, 0x1D663B05, 0x8BF3FFCF, 0xE62E02CC, 0x70BBC606, 0x3D95F8CA, 0xAB003C00, 0xC6DDC103, 0x504805C9, 0x5C81BC9B, 0xCA147851, 0xA7C98552, 0x315C4198, 0x7C727F54, 0xEAE7BB9E, 0x873A469D, 0x11AF8257, 0x4F549A1C, 0xD9C15ED6, 0xB41CA3D5, 0x2289671F, 0x6FA759D3, 0xF9329D19, 0x94EF601A, 0x027AA4D0, 0x0EB31D82, 0x9826D948, 0xF5FB244B, 0x636EE081, 0x2E40DE4D, 0xB8D51A87, 0xD508E784, 0x439D234E, 0xCC9B9520, 0x5A0E51EA, 0x37D3ACE9, 0xA1466823, 0xEC6856EF, 0x7AFD9225, 0x17206F26, 0x81B5ABEC, 0x8D7C12BE, 0x1BE9D674, 0x76342B77, 0xE0A1EFBD, 0xAD8FD171, 0x3B1A15BB, 0x56C7E8B8, 0xC0522C72
};
private static final long[] numArr2 = new long[] {
0L, 1996959894L, 3993919788L, 2567524794L, 124634137L, 1886057615L, 3915621685L, 2657392035L, 249268274L, 2044508324L, 3772115230L, 2547177864L, 162941995L, 2125561021L, 3887607047L, 2428444049L, 498536548L, 1789927666L, 4089016648L, 2227061214L, 450548861L, 1843258603L, 4107580753L, 2211677639L, 325883990L, 1684777152L, 4251122042L, 2321926636L, 335633487L, 1661365465L, 4195302755L, 2366115317L, 997073096L, 1281953886L, 3579855332L, 2724688242L, 1006888145L, 1258607687L, 3524101629L, 2768942443L, 901097722L, 1119000684L, 3686517206L, 2898065728L, 853044451L, 1172266101L, 3705015759L, 2882616665L, 651767980L, 1373503546L, 3369554304L, 3218104598L, 565507253L, 1454621731L, 3485111705L, 3099436303L, 671266974L, 1594198024L, 3322730930L, 2970347812L, 795835527L, 1483230225L, 3244367275L, 3060149565L, 1994146192L, 31158534L, 2563907772L, 4023717930L, 1907459465L, 112637215L, 2680153253L, 3904427059L, 2013776290L, 251722036L, 2517215374L, 3775830040L, 2137656763L, 141376813L, 2439277719L, 3865271297L, 1802195444L, 476864866L, 2238001368L, 4066508878L, 1812370925L, 453092731L, 2181625025L, 4111451223L, 1706088902L, 314042704L, 2344532202L, 4240017532L, 1658658271L, 366619977L, 2362670323L, 4224994405L, 1303535960L, 984961486L, 2747007092L, 3569037538L, 1256170817L, 1037604311L, 2765210733L, 3554079995L, 1131014506L, 879679996L, 2909243462L, 3663771856L, 1141124467L, 855842277L, 2852801631L, 3708648649L, 1342533948L, 654459306L, 3188396048L, 3373015174L, 1466479909L, 544179635L, 3110523913L, 3462522015L, 1591671054L, 702138776L, 2966460450L, 3352799412L, 1504918807L, 783551873L, 3082640443L, 3233442989L, 3988292384L, 2596254646L, 62317068L, 1957810842L, 3939845945L, 2647816111L, 81470997L, 1943803523L, 3814918930L, 2489596804L, 225274430L, 2053790376L, 3826175755L, 2466906013L, 167816743L, 2097651377L, 4027552580L, 2265490386L, 503444072L, 1762050814L, 4150417245L, 2154129355L, 426522225L, 1852507879L, 4275313526L, 2312317920L, 282753626L, 1742555852L, 4189708143L, 2394877945L, 397917763L, 1622183637L, 3604390888L, 2714866558L, 953729732L, 1340076626L, 3518719985L, 2797360999L, 1068828381L, 1219638859L, 3624741850L, 2936675148L, 906185462L, 1090812512L, 3747672003L, 2825379669L, 829329135L, 1181335161L, 3412177804L, 3160834842L, 628085408L, 1382605366L, 3423369109L, 3138078467L, 570562233L, 1426400815L, 3317316542L, 2998733608L, 733239954L, 1555261956L, 3268935591L, 3050360625L, 752459403L, 1541320221L, 2607071920L, 3965973030L, 1969922972L, 40735498L, 2617837225L, 3943577151L, 1913087877L, 83908371L, 2512341634L, 3803740692L, 2075208622L, 213261112L, 2463272603L, 3855990285L, 2094854071L, 198958881L, 2262029012L, 4057260610L, 1759359992L, 534414190L, 2176718541L, 4139329115L, 1873836001L, 414664567L, 2282248934L, 4279200368L, 1711684554L, 285281116L, 2405801727L, 4167216745L, 1634467795L, 376229701L, 2685067896L, 3608007406L, 1308918612L, 956543938L, 2808555105L, 3495958263L, 1231636301L, 1047427035L, 2932959818L, 3654703836L, 1088359270L, 936918000L, 2847714899L, 3736837829L, 1202900863L, 817233897L, 3183342108L, 3401237130L, 1404277552L, 615818150L, 3134207493L, 3453421203L, 1423857449L, 601450431L, 3009837614L, 3294710456L, 1567103746L, 711928724L, 3020668471L, 3272380065L, 1510334235L, 755167117L
};
Test with CRC32 from java.util.zip:
public static String crc32(byte[] data) {
Checksum c = new CRC32();
c.update(1);
c.update(data);
return String.valueOf(Long.parseLong(Long.toHexString(c.getValue()), 16)); // 1544046132
}
The solution from [3]
public static String crc32(byte[] data){
int crc = Integer.MAX_VALUE;
for (byte b : data)
crc = numArr[(crc & 0xff) ^ (b & 0xff) & 255] ^ (crc >>> 8);
crc = ~crc; // flip bit/sign
return String.valueOf(Long.parseLong(Integer.toHexString(crc), 16)); // 1120211745
}
This implementation uses Guava and was directly transformed from me. I also try BigInteger, bit operations to enforce unsigned and use byte arrays instead of int or long.
public static String crc32(byte[] data) {
long num = UnsignedInteger.MAX_VALUE.longValue();
for (byte aByte : data)
num = numArr2[(Integer.parseUnsignedInt(Long.toUnsignedString(num)) ^ aByte) & 255] ^ (num >>> 8);
num ^= UnsignedInteger.MAX_VALUE.longValue();
return String.valueOf(num); // 2128031166
}
Update 3
I tried a new start with BigInteger and now the result is closer. The CRC is 1190225612 (expected: 1195433981).
public static String crc32(byte[] bytes) {
BigInteger num = BigInteger.valueOf(0xffffffffL);
for (byte aByte: bytes)
num = BigInteger.valueOf(numArr[num.xor(BigInteger.valueOf(aByte)).and(BigInteger.valueOf(numArr.length - 1)).intValue()]).xor((num.shiftRight(8)));
num = num.xor(BigInteger.valueOf(0xffffffffL));
System.out.println("1195433981");
return String.valueOf(~num.intValue());
}
Update 4
Update the procedure description. Not the whole line with the Stamp tag will be removed instead of only the value from the CRC attribute will be removed.
You have the initial CRC wrong. It is 1. From the document you linked:
The seed value shall be "1" (see previousCrc32 = 1 in Figure B.1 and
Figure B.2).
That value is then inverted before the loop, so at that point your "num" should be 0xfffffffe.
You numArr1 is correct. I don't know where your numArr2 came from.
All you need is this, calling with the first argument being 1:
static private int crc32iodd(int crc, byte[] data){
crc = ~crc;
for (byte b : data)
crc = numArr1[(crc ^ b) & 0xff] ^ (crc >>> 8);
return ~crc;
}
Since Java doesn't have unsigned int, half the time your result will be negative. That's fine, since it's the exact same 32 bits as you would have as an unsigned result, just printed differently.
There is no sense of "closer" for the answer. The integer difference between what you want and what you're getting has no meaning in the world of CRCs, since they are not integer calculations.
Update:
Looking more closely at your links and the CRC reference code you put in your question, it appears you have referenced entirely the wrong documents for the CRC you actually need. You referenced the CRC for BLOBs used for firmware updates, which is not the CRC used for validation of the XML files. This document provides the specification for the CRC used for the XML files as the 32-bit CRC found in the ITU-T recommendation V.42. That is a different CRC using a different polynomial and different initial value. It is in fact the standard CRC-32 used by zip, and is available in Java as java.util.zip.CRC32. You don't need to write any CRC code. Just use that.
That document also notes but does not detail special processing of the XML input before computing the CRC. You will need to do more research to find out exactly what the CRC is calculated over in the XML file, and how it is pre-processed. This package has a description that notes that the associated language xml's have the CRC of main xml's appended to the end before each of their CRC calculations. However it is not made clear as to exactly how the CRC is appended.
This time please conduct adequate research and verification on your part, before coming here with questions about how to implement an incorrectly identified CRC.
For the sake of completeness.
You are right. I can use the java.util.zip.CRC32. I think the problem was that I try at the beginning of my implementation tests to use the whole xml file and so I thought that the description in the PDF was wrong.
And here is the solution:
If you have a IODD XML file than you have to remove the value of crc in the stamp-tag. Subsequently, if not already done, convert the file content without the crc-value to a byte array. And now you can use one of these two functions:
Implementation with java.util.zip.CRC32
public boolean isValid() {
CRC32 m = new CRC32();
m.update(file);
long x = 1195433981L;
return (m.getValue() == x);
}
Manual implementation (The table is according to the site https://rosettacode.org/wiki/CRC-32#Lingo or https://web.mit.edu/freebsd/head/sys/libkern/crc32.c)
public class Crc32 {
private static final long[] numArr2 = new long[] {
0L, 1996959894L, 3993919788L, 2567524794L, 124634137L, 1886057615L, 3915621685L, 2657392035L, 249268274L, 2044508324L, 3772115230L, 2547177864L, 162941995L, 2125561021L, 3887607047L, 2428444049L, 498536548L, 1789927666L, 4089016648L, 2227061214L, 450548861L, 1843258603L, 4107580753L, 2211677639L, 325883990L, 1684777152L, 4251122042L, 2321926636L, 335633487L, 1661365465L, 4195302755L, 2366115317L, 997073096L, 1281953886L, 3579855332L, 2724688242L, 1006888145L, 1258607687L, 3524101629L, 2768942443L, 901097722L, 1119000684L, 3686517206L, 2898065728L, 853044451L, 1172266101L, 3705015759L, 2882616665L, 651767980L, 1373503546L, 3369554304L, 3218104598L, 565507253L, 1454621731L, 3485111705L, 3099436303L, 671266974L, 1594198024L, 3322730930L, 2970347812L, 795835527L, 1483230225L, 3244367275L, 3060149565L, 1994146192L, 31158534L, 2563907772L, 4023717930L, 1907459465L, 112637215L, 2680153253L, 3904427059L, 2013776290L, 251722036L, 2517215374L, 3775830040L, 2137656763L, 141376813L, 2439277719L, 3865271297L, 1802195444L, 476864866L, 2238001368L, 4066508878L, 1812370925L, 453092731L, 2181625025L, 4111451223L, 1706088902L, 314042704L, 2344532202L, 4240017532L, 1658658271L, 366619977L, 2362670323L, 4224994405L, 1303535960L, 984961486L, 2747007092L, 3569037538L, 1256170817L, 1037604311L, 2765210733L, 3554079995L, 1131014506L, 879679996L, 2909243462L, 3663771856L, 1141124467L, 855842277L, 2852801631L, 3708648649L, 1342533948L, 654459306L, 3188396048L, 3373015174L, 1466479909L, 544179635L, 3110523913L, 3462522015L, 1591671054L, 702138776L, 2966460450L, 3352799412L, 1504918807L, 783551873L, 3082640443L, 3233442989L, 3988292384L, 2596254646L, 62317068L, 1957810842L, 3939845945L, 2647816111L, 81470997L, 1943803523L, 3814918930L, 2489596804L, 225274430L, 2053790376L, 3826175755L, 2466906013L, 167816743L, 2097651377L, 4027552580L, 2265490386L, 503444072L, 1762050814L, 4150417245L, 2154129355L, 426522225L, 1852507879L, 4275313526L, 2312317920L, 282753626L, 1742555852L, 4189708143L, 2394877945L, 397917763L, 1622183637L, 3604390888L, 2714866558L, 953729732L, 1340076626L, 3518719985L, 2797360999L, 1068828381L, 1219638859L, 3624741850L, 2936675148L, 906185462L, 1090812512L, 3747672003L, 2825379669L, 829329135L, 1181335161L, 3412177804L, 3160834842L, 628085408L, 1382605366L, 3423369109L, 3138078467L, 570562233L, 1426400815L, 3317316542L, 2998733608L, 733239954L, 1555261956L, 3268935591L, 3050360625L, 752459403L, 1541320221L, 2607071920L, 3965973030L, 1969922972L, 40735498L, 2617837225L, 3943577151L, 1913087877L, 83908371L, 2512341634L, 3803740692L, 2075208622L, 213261112L, 2463272603L, 3855990285L, 2094854071L, 198958881L, 2262029012L, 4057260610L, 1759359992L, 534414190L, 2176718541L, 4139329115L, 1873836001L, 414664567L, 2282248934L, 4279200368L, 1711684554L, 285281116L, 2405801727L, 4167216745L, 1634467795L, 376229701L, 2685067896L, 3608007406L, 1308918612L, 956543938L, 2808555105L, 3495958263L, 1231636301L, 1047427035L, 2932959818L, 3654703836L, 1088359270L, 936918000L, 2847714899L, 3736837829L, 1202900863L, 817233897L, 3183342108L, 3401237130L, 1404277552L, 615818150L, 3134207493L, 3453421203L, 1423857449L, 601450431L, 3009837614L, 3294710456L, 1567103746L, 711928724L, 3020668471L, 3272380065L, 1510334235L, 755167117L
};
private static long crc32(byte[] data){
long crc = 0xFFFFFFFFL;
for (byte b : data)
crc = numArr2[(int)(crc ^ b) & 0xff] ^ (crc >>> 8);
return crc ^ 0xFFFFFFFFL;
}
public static boolean isValid(byte[] data, long checkCrc) {
long crc = crc32(data);
return (checkCrc == crc);
}
}
I had the same problem, but I use Python. One comment first. CRCs are made for the detection of changes. If you change only one bit in the input stream, the CRC will be completely different. So you can't say, the calculated the crc is close to the right value.
This is my code, which calculates the right crc
prev = 0
for eachLine in open(file,"rb"):
prev = zlib.crc32(eachLine, prev)
crc = prev & 0xFFFFFFFF
At first you clear the crc in your file to an empty string "". Then you have to use this file to calculate the crc. Because of processing the file as binary, don't change anything besides the crc (tabs, spaces, line endings). One space, automatically removed by the IODD_Checker tool, costed nearly a complete day.
If you calculate a 32 bit crc, you will have the possibility of 1 to ‭4294967295‬ that 0 is a valid crc. Because the checker uses this value to mark the crc as invalid, I'm not sure there is a special handling defined.
if 0 == prev:
crc = 0xFFFFFFFF
else:
crc = prev & 0xFFFFFFFF
This issue I will have to check.

I need to decrypt a file efficiently [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 4 years ago.
Improve this question
I am trying to decrypt an encrypted file with unknown key - the only thing I know about it is that the key is an integer x, 0 <= x < 1010 (i.e. a maximum of 10 decimal digits).
public static String enc(String msg, long key) {
String ans = "";
Random rand = new Random(key);
for (int i = 0; i < msg.length(); i = i + 1) {
char c = msg.charAt(i);
int s = c;
int rd = rand.nextInt() % (256 * 256);
int s2 = s ^ rd;
char c2 = (char) (s2);
ans += c2;
}
return ans;
}
private static String tryToDecode(String string) {
String returnedString = "";
long key;
String msg = reader(string);
for (long i = 0; i <= 999999999; i++) {
System.out.println("decoding message with key + " + i);
key = i;
System.out.println("decoding with key: " + i + "\n" + enc(msg, key));
}
return returnedString;
}
I expect to find the plain text
The program works very slowly, is there any way to make it more efficient?
You can use Parallel Array Operations added in JAVA 8 if you are using Java 8 to achive this.
The best fit for you would be to use Spliterator
public void spliterate() {
System.out.println("\nSpliterate:");
int[] src = getData();
Spliterator<Integer> spliterator = Arrays.spliterator(src);
spliterator.forEachRemaining( n -> action(n) );
}
public void action(int value) {
System.out.println("value:"+value);
// Perform some real work on this data here...
}
I am still not clear about your situation. Here some great tutorials and articles to figure out which parallel array operations of java 8 is going to help you ?
http://www.drdobbs.com/jvm/parallel-array-operations-in-java-8/240166287
https://blog.rapid7.com/2015/10/28/java-8-introduction-to-parallelism-and-spliterator/
First things first: You can't println billions of lines. This will take forever, and it's pointless - you won't be able to see the text as it scrolls by, and your buffer won't save billion of lines so you couldn't scroll back up later even if you wanted to. If you prefer (and don't mind it being 2-3% slower than it otherwise would be), you can output once every hundred million keys, just so you can verify your program is making progress.
You can optimize things by not concatenating Strings inside the loop. Strings are immutable, so the old code was creating a rather large number of Strings, especially in the enc method. Normally I'd use a StringBuilder, but in this case a simple character array will meet our needs.
And there's one more thing we need to do that your current code doesn't do: Detect when we have the answer. If we assume that the message will only contain characters from 0-127 with no Unicode or extended ASCII, then we know we have a possible answer when the entire message contains only characters in this range. And we can also use this to further optimize, as we can then immediately discard any message that has a character outside of this range. We don't even have to finish decoding it and can move on to the next key. (If the message is of any length, the odds are that only one key will produce a decoded message with characters in that range - but it's not guaranteed, which is why I do not stop when I get to a valid message. You could probably do that, though.)
Due to the way random numbers are generated in Java, anything in the seed above 32 bits is not used by the encoding/decoding algorithm. So you only need to go up to 4294967295 instead of 9999999999. (This also means the key that was originally used to encode the message might not be the key this program uses to decode it, since 2-3 keys in the 10 digit range will produce the same encoding/decoding.)
private static String tryToDecode4(String msg) {
String returnedString = "";
for (long i=0; i<=4294967295l; i++)
{
if (i % 100000000 == 0) // This part is just to see that it's making progress. Remove if desired for a small speed gain.
System.out.println("Trying " + i);
char[] decoded = enc4(msg, i);
if (decoded == null)
continue;
returnedString = String.valueOf(decoded);
System.out.println("decoding with key: " + i + " " + returnedString);
}
return returnedString;
}
private static char[] enc4(String msg, long key) {
char[] ansC = new char[msg.length()];
Random rand = new Random(key);
for(int i=0;i<msg.length();i=i+1)
{
char c = msg.charAt(i);
int s = c;
int rd = rand.nextInt()%(256*256);
int s2 = s^rd;
char c2 = (char)(s2);
if (c2 > 127)
return null;
ansC[i] = c2;
}
return ansC;
}
This code finished running in a little over 3 minutes on my machine, with a message of "Hello World".
This code will not work well for very short messages (3-4 characters or less.) It will not work if the message contains Unicode or extended ASCII, although it could easily be modified to do so if you know the range of characters that might be in the message.

Convert python's struct.unpack code to java

I am trying to integrate Beuerer BF480 device with java program. I found python code which converts the received data through serial USB interface in the required format. Below is the code snippet which does the job in python:
frmt = "!" + "H"*64
x = struct.unpack(frmt, byte_array)
Could someone help me in understanding these 2 lines of code? If anyone knows java equivalent of this, it would be great to know.
I hate to answer my own question.
But, as I got it working for now, I would like to share the solution to this problem. Basically the python code is trying to convert byte array into Hexadecimal values. The expression "!" + "H"*64 tells that its BIG_ENDIAN expression (start reading the byte array from the left) and convert bytes array into 64 Hex values.
I did not find the equivalent code in java which will do this job, but after struggling to decode the byte array I am able to get the intended results using below code.
public static int[] unpack(final byte[] byte_array) {
final int[] integerReadings = new int[byte_array.length / 2];
for(int counter = 0, integerCounter = 0; counter < byte_array.length;) {
integerReadings[integerCounter] = convertTwoBytesToInteger(byte_array[counter], byte_array[counter + 1]);
counter += 2;
integerCounter++;
}
return integerReadings;
}
private static int convertTwoBytesToInteger(final byte byte1, final byte byte2) {
final int unsignedInteger1 = getUnsignedInteger(byte1);
final int unsignedInteger2 = getUnsignedInteger(byte2);
return unsignedInteger1 * 256 + unsignedInteger2;
}
private static int getUnsignedInteger(final byte b) {
int unsignedInteger = b;
if(b < 0) {
unsignedInteger = b + 256;
}
return unsignedInteger;
}
This code is about byte level manipulation of the data stream. However, it solves my problem as of now and I am getting the expected results. If there is a better solution to this, I would definitely like to implement the same.

Check whether a string is parsable into Long without try-catch?

Long.parseLong("string") throws an error if string is not parsable into long.
Is there a way to validate the string faster than using try-catch?
Thanks
You can create rather complex regular expression but it isn't worth that. Using exceptions here is absolutely normal.
It's natural exceptional situation: you assume that there is an integer in the string but indeed there is something else. Exception should be thrown and handled properly.
If you look inside parseLong code, you'll see that there are many different verifications and operations. If you want to do all that stuff before parsing it'll decrease the performance (if we are talking about parsing millions of numbers because otherwise it doesn't matter). So, the only thing you can do if you really need to improve performance by avoiding exceptions is: copy parseLong implementation to your own function and return NaN instead of throwing exceptions in all correspondent cases.
From commons-lang StringUtils:
public static boolean isNumeric(String str) {
if (str == null) {
return false;
}
int sz = str.length();
for (int i = 0; i < sz; i++) {
if (Character.isDigit(str.charAt(i)) == false) {
return false;
}
}
return true;
}
You could do something like
if(s.matches("\\d*")){
}
Using regular expression - to check if String s is full of digits.
But what do you stand to gain? another if condition?
org.apache.commons.lang3.math.NumberUtils.isParsable(yourString) will determine if the string can be parsed by one of: Integer.parseInt(String), Long.parseLong(String), Float.parseFloat(String) or Double.parseDouble(String)
Since you are interested in Longs you could have a condition that checks for isParsable and doesn't contain a decimal
if (NumberUtils.isParsable(yourString) && !StringUtils.contains(yourString,".")){ ...
This is a valid question because there are times when you need to infer what type of data is being represented in a string. For example, you may need to import a large CSV into a database and represent the data types accurately. In such cases, calling Long.parseLong and catching an exception can be too slow.
The following code only handles ASCII decimal:
public class LongParser {
// Since tryParseLong represents the value as negative during processing, we
// counter-intuitively want to keep the sign if the result is negative and
// negate it if it is positive.
private static final int MULTIPLIER_FOR_NEGATIVE_RESULT = 1;
private static final int MULTIPLIER_FOR_POSITIVE_RESULT = -1;
private static final int FIRST_CHARACTER_POSITION = 0;
private static final int SECOND_CHARACTER_POSITION = 1;
private static final char NEGATIVE_SIGN_CHARACTER = '-';
private static final char POSITIVE_SIGN_CHARACTER = '+';
private static final int DIGIT_MAX_VALUE = 9;
private static final int DIGIT_MIN_VALUE = 0;
private static final char ZERO_CHARACTER = '0';
private static final int RADIX = 10;
/**
* Parses a string representation of a long significantly faster than
* <code>Long.ParseLong</code>, and avoids the noteworthy overhead of
* throwing an exception on failure. Based on the parseInt code from
* http://nadeausoftware.com/articles/2009/08/java_tip_how_parse_integers_quickly
*
* #param stringToParse
* The string to try to parse as a <code>long</code>.
*
* #return the boxed <code>long</code> value if the string was a valid
* representation of a long; otherwise <code>null</code>.
*/
public static Long tryParseLong(final String stringToParse) {
if (stringToParse == null || stringToParse.isEmpty()) {
return null;
}
final int inputStringLength = stringToParse.length();
long value = 0;
/*
* The absolute value of Long.MIN_VALUE is greater than the absolute
* value of Long.MAX_VALUE, so during processing we'll use a negative
* value, then we'll multiply it by signMultiplier before returning it.
* This allows us to avoid a conditional add/subtract inside the loop.
*/
int signMultiplier = MULTIPLIER_FOR_POSITIVE_RESULT;
// Get the first character.
char firstCharacter = stringToParse.charAt(FIRST_CHARACTER_POSITION);
if (firstCharacter == NEGATIVE_SIGN_CHARACTER) {
// The first character is a negative sign.
if (inputStringLength == 1) {
// There are no digits.
// The string is not a valid representation of a long value.
return null;
}
signMultiplier = MULTIPLIER_FOR_NEGATIVE_RESULT;
} else if (firstCharacter == POSITIVE_SIGN_CHARACTER) {
// The first character is a positive sign.
if (inputStringLength == 1) {
// There are no digits.
// The string is not a valid representation of a long value.
return null;
}
} else {
// Store the (negative) digit (although we aren't sure yet if it's
// actually a digit).
value = -(firstCharacter - ZERO_CHARACTER);
if (value > DIGIT_MIN_VALUE || value < -DIGIT_MAX_VALUE) {
// The first character is not a digit (or a negative sign).
// The string is not a valid representation of a long value.
return null;
}
}
// Establish the "maximum" value (actually minimum since we're working
// with negatives).
final long rangeLimit = (signMultiplier == MULTIPLIER_FOR_POSITIVE_RESULT)
? -Long.MAX_VALUE
: Long.MIN_VALUE;
// Capture the maximum value that we can multiply by the radix without
// overflowing.
final long maxLongNegatedPriorToMultiplyingByRadix = rangeLimit / RADIX;
for (int currentCharacterPosition = SECOND_CHARACTER_POSITION;
currentCharacterPosition < inputStringLength;
currentCharacterPosition++) {
// Get the current digit (although we aren't sure yet if it's
// actually a digit).
long digit = stringToParse.charAt(currentCharacterPosition)
- ZERO_CHARACTER;
if (digit < DIGIT_MIN_VALUE || digit > DIGIT_MAX_VALUE) {
// The current character is not a digit.
// The string is not a valid representation of a long value.
return null;
}
if (value < maxLongNegatedPriorToMultiplyingByRadix) {
// The value will be out of range if we multiply by the radix.
// The string is not a valid representation of a long value.
return null;
}
// Multiply by the radix to slide all the previously parsed digits.
value *= RADIX;
if (value < (rangeLimit + digit)) {
// The value would be out of range if we "added" the current
// digit.
return null;
}
// "Add" the digit to the value.
value -= digit;
}
// Return the value (adjusting the sign if needed).
return value * signMultiplier;
}
}
You can use java.util.Scanner
Scanner sc = new Scanner(s);
if (sc.hasNextLong()) {
long num = sc.nextLong();
}
This does range checking etc, too. Of course it will say that "99 bottles of beer" hasNextLong(), so if you want to make sure that it only has a long you'd have to do extra checks.
This case is common for forms and programs where you have the input field and are not sure if the string is a valid number. So using try/catch with your java function is the best thing to do if you understand how try/catch works compared to trying to write the function yourself. In order to setup the try catch block in .NET virtual machine, there is zero instructions of overhead, and it is probably the same in Java. If there are instructions used at the try keyword then these will be minimal, and the bulk of the instructions will be used at the catch part and that only happens in the rare case when the number is not valid.
So while it "seems" like you can write a faster function yourself, you would have to optimize it better than the Java compiler in order to beat the try/catch mechanism you already use, and the benefit of a more optimized function is going to be very minimal since number parsing is quite generic.
If you run timing tests with your compiler and the java catch mechanism you already described, you will probably not notice any above marginal slowdown, and by marginal I mean it should be almost nothing.
Get the java language specification to understand the exceptions more and you will see that using such a technique in this case is perfectly acceptable since it wraps a fairly large and complex function. Adding on those few extra instructions in the CPU for the try part is not going to be such a big deal.
I think that's the only way of checking if a String is a valid long value. but you can implement yourself a method to do that, having in mind the biggest long value.
There are much faster ways to parse a long than Long.parseLong. If you want to see an example of a method that is not optimized then you should look at parseLong :)
Do you really need to take into account "digits" that are non-ASCII?
Do you really need to make several methods calls passing around a radix even tough you're probably parsing base 10?
:)
Using a regexp is not the way to go: it's harder to determine if you're number is too big for a long: how do you use a regexp to determine that 9223372036854775807 can be parsed to a long but that 9223372036854775907 cannot?
That said, the answer to a really fast long parsing method is a state machine and that no matter if you want to test if it's parseable or to parse it. Simply, it's not a generic state machine accepting complex regexp but a hardcoded one.
I can both write you a method that parses a long and another one that determines if a long can be parsed that totally outperforms Long.parseLong().
Now what do you want? A state testing method? In that case a state testing method may not be desirable if you want to avoid computing twice the long.
Simply wrap your call in a try/catch.
And if you really want something faster than the default Long.parseLong, write one that is tailored to your problem: base 10 if you're base 10, not checking digits outside ASCII (because you're probably not interested in Japanese's itchi-ni-yon-go etc.).
Hope this helps with the positive values. I used this method once for validating database primary keys.
private static final int MAX_LONG_STR_LEN = Long.toString(Long.MAX_VALUE).length();
public static boolean validId(final CharSequence id)
{
//avoid null
if (id == null)
{
return false;
}
int len = id.length();
//avoid empty or oversize
if (len < 1 || len > MAX_LONG_STR_LEN)
{
return false;
}
long result = 0;
// ASCII '0' at position 48
int digit = id.charAt(0) - 48;
//first char cannot be '0' in my "id" case
if (digit < 1 || digit > 9)
{
return false;
}
else
{
result += digit;
}
//start from 1, we already did the 0.
for (int i = 1; i < len; i++)
{
// ASCII '0' at position 48
digit = id.charAt(i) - 48;
//only numbers
if (digit < 0 || digit > 9)
{
return false;
}
result *= 10;
result += digit;
//if we hit 0x7fffffffffffffff
// we are at 0x8000000000000000 + digit - 1
// so negative
if (result < 0)
{
//overflow
return false;
}
}
return true;
}
Try to use this regular expression:
^(-9223372036854775808|0)$|^((-?)((?!0)\d{1,18}|[1-8]\d{18}|9[0-1]\d{17}|92[0-1]\d{16}|922[0-2]\d{15}|9223[0-2]\d{14}|92233[0-6]\d{13}|922337[0-1]\d{12}|92233720[0-2]\d{10}|922337203[0-5]\d{9}|9223372036[0-7]\d{8}|92233720368[0-4]\d{7}|922337203685[0-3]\d{6}|9223372036854[0-6]\d{5}|92233720368547[0-6]\d{4}|922337203685477[0-4]\d{3}|9223372036854775[0-7]\d{2}|922337203685477580[0-7]))$
It checks all possible numbers for Long.
But as you know in Java Long can contain additional symbols like +, L, _ and etc. And this regexp doesn't validate these values. But if this regexp is not enough for you, you can add additional restrictions for it.
Guava Longs.tryParse("string") returns null instead of throwing an exception if parsing fails. But this method is marked as Beta right now.
You could try using a regular expression to check the form of the string before trying to parse it?
A simple implementation to validate an integer that fits in a long would be:
public static boolean isValidLong(String str) {
if( str==null ) return false;
int len = str.length();
if (str.charAt(0) == '+') {
return str.matches("\\+\\d+") && (len < 20 || len == 20 && str.compareTo("+9223372036854775807") <= 0);
} else if (str.charAt(0) == '-') {
return str.matches("-\\d+") && (len < 20 || len == 20 && str.compareTo("-9223372036854775808") <= 0);
} else {
return str.matches("\\d+") && (len < 19 || len == 19 && str.compareTo("9223372036854775807") <= 0);
}
}
It doesn't handle octal, 0x prefix or so but that is seldom a requirement.
For speed, the ".match" expressions are easy to code in a loop.

Efficient way to search a stream for a string

Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle;
n = needle.length();
for every input character c do
add index 0 to the list
for every index i in the list do
if c == needle[i] then
if i + 1 == n then
return true
else
replace i in the list with i + 1
end
else
remove i from the list
end
end
end
This will find the string if it exists and you will never need a
buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.
Here's an example of a potential implementation:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
It's also how regexes work.
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
I believe the best solution to this problem is to try to keep it simple. Remember, beacause I'm reading from a stream, I want to keep the number of reads from the stream to a minimum (as network or disk latency may be an issue) while keeping the amount of memory used constant (as the stream may be very large in size). Actual efficiency of the string matching is not the number one goal (as that has been studied to death already).
Based on AlbertoPL's suggestion, here's a simple solution that compares the buffer against the search string character by character. The key is that because the search is only done one character at a time, no back tracking is needed and therefore no circular buffers, or buffers of a particular size are needed.
Now, if someone can come up with a similar implementation based on Knuth-Morris-Pratt search algorithm then we'd have a nice efficient solution ;)
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
for (int c = 0; c < numCharsRead; c++) {
if (buffer[c] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.length()) return true;
}
}
return false;
}
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
A very fast searching of a stream is implemented in the RingBuffer class from the Ujorm framework. See the sample:
Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz");
String word1 = RingBuffer.findWord(reader, "${", "}");
assertEquals("abc", word1);
String word2 = RingBuffer.findWord(reader, "${", "}");
assertEquals("def", word2);
String word3 = RingBuffer.findWord(reader, "${", "}");
assertEquals("", word3);
The single class implementation is available on the SourceForge:
For more information see the link.
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
I think you need to buffer a small amount at the boundary between buffers.
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
You might be able to implement a very fast solution using Fast Fourier Transforms, which, if implemented properly, allow you to do string matching in times O(nlog(m)), where n is the length of the longer string to be matched, and m is the length of the shorter string. You could, for example, perform FFT as soon as you receive an stream input of length m, and if it matches, you can return, and if it doesn't match, you can throw away the first character in the stream input, wait for a new character to appear through the stream, and then perform FFT again.
You can increase the speed of search for very large strings by using some string search algorithm
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
Mike.
I also had a similar problem: skip bytes from the InputStream until specified string (or byte array). This is the simple code based on circular buffer. It is not very efficient but works for my needs:
private static boolean matches(int[] buffer, int offset, byte[] search) {
final int len = buffer.length;
for (int i = 0; i < len; ++i) {
if (search[i] != buffer[(offset + i) % len]) {
return false;
}
}
return true;
}
public static void skipBytes(InputStream stream, byte[] search) throws IOException {
final int[] buffer = new int[search.length];
for (int i = 0; i < search.length; ++i) {
buffer[i] = stream.read();
}
int offset = 0;
while (true) {
if (matches(buffer, offset, search)) {
break;
}
buffer[offset] = stream.read();
offset = (offset + 1) % buffer.length;
}
}
Here is my implementation:
static boolean containsKeywordInStream( Reader ir, String keyword, int bufferSize ) throws IOException{
SlidingContainsBuffer sb = new SlidingContainsBuffer( keyword );
char[] buffer = new char[ bufferSize ];
int read;
while( ( read = ir.read( buffer ) ) != -1 ){
if( sb.checkIfContains( buffer, read ) ){
return true;
}
}
return false;
}
SlidingContainsBuffer class:
class SlidingContainsBuffer{
private final char[] keyword;
private int keywordIndexToCheck = 0;
private boolean keywordFound = false;
SlidingContainsBuffer( String keyword ){
this.keyword = keyword.toCharArray();
}
boolean checkIfContains( char[] buffer, int read ){
for( int i = 0; i < read; i++ ){
if( keywordFound == false ){
if( keyword[ keywordIndexToCheck ] == buffer[ i ] ){
keywordIndexToCheck++;
if( keywordIndexToCheck == keyword.length ){
keywordFound = true;
}
} else {
keywordIndexToCheck = 0;
}
} else {
break;
}
}
return keywordFound;
}
}
This answer fully qualifies the task:
The implementation is able to find the searched keyword even if it was split between buffers
Minimum memory usage defined by the buffer size
Number of reads will be minimized by using bigger buffer

Categories

Resources