currently I try to implement the CRC32 check algorithm from IO-Link for IODD files in Java [1]. In IO-Link each sensor is described with an IODD xml file [2]. The file contains a stamp tag with a crc code. My problem here is that the algorithm uses only unsigned integers and Java does not uses this type of types. Is there still a way to implement that?
I also use JetBrains dotPeek program to decompile the IODDChecker (latest Version). The CRC-Logic is in the file clsCRC.cs. At the end of the file I found the method "CRC32" which calculates the CRC. The problem is that they use other integers (unsigned integers) for the lookup-table. An interesting fact is that the algorithm in the IODD checker (crc32_octetwise) is slightly different from the one described in the specification.
Hope you can help me :)
What I tested:
Java CRC32 from zip
Guava UnsignedInteger
BigInteger
long instead of int
Procedure:
Program read the IODD as String/byte array
Remove the CRC value from Stamp-Tag from the String/byte array
Create CRC32 from string/byte array
In the test IODD you can find the CRC 1195433981 which should be equal to the output of the algorithm
Algorithms:
There are two algorithms. The first one is bitwise and the second one is octetwise and uses an lookup-table which you can find here [2].
uint32_t crc32_bitwise(char *data, size_t length, uint32_t previousCrc32 = 1)
{
uint32_t crc = ~previousCrc32;
int j;
const uint8_t* current = (const uint8_t*) data;
while (length-- > 0)
{
crc ^= *current++;
for (j = 0; j < 8; j++)
{
crc = (crc >> 1) ^ (-(int32_t)(crc & 1) & 0xEB31D82E);
}
}
return ~crc;
}
uint32_t crc32_octetwise(uint8_t const * current, uint32_t length, uint32_t previousCrc32 = 1, const uint32_t crc32Lookup[256])
{
uint32_t crc = ~previousCrc32;
while (length-- > 0)
crc = (crc >> 8) ^ crc32Lookup[(crc & 0xFF) ^ *current++];
return ~crc;
}
[1] https://io-link.com/share/Downloads/Profiles/IOL-Profile_Firmware-Update_V11_10082_Sep19.zip
[2] https://ioddfinder.io-link.com/productvariants/search/11765
[3] Yet another Java CRC32 implementation related issue
Update 1
I forgot to add that I currently receive 1120211745 as CRC Code but 1195433981 is the correct code.
Update 2
The following Codes was implemented from me. The big arrays are the lookup tables. The first int array is the table from the [1] and the second one is the array from the decompiled source. In the original were the int numbers marked as unsigned (e.g. 1996959894U).
private static final int[] numArr1 = new int[] {
0x00000000, 0x9695C4CA, 0xFB4839C9, 0x6DDDFD03, 0x20F3C3CF, 0xB6660705, 0xDBBBFA06, 0x4D2E3ECC, 0x41E7879E, 0xD7724354, 0xBAAFBE57, 0x2C3A7A9D, 0x61144451, 0xF781809B, 0x9A5C7D98, 0x0CC9B952, 0x83CF0F3C, 0x155ACBF6, 0x788736F5, 0xEE12F23F, 0xA33CCCF3, 0x35A90839, 0x5874F53A, 0xCEE131F0, 0xC22888A2, 0x54BD4C68, 0x3960B16B, 0xAFF575A1, 0xE2DB4B6D, 0x744E8FA7, 0x199372A4, 0x8F06B66E, 0xD1FDAE25, 0x47686AEF, 0x2AB597EC, 0xBC205326, 0xF10E6DEA, 0x679BA920, 0x0A465423, 0x9CD390E9, 0x901A29BB, 0x068FED71, 0x6B521072, 0xFDC7D4B8, 0xB0E9EA74, 0x267C2EBE, 0x4BA1D3BD, 0xDD341777, 0x5232A119, 0xC4A765D3, 0xA97A98D0, 0x3FEF5C1A, 0x72C162D6, 0xE454A61C, 0x89895B1F, 0x1F1C9FD5, 0x13D52687, 0x8540E24D, 0xE89D1F4E, 0x7E08DB84, 0x3326E548, 0xA5B32182, 0xC86EDC81, 0x5EFB184B, 0x7598EC17, 0xE30D28DD, 0x8ED0D5DE, 0x18451114, 0x556B2FD8, 0xC3FEEB12, 0xAE231611, 0x38B6D2DB, 0x347F6B89, 0xA2EAAF43, 0xCF375240, 0x59A2968A, 0x148CA846, 0x82196C8C, 0xEFC4918F, 0x79515545, 0xF657E32B, 0x60C227E1, 0x0D1FDAE2, 0x9B8A1E28, 0xD6A420E4, 0x4031E42E, 0x2DEC192D, 0xBB79DDE7, 0xB7B064B5, 0x2125A07F, 0x4CF85D7C, 0xDA6D99B6, 0x9743A77A, 0x01D663B0, 0x6C0B9EB3, 0xFA9E5A79, 0xA4654232, 0x32F086F8, 0x5F2D7BFB, 0xC9B8BF31, 0x849681FD, 0x12034537, 0x7FDEB834, 0xE94B7CFE, 0xE582C5AC, 0x73170166, 0x1ECAFC65, 0x885F38AF, 0xC5710663, 0x53E4C2A9, 0x3E393FAA, 0xA8ACFB60, 0x27AA4D0E, 0xB13F89C4, 0xDCE274C7, 0x4A77B00D, 0x07598EC1, 0x91CC4A0B, 0xFC11B708, 0x6A8473C2, 0x664DCA90, 0xF0D80E5A, 0x9D05F359, 0x0B903793, 0x46BE095F, 0xD02BCD95, 0xBDF63096, 0x2B63F45C, 0xEB31D82E, 0x7DA41CE4, 0x1079E1E7, 0x86EC252D, 0xCBC21BE1, 0x5D57DF2B, 0x308A2228, 0xA61FE6E2, 0xAAD65FB0, 0x3C439B7A, 0x519E6679, 0xC70BA2B3, 0x8A259C7F, 0x1CB058B5, 0x716DA5B6, 0xE7F8617C, 0x68FED712, 0xFE6B13D8, 0x93B6EEDB, 0x05232A11, 0x480D14DD, 0xDE98D017, 0xB3452D14, 0x25D0E9DE, 0x2919508C, 0xBF8C9446, 0xD2516945, 0x44C4AD8F, 0x09EA9343, 0x9F7F5789, 0xF2A2AA8A, 0x64376E40, 0x3ACC760B, 0xAC59B2C1, 0xC1844FC2, 0x57118B08, 0x1A3FB5C4, 0x8CAA710E, 0xE1778C0D, 0x77E248C7, 0x7B2BF195, 0xEDBE355F, 0x8063C85C, 0x16F60C96, 0x5BD8325A, 0xCD4DF690, 0xA0900B93, 0x3605CF59, 0xB9037937, 0x2F96BDFD, 0x424B40FE, 0xD4DE8434, 0x99F0BAF8, 0x0F657E32, 0x62B88331, 0xF42D47FB, 0xF8E4FEA9, 0x6E713A63, 0x03ACC760, 0x953903AA, 0xD8173D66, 0x4E82F9AC, 0x235F04AF, 0xB5CAC065, 0x9EA93439, 0x083CF0F3, 0x65E10DF0, 0xF374C93A, 0xBE5AF7F6, 0x28CF333C, 0x4512CE3F, 0xD3870AF5, 0xDF4EB3A7, 0x49DB776D, 0x24068A6E, 0xB2934EA4, 0xFFBD7068, 0x6928B4A2, 0x04F549A1, 0x92608D6B, 0x1D663B05, 0x8BF3FFCF, 0xE62E02CC, 0x70BBC606, 0x3D95F8CA, 0xAB003C00, 0xC6DDC103, 0x504805C9, 0x5C81BC9B, 0xCA147851, 0xA7C98552, 0x315C4198, 0x7C727F54, 0xEAE7BB9E, 0x873A469D, 0x11AF8257, 0x4F549A1C, 0xD9C15ED6, 0xB41CA3D5, 0x2289671F, 0x6FA759D3, 0xF9329D19, 0x94EF601A, 0x027AA4D0, 0x0EB31D82, 0x9826D948, 0xF5FB244B, 0x636EE081, 0x2E40DE4D, 0xB8D51A87, 0xD508E784, 0x439D234E, 0xCC9B9520, 0x5A0E51EA, 0x37D3ACE9, 0xA1466823, 0xEC6856EF, 0x7AFD9225, 0x17206F26, 0x81B5ABEC, 0x8D7C12BE, 0x1BE9D674, 0x76342B77, 0xE0A1EFBD, 0xAD8FD171, 0x3B1A15BB, 0x56C7E8B8, 0xC0522C72
};
private static final long[] numArr2 = new long[] {
0L, 1996959894L, 3993919788L, 2567524794L, 124634137L, 1886057615L, 3915621685L, 2657392035L, 249268274L, 2044508324L, 3772115230L, 2547177864L, 162941995L, 2125561021L, 3887607047L, 2428444049L, 498536548L, 1789927666L, 4089016648L, 2227061214L, 450548861L, 1843258603L, 4107580753L, 2211677639L, 325883990L, 1684777152L, 4251122042L, 2321926636L, 335633487L, 1661365465L, 4195302755L, 2366115317L, 997073096L, 1281953886L, 3579855332L, 2724688242L, 1006888145L, 1258607687L, 3524101629L, 2768942443L, 901097722L, 1119000684L, 3686517206L, 2898065728L, 853044451L, 1172266101L, 3705015759L, 2882616665L, 651767980L, 1373503546L, 3369554304L, 3218104598L, 565507253L, 1454621731L, 3485111705L, 3099436303L, 671266974L, 1594198024L, 3322730930L, 2970347812L, 795835527L, 1483230225L, 3244367275L, 3060149565L, 1994146192L, 31158534L, 2563907772L, 4023717930L, 1907459465L, 112637215L, 2680153253L, 3904427059L, 2013776290L, 251722036L, 2517215374L, 3775830040L, 2137656763L, 141376813L, 2439277719L, 3865271297L, 1802195444L, 476864866L, 2238001368L, 4066508878L, 1812370925L, 453092731L, 2181625025L, 4111451223L, 1706088902L, 314042704L, 2344532202L, 4240017532L, 1658658271L, 366619977L, 2362670323L, 4224994405L, 1303535960L, 984961486L, 2747007092L, 3569037538L, 1256170817L, 1037604311L, 2765210733L, 3554079995L, 1131014506L, 879679996L, 2909243462L, 3663771856L, 1141124467L, 855842277L, 2852801631L, 3708648649L, 1342533948L, 654459306L, 3188396048L, 3373015174L, 1466479909L, 544179635L, 3110523913L, 3462522015L, 1591671054L, 702138776L, 2966460450L, 3352799412L, 1504918807L, 783551873L, 3082640443L, 3233442989L, 3988292384L, 2596254646L, 62317068L, 1957810842L, 3939845945L, 2647816111L, 81470997L, 1943803523L, 3814918930L, 2489596804L, 225274430L, 2053790376L, 3826175755L, 2466906013L, 167816743L, 2097651377L, 4027552580L, 2265490386L, 503444072L, 1762050814L, 4150417245L, 2154129355L, 426522225L, 1852507879L, 4275313526L, 2312317920L, 282753626L, 1742555852L, 4189708143L, 2394877945L, 397917763L, 1622183637L, 3604390888L, 2714866558L, 953729732L, 1340076626L, 3518719985L, 2797360999L, 1068828381L, 1219638859L, 3624741850L, 2936675148L, 906185462L, 1090812512L, 3747672003L, 2825379669L, 829329135L, 1181335161L, 3412177804L, 3160834842L, 628085408L, 1382605366L, 3423369109L, 3138078467L, 570562233L, 1426400815L, 3317316542L, 2998733608L, 733239954L, 1555261956L, 3268935591L, 3050360625L, 752459403L, 1541320221L, 2607071920L, 3965973030L, 1969922972L, 40735498L, 2617837225L, 3943577151L, 1913087877L, 83908371L, 2512341634L, 3803740692L, 2075208622L, 213261112L, 2463272603L, 3855990285L, 2094854071L, 198958881L, 2262029012L, 4057260610L, 1759359992L, 534414190L, 2176718541L, 4139329115L, 1873836001L, 414664567L, 2282248934L, 4279200368L, 1711684554L, 285281116L, 2405801727L, 4167216745L, 1634467795L, 376229701L, 2685067896L, 3608007406L, 1308918612L, 956543938L, 2808555105L, 3495958263L, 1231636301L, 1047427035L, 2932959818L, 3654703836L, 1088359270L, 936918000L, 2847714899L, 3736837829L, 1202900863L, 817233897L, 3183342108L, 3401237130L, 1404277552L, 615818150L, 3134207493L, 3453421203L, 1423857449L, 601450431L, 3009837614L, 3294710456L, 1567103746L, 711928724L, 3020668471L, 3272380065L, 1510334235L, 755167117L
};
Test with CRC32 from java.util.zip:
public static String crc32(byte[] data) {
Checksum c = new CRC32();
c.update(1);
c.update(data);
return String.valueOf(Long.parseLong(Long.toHexString(c.getValue()), 16)); // 1544046132
}
The solution from [3]
public static String crc32(byte[] data){
int crc = Integer.MAX_VALUE;
for (byte b : data)
crc = numArr[(crc & 0xff) ^ (b & 0xff) & 255] ^ (crc >>> 8);
crc = ~crc; // flip bit/sign
return String.valueOf(Long.parseLong(Integer.toHexString(crc), 16)); // 1120211745
}
This implementation uses Guava and was directly transformed from me. I also try BigInteger, bit operations to enforce unsigned and use byte arrays instead of int or long.
public static String crc32(byte[] data) {
long num = UnsignedInteger.MAX_VALUE.longValue();
for (byte aByte : data)
num = numArr2[(Integer.parseUnsignedInt(Long.toUnsignedString(num)) ^ aByte) & 255] ^ (num >>> 8);
num ^= UnsignedInteger.MAX_VALUE.longValue();
return String.valueOf(num); // 2128031166
}
Update 3
I tried a new start with BigInteger and now the result is closer. The CRC is 1190225612 (expected: 1195433981).
public static String crc32(byte[] bytes) {
BigInteger num = BigInteger.valueOf(0xffffffffL);
for (byte aByte: bytes)
num = BigInteger.valueOf(numArr[num.xor(BigInteger.valueOf(aByte)).and(BigInteger.valueOf(numArr.length - 1)).intValue()]).xor((num.shiftRight(8)));
num = num.xor(BigInteger.valueOf(0xffffffffL));
System.out.println("1195433981");
return String.valueOf(~num.intValue());
}
Update 4
Update the procedure description. Not the whole line with the Stamp tag will be removed instead of only the value from the CRC attribute will be removed.
You have the initial CRC wrong. It is 1. From the document you linked:
The seed value shall be "1" (see previousCrc32 = 1 in Figure B.1 and
Figure B.2).
That value is then inverted before the loop, so at that point your "num" should be 0xfffffffe.
You numArr1 is correct. I don't know where your numArr2 came from.
All you need is this, calling with the first argument being 1:
static private int crc32iodd(int crc, byte[] data){
crc = ~crc;
for (byte b : data)
crc = numArr1[(crc ^ b) & 0xff] ^ (crc >>> 8);
return ~crc;
}
Since Java doesn't have unsigned int, half the time your result will be negative. That's fine, since it's the exact same 32 bits as you would have as an unsigned result, just printed differently.
There is no sense of "closer" for the answer. The integer difference between what you want and what you're getting has no meaning in the world of CRCs, since they are not integer calculations.
Update:
Looking more closely at your links and the CRC reference code you put in your question, it appears you have referenced entirely the wrong documents for the CRC you actually need. You referenced the CRC for BLOBs used for firmware updates, which is not the CRC used for validation of the XML files. This document provides the specification for the CRC used for the XML files as the 32-bit CRC found in the ITU-T recommendation V.42. That is a different CRC using a different polynomial and different initial value. It is in fact the standard CRC-32 used by zip, and is available in Java as java.util.zip.CRC32. You don't need to write any CRC code. Just use that.
That document also notes but does not detail special processing of the XML input before computing the CRC. You will need to do more research to find out exactly what the CRC is calculated over in the XML file, and how it is pre-processed. This package has a description that notes that the associated language xml's have the CRC of main xml's appended to the end before each of their CRC calculations. However it is not made clear as to exactly how the CRC is appended.
This time please conduct adequate research and verification on your part, before coming here with questions about how to implement an incorrectly identified CRC.
For the sake of completeness.
You are right. I can use the java.util.zip.CRC32. I think the problem was that I try at the beginning of my implementation tests to use the whole xml file and so I thought that the description in the PDF was wrong.
And here is the solution:
If you have a IODD XML file than you have to remove the value of crc in the stamp-tag. Subsequently, if not already done, convert the file content without the crc-value to a byte array. And now you can use one of these two functions:
Implementation with java.util.zip.CRC32
public boolean isValid() {
CRC32 m = new CRC32();
m.update(file);
long x = 1195433981L;
return (m.getValue() == x);
}
Manual implementation (The table is according to the site https://rosettacode.org/wiki/CRC-32#Lingo or https://web.mit.edu/freebsd/head/sys/libkern/crc32.c)
public class Crc32 {
private static final long[] numArr2 = new long[] {
0L, 1996959894L, 3993919788L, 2567524794L, 124634137L, 1886057615L, 3915621685L, 2657392035L, 249268274L, 2044508324L, 3772115230L, 2547177864L, 162941995L, 2125561021L, 3887607047L, 2428444049L, 498536548L, 1789927666L, 4089016648L, 2227061214L, 450548861L, 1843258603L, 4107580753L, 2211677639L, 325883990L, 1684777152L, 4251122042L, 2321926636L, 335633487L, 1661365465L, 4195302755L, 2366115317L, 997073096L, 1281953886L, 3579855332L, 2724688242L, 1006888145L, 1258607687L, 3524101629L, 2768942443L, 901097722L, 1119000684L, 3686517206L, 2898065728L, 853044451L, 1172266101L, 3705015759L, 2882616665L, 651767980L, 1373503546L, 3369554304L, 3218104598L, 565507253L, 1454621731L, 3485111705L, 3099436303L, 671266974L, 1594198024L, 3322730930L, 2970347812L, 795835527L, 1483230225L, 3244367275L, 3060149565L, 1994146192L, 31158534L, 2563907772L, 4023717930L, 1907459465L, 112637215L, 2680153253L, 3904427059L, 2013776290L, 251722036L, 2517215374L, 3775830040L, 2137656763L, 141376813L, 2439277719L, 3865271297L, 1802195444L, 476864866L, 2238001368L, 4066508878L, 1812370925L, 453092731L, 2181625025L, 4111451223L, 1706088902L, 314042704L, 2344532202L, 4240017532L, 1658658271L, 366619977L, 2362670323L, 4224994405L, 1303535960L, 984961486L, 2747007092L, 3569037538L, 1256170817L, 1037604311L, 2765210733L, 3554079995L, 1131014506L, 879679996L, 2909243462L, 3663771856L, 1141124467L, 855842277L, 2852801631L, 3708648649L, 1342533948L, 654459306L, 3188396048L, 3373015174L, 1466479909L, 544179635L, 3110523913L, 3462522015L, 1591671054L, 702138776L, 2966460450L, 3352799412L, 1504918807L, 783551873L, 3082640443L, 3233442989L, 3988292384L, 2596254646L, 62317068L, 1957810842L, 3939845945L, 2647816111L, 81470997L, 1943803523L, 3814918930L, 2489596804L, 225274430L, 2053790376L, 3826175755L, 2466906013L, 167816743L, 2097651377L, 4027552580L, 2265490386L, 503444072L, 1762050814L, 4150417245L, 2154129355L, 426522225L, 1852507879L, 4275313526L, 2312317920L, 282753626L, 1742555852L, 4189708143L, 2394877945L, 397917763L, 1622183637L, 3604390888L, 2714866558L, 953729732L, 1340076626L, 3518719985L, 2797360999L, 1068828381L, 1219638859L, 3624741850L, 2936675148L, 906185462L, 1090812512L, 3747672003L, 2825379669L, 829329135L, 1181335161L, 3412177804L, 3160834842L, 628085408L, 1382605366L, 3423369109L, 3138078467L, 570562233L, 1426400815L, 3317316542L, 2998733608L, 733239954L, 1555261956L, 3268935591L, 3050360625L, 752459403L, 1541320221L, 2607071920L, 3965973030L, 1969922972L, 40735498L, 2617837225L, 3943577151L, 1913087877L, 83908371L, 2512341634L, 3803740692L, 2075208622L, 213261112L, 2463272603L, 3855990285L, 2094854071L, 198958881L, 2262029012L, 4057260610L, 1759359992L, 534414190L, 2176718541L, 4139329115L, 1873836001L, 414664567L, 2282248934L, 4279200368L, 1711684554L, 285281116L, 2405801727L, 4167216745L, 1634467795L, 376229701L, 2685067896L, 3608007406L, 1308918612L, 956543938L, 2808555105L, 3495958263L, 1231636301L, 1047427035L, 2932959818L, 3654703836L, 1088359270L, 936918000L, 2847714899L, 3736837829L, 1202900863L, 817233897L, 3183342108L, 3401237130L, 1404277552L, 615818150L, 3134207493L, 3453421203L, 1423857449L, 601450431L, 3009837614L, 3294710456L, 1567103746L, 711928724L, 3020668471L, 3272380065L, 1510334235L, 755167117L
};
private static long crc32(byte[] data){
long crc = 0xFFFFFFFFL;
for (byte b : data)
crc = numArr2[(int)(crc ^ b) & 0xff] ^ (crc >>> 8);
return crc ^ 0xFFFFFFFFL;
}
public static boolean isValid(byte[] data, long checkCrc) {
long crc = crc32(data);
return (checkCrc == crc);
}
}
I had the same problem, but I use Python. One comment first. CRCs are made for the detection of changes. If you change only one bit in the input stream, the CRC will be completely different. So you can't say, the calculated the crc is close to the right value.
This is my code, which calculates the right crc
prev = 0
for eachLine in open(file,"rb"):
prev = zlib.crc32(eachLine, prev)
crc = prev & 0xFFFFFFFF
At first you clear the crc in your file to an empty string "". Then you have to use this file to calculate the crc. Because of processing the file as binary, don't change anything besides the crc (tabs, spaces, line endings). One space, automatically removed by the IODD_Checker tool, costed nearly a complete day.
If you calculate a 32 bit crc, you will have the possibility of 1 to 4294967295 that 0 is a valid crc. Because the checker uses this value to mark the crc as invalid, I'm not sure there is a special handling defined.
if 0 == prev:
crc = 0xFFFFFFFF
else:
crc = prev & 0xFFFFFFFF
This issue I will have to check.
Let's suppose that have a stream of text (or Reader in Java) that I'd like to check for a particular string. The stream of text might be very large so as soon as the search string is found I'd like to return true and also try to avoid storing the entire input in memory.
Naively, I might try to do something like this (in Java):
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
while((numCharsRead = reader.read(buffer)) > 0) {
if ((new String(buffer, 0, numCharsRead)).indexOf(searchString) >= 0)
return true;
}
return false;
}
Of course this fails to detect the given search string if it occurs on the boundary of the 1k buffer:
Search text: "stackoverflow"
Stream buffer 1: "abc.........stack"
Stream buffer 2: "overflow.......xyz"
How can I modify this code so that it correctly finds the given search string across the boundary of the buffer but without loading the entire stream into memory?
Edit: Note when searching a stream for a string, we're trying to minimise the number of reads from the stream (to avoid latency in a network/disk) and to keep memory usage constant regardless of the amount of data in the stream. Actual efficiency of the string matching algorithm is secondary but obviously, it would be nice to find a solution that used one of the more efficient of those algorithms.
There are three good solutions here:
If you want something that is easy and reasonably fast, go with no buffer, and instead implement a simple nondeterminstic finite-state machine. Your state will be a list of indices into the string you are searching, and your logic looks something like this (pseudocode):
String needle;
n = needle.length();
for every input character c do
add index 0 to the list
for every index i in the list do
if c == needle[i] then
if i + 1 == n then
return true
else
replace i in the list with i + 1
end
else
remove i from the list
end
end
end
This will find the string if it exists and you will never need a
buffer.
Slightly more work but also faster: do an NFA-to-DFA conversion that figures out in advance what lists of indices are possible, and assign each one to a small integer. (If you read about string search on Wikipedia, this is called the powerset construction.) Then you have a single state and you make a state-to-state transition on each incoming character. The NFA you want is just the DFA for the string preceded with a state that nondeterministically either drops a character or tries to consume the current character. You'll want an explicit error state as well.
If you want something faster, create a buffer whose size is at least twice n, and user Boyer-Moore to compile a state machine from needle. You'll have a lot of extra hassle because Boyer-Moore is not trivial to implement (although you'll find code online) and because you'll have to arrange to slide the string through the buffer. You'll have to build or find a circular buffer that can 'slide' without copying; otherwise you're likely to give back any performance gains you might get from Boyer-Moore.
I did a few changes to the Knuth Morris Pratt algorithm for partial searches. Since the actual comparison position is always less or equal than the next one there is no need for extra memory. The code with a Makefile is also available on github and it is written in Haxe to target multiple programming languages at once, including Java.
I also wrote a related article: searching for substrings in streams: a slight modification of the Knuth-Morris-Pratt algorithm in Haxe. The article mentions the Jakarta RegExp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter.
class StreamOrientedKnuthMorrisPratt {
var m: Int;
var i: Int;
var ss:
var table: Array<Int>;
public function new(ss: String) {
this.ss = ss;
this.buildTable(this.ss);
}
public function begin() : Void {
this.m = 0;
this.i = 0;
}
public function partialSearch(s: String) : Int {
var offset = this.m + this.i;
while(this.m + this.i - offset < s.length) {
if(this.ss.substr(this.i, 1) == s.substr(this.m + this.i - offset,1)) {
if(this.i == this.ss.length - 1) {
return this.m;
}
this.i += 1;
} else {
this.m += this.i - this.table[this.i];
if(this.table[this.i] > -1)
this.i = this.table[this.i];
else
this.i = 0;
}
}
return -1;
}
private function buildTable(ss: String) : Void {
var pos = 2;
var cnd = 0;
this.table = new Array<Int>();
if(ss.length > 2)
this.table.insert(ss.length, 0);
else
this.table.insert(2, 0);
this.table[0] = -1;
this.table[1] = 0;
while(pos < ss.length) {
if(ss.substr(pos-1,1) == ss.substr(cnd, 1))
{
cnd += 1;
this.table[pos] = cnd;
pos += 1;
} else if(cnd > 0) {
cnd = this.table[cnd];
} else {
this.table[pos] = 0;
pos += 1;
}
}
}
public static function main() {
var KMP = new StreamOrientedKnuthMorrisPratt("aa");
KMP.begin();
trace(KMP.partialSearch("ccaabb"));
KMP.begin();
trace(KMP.partialSearch("ccarbb"));
trace(KMP.partialSearch("fgaabb"));
}
}
The Knuth-Morris-Pratt search algorithm never backs up; this is just the property you want for your stream search. I've used it before for this problem, though there may be easier ways using available Java libraries. (When this came up for me I was working in C in the 90s.)
KMP in essence is a fast way to build a string-matching DFA, like Norman Ramsey's suggestion #2.
This answer applied to the initial version of the question where the key was to read the stream only as far as necessary to match on a String, if that String was present. This solution would not meet the requirement to guarantee fixed memory utilisation, but may be worth considering if you have found this question and are not bound by that constraint.
If you are bound by the constant memory usage constraint, Java stores arrays of any type on the heap, and as such nulling the reference does not deallocate memory in any way; I think any solution involving arrays in a loop will consume memory on the heap and require GC.
For simple implementation, maybe Java 5's Scanner which can accept an InputStream and use a java.util.regex.Pattern to search the input for might save you worrying about the implementation details.
Here's an example of a potential implementation:
public boolean streamContainsString(Reader reader, String searchString)
throws IOException {
Scanner streamScanner = new Scanner(reader);
if (streamScanner.findWithinHorizon(searchString, 0) != null) {
return true;
} else {
return false;
}
}
I'm thinking regex because it sounds like a job for a Finite State Automaton, something that starts in an initial state, changing state character by character until it either rejects the string (no match) or gets to an accept state.
I think this is probably the most efficient matching logic you could use, and how you organize the reading of the information can be divorced from the matching logic for performance tuning.
It's also how regexes work.
Instead of having your buffer be an array, use an abstraction that implements a circular buffer. Your index calculation will be buf[(next+i) % sizeof(buf)], and you'll have to be careful to full the buffer one-half at a time. But as long as the search string fits in half the buffer, you'll find it.
I believe the best solution to this problem is to try to keep it simple. Remember, beacause I'm reading from a stream, I want to keep the number of reads from the stream to a minimum (as network or disk latency may be an issue) while keeping the amount of memory used constant (as the stream may be very large in size). Actual efficiency of the string matching is not the number one goal (as that has been studied to death already).
Based on AlbertoPL's suggestion, here's a simple solution that compares the buffer against the search string character by character. The key is that because the search is only done one character at a time, no back tracking is needed and therefore no circular buffers, or buffers of a particular size are needed.
Now, if someone can come up with a similar implementation based on Knuth-Morris-Pratt search algorithm then we'd have a nice efficient solution ;)
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
for (int c = 0; c < numCharsRead; c++) {
if (buffer[c] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.length()) return true;
}
}
return false;
}
If you're not tied to using a Reader, then you can use Java's NIO API to efficiently load the file. For example (untested, but should be close to working):
public boolean streamContainsString(File input, String searchString) throws IOException {
Pattern pattern = Pattern.compile(Pattern.quote(searchString));
FileInputStream fis = new FileInputStream(input);
FileChannel fc = fis.getChannel();
int sz = (int) fc.size();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, sz);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
CharBuffer cb = decoder.decode(bb);
Matcher matcher = pattern.matcher(cb);
return matcher.matches();
}
This basically mmap()'s the file to search and relies on the operating system to do the right thing regarding cache and memory usage. Note however that map() is more expensive the just reading the file in to a large buffer for files less than around 10 KiB.
A very fast searching of a stream is implemented in the RingBuffer class from the Ujorm framework. See the sample:
Reader reader = RingBuffer.createReader("xxx ${abc} ${def} zzz");
String word1 = RingBuffer.findWord(reader, "${", "}");
assertEquals("abc", word1);
String word2 = RingBuffer.findWord(reader, "${", "}");
assertEquals("def", word2);
String word3 = RingBuffer.findWord(reader, "${", "}");
assertEquals("", word3);
The single class implementation is available on the SourceForge:
For more information see the link.
Implement a sliding window. Have your buffer around, move all elements in the buffer one forward and enter a single new character in the buffer at the end. If the buffer is equal to your searched word, it is contained.
Of course, if you want to make this more efficient, you can look at a way to prevent moving all elements in the buffer around, for example by having a cyclic buffer and a representation of the strings which 'cycles' the same way the buffer does, so you only need to check for content-equality. This saves moving all elements in the buffer.
I think you need to buffer a small amount at the boundary between buffers.
For example if your buffer size is 1024 and the length of the SearchString is 10, then as well as searching each 1024-byte buffer you also need to search each 18-byte transition between two buffers (9 bytes from the end of the previous buffer concatenated with 9 bytes from the start of the next buffer).
I'd say switch to a character by character solution, in which case you'd scan for the first character in your target text, then when you find that character increment a counter and look for the next character. Every time you don't find the next consecutive character restart the counter. It would work like this:
public boolean streamContainsString(Reader reader, String searchString) throws IOException {
char[] buffer = new char[1024];
int numCharsRead;
int count = 0;
while((numCharsRead = reader.read(buffer)) > 0) {
if (buffer[numCharsRead -1] == searchString.charAt(count))
count++;
else
count = 0;
if (count == searchString.size())
return true;
}
return false;
}
The only problem is when you're in the middle of looking through characters... in which case there needs to be a way of remembering your count variable. I don't see an easy way of doing so except as a private variable for the whole class. In which case you would not instantiate count inside this method.
You might be able to implement a very fast solution using Fast Fourier Transforms, which, if implemented properly, allow you to do string matching in times O(nlog(m)), where n is the length of the longer string to be matched, and m is the length of the shorter string. You could, for example, perform FFT as soon as you receive an stream input of length m, and if it matches, you can return, and if it doesn't match, you can throw away the first character in the stream input, wait for a new character to appear through the stream, and then perform FFT again.
You can increase the speed of search for very large strings by using some string search algorithm
If you're looking for a constant substring rather than a regex, I'd recommend Boyer-Moore. There's plenty of source code on the internet.
Also, use a circular buffer, to avoid think too hard about buffer boundaries.
Mike.
I also had a similar problem: skip bytes from the InputStream until specified string (or byte array). This is the simple code based on circular buffer. It is not very efficient but works for my needs:
private static boolean matches(int[] buffer, int offset, byte[] search) {
final int len = buffer.length;
for (int i = 0; i < len; ++i) {
if (search[i] != buffer[(offset + i) % len]) {
return false;
}
}
return true;
}
public static void skipBytes(InputStream stream, byte[] search) throws IOException {
final int[] buffer = new int[search.length];
for (int i = 0; i < search.length; ++i) {
buffer[i] = stream.read();
}
int offset = 0;
while (true) {
if (matches(buffer, offset, search)) {
break;
}
buffer[offset] = stream.read();
offset = (offset + 1) % buffer.length;
}
}
Here is my implementation:
static boolean containsKeywordInStream( Reader ir, String keyword, int bufferSize ) throws IOException{
SlidingContainsBuffer sb = new SlidingContainsBuffer( keyword );
char[] buffer = new char[ bufferSize ];
int read;
while( ( read = ir.read( buffer ) ) != -1 ){
if( sb.checkIfContains( buffer, read ) ){
return true;
}
}
return false;
}
SlidingContainsBuffer class:
class SlidingContainsBuffer{
private final char[] keyword;
private int keywordIndexToCheck = 0;
private boolean keywordFound = false;
SlidingContainsBuffer( String keyword ){
this.keyword = keyword.toCharArray();
}
boolean checkIfContains( char[] buffer, int read ){
for( int i = 0; i < read; i++ ){
if( keywordFound == false ){
if( keyword[ keywordIndexToCheck ] == buffer[ i ] ){
keywordIndexToCheck++;
if( keywordIndexToCheck == keyword.length ){
keywordFound = true;
}
} else {
keywordIndexToCheck = 0;
}
} else {
break;
}
}
return keywordFound;
}
}
This answer fully qualifies the task:
The implementation is able to find the searched keyword even if it was split between buffers
Minimum memory usage defined by the buffer size
Number of reads will be minimized by using bigger buffer