Search for a String (as an byte[]) in a binary stream - java

Hi Team, I am trying to find a String "Henry" in a binary file and change the String to a different string. FYI the file is the output of serialisation of an object. Original Question here
I am new to searching bytes and imagined this code would search for my byte[] and exchange it. But it doesn't come close to working it doesn't even find a match.
{
byte[] bytesHenry = new String("Henry").getBytes();
byte[] bytesSwap = new String("Zsswd").getBytes();
byte[] seekHenry = new byte[bytesHenry.length];
RandomAccessFile file = new RandomAccessFile(fileString,"rw");
long filePointer;
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);
if (bytesHenry == seekHenry) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
}
}
Okay I see the bytesHenry==seekHenry problem and will swap to Arrays.equals( bytesHenry , seekHenry )
I think I need to move along by -4 byte positions each time i read 5 bytes.
Bingo it finds it now
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);;
if (Arrays.equals(bytesHenry,
seekHenry)) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
file.seek(filePointer);
file.read();
}

The following could work for you, see the method search(byte[] input, byte[] searchedFor) which returns the index where the first match matches, or -1.
public class SearchBuffer {
public static void main(String[] args) throws UnsupportedEncodingException {
String charset= "US-ASCII";
byte[] searchedFor = "ciao".getBytes(charset);
byte[] input = "aaaciaaaciaojjcia".getBytes(charset);
int idx = search(input, searchedFor);
System.out.println("index: "+idx); //should be 8
}
public static int search(byte[] input, byte[] searchedFor) {
//convert byte[] to Byte[]
Byte[] searchedForB = new Byte[searchedFor.length];
for(int x = 0; x<searchedFor.length; x++){
searchedForB[x] = searchedFor[x];
}
int idx = -1;
//search:
Deque<Byte> q = new ArrayDeque<Byte>(input.length);
for(int i=0; i<input.length; i++){
if(q.size() == searchedForB.length){
//here I can check
Byte[] cur = q.toArray(new Byte[]{});
if(Arrays.equals(cur, searchedForB)){
//found!
idx = i - searchedForB.length;
break;
} else {
//not found
q.pop();
q.addLast(input[i]);
}
} else {
q.addLast(input[i]);
}
}
return idx;
}
}

From Fastest way to find a string in a text file with java:
The best realization I've found in MIMEParser: https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java
/**
* Finds the boundary in the given buffer using Boyer-Moore algo.
* Copied from java.util.regex.Pattern.java
*
* #param mybuf boundary to be searched in this mybuf
* #param off start index in mybuf
* #param len number of bytes in mybuf
*
* #return -1 if there is no match or index where the match starts
*/
private int match(byte[] mybuf, int off, int len) {
Needed also:
private void compileBoundaryPattern();

Related

Not able to get 4 leading zeros in sha256 hash proof of work- ever (Java)

I'm build a blockchain app.
When I run tests in main, no matter what I do, no matter how much time I give it, when I log different things out, I'm unable to get 4 leading zeroes and so complete a difficulty level of 4. I see the log of the binary hashes and many times they have repeating elements, 1111 for instance, but never 0000 until my time is hit and the difficulty decreases to three. I have no idea why.
I borrowed the hash algorithm from an online source and I checked its output against an online hasher and it checked out.
I know with each level of difficulty it increases exponentially but 2^4 is still only 16 and I see other repeating numbers (1111, 1010, any combination except 0000). Is there any reason why this might be the case?
I wanted to provide an abundance of code rather than a shortage. Logically it makes no sense why randomly if all numbers were equally possible, it woudln't turn up 0000* (e.g. 0000101011at some point). therefore Four zeros must not be possible, but why? I waited 100 seconds mutliple times and saw other numbers repeat themselves. I saw it hit at exactly 4 or 3 or 2 seconds each time on the dot when difficulty went to three. When I start at difficulty 5 (genesis block) it will never solve- I'm sure even if I left it running overnight. So what could be going on?
package privblock.gerald.ryan;
import java.nio.charset.StandardCharsets;
import java.security.NoSuchAlgorithmException;
import java.time.Instant;
import java.util.Arrays;
import java.util.Date; // gets time in ms.
import privblock.gerald.ryan.util.CryptoHash;
/**
*
* #author Gerald Ryan Block Class of blockchain app
*
* Description: The block hash is the result of the timestamp, the
* last_hash, the data, the difficulty and the nonce
*
*/
public class Block {
long timestamp;
String lastHash;
String hash;
String[] data;
int difficulty;
int nonce;
// Millisecond basis
;
static long MILLISECONDS = 1;
static long SECONDS = 1000 * MILLISECONDS;
static long MINE_RATE = 2 * SECONDS;
/**
* A block is a unit of storage for a blockchain that supports a cryptocurrency.
*
* #param timestamp
* #param lastHash
* #param hash
* #param data
* #param difficulty
* #param nonce
*/
public Block(long timestamp, String lastHash, String hash, String[] data, int difficulty, int nonce) {
super();
this.timestamp = timestamp;
this.lastHash = lastHash;
this.hash = hash;
this.data = data;
this.difficulty = difficulty;
this.nonce = nonce;
}
public String toString() {
return "\n-----------BLOCK--------\ntimestamp: " + this.timestamp + "\nlastHash: " + this.lastHash + "\nhash: "
+ this.hash + "\ndifficulty: " + this.getDifficulty() + "\nNonce: " + this.nonce
+ "\n-----------------------\n";
}
/**
* Mine a block based on given last block and data until a block hash is found
* that meets the leading 0's Proof of Work requirement.
*
* #param last_block
* #param data
* #return
* #throws NoSuchAlgorithmException
*/
public static Block mine_block(Block last_block, String[] data) throws NoSuchAlgorithmException {
long timestamp = new Date().getTime();
String last_hash = last_block.getHash();
int difficulty = Block.adjust_difficulty(last_block, timestamp);
int nonce = 0;
String hash = CryptoHash.getSHA256(timestamp, last_block.getHash(), data, difficulty, nonce);
String proof_of_work = CryptoHash.n_len_string('0', difficulty);
// System.out.println("Proof of work " + proof_of_work);
String binary_hash = CryptoHash.hex_to_binary(hash);
// System.out.println("binary hash " + binary_hash);
String binary_hash_work_end = binary_hash.substring(0, difficulty);
// System.out.println("binary_Hash_work_end " + binary_hash_work_end);
System.out.println("Difficulty: " + difficulty);
while (!proof_of_work.equalsIgnoreCase(binary_hash_work_end)) {
// System.out.println("Working");
nonce += 1;
timestamp = new Date().getTime();
difficulty = Block.adjust_difficulty(last_block, timestamp);
hash = CryptoHash.getSHA256(timestamp, last_block.getHash(), data, difficulty, nonce);
proof_of_work = CryptoHash.n_len_string('0', difficulty);
binary_hash = CryptoHash.hex_to_binary(hash);
binary_hash_work_end = binary_hash.substring(0, difficulty);
// System.out.println(binary_hash_work_end);
// System.out.println(binary_hash);
// System.out.println(proof_of_work);
}
System.out.println("Solved at Difficulty: " + difficulty);
// System.out.println("Proof of work requirement " + proof_of_work);
// System.out.println("binary_Hash_work_end " + binary_hash_work_end);
// System.out.println("binary hash " + binary_hash);
System.out.println("BLOCK MINED");
return new Block(timestamp, last_hash, hash, data, difficulty, nonce);
}
/**
* Generate Genesis block
*
* #return
*/
public static Block genesis_block() {
long timestamp = 1;
String last_hash = "genesis_last_hash";
String hash = "genesis_hash";
String[] data = { "buy", "privcoin" };
int difficulty = 4;
int nonce = 0;
return new Block(timestamp, last_hash, hash, data, difficulty, nonce);
}
/**
* Calculate the adjusted difficulty according to the MINE_RATE. Increase the
* difficulty for quickly mined blocks. Decrease the difficulty for slowly mined
* blocks.
*
* #param last_block
* #param new_timestamp
*/
public static int adjust_difficulty(Block last_block, long new_timestamp) {
long time_diff = new_timestamp - last_block.getTimestamp();
// System.out.println(time_diff);
if (time_diff < MINE_RATE) {
// System.out.println("Increasing difficulty");
return last_block.getDifficulty() + 1;
} else if (last_block.getDifficulty() - 1 > 0) {
// System.out.println("Decreasing difficulty");
return last_block.getDifficulty() - 1;
} else {
return 1;
}
}
/**
* Validate block by enforcing following rules: - Block must have the proper
* last_hash reference - Block must meet the proof of work requirements -
* difficulty must only adjust by one - block hash must be a valid combination
* of block fields
*
* #param last_block
* #param block
* #return
* #throws NoSuchAlgorithmException
*/
public static boolean is_valid_block(Block last_block, Block block) throws NoSuchAlgorithmException {
String binary_hash = CryptoHash.hex_to_binary(block.getHash());
char[] pow_array = CryptoHash.n_len_array('0', block.getDifficulty());
char[] binary_char_array = CryptoHash.string_to_charray(binary_hash);
if (!block.getLastHash().equalsIgnoreCase(last_block.getHash())) {
System.out.println("The last hash must be correct");
return false;
// Throw exception the last hash must be correct
}
if (!Arrays.equals(pow_array, Arrays.copyOfRange(binary_char_array, 0, block.getDifficulty()))) {
System.out.println("Proof of work requirement not met");
return false;
// throw exception - proof of work requirement not met
}
if (Math.abs(last_block.difficulty - block.difficulty) > 1) {
System.out.println("Block difficulty must adjust by one");
return false;
// throw exception: The block difficulty must only adjust by 1
}
String reconstructed_hash = CryptoHash.getSHA256(block.getTimestamp(), block.getLastHash(), block.getData(),
block.getDifficulty(), block.getNonce());
if (!block.getHash().equalsIgnoreCase(reconstructed_hash)) {
System.out.println("The block hash must be correct");
System.out.println(block.getHash());
System.out.println(reconstructed_hash);
return false;
// throw exception: the block hash must be correct
}
System.out.println("You have mined a valid block");
return true;
}
public int getDifficulty() {
return difficulty;
}
public long getTimestamp() {
return timestamp;
}
public String getHash() {
return hash;
}
public String getLastHash() {
return lastHash;
}
public String[] getData() {
return data;
}
public int getNonce() {
return nonce;
}
public static void main(String[] args) throws NoSuchAlgorithmException {
// String md = CryptoHash.getSHA256("foobar");
Block genesis = genesis_block();
System.out.println(genesis.toString());
// Block bad_block = Block.mine_block(genesis, new String[] { "watch", "AOT" });
// bad_block.lastHash = "evil data";
// System.out.println(bad_block.toString());
Block good_block = mine_block(genesis, new String[] { "foo", "bar" });
System.out.println(good_block.toString());
// System.out.println(mine_block(new_block, new String[] { "crypto", "is", "fun" }).toString());
// System.out.println(Block.is_valid_block(genesis, bad_block)); // returns false as expected
System.out.println(Block.is_valid_block(genesis, good_block));
System.out.println(CryptoHash.hex_to_binary(good_block.getHash()));
Block good_block2 = mine_block(good_block, new String[] { "bar", "foo" });
Block good_block3 = mine_block(good_block2, new String[] { "bar", "foo" });
Block good_block4 = mine_block(good_block3, new String[] { "bar", "foo" });
// Block good_block5 = mine_block(good_block4, new String[] {"bar", "foo"});
// Block good_block6 = mine_block(good_block5, new String[] {"bar", "foo"});
}
}
package privblock.gerald.ryan.util;
import java.math.BigInteger;
import java.nio.charset.StandardCharsets;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.HashMap;
public class CryptoHash {
static HashMap<Character, String> HEX_TO_BIN_TABLE;
static {
HEX_TO_BIN_TABLE = new HashMap<Character, String>();
HEX_TO_BIN_TABLE.put('0', "0000");
HEX_TO_BIN_TABLE.put('1', "0001");
HEX_TO_BIN_TABLE.put('2', "0010");
HEX_TO_BIN_TABLE.put('3', "0011");
HEX_TO_BIN_TABLE.put('4', "0100");
HEX_TO_BIN_TABLE.put('5', "0101");
HEX_TO_BIN_TABLE.put('6', "0110");
HEX_TO_BIN_TABLE.put('7', "0111");
HEX_TO_BIN_TABLE.put('8', "1000");
HEX_TO_BIN_TABLE.put('9', "1001");
HEX_TO_BIN_TABLE.put('a', "1010");
HEX_TO_BIN_TABLE.put('b', "1011");
HEX_TO_BIN_TABLE.put('c', "1100");
HEX_TO_BIN_TABLE.put('d', "1101");
HEX_TO_BIN_TABLE.put('e', "1110");
HEX_TO_BIN_TABLE.put('f', "1111");
}
public static String getSHA256(String... sarray) throws NoSuchAlgorithmException {
String s = concat(sarray);
// System.out.printf("Hashing \"%s\"\n", s);
MessageDigest md;
md = MessageDigest.getInstance("SHA-256");
byte[] b = md.digest(s.getBytes(StandardCharsets.UTF_8));
BigInteger number = new BigInteger(1, b);
StringBuilder hexString = new StringBuilder(number.toString(16));
while (hexString.length() < 32) {
hexString.insert(0, '0');
}
String mds = hexString.toString();
// System.out.printf("hash is:\n%s\n", mds);
return hexString.toString();
}
public static String getSHA256(long timestamp, String last_hash, String[] data, int difficulty, int nonce)
throws NoSuchAlgorithmException {
String s = "";
s += Long.toString(timestamp);
s += last_hash;
s += concat(data);
s += Integer.toString(difficulty);
s += Integer.toString(nonce);
// System.out.printf("Hashing \"%s\"\n", s);
MessageDigest md;
md = MessageDigest.getInstance("SHA-256");
byte[] b = md.digest(s.getBytes(StandardCharsets.UTF_8));
BigInteger number = new BigInteger(1, b);
StringBuilder hexString = new StringBuilder(number.toString(16));
// System.out.println(hexString);
while (hexString.length() < 32) {
hexString.insert(0, '0');
}
String messageDigestString = hexString.toString();
// System.out.printf("hash is:\n%s\n", messageDigestString);
return hexString.toString();
}
public static char[] n_len_array(char c, int n) {
char[] ch = new char[n];
for (int i = 0; i<n; i++) {
ch[i] = c;
}
return ch;
}
public static String n_len_string(char c, int n) {
String s = "";
for (int i = 0; i<n; i++) {
s += c;
}
return s;
}
public static String concat(String... args) {
String s = "";
for (String $ : args) {
s += $;
}
// System.out.println(s);
return s;
}
public static char[] string_to_charray(String str) {
char[] ch = new char[str.length()];
for (int i = 0; i < str.length(); i++) {
ch[i] = str.charAt(i);
}
return ch;
}
public static String string_to_hex(String arg) {
return String.format("%064x", new BigInteger(1, arg.getBytes(StandardCharsets.UTF_8)));
}
public static String hex_to_binary(String hex_string) {
String binary_string = "";
for (int i = 0; i < hex_string.length(); i++) {
binary_string += HEX_TO_BIN_TABLE.get(hex_string.charAt(i));
}
return binary_string;
}
public static String string_to_binary(String raw_string) {
String hex_string = string_to_hex(raw_string);
String bin_string = hex_to_binary(hex_string);
return bin_string;
}
}
ps here's an example of a log I created. I created other cleaner logs too but this shows what we're working with. The first item represents time in milliseconds. The second represents the first four digits of the hash, which is directly below it, followed by the level of difficulty requirement string (what the second item needs to be, length n = difficulty level). The hash just never leads with four zeros, ever, so my hash function or call to the function must be broken in some way.
6479
1000
1000001010111011100110111010100100111010101001111110010101011101101101110000110100110110110000001010001000000010110001100111100111010100110001001001110111011010011100110000011111110100000100000100000010100001000110000111000101100010001111011000110011111101
0000
6479
0101
0101110111010100101010100000001011100011000001110001011011001101001111101011010011000111101101111111001001001010100110101101100111111011001011100101111000011100010001000000000011000111010000101101001000001010101010111001010000101001110011111101011011011000
0000
6479
1000
1000000001000101001110001110110000110111001101100001011000111010111110001011011010011111111101011001110011001001111011011110110010101010101100011011001001110001100010010101001011100001101011011101010000000100111100011011110100000101100111010100100110011101
0000
6479
I figured out the problem. It is indeed often returning 4 leading zeroes but the code as structured is clipping them off (because it doesn't think they have meaning). I noticed by logging that the length is not always a fixed 64byte/256 bit string. Here's the output:
256
1101111000010000100001110001010001010000001010111001100011010011110010001001010001010010100110111000110010000010001110110100100101000000001111111110011100000001010100000111001000111101010001010100110100000000111000100001000000010010010111011110110011110111
256
011001111101001000011111011001111110010110000011001011111010001011010110010100001011010011010010111101100010010111000010110010110111110001010101100000000101001000111110100111011100001110010010101011011000000101100001101110101101010001110000111111110000
252
0001100101110011101000000011000101011100111101110100111110100101110110011100010110001011000110010011110110011001100111010001100100011001011000001011100011011011011011101000111000011100100011011011011000101010011101000110101011000110011100111010000011000011
256
1100110001001001110001100111100010101100100010110111100111001010011011111111100010100110110000010000101000010111111010010101110001100010101010111111111111001011010111010100001010000010111100100100111000010101011000110000100000100111010001000011000000010000
256
So that's solved, or at least I understand the problem. It's amazing what sleep will do.

Parsing objects out of PDF, objects with byte streams are ignored for some reason?

My current assignment includes taking all of the objects out of the pdf file and then using the parsed out objects. But there is an issue that I have noticed where some of the stream objects are being flat out skipped over by my code.
I am completely confused and hoping someone can help indicate what is going wrong here.
Here is the main parsing code.
void parseRawPDFFile() {
//Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
//object is what allows us to use it in regex.
ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
byteCharSequence.getStringFromData();
Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
Matcher matcher = pattern.matcher(byteCharSequence);
//While we have a match (apparently only one match exists at a time) keep looping over the list.
//When a match is found, get the starting and ending indices and manually cut these out char by char
//and assemble them into a new "ByteArrayOutputStream".
int counterOfDoom = 1;
while (matcher.find() ) {
for (int i = 0; i < matcher.groupCount(); i++) {
ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
System.out.println("----------------------------------------------------");
System.out.println(cutOutArray);
//At this point we have cut out the object and can now send it for processing.
createPDFObject(cutOutArray);
System.out.println(counterOfDoom);
System.out.println("----------------------------------------------------");
counterOfDoom++;
}
}
}
Here is the code for the ByteCharSequence
(Credits for the core of this code here: http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)
public class ByteCharSequence implements CharSequence {
private final byte[] data;
private final int length;
private final int offset;
public ByteCharSequence(byte[] data) {
this(data, 0, data.length);
}
public ByteCharSequence(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
#Override
public int length() {
return this.length;
}
#Override
public char charAt(int index) {
return (char) (data[offset + index] & 0xff);
}
#Override
public CharSequence subSequence(int start, int end) {
return new ByteCharSequence(data, offset + start, end - start);
}
/**
* Get the string from the ByteCharSequence data.
* #return
*/
public String getStringFromData() {
//Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
//But it works...
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for (byte individualByte: data
) {
byteArrayOutputStream.write(individualByte);
}
return byteArrayOutputStream.toString();
}
}
The pdf data that I am processing at present:
10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj
Some information that I was trying to look into.
1: From what I understand there should be no limitation on how much can be fit into the data structures. So size shouldn't be an issue????
Add the DOTALL flag to the pattern compile call so that your pattern matches newline characters =)

How to remove duplicate letters from a string? Caesar Cipher [duplicate]

This question already has answers here:
What is a NullPointerException, and how do I fix it?
(12 answers)
Closed 6 years ago.
import java.io.InputStream;
import java.io.OutputStream;
import java.io.IOException;
/**
This class encrypts files using the Caesar cipher.
For decryption, use an encryptor whose key is the
negative of the encryption key.
*/
public class CaesarCipher
{
private String keyWordString;
private String removeDuplicates(String word){
for (int i = 0; i < word.length(); i++) {
if(!keyWordString.contains(String.valueOf(word.charAt(i)))) {
keyWordString += String.valueOf(word.charAt(i));
}
}
return(keyWordString);
}
/**
Constructs a cipher object with a given key word.
#param aKey the encryption key
*/
public CaesarCipher(String aKeyWord)
{
keyWordString = removeDuplicates(keyWordString);
keyWordString = aKeyWord;
//create the mapping string
//keyWordString = removeDuplicates(keyWordString);
int moreChar = 26 - keyWordString.length();
char ch = 'Z';
for(int j=0; j<moreChar; j++)
{
keyWordString += ch;
ch -= 1;
}
System.out.println("The mapping string is: " + keyWordString);
}
/**
Encrypts the contents of a stream.
#param in the input stream
#param out the output stream
*/
public void encryptStream(InputStream in, OutputStream out)
throws IOException
{
boolean done = false;
while (!done)
{
int next = in.read();
if (next == -1)
{
done = true;
}
else
{
//int encrypted = encrypt(next);
int encrypted = encryptWordKey(next);
System.out.println((char)next + " is encrypted to " + (char)encrypted);
out.write(encrypted);
}
}
}
/**
Encrypts a value.
#param b the value to encrypt (between 0 and 255)
#return the encrypted value
*/
public int encryptWordKey(int b)
{
int pos = b % 65;
return keyWordString.charAt(pos);
}
}
When I am going to run this code it gives me a run time error that says:
Exception in thread "main" java.lang.NullPointerException
at CaesarCipher.removeDuplicates(CaesarCipher.java:15)
at CaesarCipher.<init>(CaesarCipher.java:29)
at CaesarEncryptor.main(CaesarEncryptor.java:29)
Let's say I enter JJJJJJJavvvvvaaaaaaa and I want it to yield Java and give me the encrypted code. In order to differentiate from the other question that was already asked I need the removeDuplicates method to execute and have the output to print JavaZYWW.....etc. Any help or suggestions? I help would be greatly appreciated
You forget to initiate your variable keyWordString.
Do it for example in the static field of your class (or in the constructor).
private String keyWordString="";

How to perform a binary search of a text file

I have a big text file (5Mb) that I use in my Android application. I create the file as a list of pre-sorted Strings, and the file doesn't change once it is created. How can I perform a binary search on the contents of this file, without reading line-by-line to find the matching String?
Since the content of the file does not change, you can break the file into multiple pieces. Say A-G, H-N, 0-T and U-Z. This allows you to check the first character and immediately be able to cut the possible set to a fourth of the original size. Now a linear search will not take as long or reading the whole file could be an option. This process could be extended if n/4 is still too large, but the idea is the same. Build the search breakdowns into the file structure instead of trying to do it all in memory.
A 5MB file isn't that big - you should be able to read each line into a String[] array, which you can then use java.util.Arrays.binarySearch() to find the line you want. This is my recommended approach.
If you don't want to read the whole file in to your app, then it gets more complicated. If each line of the file is the same length, and the file is already sorted, then you can open the file in RandomAccessFile and perform a binary search yourself by using seek() like this...
// open the file for reading
RandomAccessFile raf = new RandomAccessFile("myfile.txt","r");
String searchValue = "myline";
int lineSize = 50;
int numberOfLines = raf.length() / lineSize;
// perform the binary search...
byte[] lineBuffer = new byte[lineSize];
int bottom = 0;
int top = numberOfLines;
int middle;
while (bottom <= top){
middle = (bottom+top)/2;
raf.seek(middle*lineSize); // jump to this line in the file
raf.read(lineBuffer); // read the line from the file
String line = new String(lineBuffer); // convert the line to a String
int comparison = line.compareTo(searchValue);
if (comparison == 0){
// found it
break;
}
else if (comparison < 0){
// line comes before searchValue
bottom = middle + 1;
}
else {
// line comes after searchValue
top = middle - 1;
}
}
raf.close(); // close the file when you're finished
However, if the file doesn't have fixed-width lines, then you can't easily perform a binary search without loading it into memory first, as you can't quickly jump to a specific line in the file like you can with fixed-width lines.
Here's something I quickly put together. It uses two files, one with the words, the other with the offsets. The format of the offset file is this: the first 10 bits contains the word size, the last 22 bits contains the offset (the word position, for example, aaah would be 0, abasementable would be 4, etc.). It's encoded in big endian (java standard). Hope it helps somebody.
word.dat:
aaahabasementableabnormalabnormalityabortionistabortion-rightsabracadabra
wordx.dat:
00 80 00 00 01 20 00 04 00 80 00 0D 01 00 00 11 _____ __________
01 60 00 19 01 60 00 24 01 E0 00 2F 01 60 00 3E _`___`_$___/_`_>
I created these files in C#, but here's the code for it (it uses a txt file with words separated by crlfs)
static void Main(string[] args)
{
const string fIn = #"C:\projects\droid\WriteFiles\input\allwords.txt";
const string fwordxOut = #"C:\projects\droid\WriteFiles\output\wordx.dat";
const string fWordOut = #"C:\projects\droid\WriteFiles\output\word.dat";
int i = 0;
int offset = 0;
int j = 0;
var lines = File.ReadLines(fIn);
FileStream stream = new FileStream(fwordxOut, FileMode.Create, FileAccess.ReadWrite);
using (EndianBinaryWriter wwordxOut = new EndianBinaryWriter(EndianBitConverter.Big, stream))
{
using (StreamWriter wWordOut = new StreamWriter(File.Open(fWordOut, FileMode.Create)))
{
foreach (var line in lines)
{
wWordOut.Write(line);
i = offset | ((int)line.Length << 22); //first 10 bits to the left is the word size
offset = offset + (int)line.Length;
wwordxOut.Write(i);
//if (j == 7)
// break;
j++;
}
}
}
}
And this is the Java code for the binary file search:
public static void binarySearch() {
String TAG = "TEST";
String wordFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/word.dat";
String wordxFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/wordx.dat";
String target = "abracadabra";
boolean targetFound = false;
int searchCount = 0;
try {
RandomAccessFile raf = new RandomAccessFile(wordxFilePath, "r");
RandomAccessFile rafWord = new RandomAccessFile(wordFilePath, "r");
long low = 0;
long high = (raf.length() / 4) - 1;
int cur = 0;
long wordOffset = 0;
int len = 0;
while (high >= low) {
long mid = (low + high) / 2;
raf.seek(mid * 4);
cur = raf.readInt();
Log.v(TAG + "-cur", String.valueOf(cur));
len = cur >> 22; //word length
cur = cur & 0x3FFFFF; //first 10 bits are 0
rafWord.seek(cur);
byte [] bytes = new byte[len];
wordOffset = rafWord.read(bytes, 0, len);
Log.v(TAG + "-wordOffset", String.valueOf(wordOffset));
searchCount++;
String str = new String(bytes);
Log.v(TAG, str);
if (target.compareTo(str) < 0) {
high = mid - 1;
} else if (target.compareTo(str) == 0) {
targetFound = true;
break;
} else {
low = mid + 1;
}
}
raf.close();
rafWord.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
if (targetFound == true) {
Log.v(TAG + "-found " , String.valueOf(searchCount));
} else {
Log.v(TAG + "-not found " , String.valueOf(searchCount));
}
}
In a uniform character length text file you could seek to the middle of the interval in question character wise, start reading characters until you hit your deliminator, then use the subsequent string as an approximation for the element wise middle. The problem with doing this in android, though, is you apparently can't get random access to a resource (although I suppose you could just reopen it every time). Furthermore this technique doesn't generalize to maps and sets of other types.
Another option would be to (using a RandomAccessFile) write an "array" of ints - one for each String - at the beginning of the file then go back and update them with the locations of their corresponding Strings. Again the search will require jumping around.
What I would do (and did do in my own app) is implement a hash set in a file. This one does separate chaining with trees.
import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Set;
class StringFileSet {
private static final double loadFactor = 0.75;
public static void makeFile(String fileName, String comment, Set<String> set) throws IOException {
new File(fileName).delete();
RandomAccessFile fout = new RandomAccessFile(fileName, "rw");
//Write comment
fout.writeUTF(comment);
//Make bucket array
int numBuckets = (int)(set.size()/loadFactor);
ArrayList<ArrayList<String>> bucketArray = new ArrayList<ArrayList<String>>(numBuckets);
for (int ii = 0; ii < numBuckets; ii++){
bucketArray.add(new ArrayList<String>());
}
for (String key : set){
bucketArray.get(Math.abs(key.hashCode()%numBuckets)).add(key);
}
//Sort key lists in preparation for creating trees
for (ArrayList<String> keyList : bucketArray){
Collections.sort(keyList);
}
//Make queues in preparation for creating trees
class NodeInfo{
public final int lower;
public final int upper;
public final long callingOffset;
public NodeInfo(int lower, int upper, long callingOffset){
this.lower = lower;
this.upper = upper;
this.callingOffset = callingOffset;
}
}
ArrayList<LinkedList<NodeInfo>> queueList = new ArrayList<LinkedList<NodeInfo>>(numBuckets);
for (int ii = 0; ii < numBuckets; ii++){
queueList.add(new LinkedList<NodeInfo>());
}
//Write bucket array
fout.writeInt(numBuckets);
for (int index = 0; index < numBuckets; index++){
queueList.get(index).add(new NodeInfo(0, bucketArray.get(index).size()-1, fout.getFilePointer()));
fout.writeInt(-1);
}
//Write trees
for (int bucketIndex = 0; bucketIndex < numBuckets; bucketIndex++){
while (queueList.get(bucketIndex).size() != 0){
NodeInfo nodeInfo = queueList.get(bucketIndex).poll();
if (nodeInfo.lower <= nodeInfo.upper){
//Set respective pointer in parent node
fout.seek(nodeInfo.callingOffset);
fout.writeInt((int)(fout.length() - (nodeInfo.callingOffset + 4))); //Distance instead of absolute position so that the get method can use a DataInputStream
fout.seek(fout.length());
int middle = (nodeInfo.lower + nodeInfo.upper)/2;
//Key
fout.writeUTF(bucketArray.get(bucketIndex).get(middle));
//Left child
queueList.get(bucketIndex).add(new NodeInfo(nodeInfo.lower, middle-1, fout.getFilePointer()));
fout.writeInt(-1);
//Right child
queueList.get(bucketIndex).add(new NodeInfo(middle+1, nodeInfo.upper, fout.getFilePointer()));
fout.writeInt(-1);
}
}
}
fout.close();
}
private final String fileName;
private final int numBuckets;
private final int bucketArrayOffset;
public StringFileSet(String fileName) throws IOException {
this.fileName = fileName;
DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(fileName)));
short numBytes = fin.readShort();
fin.skipBytes(numBytes);
this.numBuckets = fin.readInt();
this.bucketArrayOffset = numBytes + 6;
fin.close();
}
public boolean contains(String key) throws IOException {
boolean containsKey = false;
DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(this.fileName)));
fin.skipBytes(4*(Math.abs(key.hashCode()%this.numBuckets)) + this.bucketArrayOffset);
int distance = fin.readInt();
while (distance != -1){
fin.skipBytes(distance);
String candidate = fin.readUTF();
if (key.compareTo(candidate) < 0){
distance = fin.readInt();
}else if (key.compareTo(candidate) > 0){
fin.skipBytes(4);
distance = fin.readInt();
}else{
fin.skipBytes(8);
containsKey = true;
break;
}
}
fin.close();
return containsKey;
}
}
A test program
import java.io.File;
import java.io.IOException;
import java.util.HashSet;
class Test {
public static void main(String[] args) throws IOException {
HashSet<String> stringMemorySet = new HashSet<String>();
stringMemorySet.add("red");
stringMemorySet.add("yellow");
stringMemorySet.add("blue");
StringFileSet.makeFile("stringSet", "Provided under ... included in all copies and derivatives ...", stringMemorySet);
StringFileSet stringFileSet = new StringFileSet("stringSet");
System.out.println("orange -> " + stringFileSet.contains("orange"));
System.out.println("red -> " + stringFileSet.contains("red"));
System.out.println("yellow -> " + stringFileSet.contains("yellow"));
System.out.println("blue -> " + stringFileSet.contains("blue"));
new File("stringSet").delete();
System.out.println();
}
}
You'll also need to pass a Context to it, if and when you modify it for android, so it can access the getResources() method.
You're also probably going to want to stop the android build tools from compressing the file, which can apparently only be done - if you're working with the GUI - by changing the file's extension to something such as jpg. This made the process about 100 to 300 times faster in my app.
You might also look into giving yourself more memory by using the NDK.
Though it might sound like overkill, don't store data you need to do this with as a flat file. Make a database and query the data in the database. This should be both effective and fast.
Here is a function that I think works (using this in practice). Lines can have any length. You have to supply a lambda called "nav" to do the actual line check so you are flexible in the file's order (case-sensitive, case-insensitive, ordered by a certain field etc.).
import java.io.File;
import java.io.RandomAccessFile;
class main {
// returns pair(character range in file, line) or null if not found
// if no exact match found, return line above
// nav takes a line and returns -1 (move up), 0 (found) or 1 (move down)
// The line supplied to nav is stripped of the trailing \n, but not the \r
// UTF-8 encoding is assumed
static Pair<LongRange, String> binarySearchForLineInTextFile(File file, IF1<String, Integer> nav) {
long length = l(file);
int bufSize = 1024;
RandomAccessFile raf = randomAccessFileForReading(file);
try {
long min = 0, max = length;
int direction = 0;
Pair<LongRange, String> possibleResult = null;
while (min < max) {
ping();
long middle = (min + max) / 2;
long lineStart = raf_findBeginningOfLine(raf, middle, bufSize);
long lineEnd = raf_findEndOfLine(raf, middle, bufSize);
String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (lineEnd - 1 - lineStart)));
direction = nav.get(line);
possibleResult = (Pair<LongRange, String>) new Pair(new LongRange(lineStart, lineEnd), line);
if (direction == 0) return possibleResult;
// asserts are to assure that loop terminates
if (direction < 0) max = assertLessThan(max, lineStart);
else min = assertBiggerThan(min, lineEnd);
}
if (direction >= 0) return possibleResult;
long lineStart = raf_findBeginningOfLine(raf, min - 1, bufSize);
String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (min - 1 - lineStart)));
return new Pair(new LongRange(lineStart, min), line);
} finally {
_close(raf);
}
}
static int l(byte[] a) {
return a == null ? 0 : a.length;
}
static long l(File f) {
return f == null ? 0 : f.length();
}
static RandomAccessFile randomAccessFileForReading(File path) {
try {
return new RandomAccessFile(path, "r");
} catch (Exception __e) {
throw rethrow(__e);
}
}
// you can change this function to allow interrupting long calculations from the outside. just throw a RuntimeException.
static boolean ping() {
return true;
}
static long raf_findBeginningOfLine(RandomAccessFile raf, long pos, int bufSize) {
try {
byte[] buf = new byte[bufSize];
while (pos > 0) {
long start = Math.max(pos - bufSize, 0);
raf.seek(start);
raf.readFully(buf, 0, (int) Math.min(pos - start, bufSize));
int idx = lastIndexOf_byteArray(buf, (byte) '\n');
if (idx >= 0) return start + idx + 1;
pos = start;
}
return 0;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static long raf_findEndOfLine(RandomAccessFile raf, long pos, int bufSize) {
try {
byte[] buf = new byte[bufSize];
long length = raf.length();
while (pos < length) {
raf.seek(pos);
raf.readFully(buf, 0, (int) Math.min(length - pos, bufSize));
int idx = indexOf_byteArray(buf, (byte) '\n');
if (idx >= 0) return pos + idx + 1;
pos += bufSize;
}
return length;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static String fromUtf8(byte[] bytes) {
try {
return bytes == null ? null : new String(bytes, "UTF-8");
} catch (Exception __e) {
throw rethrow(__e);
}
}
static byte[] raf_readFilePart(RandomAccessFile raf, long start, int l) {
try {
byte[] buf = new byte[l];
raf.seek(start);
raf.readFully(buf);
return buf;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static <A> A assertLessThan(A a, A b) {
assertTrue(cmp(b, a) < 0);
return b;
}
static <A> A assertBiggerThan(A a, A b) {
assertTrue(cmp(b, a) > 0);
return b;
}
static void _close(AutoCloseable c) {
try {
if (c != null)
c.close();
} catch (Throwable e) {
throw rethrow(e);
}
}
static RuntimeException rethrow(Throwable t) {
throw t instanceof RuntimeException ? (RuntimeException) t : new RuntimeException(t);
}
static int lastIndexOf_byteArray(byte[] a, byte b) {
for (int i = l(a) - 1; i >= 0; i--)
if (a[i] == b)
return i;
return -1;
}
static int indexOf_byteArray(byte[] a, byte b) {
int n = l(a);
for (int i = 0; i < n; i++)
if (a[i] == b)
return i;
return -1;
}
static boolean assertTrue(boolean b) {
if (!b)
throw fail("oops");
return b;
}
static int cmp(Object a, Object b) {
if (a == null) return b == null ? 0 : -1;
if (b == null) return 1;
return ((Comparable) a).compareTo(b);
}
static RuntimeException fail(String msg) {
throw new RuntimeException(msg == null ? "" : msg);
}
final static class LongRange {
long start, end;
LongRange(long start, long end) {
this.end = end;
this.start = start;
}
public String toString() {
return "[" + start + ";" + end + "]";
}
}
interface IF1<A, B> {
B get(A a);
}
static class Pair<A, B> {
A a;
B b;
Pair(A a, B b) {
this.b = b;
this.a = a;
}
public String toString() {
return "<" + a + ", " + b + ">";
}
}
}

Binary search in a sorted (memory-mapped ?) file in Java

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

Categories

Resources