Speeding up file read - java

I have a 1.7G file with the following format:
String Long String Long String Long String Long ... etc
Essentially, String is a key and Long is a value in a hashmap i'm interested in initialising before running anything else in my application.
My current code is:
RandomAccessFile raf=new RandomAccessFile("/home/map.dat","r");
raf.seek(0);
while(raf.getFilePointer()!=raf.length()){
String name=raf.readUTF();
long offset=raf.readLong();
map.put(name,offset);
}
This takes about 12 mins to complete and I'm sure there are better ways of doing this so I would appreciate any help or pointer.
thanks
Update as in EJP suggestion?
EJP thank you for your suggestion and I hope this is what you meant. Correct me if this is wrong
DataInputStream dis=null;
try{
dis=new DataInputStream(new BufferedInputStream(new FileInputStream("/home/map.dat")));
while(true){
String name=dis.readUTF();
long offset=dis.readLong();
map.put(name, offset);
}
}catch (EOFException eofe){
try{
dis.close();
}catch (IOException ioe){
ioe.printStackTrace();
}
}

Use a DataInputStream wrapped around a BufferedInputStream wrapped around a FileInputStream.
Instead of at least four system calls per iteration, checking the length, and the current size and performing who knows how many reads to get the string and the long, just call readUTF() and readLong() until you get an EOFException.

I would construct the file so it can be used in place. i.e. without loading this way. As you have variable length records you can construct an array of the location of each record, then place the key in order so you can perform a binary search for data. (Or you can use a custom hash table) You can then wrap this with method which hide the fact the data is actually store in a file instead of turned into data objects.
If you do all this the "load" phase becomes redundant and you won't need to create so many objects.
This is a long example but hopefully shows what is possible.
import vanilla.java.chronicle.Chronicle;
import vanilla.java.chronicle.Excerpt;
import vanilla.java.chronicle.impl.IndexedChronicle;
import vanilla.java.chronicle.tools.ChronicleTest;
import java.io.IOException;
import java.util.*;
public class Main {
static final String TMP = System.getProperty("java.io.tmpdir");
public static void main(String... args) throws IOException {
String baseName = TMP + "/test";
String[] keys = generateAndSave(baseName, 100 * 1000 * 1000);
long start = System.nanoTime();
SavedSortedMap map = new SavedSortedMap(baseName);
for (int i = 0; i < keys.length / 100; i++) {
long l = map.lookup(keys[i]);
// System.out.println(keys[i] + ": " + l);
}
map.close();
long time = System.nanoTime() - start;
System.out.printf("Load of %,d records and lookup of %,d keys took %.3f seconds%n",
keys.length, keys.length / 100, time / 1e9);
}
static SortedMap<String, Long> generateMap(int keys) {
SortedMap<String, Long> ret = new TreeMap<>();
while (ret.size() < keys) {
long n = ret.size();
String key = Long.toString(n);
while (key.length() < 9)
key = '0' + key;
ret.put(key, n);
}
return ret;
}
static void saveData(SortedMap<String, Long> map, String baseName) throws IOException {
Chronicle chronicle = new IndexedChronicle(baseName);
Excerpt excerpt = chronicle.createExcerpt();
for (Map.Entry<String, Long> entry : map.entrySet()) {
excerpt.startExcerpt(2 + entry.getKey().length() + 8);
excerpt.writeUTF(entry.getKey());
excerpt.writeLong(entry.getValue());
excerpt.finish();
}
chronicle.close();
}
static class SavedSortedMap {
final Chronicle chronicle;
final Excerpt excerpt;
final String midKey;
final long size;
SavedSortedMap(String baseName) throws IOException {
chronicle = new IndexedChronicle(baseName);
excerpt = chronicle.createExcerpt();
size = chronicle.size();
excerpt.index(size / 2);
midKey = excerpt.readUTF();
}
// find exact match or take the value after.
public long lookup(CharSequence key) {
if (compareTo(key, midKey) < 0)
return lookup0(0, size / 2, key);
return lookup0(size / 2, size, key);
}
private final StringBuilder tmp = new StringBuilder();
private long lookup0(long from, long to, CharSequence key) {
long mid = (from + to) >>> 1;
excerpt.index(mid);
tmp.setLength(0);
excerpt.readUTF(tmp);
if (to - from <= 1)
return excerpt.readLong();
int cmp = compareTo(key, tmp);
if (cmp < 0)
return lookup0(from, mid, key);
if (cmp > 0)
return lookup0(mid, to, key);
return excerpt.readLong();
}
public static int compareTo(CharSequence a, CharSequence b) {
int lim = Math.min(a.length(), b.length());
for (int k = 0; k < lim; k++) {
char c1 = a.charAt(k);
char c2 = b.charAt(k);
if (c1 != c2)
return c1 - c2;
}
return a.length() - b.length();
}
public void close() {
chronicle.close();
}
}
private static String[] generateAndSave(String baseName, int keyCount) throws IOException {
SortedMap<String, Long> map = generateMap(keyCount);
saveData(map, baseName);
ChronicleTest.deleteOnExit(baseName);
String[] keys = map.keySet().toArray(new String[map.size()]);
Collections.shuffle(Arrays.asList(keys));
return keys;
}
}
generates 2 GB of raw data and performs a million lookups. It's written in such a way that the loading and lookup uses very little heap. ( << 1 MB )
ls -l /tmp/test*
-rw-rw---- 1 peter peter 2013265920 Dec 11 13:23 /tmp/test.data
-rw-rw---- 1 peter peter 805306368 Dec 11 13:23 /tmp/test.index
/tmp/test created.
/tmp/test, size=100000000
Load of 100,000,000 records and lookup of 1,000,000 keys took 10.945 seconds
Using a hash table lookup would be faster per lookup as it is O(1) instead of O(ln N), but more complex to implement.

Related

Optimizing Recursive Function in Java

I'm having problem with optimizing a function AltOper. In set {a, b, c}, there is a given multiplication (or binomial operation, whatsoever) which does not follow associative law. AltOper gets string consists of a, b, c such as "abbac", and calculates any possible answers for the operation, such as ((ab)b)(ac) = c, (a(b(ba)))c = a. AltOper counts every operation (without duplication) which ends with a, b, c, and return it as a triple tuple.
Though this code runs well for small inputs, it takes too much time for bit bulky ones. I tried memoization for some small ones, but apparently it's not enough. Struggling some hours, I finally figured out that its time complexity is basically too large. But I couldn't find any better algorithm for calculating this. Can anyone suggest idea for enhancing (significantly) or rebuilding the code? No need to be specific, but just vague idea would also be helpful.
public long[] AltOper(String str){
long[] triTuple = new long[3]; // result: {number-of-a, number-of-b, number-of-c}
if (str.length() == 1){ // Ending recursion condition
if (str.equals("a")) triTuple[0]++;
else if (str.equals("b")) triTuple[1]++;
else triTuple[2]++;
return triTuple;
}
String left = "";
String right = str;
while (right.length() > 1){
// splitting string into two, by one character each
left = left + right.substring(0, 1);
right = right.substring(1, right.length());
long[] ltemp = AltOper(left);
long[] rtemp = AltOper(right);
// calculating possible answers from left/right split strings
triTuple[0] += ((ltemp[0] + ltemp[1]) * rtemp[2] + ltemp[2] * rtemp[0]);
triTuple[1] += (ltemp[0] * rtemp[0] + (ltemp[0] + ltemp[1]) * rtemp[1]);
triTuple[2] += (ltemp[1] * rtemp[0] + ltemp[2] * (rtemp[1] + rtemp[2]));
}
return triTuple;
}
One comment ahead: I would modify the signature to allow for a binary string operation, so you can easiely modify your "input operation".
java public long[] AltOper(BiFunction<long[], long[], long[]> op, String str) {
I recommend using some sort of lookup table for subportions you have already answered. You hinted that you tried this already:
I tried memoization for some small ones, but apparently it's not enough
I wonder what went wrong, since this is a good idea, especially since your input is strings, which are both quickly hashable and comparable, so putting them in a map is cheap. You just need to ensure, that the map does not block the entire memory by ensuring, that old, unused entries are dropped. Cache-like maps can do this. I leave it to you to find one that suites your personal preferences.
From there, I would run any recursions through the cache check, to find precalculated results in the map. Small substrings that would otherwise be calculated insanely often are then looked up quickly, which cheapens your algorithm drastically.
I rewrote your code a bit, to allow for various inputs (including different operations):
import java.util.Arrays;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.function.BiFunction;
import org.junit.jupiter.api.Test;
public class StringOpExplorer {
#Test
public void test() {
BiFunction<long[], long[], long[]> op = (left, right) -> {
long[] r = new long[3];
r[0] += ((left[0] + left[1]) * right[2] + left[2] * right[0]);
r[1] += (left[0] * right[0] + (left[0] + left[1]) * right[1]);
r[2] += (left[1] * right[0] + left[2] * (right[1] + right[2]));
return r;
};
long[] result = new StringOpExplorer().opExplore(op, "abcacbabc");
System.out.println(Arrays.toString(result));
}
#SuppressWarnings("serial")
final LinkedHashMap<String, long[]> cache = new LinkedHashMap<String, long[]>() {
#Override
protected boolean removeEldestEntry(final Map.Entry<String, long[]> eldest) {
return size() > 1_000_000;
}
};
public long[] opExplore(BiFunction<long[], long[], long[]> op, String input) {
// if the input is length 1, we return.
int length = input.length();
if (length == 1) {
long[] result = new long[3];
if (input.equals("a")) {
++result[0];
} else if (input.equals("b")) {
++result[1];
} else if (input.equals("c")) {
++result[2];
}
return result;
}
// This will check, if the result is already known.
long[] result = cache.get(input);
if (result == null) {
// This will calculate the result, if it is not yet known.
result = applyOp(op, input);
cache.put(input, result);
}
return result;
}
public long[] applyOp(BiFunction<long[], long[], long[]> op, String input) {
long[] result = new long[3];
int length = input.length();
for (int i = 1; i < length; ++i) {
// This might be easier to read...
String left = input.substring(0, i);
String right = input.substring(i, length);
// Subcalculation.
long[] leftResult = opExplore(op, left);
long[] rightResult = opExplore(op, right);
// apply operation and add result.
long[] operationResult = op.apply(leftResult, rightResult);
for (int d = 0; d < 3; ++d) {
result[d] += operationResult[d];
}
}
return result;
}
}
The idea of the rewrite was to introduce caching and to isolate the operation from the exploration. After all, your algorithm is in itself an operation, but not the 'operation under test'. So now you colud (theoretically) test any operation, by changing the BiFunction parameter.
This result is extremely fast, though I really wonder about the applicability...

Given n files with their lengths, divide total bytes equally among d threads equally such that any two threads differ by atmost 1 byte

You are given a list of file names and their lengths in bytes.
Example:
File1: 200 File2: 500 File3: 800
You are given a number N. We want to launch N threads to read all the files parallelly such that each thread approximately reads an equal amount of bytes
You should return N lists. Each list describes the work of one thread: Example, when N=2, there are two threads. In the above example, there is a total of 1500 bytes (200 + 500 + 800). A fairway to divide is for each thread to read 750 bytes. So you will return:
Two lists
List 1: File1: 0 - 199 File2: 0 - 499 File3: 0-49 ---------------- Total 750 bytes
List 2: File3: 50-799 -------------------- Total 750 bytes
Implement the following method
List<List<FileRange>> getSplits(List<File> files, int N)
Class File {
String filename; long length }
Class FileRange {
String filename Long startOffset Long endOffset }
I tried with this one but it's not working any help would be highly appreciated.
List<List<FileRange>> getSplits(List<File> files, int n) {
List<List<FileRange>> al=new ArrayList<>();
long s=files.size();
long sum=0;
for(int i=0;i<s;i++){
long l=files.get(i).length;
sum+=(long)l;
}
long div=(long)sum/n; // no of bytes per thread
long mod=(long)sum%n;
long[] lo=new long[(long)n];
for(long i=0;i<n;i++)
lo[i]=div;
if(mod!=0){
long i=0;
while(mod>0){
lo[i]+=1;
mod--;
i++;
}
}
long inOffset=0;
for(long j=0;j<n;j++){
long val=lo[i];
for(long i=0;i<(long)files.size();i++){
String ss=files.get(i).filename;
long ll=files.get(i).length;
if(ll<val){
inOffset=0;
val-=ll;
}
else{
inOffset=ll-val;
ll=val;
}
al.add(new ArrayList<>(new File(ss,inOffset,ll-1)));
}
}
}
I'm getting problem in startOffset and endOffset with it's corresponding file. I tried it but I was not able to extract from List and add in the form of required return type List>.
The essence of the problem is to simultaneously walk through two lists:
the input list, which is a list of files
the output list, which is a list of threads (where each thread has a list of ranges)
I find that the easiest approach to such problems is an infinite loop that looks something like this:
while (1)
{
move some information from the input to the output
decide whether to advance to the next input item
decide whether to advance to the next output item
if we've reached (the end of the input _OR_ the end of the output)
break
if we advanced to the next input item
prepare the next input item for processing
if we advanced to the next output item
prepare the next output item for processing
}
To keep track of the input, we need the following information
fileIndex the index into the list of files
fileOffset the offset of the first unassigned byte in the file, initially 0
fileRemain the number of bytes in the file that are unassigned, initially the file size
To keep track of the output, we need
threadIndex the index of the thread we're currently working on (which is the first index into the List<List<FileRange>> that the algorithm produces)
threadNeeds the number of bytes that the thread still needs, initially base or base+1
Side note: I'm using base as the minimum number bytes assigned to each thread (sum/n), and extra as the number of threads that get an extra byte (sum%n).
So now we get to the heart of the algorithm: what information to move from input to output:
if fileRemain is less than threadNeeds then the rest of the file (which may be the entire file) gets assigned to the current thread, and we move to the next file
if fileRemain is greater than threadNeeds then a portion of the file is assigned to the current thread, and we move to the next thread
if fileRemain is equal to threadNeeds then the rest of the file is assigned to the thread, and we move to the next file, and the next thread
Those three cases are easily handled by comparing fileRemain and threadNeeds, and choosing a byteCount that is the minimum of the two.
With all that in mind, here's some pseudo-code to help get you started:
base = sum/n;
extra = sum%n;
// initialize the input control variables
fileIndex = 0
fileOffset = 0
fileRemain = length of file 0
// initialize the output control variables
threadIndex = 0
threadNeeds = base
if (threadIndex < extra)
threadNeeds++
while (1)
{
// decide how many bytes can be assigned, and generate some output
byteCount = min(fileRemain, threadNeeds)
add (file.name, fileOffset, fileOffset+byteCount-1) to the list of ranges
// decide whether to advance to the next input and output items
threadNeeds -= byteCount
fileRemain -= byteCount
if (threadNeeds == 0)
threadIndex++
if (fileRemain == 0)
fileIndex++
// are we done yet?
if (threadIndex == n || fileIndex == files.size())
break
// if we've moved to the next input item, reinitialize the input control variables
if (fileRemain == 0)
{
fileOffset = 0
fileRemain = length of file
}
// if we've moved to the next output item, reinitialize the output control variables
if (threadNeeds == 0)
{
threadNeeds = base
if (threadIndex < extra)
threadNeeds++
}
}
Debugging tip: Reaching the end of the input, and the end of the output, should happen simultaneously. In other words, you should run out of files at exactly the same time as you run out of threads. So during development, I would check both conditions, and verify that they do, in fact, change at the same time.
Here's the code solution for your problem (in Java) :
The custom class 'File' and 'FileRange' are as follows :
public class File{
String filename;
long length;
public File(String filename, long length) {
this.filename = filename;
this.length = length;
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public long getLength() {
return length;
}
public void setLength(long length) {
this.length = length;
}
}
public class FileRange {
String filename;
Long startOffset;
Long endOffset;
public FileRange(String filename, Long startOffset, Long endOffset) {
this.filename = filename;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public Long getStartOffset() {
return startOffset;
}
public void setStartOffset(Long startOffset) {
this.startOffset = startOffset;
}
public Long getEndOffset() {
return endOffset;
}
public void setEndOffset(Long endOffset) {
this.endOffset = endOffset;
}
}
The main class will be as follows :
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.concurrent.atomic.AtomicInteger;
public class MainClass {
private static List<List<FileRange>> getSplits(List<File> files, int N) {
List<List<FileRange>> results = new ArrayList<>();
long sum = files.stream().mapToLong(File::getLength).sum(); // Total bytes in all the files
long div = sum/N;
long mod = sum%N;
// Storing how many bytes each thread gets to process
long thread_bytes[] = new long[N];
// At least 'div' number of bytes will be processed by each thread
for(int i=0;i<N;i++)
thread_bytes[i] = div;
// Left over bytes to be processed by each thread
for(int i=0;i<mod;i++)
thread_bytes[i] += 1;
int count = 0;
int len = files.size();
long processed_bytes[] = new long[len];
long temp = 0L;
int file_to_be_processed = 0;
while(count < N && sum > 0) {
temp = thread_bytes[count];
sum -= temp;
List<FileRange> internal = new ArrayList<>();
while (temp > 0) {
// Start from the file to be processed - Will be 0 in the first iteration
// Will be updated in the subsequent iterations
for(int j=file_to_be_processed;j<len && temp>0;j++){
File f = files.get(j);
if(f.getLength() - processed_bytes[j] <= temp){
internal.add(new FileRange(f.getFilename(), processed_bytes[j], f.getLength()- 1));
processed_bytes[j] = f.getLength() - processed_bytes[j];
temp -= processed_bytes[j];
file_to_be_processed++;
}
else{
internal.add(new FileRange(f.getFilename(), processed_bytes[j], processed_bytes[j] + temp - 1));
// In this case, we won't update the number for file to be processed
processed_bytes[j] += temp;
temp -= processed_bytes[j];
}
}
results.add(internal);
count++;
}
}
return results;
}
public static void main(String args[]){
Scanner scn = new Scanner(System.in);
int N = scn.nextInt();
// Inserting demo records in list
File f1 = new File("File 1",200);
File f2 = new File("File 2",500);
File f3 = new File("File 3",800);
List<File> files = new ArrayList<>();
files.add(f1);
files.add(f2);
files.add(f3);
List<List<FileRange>> results = getSplits(files, N);
final AtomicInteger result_count = new AtomicInteger();
// Displaying the results
results.forEach(result -> {
System.out.println("List "+result_count.incrementAndGet() + " : ");
result.forEach(res -> {
System.out.print(res.getFilename() + " : ");
System.out.print(res.getStartOffset() + " - ");
System.out.print(res.getEndOffset() + "\n");
});
System.out.println("---------------");
});
}
}
If some part is still unclear, consider a case and dry run the program.
Say 999 bytes have to be processed by 100 threads
So the 100 threads get 9 bytes each and out of the remaining 99 bytes, each thread except the 100th gets 1 byte. By doing this, we'll make sure no 2 threads differ by at most 1 byte. Proceed with this idea and follow up with the code.

Custom HashMap Code Issue

I have following code, where I used HashMap (using two parallel arrays) for storing key-value pairs (key can have multiple values). Now, I have to store and load it for future use that's why I store and load it by using File Channel. Issue with this code is: I can store nearly 120 millions of key-value pairs in my 8 GB server (actually, I can allocate nearly 5 gb out of 8 gb for my JVM, and those two parallel arrays takes nearly 2.5 gb, other memory are used for various processing of my code). But, I have to store nearly 600/700 millions of key-value pairs. Can anybdoy help me how to modify this code thus I can store nearly 600/700 millions of key-value pairs. Or any comment on this will be nice for me. Another point, I have to load and store the hashmap to/from memory. It takes little bit long time using file channel. As per various suggestions of Stack Overflow, I didn't find faster one. I have used ObjectOutputStream, Zipped output stream also, however, slower than below code. Is there anyway to store those two parallel arrays in such a way thus loading time will be much faster. I have given below in my code a test case. Any comment on that will also be helpful to me.
import java.io.*;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Arrays;
import java.util.Random;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.io.RandomAccessFile;
public class Test {
public static void main(String args[]) {
try {
Random randomGenerator = new Random();
LongIntParallelHashMultimap lph = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");
for (int i = 0; i < 110000000; i++) {
lph.put(i, randomGenerator.nextInt(200000000));
}
lph.save();
LongIntParallelHashMultimap lphN = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");
lphN.load();
int tt[] = lphN.get(1);
System.out.println(tt[0]);
} catch (Exception e) {
e.printStackTrace();
}
}
}
class LongIntParallelHashMultimap {
private static final long NULL = -1L;
private final long[] keys;
private final int[] values;
private int size;
private int savenum = 0;
private String str1 = "";
private String str2 = "";
public LongIntParallelHashMultimap(int capacity, String st1, String st2) {
keys = new long[capacity];
values = new int[capacity];
Arrays.fill(keys, NULL);
savenum = capacity;
str1 = st1;
str2 = st2;
}
public void put(long key, int value) {
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int[] get(long key) {
int index = indexFor(key);
int count = countHits(key, index);
int[] hits = new int[count];
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
}
index = successor(index);
}
return hits;
}
private int countHits(long key, int index) {
int numHits = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
++numHits;
}
index = successor(index);
}
return numHits;
}
private int indexFor(long key) {
return Math.abs((int) ((key * 5700357409661598721L) % keys.length));
}
private int successor(int index) {
return (index + 1) % keys.length;
}
public int size() {
return size;
}
public void load() {
try {
FileChannel channel2 = new RandomAccessFile(str1, "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == savenum * 8;
for (int i = 0; i < savenum; i++) {
long l = mbb2.getLong();
keys[i] = l;
}
channel2.close();
FileChannel channel3 = new RandomAccessFile(str2, "r").getChannel();
MappedByteBuffer mbb3 = channel3.map(FileChannel.MapMode.READ_ONLY, 0, channel3.size());
mbb3.order(ByteOrder.nativeOrder());
assert mbb3.remaining() == savenum * 4;
for (int i = 0; i < savenum; i++) {
int l1 = mbb3.getInt();
values[i] = l1;
}
channel3.close();
} catch (Exception e) {
System.out.println(e);
}
}
public void save() {
try {
FileChannel channel = new RandomAccessFile(str1, "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 8);
mbb.order(ByteOrder.nativeOrder());
for (int i = 0; i < savenum; i++) {
mbb.putLong(keys[i]);
}
channel.close();
FileChannel channel1 = new RandomAccessFile(str2, "rw").getChannel();
MappedByteBuffer mbb1 = channel1.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 4);
mbb1.order(ByteOrder.nativeOrder());
for (int i = 0; i < savenum; i++) {
mbb1.putInt(values[i]);
}
channel1.close();
} catch (Exception e) {
System.out.println("IOException : " + e);
}
}
}
I doubt this is possible, given the datatypes you have declared. Just multiply the sizes of the primitive types.
Each row requires 4 bytes to store an int and 8 bytes to store a long.
600 million rows * 12 bytes per row = 7200 MB = 7.03 GB. You say you can allocate 5 GB to the JVM. So even if it was all heap and stored only this custom HashMap, it will not fit. Consider shrinking the size of the datatypes involved or storing it somewhere other than RAM.
Have the database on disk, and not in memory. Rewrite your operations so that they don't operate on arrays, but instead operate on buffers. Then you can open a sufficiently large file, and have the operations access the portion they need using a mapped buffer. Try whether your application performs better when you implement a cache of the few most recently mapped memory regions, so you won't have to map and unmap common regions too often, but instead can keep them mapped in.
This should give you the best of both worlds, disk and ram:
Random access to any portion of the data structure is easy to implement
Access to often used portions of the table will be cached
Seldom used portions of the table will not occupy any memory
As you can see, this depends a lot on locality: if some keys are more common than others, things will perform well, whereas nicely distributed keys will cause a new disk operation for each access. So while nice distributions are desirable for most in-memory hash maps, other structures which map often-used keys to similar locations will perform better here. Those will interfere with collision handling, though.
Better to use in-memory database like sqlite, which will give good result.

How to perform a binary search of a text file

I have a big text file (5Mb) that I use in my Android application. I create the file as a list of pre-sorted Strings, and the file doesn't change once it is created. How can I perform a binary search on the contents of this file, without reading line-by-line to find the matching String?
Since the content of the file does not change, you can break the file into multiple pieces. Say A-G, H-N, 0-T and U-Z. This allows you to check the first character and immediately be able to cut the possible set to a fourth of the original size. Now a linear search will not take as long or reading the whole file could be an option. This process could be extended if n/4 is still too large, but the idea is the same. Build the search breakdowns into the file structure instead of trying to do it all in memory.
A 5MB file isn't that big - you should be able to read each line into a String[] array, which you can then use java.util.Arrays.binarySearch() to find the line you want. This is my recommended approach.
If you don't want to read the whole file in to your app, then it gets more complicated. If each line of the file is the same length, and the file is already sorted, then you can open the file in RandomAccessFile and perform a binary search yourself by using seek() like this...
// open the file for reading
RandomAccessFile raf = new RandomAccessFile("myfile.txt","r");
String searchValue = "myline";
int lineSize = 50;
int numberOfLines = raf.length() / lineSize;
// perform the binary search...
byte[] lineBuffer = new byte[lineSize];
int bottom = 0;
int top = numberOfLines;
int middle;
while (bottom <= top){
middle = (bottom+top)/2;
raf.seek(middle*lineSize); // jump to this line in the file
raf.read(lineBuffer); // read the line from the file
String line = new String(lineBuffer); // convert the line to a String
int comparison = line.compareTo(searchValue);
if (comparison == 0){
// found it
break;
}
else if (comparison < 0){
// line comes before searchValue
bottom = middle + 1;
}
else {
// line comes after searchValue
top = middle - 1;
}
}
raf.close(); // close the file when you're finished
However, if the file doesn't have fixed-width lines, then you can't easily perform a binary search without loading it into memory first, as you can't quickly jump to a specific line in the file like you can with fixed-width lines.
Here's something I quickly put together. It uses two files, one with the words, the other with the offsets. The format of the offset file is this: the first 10 bits contains the word size, the last 22 bits contains the offset (the word position, for example, aaah would be 0, abasementable would be 4, etc.). It's encoded in big endian (java standard). Hope it helps somebody.
word.dat:
aaahabasementableabnormalabnormalityabortionistabortion-rightsabracadabra
wordx.dat:
00 80 00 00 01 20 00 04 00 80 00 0D 01 00 00 11 _____ __________
01 60 00 19 01 60 00 24 01 E0 00 2F 01 60 00 3E _`___`_$___/_`_>
I created these files in C#, but here's the code for it (it uses a txt file with words separated by crlfs)
static void Main(string[] args)
{
const string fIn = #"C:\projects\droid\WriteFiles\input\allwords.txt";
const string fwordxOut = #"C:\projects\droid\WriteFiles\output\wordx.dat";
const string fWordOut = #"C:\projects\droid\WriteFiles\output\word.dat";
int i = 0;
int offset = 0;
int j = 0;
var lines = File.ReadLines(fIn);
FileStream stream = new FileStream(fwordxOut, FileMode.Create, FileAccess.ReadWrite);
using (EndianBinaryWriter wwordxOut = new EndianBinaryWriter(EndianBitConverter.Big, stream))
{
using (StreamWriter wWordOut = new StreamWriter(File.Open(fWordOut, FileMode.Create)))
{
foreach (var line in lines)
{
wWordOut.Write(line);
i = offset | ((int)line.Length << 22); //first 10 bits to the left is the word size
offset = offset + (int)line.Length;
wwordxOut.Write(i);
//if (j == 7)
// break;
j++;
}
}
}
}
And this is the Java code for the binary file search:
public static void binarySearch() {
String TAG = "TEST";
String wordFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/word.dat";
String wordxFilePath = Environment.getExternalStorageDirectory().getAbsolutePath() + "/wordx.dat";
String target = "abracadabra";
boolean targetFound = false;
int searchCount = 0;
try {
RandomAccessFile raf = new RandomAccessFile(wordxFilePath, "r");
RandomAccessFile rafWord = new RandomAccessFile(wordFilePath, "r");
long low = 0;
long high = (raf.length() / 4) - 1;
int cur = 0;
long wordOffset = 0;
int len = 0;
while (high >= low) {
long mid = (low + high) / 2;
raf.seek(mid * 4);
cur = raf.readInt();
Log.v(TAG + "-cur", String.valueOf(cur));
len = cur >> 22; //word length
cur = cur & 0x3FFFFF; //first 10 bits are 0
rafWord.seek(cur);
byte [] bytes = new byte[len];
wordOffset = rafWord.read(bytes, 0, len);
Log.v(TAG + "-wordOffset", String.valueOf(wordOffset));
searchCount++;
String str = new String(bytes);
Log.v(TAG, str);
if (target.compareTo(str) < 0) {
high = mid - 1;
} else if (target.compareTo(str) == 0) {
targetFound = true;
break;
} else {
low = mid + 1;
}
}
raf.close();
rafWord.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
if (targetFound == true) {
Log.v(TAG + "-found " , String.valueOf(searchCount));
} else {
Log.v(TAG + "-not found " , String.valueOf(searchCount));
}
}
In a uniform character length text file you could seek to the middle of the interval in question character wise, start reading characters until you hit your deliminator, then use the subsequent string as an approximation for the element wise middle. The problem with doing this in android, though, is you apparently can't get random access to a resource (although I suppose you could just reopen it every time). Furthermore this technique doesn't generalize to maps and sets of other types.
Another option would be to (using a RandomAccessFile) write an "array" of ints - one for each String - at the beginning of the file then go back and update them with the locations of their corresponding Strings. Again the search will require jumping around.
What I would do (and did do in my own app) is implement a hash set in a file. This one does separate chaining with trees.
import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Set;
class StringFileSet {
private static final double loadFactor = 0.75;
public static void makeFile(String fileName, String comment, Set<String> set) throws IOException {
new File(fileName).delete();
RandomAccessFile fout = new RandomAccessFile(fileName, "rw");
//Write comment
fout.writeUTF(comment);
//Make bucket array
int numBuckets = (int)(set.size()/loadFactor);
ArrayList<ArrayList<String>> bucketArray = new ArrayList<ArrayList<String>>(numBuckets);
for (int ii = 0; ii < numBuckets; ii++){
bucketArray.add(new ArrayList<String>());
}
for (String key : set){
bucketArray.get(Math.abs(key.hashCode()%numBuckets)).add(key);
}
//Sort key lists in preparation for creating trees
for (ArrayList<String> keyList : bucketArray){
Collections.sort(keyList);
}
//Make queues in preparation for creating trees
class NodeInfo{
public final int lower;
public final int upper;
public final long callingOffset;
public NodeInfo(int lower, int upper, long callingOffset){
this.lower = lower;
this.upper = upper;
this.callingOffset = callingOffset;
}
}
ArrayList<LinkedList<NodeInfo>> queueList = new ArrayList<LinkedList<NodeInfo>>(numBuckets);
for (int ii = 0; ii < numBuckets; ii++){
queueList.add(new LinkedList<NodeInfo>());
}
//Write bucket array
fout.writeInt(numBuckets);
for (int index = 0; index < numBuckets; index++){
queueList.get(index).add(new NodeInfo(0, bucketArray.get(index).size()-1, fout.getFilePointer()));
fout.writeInt(-1);
}
//Write trees
for (int bucketIndex = 0; bucketIndex < numBuckets; bucketIndex++){
while (queueList.get(bucketIndex).size() != 0){
NodeInfo nodeInfo = queueList.get(bucketIndex).poll();
if (nodeInfo.lower <= nodeInfo.upper){
//Set respective pointer in parent node
fout.seek(nodeInfo.callingOffset);
fout.writeInt((int)(fout.length() - (nodeInfo.callingOffset + 4))); //Distance instead of absolute position so that the get method can use a DataInputStream
fout.seek(fout.length());
int middle = (nodeInfo.lower + nodeInfo.upper)/2;
//Key
fout.writeUTF(bucketArray.get(bucketIndex).get(middle));
//Left child
queueList.get(bucketIndex).add(new NodeInfo(nodeInfo.lower, middle-1, fout.getFilePointer()));
fout.writeInt(-1);
//Right child
queueList.get(bucketIndex).add(new NodeInfo(middle+1, nodeInfo.upper, fout.getFilePointer()));
fout.writeInt(-1);
}
}
}
fout.close();
}
private final String fileName;
private final int numBuckets;
private final int bucketArrayOffset;
public StringFileSet(String fileName) throws IOException {
this.fileName = fileName;
DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(fileName)));
short numBytes = fin.readShort();
fin.skipBytes(numBytes);
this.numBuckets = fin.readInt();
this.bucketArrayOffset = numBytes + 6;
fin.close();
}
public boolean contains(String key) throws IOException {
boolean containsKey = false;
DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(this.fileName)));
fin.skipBytes(4*(Math.abs(key.hashCode()%this.numBuckets)) + this.bucketArrayOffset);
int distance = fin.readInt();
while (distance != -1){
fin.skipBytes(distance);
String candidate = fin.readUTF();
if (key.compareTo(candidate) < 0){
distance = fin.readInt();
}else if (key.compareTo(candidate) > 0){
fin.skipBytes(4);
distance = fin.readInt();
}else{
fin.skipBytes(8);
containsKey = true;
break;
}
}
fin.close();
return containsKey;
}
}
A test program
import java.io.File;
import java.io.IOException;
import java.util.HashSet;
class Test {
public static void main(String[] args) throws IOException {
HashSet<String> stringMemorySet = new HashSet<String>();
stringMemorySet.add("red");
stringMemorySet.add("yellow");
stringMemorySet.add("blue");
StringFileSet.makeFile("stringSet", "Provided under ... included in all copies and derivatives ...", stringMemorySet);
StringFileSet stringFileSet = new StringFileSet("stringSet");
System.out.println("orange -> " + stringFileSet.contains("orange"));
System.out.println("red -> " + stringFileSet.contains("red"));
System.out.println("yellow -> " + stringFileSet.contains("yellow"));
System.out.println("blue -> " + stringFileSet.contains("blue"));
new File("stringSet").delete();
System.out.println();
}
}
You'll also need to pass a Context to it, if and when you modify it for android, so it can access the getResources() method.
You're also probably going to want to stop the android build tools from compressing the file, which can apparently only be done - if you're working with the GUI - by changing the file's extension to something such as jpg. This made the process about 100 to 300 times faster in my app.
You might also look into giving yourself more memory by using the NDK.
Though it might sound like overkill, don't store data you need to do this with as a flat file. Make a database and query the data in the database. This should be both effective and fast.
Here is a function that I think works (using this in practice). Lines can have any length. You have to supply a lambda called "nav" to do the actual line check so you are flexible in the file's order (case-sensitive, case-insensitive, ordered by a certain field etc.).
import java.io.File;
import java.io.RandomAccessFile;
class main {
// returns pair(character range in file, line) or null if not found
// if no exact match found, return line above
// nav takes a line and returns -1 (move up), 0 (found) or 1 (move down)
// The line supplied to nav is stripped of the trailing \n, but not the \r
// UTF-8 encoding is assumed
static Pair<LongRange, String> binarySearchForLineInTextFile(File file, IF1<String, Integer> nav) {
long length = l(file);
int bufSize = 1024;
RandomAccessFile raf = randomAccessFileForReading(file);
try {
long min = 0, max = length;
int direction = 0;
Pair<LongRange, String> possibleResult = null;
while (min < max) {
ping();
long middle = (min + max) / 2;
long lineStart = raf_findBeginningOfLine(raf, middle, bufSize);
long lineEnd = raf_findEndOfLine(raf, middle, bufSize);
String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (lineEnd - 1 - lineStart)));
direction = nav.get(line);
possibleResult = (Pair<LongRange, String>) new Pair(new LongRange(lineStart, lineEnd), line);
if (direction == 0) return possibleResult;
// asserts are to assure that loop terminates
if (direction < 0) max = assertLessThan(max, lineStart);
else min = assertBiggerThan(min, lineEnd);
}
if (direction >= 0) return possibleResult;
long lineStart = raf_findBeginningOfLine(raf, min - 1, bufSize);
String line = fromUtf8(raf_readFilePart(raf, lineStart, (int) (min - 1 - lineStart)));
return new Pair(new LongRange(lineStart, min), line);
} finally {
_close(raf);
}
}
static int l(byte[] a) {
return a == null ? 0 : a.length;
}
static long l(File f) {
return f == null ? 0 : f.length();
}
static RandomAccessFile randomAccessFileForReading(File path) {
try {
return new RandomAccessFile(path, "r");
} catch (Exception __e) {
throw rethrow(__e);
}
}
// you can change this function to allow interrupting long calculations from the outside. just throw a RuntimeException.
static boolean ping() {
return true;
}
static long raf_findBeginningOfLine(RandomAccessFile raf, long pos, int bufSize) {
try {
byte[] buf = new byte[bufSize];
while (pos > 0) {
long start = Math.max(pos - bufSize, 0);
raf.seek(start);
raf.readFully(buf, 0, (int) Math.min(pos - start, bufSize));
int idx = lastIndexOf_byteArray(buf, (byte) '\n');
if (idx >= 0) return start + idx + 1;
pos = start;
}
return 0;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static long raf_findEndOfLine(RandomAccessFile raf, long pos, int bufSize) {
try {
byte[] buf = new byte[bufSize];
long length = raf.length();
while (pos < length) {
raf.seek(pos);
raf.readFully(buf, 0, (int) Math.min(length - pos, bufSize));
int idx = indexOf_byteArray(buf, (byte) '\n');
if (idx >= 0) return pos + idx + 1;
pos += bufSize;
}
return length;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static String fromUtf8(byte[] bytes) {
try {
return bytes == null ? null : new String(bytes, "UTF-8");
} catch (Exception __e) {
throw rethrow(__e);
}
}
static byte[] raf_readFilePart(RandomAccessFile raf, long start, int l) {
try {
byte[] buf = new byte[l];
raf.seek(start);
raf.readFully(buf);
return buf;
} catch (Exception __e) {
throw rethrow(__e);
}
}
static <A> A assertLessThan(A a, A b) {
assertTrue(cmp(b, a) < 0);
return b;
}
static <A> A assertBiggerThan(A a, A b) {
assertTrue(cmp(b, a) > 0);
return b;
}
static void _close(AutoCloseable c) {
try {
if (c != null)
c.close();
} catch (Throwable e) {
throw rethrow(e);
}
}
static RuntimeException rethrow(Throwable t) {
throw t instanceof RuntimeException ? (RuntimeException) t : new RuntimeException(t);
}
static int lastIndexOf_byteArray(byte[] a, byte b) {
for (int i = l(a) - 1; i >= 0; i--)
if (a[i] == b)
return i;
return -1;
}
static int indexOf_byteArray(byte[] a, byte b) {
int n = l(a);
for (int i = 0; i < n; i++)
if (a[i] == b)
return i;
return -1;
}
static boolean assertTrue(boolean b) {
if (!b)
throw fail("oops");
return b;
}
static int cmp(Object a, Object b) {
if (a == null) return b == null ? 0 : -1;
if (b == null) return 1;
return ((Comparable) a).compareTo(b);
}
static RuntimeException fail(String msg) {
throw new RuntimeException(msg == null ? "" : msg);
}
final static class LongRange {
long start, end;
LongRange(long start, long end) {
this.end = end;
this.start = start;
}
public String toString() {
return "[" + start + ";" + end + "]";
}
}
interface IF1<A, B> {
B get(A a);
}
static class Pair<A, B> {
A a;
B b;
Pair(A a, B b) {
this.b = b;
this.a = a;
}
public String toString() {
return "<" + a + ", " + b + ">";
}
}
}

Binary search in a sorted (memory-mapped ?) file in Java

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

Categories

Resources