Custom HashMap Code Issue - java

I have following code, where I used HashMap (using two parallel arrays) for storing key-value pairs (key can have multiple values). Now, I have to store and load it for future use that's why I store and load it by using File Channel. Issue with this code is: I can store nearly 120 millions of key-value pairs in my 8 GB server (actually, I can allocate nearly 5 gb out of 8 gb for my JVM, and those two parallel arrays takes nearly 2.5 gb, other memory are used for various processing of my code). But, I have to store nearly 600/700 millions of key-value pairs. Can anybdoy help me how to modify this code thus I can store nearly 600/700 millions of key-value pairs. Or any comment on this will be nice for me. Another point, I have to load and store the hashmap to/from memory. It takes little bit long time using file channel. As per various suggestions of Stack Overflow, I didn't find faster one. I have used ObjectOutputStream, Zipped output stream also, however, slower than below code. Is there anyway to store those two parallel arrays in such a way thus loading time will be much faster. I have given below in my code a test case. Any comment on that will also be helpful to me.
import java.io.*;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Arrays;
import java.util.Random;
import java.nio.*;
import java.nio.channels.FileChannel;
import java.io.RandomAccessFile;
public class Test {
public static void main(String args[]) {
try {
Random randomGenerator = new Random();
LongIntParallelHashMultimap lph = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");
for (int i = 0; i < 110000000; i++) {
lph.put(i, randomGenerator.nextInt(200000000));
}
lph.save();
LongIntParallelHashMultimap lphN = new LongIntParallelHashMultimap(220000000, "xx.dat", "yy.dat");
lphN.load();
int tt[] = lphN.get(1);
System.out.println(tt[0]);
} catch (Exception e) {
e.printStackTrace();
}
}
}
class LongIntParallelHashMultimap {
private static final long NULL = -1L;
private final long[] keys;
private final int[] values;
private int size;
private int savenum = 0;
private String str1 = "";
private String str2 = "";
public LongIntParallelHashMultimap(int capacity, String st1, String st2) {
keys = new long[capacity];
values = new int[capacity];
Arrays.fill(keys, NULL);
savenum = capacity;
str1 = st1;
str2 = st2;
}
public void put(long key, int value) {
int index = indexFor(key);
while (keys[index] != NULL) {
index = successor(index);
}
keys[index] = key;
values[index] = value;
++size;
}
public int[] get(long key) {
int index = indexFor(key);
int count = countHits(key, index);
int[] hits = new int[count];
int hitIndex = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
hits[hitIndex] = values[index];
++hitIndex;
}
index = successor(index);
}
return hits;
}
private int countHits(long key, int index) {
int numHits = 0;
while (keys[index] != NULL) {
if (keys[index] == key) {
++numHits;
}
index = successor(index);
}
return numHits;
}
private int indexFor(long key) {
return Math.abs((int) ((key * 5700357409661598721L) % keys.length));
}
private int successor(int index) {
return (index + 1) % keys.length;
}
public int size() {
return size;
}
public void load() {
try {
FileChannel channel2 = new RandomAccessFile(str1, "r").getChannel();
MappedByteBuffer mbb2 = channel2.map(FileChannel.MapMode.READ_ONLY, 0, channel2.size());
mbb2.order(ByteOrder.nativeOrder());
assert mbb2.remaining() == savenum * 8;
for (int i = 0; i < savenum; i++) {
long l = mbb2.getLong();
keys[i] = l;
}
channel2.close();
FileChannel channel3 = new RandomAccessFile(str2, "r").getChannel();
MappedByteBuffer mbb3 = channel3.map(FileChannel.MapMode.READ_ONLY, 0, channel3.size());
mbb3.order(ByteOrder.nativeOrder());
assert mbb3.remaining() == savenum * 4;
for (int i = 0; i < savenum; i++) {
int l1 = mbb3.getInt();
values[i] = l1;
}
channel3.close();
} catch (Exception e) {
System.out.println(e);
}
}
public void save() {
try {
FileChannel channel = new RandomAccessFile(str1, "rw").getChannel();
MappedByteBuffer mbb = channel.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 8);
mbb.order(ByteOrder.nativeOrder());
for (int i = 0; i < savenum; i++) {
mbb.putLong(keys[i]);
}
channel.close();
FileChannel channel1 = new RandomAccessFile(str2, "rw").getChannel();
MappedByteBuffer mbb1 = channel1.map(FileChannel.MapMode.READ_WRITE, 0, savenum * 4);
mbb1.order(ByteOrder.nativeOrder());
for (int i = 0; i < savenum; i++) {
mbb1.putInt(values[i]);
}
channel1.close();
} catch (Exception e) {
System.out.println("IOException : " + e);
}
}
}

I doubt this is possible, given the datatypes you have declared. Just multiply the sizes of the primitive types.
Each row requires 4 bytes to store an int and 8 bytes to store a long.
600 million rows * 12 bytes per row = 7200 MB = 7.03 GB. You say you can allocate 5 GB to the JVM. So even if it was all heap and stored only this custom HashMap, it will not fit. Consider shrinking the size of the datatypes involved or storing it somewhere other than RAM.

Have the database on disk, and not in memory. Rewrite your operations so that they don't operate on arrays, but instead operate on buffers. Then you can open a sufficiently large file, and have the operations access the portion they need using a mapped buffer. Try whether your application performs better when you implement a cache of the few most recently mapped memory regions, so you won't have to map and unmap common regions too often, but instead can keep them mapped in.
This should give you the best of both worlds, disk and ram:
Random access to any portion of the data structure is easy to implement
Access to often used portions of the table will be cached
Seldom used portions of the table will not occupy any memory
As you can see, this depends a lot on locality: if some keys are more common than others, things will perform well, whereas nicely distributed keys will cause a new disk operation for each access. So while nice distributions are desirable for most in-memory hash maps, other structures which map often-used keys to similar locations will perform better here. Those will interfere with collision handling, though.

Better to use in-memory database like sqlite, which will give good result.

Related

Why won't it tear? Trying to write an example program showing tearing for long between two threads in java

According to the java standard longs are written in two parts, and it is possible in one thread to read a number that was never written b/c it consists of the first part of one write and the second of another (https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.7). I have tried to write a program that shows this happening; but it never happens. Do I misunderstand the standard, or is there an error in my example program.
In the program below if tearing happens, we should get an output
[main] INFO net.kasterma.basicjava.TearingTest2 - compare false
This has not happened in many runs.
package net.kasterma.basicjava;
import lombok.extern.slf4j.Slf4j;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Random;
import java.util.Set;
#Slf4j
public class TearingTest2 extends Thread {
private final boolean read;
private static long i = 0;
private final static int ITERATIONS = 1_000_000;
private final static long[] read_i = new long[ITERATIONS];
private final static long[] write_i = new long[ITERATIONS];
private final static Random random = new Random();
TearingTest2(boolean read) {
this.read = read;
}
public void run() {
if (read) {
for (int iter = 0; iter < ITERATIONS; iter++) {
// log.info("read {}", iter);
read_i[iter] = i;
}
} else {
for (int iter = 0; iter < ITERATIONS; iter++) {
// log.info("write {}", iter);
i = random.nextLong();
write_i[iter] = i;
}
}
}
static boolean compare() {
Set<Long> writes = new HashSet<>();
writes.add(0L);
Arrays.stream(write_i).forEach(l -> writes.add(l));
for (int iter = 0; iter < ITERATIONS; iter++) {
if (!writes.contains(read_i[iter])) {
log.info("not found {}", iter);
return false; // <--- tearing has happened.
}
}
// compute some statistics for debugging of the program
Set<Long> reads = new HashSet<>();
Arrays.stream(read_i).forEach(l -> reads.add(l));
log.info("Number of read values {}", reads.size());
int ct = 0;
for (int iter = 0; iter < ITERATIONS; iter++) {
if (read_i[iter] == 0) {
ct++;
}
}
log.info("number zeros {}", ct);
return true;
}
public static void main(String[] args) throws InterruptedException {
Thread T1 = new TearingTest2(true);
Thread T2 = new TearingTest2(false);
T1.start();
T2.start();
T1.join();
T2.join();
log.info("compare {}", compare());
}
}
The output of a run is:
[main] INFO net.kasterma.basicjava.TearingTest2 - Number of read values 105328
[main] INFO net.kasterma.basicjava.TearingTest2 - number zeros 1575
[main] INFO net.kasterma.basicjava.TearingTest2 - compare true
From the same article regarding standard that you've mentioned
Some implementations may find it convenient to divide a single write action on a 64-bit long or double value into two write actions on adjacent 32-bit values. For efficiency's sake, this behavior is implementation-specific; an implementation of the Java Virtual Machine is free to perform writes to long and double values atomically or in two parts.
Implementations of the Java Virtual Machine are encouraged to avoid splitting 64-bit values where possible.
So I think for some JVMs you may never see such split because of how JVM is implemented

Speeding up file read

I have a 1.7G file with the following format:
String Long String Long String Long String Long ... etc
Essentially, String is a key and Long is a value in a hashmap i'm interested in initialising before running anything else in my application.
My current code is:
RandomAccessFile raf=new RandomAccessFile("/home/map.dat","r");
raf.seek(0);
while(raf.getFilePointer()!=raf.length()){
String name=raf.readUTF();
long offset=raf.readLong();
map.put(name,offset);
}
This takes about 12 mins to complete and I'm sure there are better ways of doing this so I would appreciate any help or pointer.
thanks
Update as in EJP suggestion?
EJP thank you for your suggestion and I hope this is what you meant. Correct me if this is wrong
DataInputStream dis=null;
try{
dis=new DataInputStream(new BufferedInputStream(new FileInputStream("/home/map.dat")));
while(true){
String name=dis.readUTF();
long offset=dis.readLong();
map.put(name, offset);
}
}catch (EOFException eofe){
try{
dis.close();
}catch (IOException ioe){
ioe.printStackTrace();
}
}
Use a DataInputStream wrapped around a BufferedInputStream wrapped around a FileInputStream.
Instead of at least four system calls per iteration, checking the length, and the current size and performing who knows how many reads to get the string and the long, just call readUTF() and readLong() until you get an EOFException.
I would construct the file so it can be used in place. i.e. without loading this way. As you have variable length records you can construct an array of the location of each record, then place the key in order so you can perform a binary search for data. (Or you can use a custom hash table) You can then wrap this with method which hide the fact the data is actually store in a file instead of turned into data objects.
If you do all this the "load" phase becomes redundant and you won't need to create so many objects.
This is a long example but hopefully shows what is possible.
import vanilla.java.chronicle.Chronicle;
import vanilla.java.chronicle.Excerpt;
import vanilla.java.chronicle.impl.IndexedChronicle;
import vanilla.java.chronicle.tools.ChronicleTest;
import java.io.IOException;
import java.util.*;
public class Main {
static final String TMP = System.getProperty("java.io.tmpdir");
public static void main(String... args) throws IOException {
String baseName = TMP + "/test";
String[] keys = generateAndSave(baseName, 100 * 1000 * 1000);
long start = System.nanoTime();
SavedSortedMap map = new SavedSortedMap(baseName);
for (int i = 0; i < keys.length / 100; i++) {
long l = map.lookup(keys[i]);
// System.out.println(keys[i] + ": " + l);
}
map.close();
long time = System.nanoTime() - start;
System.out.printf("Load of %,d records and lookup of %,d keys took %.3f seconds%n",
keys.length, keys.length / 100, time / 1e9);
}
static SortedMap<String, Long> generateMap(int keys) {
SortedMap<String, Long> ret = new TreeMap<>();
while (ret.size() < keys) {
long n = ret.size();
String key = Long.toString(n);
while (key.length() < 9)
key = '0' + key;
ret.put(key, n);
}
return ret;
}
static void saveData(SortedMap<String, Long> map, String baseName) throws IOException {
Chronicle chronicle = new IndexedChronicle(baseName);
Excerpt excerpt = chronicle.createExcerpt();
for (Map.Entry<String, Long> entry : map.entrySet()) {
excerpt.startExcerpt(2 + entry.getKey().length() + 8);
excerpt.writeUTF(entry.getKey());
excerpt.writeLong(entry.getValue());
excerpt.finish();
}
chronicle.close();
}
static class SavedSortedMap {
final Chronicle chronicle;
final Excerpt excerpt;
final String midKey;
final long size;
SavedSortedMap(String baseName) throws IOException {
chronicle = new IndexedChronicle(baseName);
excerpt = chronicle.createExcerpt();
size = chronicle.size();
excerpt.index(size / 2);
midKey = excerpt.readUTF();
}
// find exact match or take the value after.
public long lookup(CharSequence key) {
if (compareTo(key, midKey) < 0)
return lookup0(0, size / 2, key);
return lookup0(size / 2, size, key);
}
private final StringBuilder tmp = new StringBuilder();
private long lookup0(long from, long to, CharSequence key) {
long mid = (from + to) >>> 1;
excerpt.index(mid);
tmp.setLength(0);
excerpt.readUTF(tmp);
if (to - from <= 1)
return excerpt.readLong();
int cmp = compareTo(key, tmp);
if (cmp < 0)
return lookup0(from, mid, key);
if (cmp > 0)
return lookup0(mid, to, key);
return excerpt.readLong();
}
public static int compareTo(CharSequence a, CharSequence b) {
int lim = Math.min(a.length(), b.length());
for (int k = 0; k < lim; k++) {
char c1 = a.charAt(k);
char c2 = b.charAt(k);
if (c1 != c2)
return c1 - c2;
}
return a.length() - b.length();
}
public void close() {
chronicle.close();
}
}
private static String[] generateAndSave(String baseName, int keyCount) throws IOException {
SortedMap<String, Long> map = generateMap(keyCount);
saveData(map, baseName);
ChronicleTest.deleteOnExit(baseName);
String[] keys = map.keySet().toArray(new String[map.size()]);
Collections.shuffle(Arrays.asList(keys));
return keys;
}
}
generates 2 GB of raw data and performs a million lookups. It's written in such a way that the loading and lookup uses very little heap. ( << 1 MB )
ls -l /tmp/test*
-rw-rw---- 1 peter peter 2013265920 Dec 11 13:23 /tmp/test.data
-rw-rw---- 1 peter peter 805306368 Dec 11 13:23 /tmp/test.index
/tmp/test created.
/tmp/test, size=100000000
Load of 100,000,000 records and lookup of 1,000,000 keys took 10.945 seconds
Using a hash table lookup would be faster per lookup as it is O(1) instead of O(ln N), but more complex to implement.

what is wrong with this thread-safe byte sequence generator?

I need a byte generator that would generate values from Byte.MIN_VALUE to Byte.MAX_VALUE. When it reaches MAX_VALUE, it should start over again from MIN_VALUE.
I have written the code using AtomicInteger (see below); however, the code does not seem to behave properly if accessed concurrently and if made artificially slow with Thread.sleep() (if no sleeping, it runs fine; however, I suspect it is just too fast for concurrency problems to show up).
The code (with some added debug code):
public class ByteGenerator {
private static final int INITIAL_VALUE = Byte.MIN_VALUE-1;
private AtomicInteger counter = new AtomicInteger(INITIAL_VALUE);
private AtomicInteger resetCounter = new AtomicInteger(0);
private boolean isSlow = false;
private long startTime;
public byte nextValue() {
int next = counter.incrementAndGet();
//if (isSlow) slowDown(5);
if (next > Byte.MAX_VALUE) {
synchronized(counter) {
int i = counter.get();
//if value is still larger than max byte value, we reset it
if (i > Byte.MAX_VALUE) {
counter.set(INITIAL_VALUE);
resetCounter.incrementAndGet();
if (isSlow) slowDownAndLog(10, "resetting");
} else {
if (isSlow) slowDownAndLog(1, "missed");
}
next = counter.incrementAndGet();
}
}
return (byte) next;
}
private void slowDown(long millis) {
try {
Thread.sleep(millis);
} catch (InterruptedException e) {
}
}
private void slowDownAndLog(long millis, String msg) {
slowDown(millis);
System.out.println(resetCounter + " "
+ (System.currentTimeMillis()-startTime) + " "
+ Thread.currentThread().getName() + ": " + msg);
}
public void setSlow(boolean isSlow) {
this.isSlow = isSlow;
}
public void setStartTime(long startTime) {
this.startTime = startTime;
}
}
And, the test:
public class ByteGeneratorTest {
#Test
public void testGenerate() throws Exception {
ByteGenerator g = new ByteGenerator();
for (int n = 0; n < 10; n++) {
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
assertEquals(i, g.nextValue());
}
}
}
#Test
public void testGenerateMultiThreaded() throws Exception {
final ByteGenerator g = new ByteGenerator();
g.setSlow(true);
final AtomicInteger[] counters = new AtomicInteger[Byte.MAX_VALUE-Byte.MIN_VALUE+1];
for (int i = 0; i < counters.length; i++) {
counters[i] = new AtomicInteger(0);
}
Thread[] threads = new Thread[100];
final CountDownLatch latch = new CountDownLatch(threads.length);
for (int i = 0; i < threads.length; i++) {
threads[i] = new Thread(new Runnable() {
public void run() {
try {
for (int i = Byte.MIN_VALUE; i <= Byte.MAX_VALUE; i++) {
byte value = g.nextValue();
counters[value-Byte.MIN_VALUE].incrementAndGet();
}
} finally {
latch.countDown();
}
}
}, "generator-client-" + i);
threads[i].setDaemon(true);
}
g.setStartTime(System.currentTimeMillis());
for (int i = 0; i < threads.length; i++) {
threads[i].start();
}
latch.await();
for (int i = 0; i < counters.length; i++) {
System.out.println("value #" + (i+Byte.MIN_VALUE) + ": " + counters[i].get());
}
//print out the number of hits for each value
for (int i = 0; i < counters.length; i++) {
assertEquals("value #" + (i+Byte.MIN_VALUE), threads.length, counters[i].get());
}
}
}
The result on my 2-core machine is that value #-128 gets 146 hits (all of them should get 100 hits equally as we have 100 threads).
If anyone has any ideas, what's wrong with this code, I'm all ears/eyes.
UPDATE: for those who are in a hurry and do not want to scroll down, the correct (and shortest and most elegant) way to solve this in Java would be like this:
public byte nextValue() {
return (byte) counter.incrementAndGet();
}
Thanks, Heinz!
Initially, Java stored all fields as 4 or 8 byte values, even short and byte. Operations on the fields would simply do bit masking to shrink the bytes. Thus we could very easily do this:
public byte nextValue() {
return (byte) counter.incrementAndGet();
}
Fun little puzzle, thanks Neeme :-)
You make the decision to incrementAndGet() based on a old value of counter.get(). The value of the counter can reach MAX_VALUE again before you do the incrementAndGet() operation on the counter.
if (next > Byte.MAX_VALUE) {
synchronized(counter) {
int i = counter.get(); //here You make sure the the counter is not over the MAX_VALUE
if (i > Byte.MAX_VALUE) {
counter.set(INITIAL_VALUE);
resetCounter.incrementAndGet();
if (isSlow) slowDownAndLog(10, "resetting");
} else {
if (isSlow) slowDownAndLog(1, "missed"); //the counter can reach MAX_VALUE again if you wait here long enough
}
next = counter.incrementAndGet(); //here you increment on return the counter that can reach >MAX_VALUE in the meantime
}
}
To make it work one has to make sure the no decisions are made on stale info. Either reset the counter or return the old value.
public byte nextValue() {
int next = counter.incrementAndGet();
if (next > Byte.MAX_VALUE) {
synchronized(counter) {
next = counter.incrementAndGet();
//if value is still larger than max byte value, we reset it
if (next > Byte.MAX_VALUE) {
counter.set(INITIAL_VALUE + 1);
next = INITIAL_VALUE + 1;
resetCounter.incrementAndGet();
if (isSlow) slowDownAndLog(10, "resetting");
} else {
if (isSlow) slowDownAndLog(1, "missed");
}
}
}
return (byte) next;
}
Your synchronized block contains only the if body. It should wrap whole method including if statement itself. Or just make your method nextValue synchronized. BTW in this case you do not need Atomic variables at all.
I hope this will work for you. Try to use Atomic variables only if your really need highest performance code, i.e. synchronized statement bothers you. IMHO in most cases it does not.
If I understand you correctly, you care that the results of nextValue are in the range of Byte.MIN_VALUE and Byte.MAX_VALUE and you don't care about the value stored in the counter.
Then you can map integers on bytes such that you required enumeration behavior is exposed:
private static final int VALUE_RANGE = Byte.MAX_VALUE - Byte.MIN_VALUE + 1;
private final AtomicInteger counter = new AtomicInteger(0);
public byte nextValue() {
return (byte) (counter.incrementAndGet() % VALUE_RANGE + Byte.MIN_VALUE - 1);
}
Beware, this is untested code. But the idea should work.
I coded up the following version of nextValue using compareAndSet which is designed to be used in a non-synchronized block. It passed your unit tests:
Oh, and I introduced new constants for MIN_VALUE and MAX_VALUE but you can ignore those if you prefer.
static final int LOWEST_VALUE = Byte.MIN_VALUE;
static final int HIGHEST_VALUE = Byte.MAX_VALUE;
private AtomicInteger counter = new AtomicInteger(LOWEST_VALUE - 1);
private AtomicInteger resetCounter = new AtomicInteger(0);
public byte nextValue() {
int oldValue;
int newValue;
do {
oldValue = counter.get();
if (oldValue >= HIGHEST_VALUE) {
newValue = LOWEST_VALUE;
resetCounter.incrementAndGet();
if (isSlow) slowDownAndLog(10, "resetting");
} else {
newValue = oldValue + 1;
if (isSlow) slowDownAndLog(1, "missed");
}
} while (!counter.compareAndSet(oldValue, newValue));
return (byte) newValue;
}
compareAndSet() works in conjunction with get() to manage concurrency.
At the start of your critical section, you perform a get() to retrieve the old value. You then perform some function dependent only on the old value to compute a new value. Then you use compareAndSet() to set the new value. If the AtomicInteger is no longer equal to the old value at the time compareAndSet() is executed (because of concurrent activity), it fails and you must start over.
If you have an extreme amount of concurrency and the computation time is long, it is conceivable that the compareAndSet() may fail many times before succeeding and it may be worth gathering statistics on that if concerns you.
I'm not suggesting that this is a better or worse approach than a simple synchronized block as others have suggested, but I personally would probably use a synchronized block for simplicity.
EDIT: I'll answer your actual question "Why doesn't mine work?"
Your code has:
int next = counter.incrementAndGet();
if (next > Byte.MAX_VALUE) {
As these two lines are not protected by a synchronized block, multiple threads can execute them concurrently and all obtain values of next > Byte.MAX_VALUE. All of them will then drop through into the synchronized block and set counter back to INITIAL_VALUE (one after another as they wait for each other).
Over the years, there has been a huge amount written over the pitfalls of trying to get a performance tweak by not synchronizing when it doesn't seem necessary. For example, see Double Checked Locking
Notwithstanding that Heinz Kabutz is the clean answer to the specific question, ye olde Java SE 8 [March 2014] added AtomicIntger.updateAndGet (and friends). This leads to a more general solution if circumstances required:
public class ByteGenerator {
private static final int MIN = Byte.MIN_VALUE;
private static final int MAX = Byte.MAX_VALUE;
private final AtomicInteger counter = new AtomicInteger(MIN);
public byte nextValue() {
return (byte)counter.getAndUpdate(ByteGenerator::update);
}
private static int update(int old) {
return old==MAX ? MIN : old+1;
}
}

Encoding codes in Java

Over the past couple of weeks I've read through the book Error Control Coding: Fundamentals and Applications in order to learn about BCH (Bose, Chaudhuri, Hocquenghem) Codes for an junior programming role at a telecoms company.
This book mostly covers the mathematics and theory behind the subject, but I'm struggling to implement some of the concepts; primarily getting the next n codewords.I have a GUI (implemented through NetBeans, so I won't post the code as the file is huge) that passes a code in order to get the next n numbers:
Generating these numbers is where I am having problems. If I could go through all of these within just the encoding method instead of looping through using the GUI my life would be ten times easier.
This has been driving me crazy for days now as it is easy enough to generate 0000000000 from the input, but I am lost as to where to go from there with my code. What do I then do to generate the next working number?
Any help with generating the above code would be appreciated.
(big edit...) Playing with the code a bit more this seems to work:
import java.util.ArrayList;
import java.util.List;
public class Main
{
public static void main(final String[] argv)
{
final int startValue;
final int iterations;
final List<String> list;
startValue = Integer.parseInt(argv[0]);
iterations = Integer.parseInt(argv[1]);
list = encodeAll(startValue, iterations);
System.out.println(list);
}
private static List<String> encodeAll(final int startValue, final int iterations)
{
final List<String> allEncodings;
allEncodings = new ArrayList<String>();
for(int i = 0; i < iterations; i++)
{
try
{
final int value;
final String str;
final String encoding;
value = i + startValue;
str = String.format("%06d", value);
encoding = encoding(str);
allEncodings.add(encoding);
}
catch(final BadNumberException ex)
{
// do nothing
}
}
return allEncodings;
}
public static String encoding(String str)
throws BadNumberException
{
final int[] digit;
final StringBuilder s;
digit = new int[10];
for(int i = 0; i < 6; i++)
{
digit[i] = Integer.parseInt(String.valueOf(str.charAt(i)));
}
digit[6] = ((4*digit[0])+(10*digit[1])+(9*digit[2])+(2*digit[3])+(digit[4])+(7*digit[5])) % 11;
digit[7] = ((7*digit[0])+(8*digit[1])+(7*digit[2])+(digit[3])+(9*digit[4])+(6*digit[5])) % 11;
digit[8] = ((9*digit[0])+(digit[1])+(7*digit[2])+(8*digit[3])+(7*digit[4])+(7*digit[5])) % 11;
digit[9] = ((digit[0])+(2*digit[1])+(9*digit[2])+(10*digit[3])+(4*digit[4])+(digit[5])) % 11;
// Insert Parity Checking method (Vandermonde Matrix)
s = new StringBuilder();
for(int i = 0; i < 9; i++)
{
s.append(Integer.toString(digit[i]));
}
if(digit[6] == 10 || digit[7] == 10 || digit[8] == 10 || digit[9] == 10)
{
throw new BadNumberException(str);
}
return (s.toString());
}
}
class BadNumberException
extends Exception
{
public BadNumberException(final String str)
{
super(str + " cannot be encoded");
}
}
I prefer throwing the exception rather than returning a special string. In this case I ignore the exception which normally I would say is bad practice, but for this case I think it is what you want.
Hard to tell, if I got your problem, but after reading your question several times, maybe that's what you're looking for:
public List<String> encodeAll() {
List<String> allEncodings = new ArrayList<String>();
for (int i = 0; i < 1000000 ; i++) {
String encoding = encoding(Integer.toString(i));
allEncodings.add(encoding);
}
return allEncodings;
}
There's one flaw in the solution, the toOctalString results are not 0-padded. If that's what you want, I suggest using String.format("<something>", i) in the encoding call.
Update
To use it in your current call, replace a call to encoding(String str) with call to this method. You'll receive an ordered List with all encodings.
I aasumed, you were only interested in octal values - my mistake, now I think you just forgot the encoding for value 000009 in you example and thus removed the irretating octal stuff.

Binary search in a sorted (memory-mapped ?) file in Java

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

Categories

Resources