Hadoop Map Reduce CustomSplit/CustomRecordReader - java

I have a huge text file and I wanted to split the file so that each chunk has 5 lines. I implemented my own GWASInputFormat and GWASRecordReader classes. However my question is, in the following code(which I copied from http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/), inside the initialize() method I have the following lines
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
My question is, Is the file already split by the time the initialize() method is called in my GWASRecordReader class? I thought that I was doing it(the split) in the GWASRecordReader class. Let me know if my thought process is right here.
package com.test;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
public class GWASRecordReader extends RecordReader<LongWritable, Text> {
private final int NLINESTOPROCESS = 5;
private LineReader in;
private LongWritable key;
private Text value = new Text();
private long start = 0;
private long pos = 0;
private long end = 0;
private int maxLineLength;
public void close() throws IOException {
if(in != null) {
in.close();
}
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
if(start == end) {
return 0.0f;
}
else {
return Math.min(1.0f, (pos - start)/(float) (end - start));
}
}
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
final Path file = split.getPath();
Configuration conf = context.getConfiguration();
this.maxLineLength = conf.getInt("mapred.linerecordreader.maxlength",Integer.MAX_VALUE);
FileSystem fs = file.getFileSystem(conf);
start = split.getStart();
end = start + split.getLength();
System.out.println("---------------SPLIT LENGTH---------------------" + split.getLength());
boolean skipFirstLine = false;
FSDataInputStream filein = fs.open(split.getPath());
if(start != 0) {
skipFirstLine = true;
--start;
filein.seek(start);
}
in = new LineReader(filein, conf);
if(skipFirstLine) {
start += in.readLine(new Text(),0,(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
}
public boolean nextKeyValue() throws IOException, InterruptedException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
value.clear();
final Text endline = new Text("\n");
int newSize = 0;
for(int i=0; i<NLINESTOPROCESS;i++) {
Text v = new Text();
while( pos < end) {
newSize = in.readLine(v ,maxLineLength, Math.max((int)Math.min(Integer.MAX_VALUE, end - pos), maxLineLength));
value.append(v.getBytes(), 0, v.getLength());
value.append(endline.getBytes(),0,endline.getLength());
if(newSize == 0) {
break;
}
pos += newSize;
if(newSize < maxLineLength) {
break;
}
}
}
if(newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
}

Yes, the input file will already be split. It basically goes like this:
your input file(s) -> InputSplit -> RecordReader -> Mapper...
Basically, InputSplit breaks the input into chunks, RecordReader breaks these chunks into key/value pairs. Note that InputSplit and RecordReader will be determined by the InputFormat you use. For example, TextInputFormat uses FileSplit to break apart the input, then LineRecordReader which processes each individual line with the position as the key, and the line itself as the value.
So in your GWASInputFormat you'll need to look into what kind of FileSplit you use to see what it's passing to GWASRecordReader.
I would suggest looking into NLineInputFormat which "splits N lines of input as one split". It may be able to do exactly what you are trying to do yourself.
If you're trying to get 5 lines at a time as the value, and the line number of the first as a key, I would say you could do this with a customized NLineInputFormat and custom LineRecordReader. You don't need to worry as much about the input split I think, since the input format can split it into those 5 line chunks. Your RecordReader would be very similar to LineRecordReader, but instead of getting the byte position of the start of the chunk, you would get the line number. So the code would be almost identical except for that small change. So you could essentially copy and paste NLineInputFormat and LineRecordReader but then have the input format use your record reader that gets the line number. The code would be very similar.

Related

ArrayIndexOutofBoundsException with Hadoop MapReduce

I'm getting ArrayIndexOutofBoundsException next to String temp = word[5]; in my mapper.
I've researched this and I know what the error is coming from (when the input data is empty or the length is less or more than the index specified in the code. My data has some empty cell values)
I've tried to catch the array index error using the following code but it still gives me error.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line != null && !line.isEmpty() && str.matches(".*\\d+.*"));
String [] word = line.split(",");
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
}
}
If you could please give me any hints or tips to whether the error is in this code or I should look somewhere else, that would save a lot of stress!
By throwing the exception in the method signature, you're basically causing the entire mapper to stop whenever it encounters a single "bad" line of data. What you actually want to do is have the mapper ignore that line of data but keep processing other lines.
You should check the length of word[] immediately after split(). If it's not long enough, stop processing that line. You'll also want to check that month and temp are valid after you've extracted them. How about:
String [] word = line.split(",");
if (word == null || word.length < 6) {
break;
}
String month = word[3];
if (month != null) {
break;
}
String temp = word[5];
if (temp != null && temp.length() > 1 && temp.length() < 5) {
try {
Double avgtemp = Double.parseDouble(temp);
} catch (NumberFormatException ex) {
//Log that you've seen a dodgy temperature
break;
}
output.collect(new Text(month), new DoubleWritable(avgtemp));
}
It's very important to validate data in MapReduce jobs, as you can never guarantee what you'll get as input.
You might also want to look at ApacheCommons StringUtils and ArrayUtils classes - they provide methods such as StringUtils.isEmpty(temp) and ArrayUtils.isEmpty(word) that will neaten up the above.
I would recommend using a custom counter instead, which you will increase every time you find an empty cell. This will give you a picture of how many such lines exist in your data.
Along with some other efficiency modifications, my suggestion is the following:
import java.io.IOException; //do you still need this?
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class AvgMaxTempMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, DoubleWritable> {
public static enum STATS {MISSING_VALUE};
private Text outKey = new Text();
private DoubleWritable outValue = new DoubleWritable();
public void map(LongWritable key, Text value, OutputCollector<Text, DoubleWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
if(line.matches(".*\\d+.*"));
String [] word = line.split(",");
if (word.length < 6) { //or whatever else you consider expected
reporter.incrCounter(STATS.MISSING_VALUE,1); //you can also print/log an error message if you like
return;
}
String month = word[3];
String temp = word[5];
if (temp.length() > 1 && temp.length() < 5){
Double avgtemp = Double.parseDouble(temp);
outKey.set(month);
outValue.set(avgtemp);
output.collect(outKey, outValue);
} //you were missing this '}'
}
}
}

Use JLine to Complete Multiple Commands on One Line

I was wondering how I could implement an ArgumentCompleter such that if I complete a full and valid command, then it would begin tab completing for a new command.
I would have assumed it could be constructed doing something like this:
final ConsoleReader consoleReader = new ConsoleReader()
final ArgumentCompleter cyclicalArgument = new ArgumentCompleter();
cyclicalArgument.getCompleters().addAll(Arrays.asList(
new StringsCompleter("foo"),
new StringsCompleter("bar"),
cyclicalArgument));
consoleReader.addCompleter(cyclicalArgument);
consoleReader.readLine();
However right now this stops working after tab completeing the first foo bar
Is anyone familiar enough with the library to tell me how I would go about implementing this? Or is there a known way to do this that I am missing? Also this is using JLine2.
That was quite a task :-)
It is handled by the completer you are using. The complete() method of the completer has to use for the search only what comes after the last blank.
If you look for example at the FileNameCompleter of the library: this is not done at all, so you will find no completion, because the completer searches for <input1> <input2> and not only for <input2> :-)
You will have to do your own implementation of a completer that is able to find input2.
Additionally the CompletionHandler has to append what you found to what you already typed.
Here is a basic implementation changing the default FileNameCompleter:
protected int matchFiles(final String buffer, final String translated, final File[] files,
final List<CharSequence> candidates) {
// THIS IS NEW
String[] allWords = translated.split(" ");
String lastWord = allWords[allWords.length - 1];
// the lastWord is used when searching the files now
// ---
if (files == null) {
return -1;
}
int matches = 0;
// first pass: just count the matches
for (File file : files) {
if (file.getAbsolutePath().startsWith(lastWord)) {
matches++;
}
}
for (File file : files) {
if (file.getAbsolutePath().startsWith(lastWord)) {
CharSequence name = file.getName() + (matches == 1 && file.isDirectory() ? this.separator() : " ");
candidates.add(this.render(file, name).toString());
}
}
final int index = buffer.lastIndexOf(this.separator());
return index + this.separator().length();
}
And here the complete()-Method of the CompletionHandler changing the default CandidateListCompletionHandler:
#Override
public boolean complete(final ConsoleReader reader, final List<CharSequence> candidates, final int pos)
throws IOException {
CursorBuffer buf = reader.getCursorBuffer();
// THIS IS NEW
String[] allWords = buf.toString().split(" ");
String firstWords = "";
if (allWords.length > 1) {
for (int i = 0; i < allWords.length - 1; i++) {
firstWords += allWords[i] + " ";
}
}
//-----
// if there is only one completion, then fill in the buffer
if (candidates.size() == 1) {
String value = Ansi.stripAnsi(candidates.get(0).toString());
if (buf.cursor == buf.buffer.length() && this.printSpaceAfterFullCompletion && !value.endsWith(" ")) {
value += " ";
}
// fail if the only candidate is the same as the current buffer
if (value.equals(buf.toString())) {
return false;
}
CandidateListCompletionHandler.setBuffer(reader, firstWords + " " + value, pos);
return true;
} else if (candidates.size() > 1) {
String value = this.getUnambiguousCompletions(candidates);
CandidateListCompletionHandler.setBuffer(reader, value, pos);
}
CandidateListCompletionHandler.printCandidates(reader, candidates);
// redraw the current console buffer
reader.drawLine();
return true;
}

How to generate more than one key-value pairs for one input line in Hadoop Input Format?

Here is the background. I have the following input for my MapReduce job (example):
Apache Hadoop
Apache Lucene
StackOverflow
....
(Actually each line represents a user query. Not important here.) And I want my RecordReader class read one line and then pass several key-value pairs to mappers. For example, if RecordReader gets Apache Hadoop, then I want it to generate the following key-value pairs and pass it to mappers:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
("-" is the separator here.) And I found RecordReader pass key-values in next() method:
next(key, value);
Every time a RecordReader.next() is called, only one key and one value will be passed as argument. So how should I get my work done?
I believe you can simply use this:
public static class MultiMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
for (int i = 1; i <= n; i++) {
context.write(value, new IntWritable(i));
}
}
}
Here n is the number of values you want to pass. For example for the key-value pairs you specified:
Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3
n would be 3.
I think if you want to send to the mapper use the same key; you must implement your owner RecordReader; for example you can wirte a MutliRecordReader to extends the LineRecordReade; and here you must change the nextKeyValue method;
this is the original Code from LineRecordReadeļ¼š
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
// We always read one extra line, which lies outside the upper
// split limit i.e. (end - 1)
while (getFilePosition() <= end) {
newSize = in.readLine(value, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
and you can change it like this:
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new Text();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end && n<=3) {
newSize = in.readLine(key, maxLineLength,
Math.max(maxBytesToConsume(pos), maxLineLength));//change value --> key
value =Text(n);
n++;
if(n ==3 )// we don't go to next until the N is three;
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
I think this can suit for you
Try not giving key:-
context.write(NullWritable.get(), new Text("Apache Hadoop - 1"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 2"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 3"));

Normalizing using MapReduce

There is this sample record,
100,1:2:3
Which I want to normalize as,
100,1
100,2
100,3
A colleague of mine wrote a pig script to achieve this and my MapReduce code took more time. I was using the default TextInputformat before. But to improve performance, I decided to write a custom Input format class, with a custom RecordReader. Taking the LineRecordReader class as reference, I tried to write the following code.
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
import com.normalize.util.Splitter;
public class NormalRecordReader extends RecordReader<Text, Text> {
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private Text key = null;
private Text value = null;
private Text line = null;
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
this.pos = start;
}
public boolean nextKeyValue() throws IOException {
int newSize = 0;
if (line == null) {
line = new Text();
}
while (pos < end) {
newSize = in.readLine(line);
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
Splitter splitter = new Splitter(line.toString(), ",");
List<String> split = splitter.split();
if (key == null) {
key = new Text();
}
key.set(split.get(0));
if (value == null) {
value = new Text();
}
value.set(split.get(1));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
Though this works, but I haven't seen any performance improvement. Here I am breaking the record at "," and setting the 100 as key and 1,2,3 as value. I only call the mapper which does the following:
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
try {
Splitter splitter = new Splitter(value.toString(), ":");
List<String> splits = splitter.split();
for (String split : splits) {
context.write(key, new Text(split));
}
} catch (IndexOutOfBoundsException ibe) {
System.err.println(value + " is malformed.");
}
}
The splitter class is used to split the data, as I found String's splitter to be slower. The method is:
public List<String> split() {
List<String> splitData = new ArrayList<String>();
int beginIndex = 0, endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
splitData.add(dataToSplit.substring(beginIndex));
break;
}
splitData.add(dataToSplit.substring(beginIndex, endIndex));
beginIndex = endIndex + delimLength;
}
return splitData;
}
Can the code be improved in any way?
Let me summarize here what I think you can improve instead of in the comments:
As explained, currently you are creating a Text object several times per record (number of times will be equal to your number of tokens). While it may not matter too much for small input, this can be a big deal for decently sized jobs. To fix that, do the following:
private final Text text = new Text();
public void map(Text key, Text value, Context context) {
....
for (String split : splits) {
text.set(split);
context.write(key, text);
}
}
For your splitting, what you're doing right now is for every record allocating a new array, populating this array, and then iterating over this array to write your output. Effectively you don't really need an array in this case since you're not maintaining any state. Using the implementation of the split method you provided, you only need to make one pass on the data:
public void map(Text key, Text value, Context context) {
String dataToSplit = value.toString();
String delim = ":";
int beginIndex = 0;
int endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
text.set(dataToSplit.substring(beginIndex));
context.write(key, text);
break;
}
text.set(dataToSplit.substring(beginIndex, endIndex));
context.write(key, text);
beginIndex = endIndex + delim.length();
}
}
I don't really see why you write your own InputFormat, it seems that KeyValueTextInputFormat is exactly what you need and has probably been already optimized. Here is how you use it:
conf.set("key.value.separator.in.input.line", ",");
job.setInputFormatClass(KeyValueTextInputFormat.class);
Based on your example, the key for each record seems to be an integer. If that's always the case, then using a Text as your mapper input key is not optimal and it should be an IntWritable or maybe even a ByteWritable depending on what's in your data.
Similarly, you want want to use an IntWritable or ByteWritable as your mapper output key and output value.
Also, if you want some meaningful benchmark, you should test on a bigger dataset, like a few Gbs if possible. 1 minute tests are not really meaningful, especially in the context of distributed systems. 1 job may run quicker than another one on a small input, but the trend may be reverted for bigger inputs.
That being said, you should also know that Pig does a lot of optimizations behind the hood when translating to Map/Reduce, so I'm not too surprised that it runs faster than your Java Map/Reduce code and I've seen that in the past. Try the optimizations I suggested, if it's still not fast enough here is a link on profiling your Map/Reduce jobs with a few more useful tricks (especially tip 7 on profiling is something I've found useful).

Binary search in a sorted (memory-mapped ?) file in Java

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

Categories

Resources