I'm new to Java and a I need to read a binary file and display its contents converted as integers. The file has this structure:
{client#, position 1, size 32 |
category, position 33, size 10 |
type, position 43, size 10 |
creditlimit, position 53, size 20}
I need just a guide on what classes to use and a convertion example, a little snipet will be appreciated.
I assume that position 1 actually is 0; the first byte.
Also it seems the file format is of fixed size records, probably with ASCII in the bytes.
To check the data, I start with taking the fields in Strings. Converting them to long/int could loose information on the actual content.
The following uses a sequential binary file. Faster would be a memory mapped file, but this is acceptable and short.
Hold the client data:
class Client {
String clientno;
String category;
String type;
String position;
String creditlimit;
#Override
public String toString() {
return String.format("Client# %s, categ %s, type %s, pos %s, creditlimit %s%n",
clientno, category, type, position, creditlimit);
}
}
Read the file:
// Field sizes:
final int CLIENT_NO = 32;
final int CATEGORY = 10;
final int TYPE = 10;
final int CREDIT_LIMIT = 20;
final int RECORD_SIZE = CLIENT_NO + CATEGORY + TYPE + CREDIT_LIMIT;
byte[] record = new byte[RECORD_SIZE];
try (BufferedInputStream in = new BufferedInputStream(
new FileInputStream(file))) {
for (;;) {
int nread = in.read(record);
if (nread < RECORD_SIZE) {
break;
}
Client client = new Client();
int offset = 0;
int offset2 = offset + CLIENT_NO;
client.clientno = recordField(record, offset, offset2 - offset);
offset = offset2;
int offset2 = offset + CATEGORY;
client.category = recordField(record, offset, offset2 - offset);
offset = offset2;
int offset2 = offset + TYPE;
client.type = recordField(record, offset, offset2 - offset);
offset = offset2;
int offset2 = offset + CREDITLIMIT;
client.creditlimit = recordField(record, offset, offset2 - offset);
System.out.println(client);
}
} // Closes in.
with a field extraction:
private static String recordField(byte[] record, int offset, int length) {
String field = new String(record, offset, length, StandardCharsets.ISO_8859_1);
// Use ASCII NUL as string terminator:
int pos = field.indexOf('\u0000');
if (pos != -1) {
field = field.substring(0, pos);
}
return field.trim(); // Trim also spaces for fixed fields.
}
If i understand your question correct, you should use the NIO Package.
With the asIntBuffer() from the byteBuffer class, you can get an IntBuffer view of a ByteBuffer. And by calling get(int[] dst) you could convert it to integers.
The initial ByteBuffer is available by using file channels.
If you work with binary data, may be JBBP will be comfortable way for you, to parse and print the data structure with the framework is very easy (if I understood the task correctly and you work with byte fields), the example parsing whole input stream and then print parsed data to console
#Bin class Record {byte [] client; byte [] category; byte [] type; byte [] creditlimit;};
#Bin class Records {Record [] records;};
Records parsed = JBBPParser.prepare("records [_] {byte [32] client; byte [10] category; byte [10] type; byte [20] creditlimit;}").parse(THE_INPUT_STREAM).mapTo(Records.class);
System.out.println(new JBBPTextWriter().Bin(parsed).toString());
Related
You are given a list of file names and their lengths in bytes.
Example:
File1: 200 File2: 500 File3: 800
You are given a number N. We want to launch N threads to read all the files parallelly such that each thread approximately reads an equal amount of bytes
You should return N lists. Each list describes the work of one thread: Example, when N=2, there are two threads. In the above example, there is a total of 1500 bytes (200 + 500 + 800). A fairway to divide is for each thread to read 750 bytes. So you will return:
Two lists
List 1: File1: 0 - 199 File2: 0 - 499 File3: 0-49 ---------------- Total 750 bytes
List 2: File3: 50-799 -------------------- Total 750 bytes
Implement the following method
List<List<FileRange>> getSplits(List<File> files, int N)
Class File {
String filename; long length }
Class FileRange {
String filename Long startOffset Long endOffset }
I tried with this one but it's not working any help would be highly appreciated.
List<List<FileRange>> getSplits(List<File> files, int n) {
List<List<FileRange>> al=new ArrayList<>();
long s=files.size();
long sum=0;
for(int i=0;i<s;i++){
long l=files.get(i).length;
sum+=(long)l;
}
long div=(long)sum/n; // no of bytes per thread
long mod=(long)sum%n;
long[] lo=new long[(long)n];
for(long i=0;i<n;i++)
lo[i]=div;
if(mod!=0){
long i=0;
while(mod>0){
lo[i]+=1;
mod--;
i++;
}
}
long inOffset=0;
for(long j=0;j<n;j++){
long val=lo[i];
for(long i=0;i<(long)files.size();i++){
String ss=files.get(i).filename;
long ll=files.get(i).length;
if(ll<val){
inOffset=0;
val-=ll;
}
else{
inOffset=ll-val;
ll=val;
}
al.add(new ArrayList<>(new File(ss,inOffset,ll-1)));
}
}
}
I'm getting problem in startOffset and endOffset with it's corresponding file. I tried it but I was not able to extract from List and add in the form of required return type List>.
The essence of the problem is to simultaneously walk through two lists:
the input list, which is a list of files
the output list, which is a list of threads (where each thread has a list of ranges)
I find that the easiest approach to such problems is an infinite loop that looks something like this:
while (1)
{
move some information from the input to the output
decide whether to advance to the next input item
decide whether to advance to the next output item
if we've reached (the end of the input _OR_ the end of the output)
break
if we advanced to the next input item
prepare the next input item for processing
if we advanced to the next output item
prepare the next output item for processing
}
To keep track of the input, we need the following information
fileIndex the index into the list of files
fileOffset the offset of the first unassigned byte in the file, initially 0
fileRemain the number of bytes in the file that are unassigned, initially the file size
To keep track of the output, we need
threadIndex the index of the thread we're currently working on (which is the first index into the List<List<FileRange>> that the algorithm produces)
threadNeeds the number of bytes that the thread still needs, initially base or base+1
Side note: I'm using base as the minimum number bytes assigned to each thread (sum/n), and extra as the number of threads that get an extra byte (sum%n).
So now we get to the heart of the algorithm: what information to move from input to output:
if fileRemain is less than threadNeeds then the rest of the file (which may be the entire file) gets assigned to the current thread, and we move to the next file
if fileRemain is greater than threadNeeds then a portion of the file is assigned to the current thread, and we move to the next thread
if fileRemain is equal to threadNeeds then the rest of the file is assigned to the thread, and we move to the next file, and the next thread
Those three cases are easily handled by comparing fileRemain and threadNeeds, and choosing a byteCount that is the minimum of the two.
With all that in mind, here's some pseudo-code to help get you started:
base = sum/n;
extra = sum%n;
// initialize the input control variables
fileIndex = 0
fileOffset = 0
fileRemain = length of file 0
// initialize the output control variables
threadIndex = 0
threadNeeds = base
if (threadIndex < extra)
threadNeeds++
while (1)
{
// decide how many bytes can be assigned, and generate some output
byteCount = min(fileRemain, threadNeeds)
add (file.name, fileOffset, fileOffset+byteCount-1) to the list of ranges
// decide whether to advance to the next input and output items
threadNeeds -= byteCount
fileRemain -= byteCount
if (threadNeeds == 0)
threadIndex++
if (fileRemain == 0)
fileIndex++
// are we done yet?
if (threadIndex == n || fileIndex == files.size())
break
// if we've moved to the next input item, reinitialize the input control variables
if (fileRemain == 0)
{
fileOffset = 0
fileRemain = length of file
}
// if we've moved to the next output item, reinitialize the output control variables
if (threadNeeds == 0)
{
threadNeeds = base
if (threadIndex < extra)
threadNeeds++
}
}
Debugging tip: Reaching the end of the input, and the end of the output, should happen simultaneously. In other words, you should run out of files at exactly the same time as you run out of threads. So during development, I would check both conditions, and verify that they do, in fact, change at the same time.
Here's the code solution for your problem (in Java) :
The custom class 'File' and 'FileRange' are as follows :
public class File{
String filename;
long length;
public File(String filename, long length) {
this.filename = filename;
this.length = length;
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public long getLength() {
return length;
}
public void setLength(long length) {
this.length = length;
}
}
public class FileRange {
String filename;
Long startOffset;
Long endOffset;
public FileRange(String filename, Long startOffset, Long endOffset) {
this.filename = filename;
this.startOffset = startOffset;
this.endOffset = endOffset;
}
public String getFilename() {
return filename;
}
public void setFilename(String filename) {
this.filename = filename;
}
public Long getStartOffset() {
return startOffset;
}
public void setStartOffset(Long startOffset) {
this.startOffset = startOffset;
}
public Long getEndOffset() {
return endOffset;
}
public void setEndOffset(Long endOffset) {
this.endOffset = endOffset;
}
}
The main class will be as follows :
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.concurrent.atomic.AtomicInteger;
public class MainClass {
private static List<List<FileRange>> getSplits(List<File> files, int N) {
List<List<FileRange>> results = new ArrayList<>();
long sum = files.stream().mapToLong(File::getLength).sum(); // Total bytes in all the files
long div = sum/N;
long mod = sum%N;
// Storing how many bytes each thread gets to process
long thread_bytes[] = new long[N];
// At least 'div' number of bytes will be processed by each thread
for(int i=0;i<N;i++)
thread_bytes[i] = div;
// Left over bytes to be processed by each thread
for(int i=0;i<mod;i++)
thread_bytes[i] += 1;
int count = 0;
int len = files.size();
long processed_bytes[] = new long[len];
long temp = 0L;
int file_to_be_processed = 0;
while(count < N && sum > 0) {
temp = thread_bytes[count];
sum -= temp;
List<FileRange> internal = new ArrayList<>();
while (temp > 0) {
// Start from the file to be processed - Will be 0 in the first iteration
// Will be updated in the subsequent iterations
for(int j=file_to_be_processed;j<len && temp>0;j++){
File f = files.get(j);
if(f.getLength() - processed_bytes[j] <= temp){
internal.add(new FileRange(f.getFilename(), processed_bytes[j], f.getLength()- 1));
processed_bytes[j] = f.getLength() - processed_bytes[j];
temp -= processed_bytes[j];
file_to_be_processed++;
}
else{
internal.add(new FileRange(f.getFilename(), processed_bytes[j], processed_bytes[j] + temp - 1));
// In this case, we won't update the number for file to be processed
processed_bytes[j] += temp;
temp -= processed_bytes[j];
}
}
results.add(internal);
count++;
}
}
return results;
}
public static void main(String args[]){
Scanner scn = new Scanner(System.in);
int N = scn.nextInt();
// Inserting demo records in list
File f1 = new File("File 1",200);
File f2 = new File("File 2",500);
File f3 = new File("File 3",800);
List<File> files = new ArrayList<>();
files.add(f1);
files.add(f2);
files.add(f3);
List<List<FileRange>> results = getSplits(files, N);
final AtomicInteger result_count = new AtomicInteger();
// Displaying the results
results.forEach(result -> {
System.out.println("List "+result_count.incrementAndGet() + " : ");
result.forEach(res -> {
System.out.print(res.getFilename() + " : ");
System.out.print(res.getStartOffset() + " - ");
System.out.print(res.getEndOffset() + "\n");
});
System.out.println("---------------");
});
}
}
If some part is still unclear, consider a case and dry run the program.
Say 999 bytes have to be processed by 100 threads
So the 100 threads get 9 bytes each and out of the remaining 99 bytes, each thread except the 100th gets 1 byte. By doing this, we'll make sure no 2 threads differ by at most 1 byte. Proceed with this idea and follow up with the code.
My current assignment includes taking all of the objects out of the pdf file and then using the parsed out objects. But there is an issue that I have noticed where some of the stream objects are being flat out skipped over by my code.
I am completely confused and hoping someone can help indicate what is going wrong here.
Here is the main parsing code.
void parseRawPDFFile() {
//Transform the bytes obtained from the file into a byte character sequence. This byte character sequence
//object is what allows us to use it in regex.
ByteCharSequence byteCharSequence = new ByteCharSequence(bytesFromFile.toByteArray());
byteCharSequence.getStringFromData();
Pattern pattern = Pattern.compile(SINGLE_OBJECT_REGEX);
Matcher matcher = pattern.matcher(byteCharSequence);
//While we have a match (apparently only one match exists at a time) keep looping over the list.
//When a match is found, get the starting and ending indices and manually cut these out char by char
//and assemble them into a new "ByteArrayOutputStream".
int counterOfDoom = 1;
while (matcher.find() ) {
for (int i = 0; i < matcher.groupCount(); i++) {
ByteArrayOutputStream cutOutArray = cutOutByteArrayOutputStreamFromOriginal(matcher.start(), matcher.end());
System.out.println("----------------------------------------------------");
System.out.println(cutOutArray);
//At this point we have cut out the object and can now send it for processing.
createPDFObject(cutOutArray);
System.out.println(counterOfDoom);
System.out.println("----------------------------------------------------");
counterOfDoom++;
}
}
}
Here is the code for the ByteCharSequence
(Credits for the core of this code here: http://blog.sarah-happy.ca/2013/01/java-regular-expression-on-byte-array.html)
public class ByteCharSequence implements CharSequence {
private final byte[] data;
private final int length;
private final int offset;
public ByteCharSequence(byte[] data) {
this(data, 0, data.length);
}
public ByteCharSequence(byte[] data, int offset, int length) {
this.data = data;
this.offset = offset;
this.length = length;
}
#Override
public int length() {
return this.length;
}
#Override
public char charAt(int index) {
return (char) (data[offset + index] & 0xff);
}
#Override
public CharSequence subSequence(int start, int end) {
return new ByteCharSequence(data, offset + start, end - start);
}
/**
* Get the string from the ByteCharSequence data.
* #return
*/
public String getStringFromData() {
//Load it into the method I know works to convert it to a string... Optimized? Probably not at all.
//But it works...
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
for (byte individualByte: data
) {
byteArrayOutputStream.write(individualByte);
}
return byteArrayOutputStream.toString();
}
}
The pdf data that I am processing at present:
10 0 obj
<</Filter/FlateDecode/Length 1040>>stream
(Bunch of bytes)
endstream
endobj
12 0 obj
<</Filter/FlateDecode/Length 2574/N 3>>stream
(Bunch of bytes)
endstream
endobj
Some information that I was trying to look into.
1: From what I understand there should be no limitation on how much can be fit into the data structures. So size shouldn't be an issue????
Add the DOTALL flag to the pattern compile call so that your pattern matches newline characters =)
Hi Team, I am trying to find a String "Henry" in a binary file and change the String to a different string. FYI the file is the output of serialisation of an object. Original Question here
I am new to searching bytes and imagined this code would search for my byte[] and exchange it. But it doesn't come close to working it doesn't even find a match.
{
byte[] bytesHenry = new String("Henry").getBytes();
byte[] bytesSwap = new String("Zsswd").getBytes();
byte[] seekHenry = new byte[bytesHenry.length];
RandomAccessFile file = new RandomAccessFile(fileString,"rw");
long filePointer;
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);
if (bytesHenry == seekHenry) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
}
}
Okay I see the bytesHenry==seekHenry problem and will swap to Arrays.equals( bytesHenry , seekHenry )
I think I need to move along by -4 byte positions each time i read 5 bytes.
Bingo it finds it now
while (seekHenry != null) {
filePointer = file.getFilePointer();
file.readFully(seekHenry);;
if (Arrays.equals(bytesHenry,
seekHenry)) {
file.seek(filePointer);
file.write(bytesSwap);
break;
}
file.seek(filePointer);
file.read();
}
The following could work for you, see the method search(byte[] input, byte[] searchedFor) which returns the index where the first match matches, or -1.
public class SearchBuffer {
public static void main(String[] args) throws UnsupportedEncodingException {
String charset= "US-ASCII";
byte[] searchedFor = "ciao".getBytes(charset);
byte[] input = "aaaciaaaciaojjcia".getBytes(charset);
int idx = search(input, searchedFor);
System.out.println("index: "+idx); //should be 8
}
public static int search(byte[] input, byte[] searchedFor) {
//convert byte[] to Byte[]
Byte[] searchedForB = new Byte[searchedFor.length];
for(int x = 0; x<searchedFor.length; x++){
searchedForB[x] = searchedFor[x];
}
int idx = -1;
//search:
Deque<Byte> q = new ArrayDeque<Byte>(input.length);
for(int i=0; i<input.length; i++){
if(q.size() == searchedForB.length){
//here I can check
Byte[] cur = q.toArray(new Byte[]{});
if(Arrays.equals(cur, searchedForB)){
//found!
idx = i - searchedForB.length;
break;
} else {
//not found
q.pop();
q.addLast(input[i]);
}
} else {
q.addLast(input[i]);
}
}
return idx;
}
}
From Fastest way to find a string in a text file with java:
The best realization I've found in MIMEParser: https://github.com/samskivert/ikvm-openjdk/blob/master/build/linux-amd64/impsrc/com/sun/xml/internal/org/jvnet/mimepull/MIMEParser.java
/**
* Finds the boundary in the given buffer using Boyer-Moore algo.
* Copied from java.util.regex.Pattern.java
*
* #param mybuf boundary to be searched in this mybuf
* #param off start index in mybuf
* #param len number of bytes in mybuf
*
* #return -1 if there is no match or index where the match starts
*/
private int match(byte[] mybuf, int off, int len) {
Needed also:
private void compileBoundaryPattern();
I have a 1.7G file with the following format:
String Long String Long String Long String Long ... etc
Essentially, String is a key and Long is a value in a hashmap i'm interested in initialising before running anything else in my application.
My current code is:
RandomAccessFile raf=new RandomAccessFile("/home/map.dat","r");
raf.seek(0);
while(raf.getFilePointer()!=raf.length()){
String name=raf.readUTF();
long offset=raf.readLong();
map.put(name,offset);
}
This takes about 12 mins to complete and I'm sure there are better ways of doing this so I would appreciate any help or pointer.
thanks
Update as in EJP suggestion?
EJP thank you for your suggestion and I hope this is what you meant. Correct me if this is wrong
DataInputStream dis=null;
try{
dis=new DataInputStream(new BufferedInputStream(new FileInputStream("/home/map.dat")));
while(true){
String name=dis.readUTF();
long offset=dis.readLong();
map.put(name, offset);
}
}catch (EOFException eofe){
try{
dis.close();
}catch (IOException ioe){
ioe.printStackTrace();
}
}
Use a DataInputStream wrapped around a BufferedInputStream wrapped around a FileInputStream.
Instead of at least four system calls per iteration, checking the length, and the current size and performing who knows how many reads to get the string and the long, just call readUTF() and readLong() until you get an EOFException.
I would construct the file so it can be used in place. i.e. without loading this way. As you have variable length records you can construct an array of the location of each record, then place the key in order so you can perform a binary search for data. (Or you can use a custom hash table) You can then wrap this with method which hide the fact the data is actually store in a file instead of turned into data objects.
If you do all this the "load" phase becomes redundant and you won't need to create so many objects.
This is a long example but hopefully shows what is possible.
import vanilla.java.chronicle.Chronicle;
import vanilla.java.chronicle.Excerpt;
import vanilla.java.chronicle.impl.IndexedChronicle;
import vanilla.java.chronicle.tools.ChronicleTest;
import java.io.IOException;
import java.util.*;
public class Main {
static final String TMP = System.getProperty("java.io.tmpdir");
public static void main(String... args) throws IOException {
String baseName = TMP + "/test";
String[] keys = generateAndSave(baseName, 100 * 1000 * 1000);
long start = System.nanoTime();
SavedSortedMap map = new SavedSortedMap(baseName);
for (int i = 0; i < keys.length / 100; i++) {
long l = map.lookup(keys[i]);
// System.out.println(keys[i] + ": " + l);
}
map.close();
long time = System.nanoTime() - start;
System.out.printf("Load of %,d records and lookup of %,d keys took %.3f seconds%n",
keys.length, keys.length / 100, time / 1e9);
}
static SortedMap<String, Long> generateMap(int keys) {
SortedMap<String, Long> ret = new TreeMap<>();
while (ret.size() < keys) {
long n = ret.size();
String key = Long.toString(n);
while (key.length() < 9)
key = '0' + key;
ret.put(key, n);
}
return ret;
}
static void saveData(SortedMap<String, Long> map, String baseName) throws IOException {
Chronicle chronicle = new IndexedChronicle(baseName);
Excerpt excerpt = chronicle.createExcerpt();
for (Map.Entry<String, Long> entry : map.entrySet()) {
excerpt.startExcerpt(2 + entry.getKey().length() + 8);
excerpt.writeUTF(entry.getKey());
excerpt.writeLong(entry.getValue());
excerpt.finish();
}
chronicle.close();
}
static class SavedSortedMap {
final Chronicle chronicle;
final Excerpt excerpt;
final String midKey;
final long size;
SavedSortedMap(String baseName) throws IOException {
chronicle = new IndexedChronicle(baseName);
excerpt = chronicle.createExcerpt();
size = chronicle.size();
excerpt.index(size / 2);
midKey = excerpt.readUTF();
}
// find exact match or take the value after.
public long lookup(CharSequence key) {
if (compareTo(key, midKey) < 0)
return lookup0(0, size / 2, key);
return lookup0(size / 2, size, key);
}
private final StringBuilder tmp = new StringBuilder();
private long lookup0(long from, long to, CharSequence key) {
long mid = (from + to) >>> 1;
excerpt.index(mid);
tmp.setLength(0);
excerpt.readUTF(tmp);
if (to - from <= 1)
return excerpt.readLong();
int cmp = compareTo(key, tmp);
if (cmp < 0)
return lookup0(from, mid, key);
if (cmp > 0)
return lookup0(mid, to, key);
return excerpt.readLong();
}
public static int compareTo(CharSequence a, CharSequence b) {
int lim = Math.min(a.length(), b.length());
for (int k = 0; k < lim; k++) {
char c1 = a.charAt(k);
char c2 = b.charAt(k);
if (c1 != c2)
return c1 - c2;
}
return a.length() - b.length();
}
public void close() {
chronicle.close();
}
}
private static String[] generateAndSave(String baseName, int keyCount) throws IOException {
SortedMap<String, Long> map = generateMap(keyCount);
saveData(map, baseName);
ChronicleTest.deleteOnExit(baseName);
String[] keys = map.keySet().toArray(new String[map.size()]);
Collections.shuffle(Arrays.asList(keys));
return keys;
}
}
generates 2 GB of raw data and performs a million lookups. It's written in such a way that the loading and lookup uses very little heap. ( << 1 MB )
ls -l /tmp/test*
-rw-rw---- 1 peter peter 2013265920 Dec 11 13:23 /tmp/test.data
-rw-rw---- 1 peter peter 805306368 Dec 11 13:23 /tmp/test.index
/tmp/test created.
/tmp/test, size=100000000
Load of 100,000,000 records and lookup of 1,000,000 keys took 10.945 seconds
Using a hash table lookup would be faster per lookup as it is O(1) instead of O(ln N), but more complex to implement.
I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.