How to compare images for similarity using java - java

Recently I got an opportunity to work with Image Processing Technologies as a part of one of my projects and my task was to find matching images from an image store when a new image is given. I started my project with googling "How to compare images using java" and I got some good articles on finding the similarity of two images. Almost all of them were based on four basic steps, they are:
1.Locating the Region of Interest (Where the Objects appear in the given image),
2.Re-sizing the ROIs in to a common size,
3.Substracting ROIs,
4.Calculating the Black and White Ratio of the resultant image after subtraction.
Though this sounds as a good algorithm to compare images, it takes a considerable amount of time after implementing it using JAI in my project. Therefore I have to find an alternate method of doing it.
Any suggestions?

**// This API will compare two image file //
// return true if both image files are equal else return false//**
public static boolean compareImage(File fileA, File fileB) {
try {
// take buffer data from botm image files //
BufferedImage biA = ImageIO.read(fileA);
DataBuffer dbA = biA.getData().getDataBuffer();
int sizeA = dbA.getSize();
BufferedImage biB = ImageIO.read(fileB);
DataBuffer dbB = biB.getData().getDataBuffer();
int sizeB = dbB.getSize();
// compare data-buffer objects //
if(sizeA == sizeB) {
for(int i=0; i<sizeA; i++) {
if(dbA.getElem(i) != dbB.getElem(i)) {
return false;
}
}
return true;
}
else {
return false;
}
}
catch (Exception e) {
System.out.println("Failed to compare image files ...");
return false;
}
}

Depending on how different the images are, you could do something like this (pseudocode). It's very primitive, but should be pretty efficient. You could speed it up by taking random or patterned pixels instead of every one.
for x = 0 to image.size:
for y = 0 to image.size:
diff += abs(image1.get(x,y).red - image2.get(x,y).red)
diff += abs(image1.get(x,y).blue - image2.get(x,y).blue)
diff += abs(image1.get(x,y).green - image2.get(x,y).green)
end
end
return ((float)(diff)) / ( x * y * 3)

This API will compare two image file and return the percentage of similarity
public float compareImage(File fileA, File fileB) {
float percentage = 0;
try {
// take buffer data from both image files //
BufferedImage biA = ImageIO.read(fileA);
DataBuffer dbA = biA.getData().getDataBuffer();
int sizeA = dbA.getSize();
BufferedImage biB = ImageIO.read(fileB);
DataBuffer dbB = biB.getData().getDataBuffer();
int sizeB = dbB.getSize();
int count = 0;
// compare data-buffer objects //
if (sizeA == sizeB) {
for (int i = 0; i < sizeA; i++) {
if (dbA.getElem(i) == dbB.getElem(i)) {
count = count + 1;
}
}
percentage = (count * 100) / sizeA;
} else {
System.out.println("Both the images are not of same size");
}
} catch (Exception e) {
System.out.println("Failed to compare image files ...");
}
return percentage;
}

Related

how load all imagens from directory and read using function imread opencv?

I need some help. I created a function that read a single image. Well, it's work, but I want to create something like a loop for get all images from directory and use the imread method for get pixels values. How I can do this? follow my code below.
public void cor() {
String src = ("path_to_folder");
Mat imgread;
imgread = Imgcodecs.imread(src, IMREAD_COLOR);
Mat rgbimage = null; //for conversion bgr2rgb
int lin = imgread.rows(); //get the number of rows
int col = imgread.cols(); //get the number of cols
if (imgread.empty()) {
Log.e("error", "is empty!");
} else {
rgbimage = new Mat(imgread.size(), imgread.type());
Imgproc.cvtColor(imgread, rgbimage, Imgproc.COLOR_BGR2RGB);
}
for (int i = 0; i < lin; i++) {
for (int j = 0; j < col; j++) {
double[] rgb =rgbimage.get(i, j);
pixels.add(rgb); //put data in arraylist
}
}
}
Using File you can get a list of all files in a directory. Then, you can loop through the list to get the absolute path of each file and do whatever you want with it.
public void cor() {
File rootDir= new File("your/path/to/root_directory");
File[] files = rootDir.listFiles();
for(File file :files) {
String src = file.getAbsolutePath();
Mat imgread;
imgread = Imgcodecs.imread(src, IMREAD_COLOR);
/*
* Do the other stuff in your method.
*/
}
}
Note: I was not 100% sure what you were doing with pixels, so I just wrote what you need to loop through a directory.

Error with some Image using Raster.getPixel

i ran into the problem yesterday using the BufferedImage Lib, i get a
java.lang.ArrayIndexOutOfBoundsException: 3
but only for pictures "PNG" i get from the net, but if i make my own in Paint it all works. i have tryed looking up the problem but cant see where im wrong.
package grayandconvert;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.File;
import java.io.IOException;
import java.awt.image.BufferedImage;
import javax.imageio.ImageIO;
public class JavaCodeProject { // remain if needed Kim
private final String PATH = "C:\\New folder\\"; //
private final String graypath = PATH + "oZPX0bbg.png"; // filename for Grayscale pic
private final String imgpath = PATH + "oZPX0bb.png"; // filename for Orginal pic
private final String textpath =PATH + "filename.txt"; // filename for Output textfile
private final String imgtype = "png"; // image file type for Grascale "png" "jpg"
public static void main(String[] args)
{
JavaCodeProject main = new JavaCodeProject(); //new class for use of the metoth
main.grayscale();
main.convert();
}
public void convert()
{
try
{
BufferedImage image =ImageIO.read(new File(graypath)); // called the gray pic for image
int[] pixel; // int array named pixel
System.out.print(image.getHeight());
System.out.print(image.getWidth());
for (int y = 0; y < image.getHeight(); y++) // outer forloop to control Y axel image.getWidth
{
for (int x = 0; x < image.getWidth(); x++) //inner forloop to control X axel
{
pixel = image.getRaster().getPixel(x, y, new int[3]); // gets the RGB data from the buffer
if(pixel[0]< 255 && pixel[1]< 255 && pixel[2]< 255)
{
System.out.print(" Y");
writefile("Y");
}
else
{
System.out.print(" N");
writefile("N");
}
}
System.out.print(" L");
System.out.println("");
writefile("L");
}
System.out.print("S");
writefile("S");
}
catch (IOException e) // never used it but it needs to be here
{
}
}
public void writefile(String value)
{
String array = value; //named it array. i know right :P
File file = new File(textpath); //path for new file.txt
try
{
if (!file.exists()) // if file doesnt exists, then this will create it ;)
{
file.createNewFile();
}
FileWriter fw = new FileWriter(file.getAbsoluteFile(),true);
try (BufferedWriter bw = new BufferedWriter(fw)) {
bw.write(array,0,array.length());
}
}
catch (IOException e) // if IO exceptions happens this outputs Stacktrace
{
}
}
public void grayscale()
{
BufferedImage img = null;
try
{
File f = new File(imgpath); //org pic
img = ImageIO.read(f);
}
catch(IOException e)
{
System.out.println(e);
}
for(int y = 0; y < img.getHeight(); y++)
{
for(int x = 0; x < img.getWidth(); x++)
{
int p = img.getRGB(x,y);
int a = (p>>24)&0xff;
int r = (p>>16)&0xff;
int g = (p>>8)&0xff;
int b = p&0xff;
//calculate average
int avg = (r+g+b)/3;
//replace RGB value with avg
p = (a<<24) | (avg<<16) | (avg<<8) | avg;
img.setRGB(x, y, p);
}
}
try
{
File f = new File(graypath); //gray pic
ImageIO.write(img,imgtype,f);
}
catch(IOException e)
{
System.out.println(e);
}
}
}
I get the error
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
at java.awt.image.ComponentSampleModel.getPixel(ComponentSampleModel.java:750)
at java.awt.image.Raster.getPixel(Raster.java:1519)
at grayandconvert.JavaCodeProject.convert(JavaCodeProject.java:41)
at grayandconvert.JavaCodeProject.main(JavaCodeProject.java:23)
128128C:\Users\clipcomet\Desktop\JavaApplication10\nbproject\build-impl.xml:1051: The following error occurred while executing this line:
C:\Users\clipcomet\Desktop\JavaApplication10\nbproject\build-impl.xml:805: Java returned: 1
BUILD FAILED (total time: 1 second)
i just started programming a bit ago and i know im using a Lib i dont know fully but needed to use the BufferedImage, ignoring all the bad code i have can someone tell me why i only get that error on some pics
The reason you get the exception in some cases, is that Raster.getPixel(x, y, pixel) tries to copy all the samples for the pixel at x, y into the pixel array. And you have no control over how many samples per pixel your raster has, if you download random pictures from the net, yet you hardcode the pixel array to 3 elements.
From the API doc (emphasis mine):
ArrayIndexOutOfBoundsException - if the coordinates are not in bounds, or if iArray is too small to hold the output.
Most likely, the images where you get the exception, have 4 components and are RGBA (while the ones from Paint have 3 components, RGB). You will probably get rid of the exception by creating a larger array (ie. new int[4]).
However, the best way to fix the problem is to not create the array at all yourself, and instead leave that to the getPixel method, like this:
int[] pixel = null;
for (y...) {
for (x...) {
pixel = raster.getPixel(x, y, pixel);
...
}
}
This also ensures that the allocation happens only once, which is obviously good for performance.
That said, you still need to handle the fact that a random image may not have the expected number of samples per pixel. If your input is gray or uses a color map (IndexColoModel), it will only have one sample (and you'll have an ArrayIndexOutOfBoundsExpcetion for your pixel[1] and pixel[2] array accesses). And for the color map case, the sample values is unrelated to the RGB value you see on screen (it's only an index into a lookup table).
For these reasons, you may find it easier and more intuitive to just use the BufferedImage.getRGB(x, y) method, which always gives you the ARGB values of the pixel as a single packed int sample, in sRGB color space.

What is more effective, a huge condition, or a huge array?

In my app, i have two Fragments of which each one has a tridimensional array that stores 2160 variables, of which 720 it's float and 1440 it's integer.
I have two options:
1 - Continue with this huge tridimensional array.
2 - Or do a huge Condition.
My concern is about the application's performance on the user's mobile phone. Which would consume less time? Is the memory that this array would use high enough to affect fragment loading?
NOTES
In my app, all this variables it's a constant values.
The user will answer a series of questions to in the end display some variables.
There are three issues:
Has 6 options
Has 15 options
Has 8 options
But i have 3 series of this questions, in one of them i display float variables. And in the other two, I display integer variables.
Actually, it's my code. RESULTS_ARRAY[][][] it's a big tridimensional array, i just made a part of the code, the part dealing with the 720 float values.
switch (rewardSelected) {
case 0:
int count = 0;
while (count < 6) {
if(typePack == count) {
int count2 = 0;
while(count2 < 15) {
if(spinnerSelected == count2) {
int count3 = 0;
while(count3 < 8) {
float percent = Float.parseFloat(editTextPercent.getText().toString());
float withPercentApplied = (RESULTS_ARRAY[count][count3][count2] * percent) / 130;
if(checkFivePercent.isChecked()) {
float resultFinal = ((withPercentApplied*5) / 100) + withPercentApplied;
textViewResults.get(count3).setText(String.valueOf(resultFinal));
} else {
textViewResults.get(count3).setText(String.valueOf(withPercentApplied));
}
count3++;
}
}
count2++;
}
}
count++;
}
break;
case 1:
break;
case 2:
break;
}
For completeness sake, this is all you would need.
package so;
import java.io.FileNotFoundException;
import java.io.FileReader;
import com.google.gson.*;
public class SOCLass{
JsonElement job;
SOCLass(){
try {
job = new JsonParser().parse(new FileReader("JSON FILE"));
} catch (JsonIOException | JsonSyntaxException | FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
boolean isValuePresent(final String v){
return job.getAsJsonObject().get(v) != null;
}
public static void main(String[] args){
SOCLass so = new SOCLass();
System.out.println("Is present? " + so.isValuePresent("NO"));
}
}
EDIT
Array contents in a json file:
{
"key" : null,
"key1" : null
}
If you only care about the constant values/tags. Then you could associate a real value with its key

Is it possible to make this java code far more efficient?

The code below is taking 27 seconds to load 51 images each image is about 22 KB (timed between the starting and ending alerts I inserted). Is it possible to make it a lot more efficient (I would like to get it to under 3 seconds)?
At first I thought it was the database so I put the alerts in to make sure and found it was this code.
Regards,
Glyn
public void renderYMAwardsTable(List<YouthMemberAwards> ymAwardsList) {
if (!ymAwardsList.isEmpty()) {
flexTableLink.clear();
int linkRow = 0;
int linkCol = 0;
flexTableLeadership.clear();
int leadershipRow = 0;
int leadershipCol = 0;
flexTableBoomerang.clear();
int boomerangRow = 0;
int boomerangCol = 0;
flexTableAchievement.clear();
int achievementRow = 0;
int achievementCol = 0;
flexTableSpecialInterest.clear();
int specialInterestRow = 0;
int specialInterestCol = 0;
totalAwards = 0;
Window.alert("Start populating page.");
for (final YouthMemberAwards ymAwards : ymAwardsList) {
// Display awards and, if applicable, date awarded
if (((ymAwards.getCaAwardedDate() != null)
|| (ymAwards.getAwArchivedDate() == null)
|| (ymAwards.getCaAwardStarted().equals("Y"))
) && (ymAwards.getAwStartedDate().before(ymEndDate))) {
//Display each award in the correct area with:
// the date awarded, if applicable, and
// the date box shaded if the award has been started but not awarded
String imageDataString = ymAwards.getAwAwardPicture();
Image image = new Image(imageDataString);
image.setWidth("75px");
image.setHeight("75px");
image.setStyleName("gwt-Selectable");
final DateBox awardedDate = new DateBox();
awardedDate.setFormat(new DefaultFormat(DateTimeFormat.getFormat("dd/MM/yyyy")));
awardedDate.setValue(ymAwards.getCaAwardedDate());
awardedDate.setWidth("75px");
awardedDate.setFireNullValues(true);
//Check if the Youth Member has started the Award,
//if they have then colour the date box Green.
if (ymAwards.getCaAwardStarted()!= null){
if ((ymAwards.getCaAwardedDate() == null)
&& (ymAwards.getCaAwardStarted().equals("Y"))){
awardedDate.setStyleName("gwt-Green-Background");
}else{
awardedDate.setStyleName("gwt-Label-Login");
}
}else{
awardedDate.setStyleName("gwt-Label-Login");
}
//Tally the number of Awards the Youth Member has been awarded.
if (ymAwards.getCaAwardedDate() != null){
totalAwards = totalAwards + 1;
}
//Display each Award in the appropriate category.
if (ymAwards.getAwAwardType().equals("Link")){
flexTableLink.setWidget(linkRow, linkCol, image);
flexTableLink.setWidget(linkRow + 1, linkCol, awardedDate);
if (linkCol < 10){
linkCol = linkCol + 1;
}else{
linkCol = 0;
linkRow = linkRow + 2;
}
}else{
if (ymAwards.getAwAwardType().equals("Leadership")){
flexTableLeadership.setWidget(leadershipRow, leadershipCol, image);
flexTableLeadership.setWidget(leadershipRow + 1, leadershipCol, awardedDate);
if (leadershipCol < 10){
leadershipCol = leadershipCol + 1;
}else{
leadershipCol = 0;
leadershipRow = leadershipRow + 2;
}
}else{
if (ymAwards.getAwAwardType().equals("Boomerang")){
flexTableBoomerang.setWidget(boomerangRow, boomerangCol, image);
flexTableBoomerang.setWidget(boomerangRow + 1, boomerangCol, awardedDate);
if (boomerangCol < 10){
boomerangCol = boomerangCol + 1;
}else{
boomerangCol = 0;
boomerangRow = boomerangRow + 2;
}
}else{
if (ymAwards.getAwAwardType().equals("Achievement")){
flexTableAchievement.setWidget(achievementRow, achievementCol, image);
flexTableAchievement.setWidget(achievementRow + 1, achievementCol, awardedDate);
if (achievementCol < 10){
achievementCol = achievementCol + 1;
}else{
achievementCol = 0;
achievementRow = achievementRow + 2;
}
}else{
if (ymAwards.getAwAwardType().equals("Special Interest")){
flexTableSpecialInterest.setWidget(specialInterestRow, specialInterestCol, image);
flexTableSpecialInterest.setWidget(specialInterestRow + 1, specialInterestCol, awardedDate);
if (specialInterestCol < 10){
specialInterestCol = specialInterestCol + 1;
}else{
specialInterestCol = 0;
specialInterestRow = specialInterestRow + 2;
}
}else{
//If not found then default to Special Interest.
flexTableSpecialInterest.setWidget(specialInterestRow, specialInterestCol, image);
flexTableSpecialInterest.setWidget(specialInterestRow + 1, specialInterestCol, awardedDate);
if (specialInterestCol < 10){
specialInterestCol = specialInterestCol + 1;
}else{
specialInterestCol = 0;
specialInterestRow = specialInterestRow + 2;
}
}
}
}
}
}
//Add a click handler to the image
image.addClickHandler(new ClickHandler(){
public void onClick(ClickEvent event){
//Store the data from this view for use in subsequent Views (ScoutAwardView).
AsyncCallback<ViewData> callback = new ViewDataStoreHandler<ViewData>();
rpc.setViewData(accountId, accountLevel, youthMemberID, ymAwards.getAwId(), "0", callback);
//If the Award has sub groups then display the Groups and allow one the
//be selected to display the details. Otherwise, display the details.
if (ymAwards.getAwGrouped().equals("Y")) {
//Go to the AwardGroupView
navHandler2.go("AwardGroup");
}else{
//Go to the ScoutAwardView
navHandler2.go("ScoutAward");
}
}
});
//Add change handler for the awarded date.
//Only a Leader or Administrator can update the date
if (accountLevel.equals("Leader") || accountLevel.equals("Administrator")) {
awardedDate.addValueChangeHandler(new ValueChangeHandler<java.util.Date>() {
public void onValueChange(ValueChangeEvent<java.util.Date> event) {
//Check for a null date and handle it for dateBoxArchived and dateBoxPackOut
java.sql.Date sqlDateAwarded = awardedDate.getValue() == null ? null : new java.sql.Date(awardedDate.getValue().getTime());
AsyncCallback<Void> callback = new YMAwardedDateHandler<Void>();
rpc.updateYMAwarded(youthMemberID, ymAwards.getAwId(), sqlDateAwarded, callback);
#SuppressWarnings("unused")
AdjustAwardStock adjustAwardStock = new AdjustAwardStock(sqlDateAwarded, ymAwards.getAwId());
}
});
}
}
}
}
Window.alert("Finish populating page.");
//Hide "Loading please wait" popup.
popup.hide();
//Display the number of Awards earned.
String totalAwardsString = Integer.toString(totalAwards);
textBoxTotalAwards.setText(totalAwardsString);
}
First of all, an image with 75x75 pixels size should not be 22kB. Even a PNG-24 image of this size is about 3kB, let alone GIF. Store your images in the correct sizes and the appropriate file format (use PNG or GIF). For 51 images that means a difference between 1MB and 150kB. This is the first big improvement.
Second, if you use a limited number of images, combine them into a single sprite. This will reduce the number of round-trip server calls from 51 (in your example) to 1. That's another huge improvement.
You don't need to make this sprite manually (unless you want to). You can use GWT's ImageResource ClientBundle. Note that this sprite can be cached, so a browser won't need to load it every time a user visits this page.
The other suggestions (like using switch statements) are good for code readability and maintenance, but they won't give you a significant performance boost because your Java code is compiled into JavaScript, and the compiler is pretty smart.
For combining the images and other JS files, you can always use gzip functionality supported by browser so that all the files can b zipped and send to the browser to unzip. It will lot of network time and transferred data size will b much reduced.

Binary search in a sorted (memory-mapped ?) file in Java

I am struggling to port a Perl program to Java, and learning Java as I go. A central component of the original program is a Perl module that does string prefix lookups in a +500 GB sorted text file using binary search
(essentially, "seek" to a byte offset in the middle of the file, backtrack to nearest newline, compare line prefix with the search string, "seek" to half/double that byte offset, repeat until found...)
I have experimented with several database solutions but found that nothing beats this in sheer lookup speed with data sets of this size. Do you know of any existing Java library that implements such functionality? Failing that, could you point me to some idiomatic example code that does random access reads in text files?
Alternatively, I am not familiar with the new (?) Java I/O libraries but would it be an option to memory-map the 500 GB text file (I'm on a 64-bit machine with memory to spare) and do binary search on the memory-mapped byte array? I would be very interested to hear any experiences you have to share about this and similar problems.
I am a big fan of Java's MappedByteBuffers for situations like this. It is blazing fast. Below is a snippet I put together for you that maps a buffer to the file, seeks to the middle, and then searches backwards to a newline character. This should be enough to get you going?
I have similar code (seek, read, repeat until done) in my own application, benchmarked
java.io streams against MappedByteBuffer in a production environment and posted the results on my blog (Geekomatic posts tagged 'java.nio' ) with raw data, graphs and all.
Two second summary? My MappedByteBuffer-based implementation was about 275% faster. YMMV.
To work for files larger than ~2GB, which is a problem because of the cast and .position(int pos), I've crafted paging algorithm backed by an array of MappedByteBuffers. You'll need to be working on a 64-bit system for this to work with files larger than 2-4GB because MBB's use the OS's virtual memory system to work their magic.
public class StusMagicLargeFileReader {
private static final long PAGE_SIZE = Integer.MAX_VALUE;
private List<MappedByteBuffer> buffers = new ArrayList<MappedByteBuffer>();
private final byte raw[] = new byte[1];
public static void main(String[] args) throws IOException {
File file = new File("/Users/stu/test.txt");
FileChannel fc = (new FileInputStream(file)).getChannel();
StusMagicLargeFileReader buffer = new StusMagicLargeFileReader(fc);
long position = file.length() / 2;
String candidate = buffer.getString(position--);
while (position >=0 && !candidate.equals('\n'))
candidate = buffer.getString(position--);
//have newline position or start of file...do other stuff
}
StusMagicLargeFileReader(FileChannel channel) throws IOException {
long start = 0, length = 0;
for (long index = 0; start + length < channel.size(); index++) {
if ((channel.size() / PAGE_SIZE) == index)
length = (channel.size() - index * PAGE_SIZE) ;
else
length = PAGE_SIZE;
start = index * PAGE_SIZE;
buffers.add(index, channel.map(READ_ONLY, start, length));
}
}
public String getString(long bytePosition) {
int page = (int) (bytePosition / PAGE_SIZE);
int index = (int) (bytePosition % PAGE_SIZE);
raw[0] = buffers.get(page).get(index);
return new String(raw);
}
}
I have the same problem. I am trying to find all lines that start with some prefix in a sorted file.
Here is a method I cooked up which is largely a port of Python code found here: http://www.logarithmic.net/pfh/blog/01186620415
I have tested it but not thoroughly just yet. It does not use memory mapping, though.
public static List<String> binarySearch(String filename, String string) {
List<String> result = new ArrayList<String>();
try {
File file = new File(filename);
RandomAccessFile raf = new RandomAccessFile(file, "r");
long low = 0;
long high = file.length();
long p = -1;
while (low < high) {
long mid = (low + high) / 2;
p = mid;
while (p >= 0) {
raf.seek(p);
char c = (char) raf.readByte();
//System.out.println(p + "\t" + c);
if (c == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
String line = raf.readLine();
//System.out.println("-- " + mid + " " + line);
if (line.compareTo(string) < 0)
low = mid + 1;
else
high = mid;
}
p = low;
while (p >= 0) {
raf.seek(p);
if (((char) raf.readByte()) == '\n')
break;
p--;
}
if (p < 0)
raf.seek(0);
while (true) {
String line = raf.readLine();
if (line == null || !line.startsWith(string))
break;
result.add(line);
}
raf.close();
} catch (IOException e) {
System.out.println("IOException:");
e.printStackTrace();
}
return result;
}
I am not aware of any library that has that functionality. However, a correct code for a external binary search in Java should be similar to this:
class ExternalBinarySearch {
final RandomAccessFile file;
final Comparator<String> test; // tests the element given as search parameter with the line. Insert a PrefixComparator here
public ExternalBinarySearch(File f, Comparator<String> test) throws FileNotFoundException {
this.file = new RandomAccessFile(f, "r");
this.test = test;
}
public String search(String element) throws IOException {
long l = file.length();
return search(element, -1, l-1);
}
/**
* Searches the given element in the range [low,high]. The low value of -1 is a special case to denote the beginning of a file.
* In contrast to every other line, a line at the beginning of a file doesn't need a \n directly before the line
*/
private String search(String element, long low, long high) throws IOException {
if(high - low < 1024) {
// search directly
long p = low;
while(p < high) {
String line = nextLine(p);
int r = test.compare(line,element);
if(r > 0) {
return null;
} else if (r < 0) {
p += line.length();
} else {
return line;
}
}
return null;
} else {
long m = low + ((high - low) / 2);
String line = nextLine(m);
int r = test.compare(line, element);
if(r > 0) {
return search(element, low, m);
} else if (r < 0) {
return search(element, m, high);
} else {
return line;
}
}
}
private String nextLine(long low) throws IOException {
if(low == -1) { // Beginning of file
file.seek(0);
} else {
file.seek(low);
}
int bufferLength = 65 * 1024;
byte[] buffer = new byte[bufferLength];
int r = file.read(buffer);
int lineBeginIndex = -1;
// search beginning of line
if(low == -1) { //beginning of file
lineBeginIndex = 0;
} else {
//normal mode
for(int i = 0; i < 1024; i++) {
if(buffer[i] == '\n') {
lineBeginIndex = i + 1;
break;
}
}
}
if(lineBeginIndex == -1) {
// no line begins within next 1024 bytes
return null;
}
int start = lineBeginIndex;
for(int i = start; i < r; i++) {
if(buffer[i] == '\n') {
// Found end of line
return new String(buffer, lineBeginIndex, i - lineBeginIndex + 1);
return line.toString();
}
}
throw new IllegalArgumentException("Line to long");
}
}
Please note: I made up this code ad-hoc: Corner cases are not tested nearly good enough, the code assumes that no single line is larger than 64K, etc.
I also think that building an index of the offsets where lines start might be a good idea. For a 500 GB file, that index should be stored in an index file. You should gain a not-so-small constant factor with that index because than there is no need to search for the next line in each step.
I know that was not the question, but building a prefix tree data structure like (Patrica) Tries (on disk/SSD) might be a good idea to do the prefix search.
This is a simple example of what you want to achieve. I would probably first index the file, keeping track of the file position for each string. I'm assuming the strings are separated by newlines (or carriage returns):
RandomAccessFile file = new RandomAccessFile("filename.txt", "r");
List<Long> indexList = new ArrayList();
long pos = 0;
while (file.readLine() != null)
{
Long linePos = new Long(pos);
indexList.add(linePos);
pos = file.getFilePointer();
}
int indexSize = indexList.size();
Long[] indexArray = new Long[indexSize];
indexList.toArray(indexArray);
The last step is to convert to an array for a slight speed improvement when doing lots of lookups. I would probably convert the Long[] to a long[] also, but I did not show that above. Finally the code to read the string from a given indexed position:
int i; // Initialize this appropriately for your algorithm.
file.seek(indexArray[i]);
String line = file.readLine();
// At this point, line contains the string #i.
If you are dealing with a 500GB file, then you might want to use a faster lookup method than binary search - namely a radix sort which is essentially a variant of hashing. The best method for doing this really depends on your data distributions and types of lookup, but if you are looking for string prefixes there should be a good way to do this.
I posted an example of a radix sort solution for integers, but you can use the same idea - basically to cut down the sort time by dividing the data into buckets, then using O(1) lookup to retrieve the bucket of data that is relevant.
Option Strict On
Option Explicit On
Module Module1
Private Const MAX_SIZE As Integer = 100000
Private m_input(MAX_SIZE) As Integer
Private m_table(MAX_SIZE) As List(Of Integer)
Private m_randomGen As New Random()
Private m_operations As Integer = 0
Private Sub generateData()
' fill with random numbers between 0 and MAX_SIZE - 1
For i = 0 To MAX_SIZE - 1
m_input(i) = m_randomGen.Next(0, MAX_SIZE - 1)
Next
End Sub
Private Sub sortData()
For i As Integer = 0 To MAX_SIZE - 1
Dim x = m_input(i)
If m_table(x) Is Nothing Then
m_table(x) = New List(Of Integer)
End If
m_table(x).Add(x)
' clearly this is simply going to be MAX_SIZE -1
m_operations = m_operations + 1
Next
End Sub
Private Sub printData(ByVal start As Integer, ByVal finish As Integer)
If start < 0 Or start > MAX_SIZE - 1 Then
Throw New Exception("printData - start out of range")
End If
If finish < 0 Or finish > MAX_SIZE - 1 Then
Throw New Exception("printData - finish out of range")
End If
For i As Integer = start To finish
If m_table(i) IsNot Nothing Then
For Each x In m_table(i)
Console.WriteLine(x)
Next
End If
Next
End Sub
' run the entire sort, but just print out the first 100 for verification purposes
Private Sub test()
m_operations = 0
generateData()
Console.WriteLine("Time started = " & Now.ToString())
sortData()
Console.WriteLine("Time finished = " & Now.ToString & " Number of operations = " & m_operations.ToString())
' print out a random 100 segment from the sorted array
Dim start As Integer = m_randomGen.Next(0, MAX_SIZE - 101)
printData(start, start + 100)
End Sub
Sub Main()
test()
Console.ReadLine()
End Sub
End Module
I post a gist https://gist.github.com/mikee805/c6c2e6a35032a3ab74f643a1d0f8249c
that is rather complete example based on what I found on stack overflow and some blogs hopefully someone else can use it
import static java.nio.file.Files.isWritable;
import static java.nio.file.StandardOpenOption.READ;
import static org.apache.commons.io.FileUtils.forceMkdir;
import static org.apache.commons.io.IOUtils.closeQuietly;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.trimToNull;
import java.io.File;
import java.io.IOException;
import java.nio.Buffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
public class FileUtils {
private FileUtils() {
}
private static boolean found(final String candidate, final String prefix) {
return isBlank(candidate) || candidate.startsWith(prefix);
}
private static boolean before(final String candidate, final String prefix) {
return prefix.compareTo(candidate.substring(0, prefix.length())) < 0;
}
public static MappedByteBuffer getMappedByteBuffer(final Path path) {
FileChannel fileChannel = null;
try {
fileChannel = FileChannel.open(path, READ);
return fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size()).load();
}
catch (Exception e) {
throw new RuntimeException(e);
}
finally {
closeQuietly(fileChannel);
}
}
public static String binarySearch(final String prefix, final MappedByteBuffer buffer) {
if (buffer == null) {
return null;
}
try {
long low = 0;
long high = buffer.limit();
while (low < high) {
int mid = (int) ((low + high) / 2);
final String candidate = getLine(mid, buffer);
if (found(candidate, prefix)) {
return trimToNull(candidate);
}
else if (before(candidate, prefix)) {
high = mid;
}
else {
low = mid + 1;
}
}
}
catch (Exception e) {
throw new RuntimeException(e);
}
return null;
}
private static String getLine(int position, final MappedByteBuffer buffer) {
// search backwards to the find the proceeding new line
// then search forwards again until the next new line
// return the string in between
final StringBuilder stringBuilder = new StringBuilder();
// walk it back
char candidate = (char)buffer.get(position);
while (position > 0 && candidate != '\n') {
candidate = (char)buffer.get(--position);
}
// we either are at the beginning of the file or a new line
if (position == 0) {
// we are at the beginning at the first char
candidate = (char)buffer.get(position);
stringBuilder.append(candidate);
}
// there is/are char(s) after new line / first char
if (isInBuffer(buffer, position)) {
//first char after new line
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
//walk it forward
while (isInBuffer(buffer, position) && candidate != ('\n')) {
candidate = (char)buffer.get(++position);
stringBuilder.append(candidate);
}
}
return stringBuilder.toString();
}
private static boolean isInBuffer(final Buffer buffer, int position) {
return position + 1 < buffer.limit();
}
public static File getOrCreateDirectory(final String dirName) {
final File directory = new File(dirName);
try {
forceMkdir(directory);
isWritable(directory.toPath());
}
catch (IOException e) {
throw new RuntimeException(e);
}
return directory;
}
}
I had similar problem, so I created (Scala) library from solutions provided in this thread:
https://github.com/avast/BigMap
It contains utility for sorting huge file and binary search in this sorted file...
If you truly want to try memory mapping the file, I found a tutorial on how to use memory mapping in Java nio.

Categories

Resources