When to close DL4J INDArrays

When to close DL4J INDArrays - java

I created a custom DataSetIterator. It works by randomly generating two INDArrays (one for input and one for output) in the next method and creating a DataSet out of it:
int[][] inputArray = new int[num][NUM_INPUTS];
int[][] expectedOutputArray = new int[num][];
for (int i = 0; i < num; i++) {//just fill the arrays with some data
int sum = 0;
int product = 1;
for (int j = 0; j < inputArray[i].length; j++) {
inputArray[i][j] = rand.nextInt();
sum += inputArray[i][j];
product *= inputArray[i][j];
}
expectedOutputArray[i] = new int[] { sum, product, sum / inputArray[i].length };
}
INDArray inputs = Nd4j.createFromArray(inputArray);//never closed
INDArray desiredOutputs = Nd4j.createFromArray(expectedOutputArray);//never closed
return new DataSet(inputs, desiredOutputs);
However, INDArray implements AutoClosable and the javadoc for close() states:
This method releases exclusive off-heap resources uses by this INDArray instance. If INDArray relies on shared resources, exception will be thrown instead PLEASE NOTE: This method is NOT safe by any means
Do I need to close the INDArrays?
If so, when do I need to close the INDArrays?
I have tried to use a try-with-resources but it threw an exception as the INDArray is closed when using it in the fit method.
The documentation of createFromArray(int[][]) does not seem to explain this.

you don't really need to close them. We take care of that automatically with javacpp. You can choose to close them but AutoCloseable was implemented for people who wanted more control over the memory management of the ndarrays.
Edit: Javacpp is the underlying native integration that we use to connect to native libraries we maintain written in c++ and other libraries. All of our calculations and data are all based on native code and off heap memory.
close() just forces us to de allocate those buffers faster. Javacpp has automatic de allocation built in to it already though.

Related

How to improve Java multi-threading performance? Time efficiency of saving/loading data from NoSQL database (like Redis) vs ArrayList?

I am evaluating an SDK and I need to cross compare ~15000 iris images stored in a gallery folder and generate the similarity scores as a 15000 x 15000 matrix.
So I pre-processed all the images and stored the processed blobs in an ArrayList. Then I'm using multiple threads with 2 'for' loops in the run method to call the 'compare' method (from SDK) and pass the index of an ArrayList as parameters to compare those respective blobs and save the integer return values in an excel sheet using Apache poi library. The performance is very inefficient (each comparison takes ~40ms) and the whole task takes a lot of time (~100 days estimated with 8 cores running at 100%) to do all the 225,000,000 comparisons. Please help me understand this bottle neck.
Multithreading code
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService executor = Executors.newFixedThreadPool(processors);
for(int i =0; i<processors; i++) {
//each thread compares 1875 images with 15000 images
Runnable task = new Thread(bloblist,i*1875,i*1875+1874);
executor.execute(task);
}
executor.shutdown();
Run Method
public void run(){
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
//save result to Excel using Apache POI
...
...
}
}
}
Please suggest me a time-efficient architecture to accomplish this task. Shall I store the blobs in a NoSQL DB or is there any alternate way to do this?

I'd consider adding some simple profiling to your code as a first step. Profiling libraries are great, but can be a little intimidating. All you really need to get started is:
public void run(){
long sumCompare = 0;
long sumSave = 0
for(int i = startIndex; i<= lastIndex; i++) {
for(int j=0;j<15000;j++){
final long compareStart = System.currentTimeMillis();
compare.compareIris(bloblist.get(i),bloblist.get(j));
score= compare.getScore();
final long compareEnd = System.currentTimeMillis();
compareSum += (compareEnd - compareStart);
//save result to Excel using Apache POI
...
...
final long saveEnd = System.currentTimeMillis();
saveSum += (saveEnd - compareEnd);
}
}
System.out.println(String.format("Compare: %d; Save: %d", sumCompare, sumSave);
}
Maybe run this over a 100x100 grid instead to get a rough idea of where the bulk of your runtime is.
If it's the save step, I'd strongly recommend using a database as an intermediate step between computing the score and exporting it to a spreadsheet. A NoSQL database would work, although I'd also encourage you to look at something like SQLite, just for simplicity's sake. (Many NoSQL databases are designed to offer advantages across a cluster of database nodes while working with very large datasets; if you're storing write-only data on one node, SQL may be your best bet.)
If the bottleneck is the compute step, it will be more difficult to improve performance. If the blobs don't all fit comfortably in RAM along with whatever RAM the comparisons consume, you may be paying the price of swapping this data onto and off-of disk. You may see a small improvement by having each thread take work "off of a queue", rather than starting with a pre-assigned block:
final int processors = Runtime.getRuntime().availableProcessors();
final ExecutorService executor = Executors.newFixedThreadPool(processors);
final AtomicLong nextCompare = new AtomicLong(0);
for(int i =0; i<processors; i++) {
Runnable task = new Thread(bloblist, nextCompare);
executor.execute(task);
}
executor.shutdown();
public void run(){
while (true) {
final long taskNum = nextCompare.getAndIncrement();
if (taskNum >= 15000 * 15000) {
return;
}
final long i = Math.floor(taskNum/15000);
final long j = taskNum % 15000;
compare.compareIris(bloblist.get(i),bloblist.get(j));
score = compare.getScore();
// Save score, etc.)
}
}
This will result in all threads working on blobs stored relatively close together in memory. In this way, no thread is evicting data from the cache that another thread will require in the near future. You are, however, paying the price of locking the AtomicLong; if memory thrashing wasn't your issue, this will likely be a bit slower.

Performance difference between assignment and conditional test

This question is specifically geared towards the Java language, but I would not mind feedback about this being a general concept if so. I would like to know which operation might be faster, or if there is no difference between assigning a variable a value and performing tests for values. For this issue we could have a large series of Boolean values that will have many requests for changes. I would like to know if testing for the need to change a value would be considered a waste when weighed against the speed of simply changing the value during every request.
public static void main(String[] args){
Boolean array[] = new Boolean[veryLargeValue];
for(int i = 0; i < array.length; i++) {
array[i] = randomTrueFalseAssignment;
}
for(int i = 400; i < array.length - 400; i++) {
testAndChange(array, i);
}
for(int i = 400; i < array.length - 400; i++) {
justChange(array, i);
}
}
This could be the testAndChange method
public static void testAndChange(Boolean[] pArray, int ind) {
if(pArray)
pArray[ind] = false;
}
This could be the justChange method
public static void justChange(Boolean[] pArray, int ind) {
pArray[ind] = false;
}
If we were to end up with the very rare case that every value within the range supplied to the methods were false, would there be a point where one method would eventually become slower than the other? Is there a best practice for issues similar to this?
Edit: I wanted to add this to help clarify this question a bit more. I realize that the data type can be factored into the answer as larger or more efficient datatypes can be utilized. I am more focused on the task itself. Is the task of a test "if(aConditionalTest)" is slower, faster, or indeterminable without additional informaiton (such as data type) than the task of an assignment "x=avalue".

As #TrippKinetics points out, there is a semantical difference between the two methods. Because you use Boolean instead of boolean, it is possible that one of the values is a null reference. In that case the first method (with the if-statement) will throw an exception while the second, simply assigns values to all the elements in the array.
Assuming you use boolean[] instead of Boolean[]. Optimization is an undecidable problem. There are very rare cases where adding an if-statement could result in better performance. For instance most processors use cache and the if-statement can result in the fact that the executed code is stored exactly on two cache-pages where without an if on more resulting in cache faults. Perhaps you think you will save an assignment instruction but at the cost of a fetch instruction and a conditional instruction (which breaks the CPU pipeline). Assigning has more or less the same cost as fetching a value.
In general however, one can assume that adding an if statement is useless and will nearly always result in slower code. So you can quite safely state that the if statement will slow down your code always.
More specifically on your question, there are faster ways to set a range to false. For instance using bitvectors like:
long[] data = new long[(veryLargeValue+0x3f)>>0x06];//a long has 64 bits
//assign random values
int low = 400>>0x06;
int high = (veryLargeValue-400)>>0x06;
data[low] &= 0xffffffffffffffff<<(0x3f-(400&0x3f));
for(int i = low+0x01; i < high; i++) {
data[i] = 0x00;
}
data[high] &= 0xffffffffffffffff>>(veryLargeValue-400)&0x3f));
The advantage is that a processor can perform operations on 32- or 64-bits at once. Since a boolean is one bit, by storing bits into a long or int, operations are done in parallel.

Java wordcount: a mediocre implementation

I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS) to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}

Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.

How to create a FloatBuffer dynamically

I have need to create FloatBuffer's from a dynamic set of floats (that is, I don't know the length ahead of time). The only way I've found to do this is rather inelegant (below). I assume I'm missing something and there must be a cleaner/simpler method.
My solution:
Vector<Float> temp = new Vector<Float>();
//add stuff to temp
ByteBuffer bb = ByteBuffer.allocateDirect( work.size() * 4/*sizeof(float)*/ );
bb.order( ByteOrder.nativeOrder() );
FloatBuffer floatBuf = bb.asFloatBuffer();
for( Float f : work )
floatBuf.put( f );
floatBuf.position(0);
I am using my buffers for OpenGL commands thus I need to keep them around (that is, the resulting FloatBuffer is not just a temporary space).

If you're using the OpenGL API through Java, I assume you're using LWJGL as the go-between. If so, there's a simple solution for this, which is to use the BufferUtils class in the org.lwjgl package. The method BufferUtils.createFloatBuffer() allows you to put in floats from an array, which if you're using a Vector, is a simple conversion. Although it's not much better than your method, it does save the need for a byte buffer which is nasty enough, and allows for a few quick conversions. The code for this exists in the new LWJGL tutorials for OpenGL 3.2+ here.
Hope this helps.

I would use a plain ByteBuffer and I would write out the data when the buffer fills. (or do what ever you planed to do with it)
e.g.
SocketChannel sc = ...
ByteBuffer bb = ByteBuffer.allocateDirect(32 * 1024).order(ByteOrder.LITTLE_ENDIAN);
for(int i = 0 ; i < 100000000; i++) {
float f = i;
// move to a checkFree(4) method.
if (bb.remaining() < 4) {
bb.flip();
while(bb.remaining() > 0)
sc.write(bb);
}
// end of method
bb.putFloat(f);
}
Creating really large buffers can actually be slower than processing the data as you generate it.
Note: this creates almost no garbage. There is only one object which is the ByteBuffer.

How to pass array values to and from Android RenderScript using Allocations

I've been working with RenderScript recently with the intent of creating an API that a programmer can use with ease, similar to the way that Microsoft Accelerator works.
The trouble I'm stuck with at the moment as that I want to pass values to and from the RenderScript layer and have everything run in the most efficient way possible, this is an extract of my source code so far:
int[] A = new int[10];
int[] B = new int[10];
for (int i = 0; i < 10; i++) {
A[i] = 2;
B[i] = i;
}
intAdd(A, B);
This just creates two basic arrays and fills them with values and calls the functions that will send them to RenderScript.
private void intAdd(int[] A, int[] B) {
RenderScript rs = RenderScript.create(this);
ScriptC_rsintadd intaddscript = new ScriptC_rsintadd(rs, getResources(), R.raw.rsintadd);
mScript = intaddscript;
for(int i = 0; i < A.length; i++) {
setNewValues(mScript, A[i], B[i]);
intaddscript.invoke_intAdd();
int C = getResult(mScript);
notifyUser.append(" " + C);
}
}
public void setNewValues(Script script, int A, int B) {
mScript.set_numberA(A);
mScript.set_numberB(B);
}
public int getResult(Script script) {
int C = mScript.get_numberC();
return C;
}
This will send a pair of values to the following RenderScript code:
int numberA;
int numberB;
int numberC;
void intAdd() {
/*Add the two together*/
numberC = numberA + numberB;
/*Send their values to the logcat*/
rsDebug("Current Value", numberC);
}
But there are two problems with this, the first one is the Asynchronous nature of RenderScript means that when the Java layer requests the value, the script either hasn't done the operation yet, or it's already done it, destroyed the value of the output and started on the next one. And thanks to the low debugging visibility of RenderScript there's no way of telling.
The other problem is that it's not very efficient, the code is constantly calling the RenderScript function to add two numbers together. Ideally I'd want to pass the array to RenderScript and store it in a struct and have the entire operation done in one script call rather than many. But in order to get it back I reckon I'll need to user the rsSendtoClient function, but I've not found any material on how to use it. And preferably I'd like to use the rsForEach strategy, but again information is scare.
If anyone has any ideas I'd be very grateful. Thanks.
Will Scott-Jackson

I'm not sure if this will be of help to you at this point but since I know how much of a pain it can be to work through RenderScript, here is the help I can offer. In order to use the rsSendToClient function, you need to instruct the RenderScript instance you created where to send messages to. This is accomplished by something such as:
private void intAdd(int[] A, int[] B) {
RenderScript rs = RenderScript.create(this);
MySubclassedRSMessageHandler handler = new MySubclassedRSMessageHandler();
rs.setMessageHandler(handler);
ScriptC_rsintadd intaddscript = new ScriptC_rsintadd(rs, getResources(), R.raw.rsintadd);
mScript = intaddscript;
for(int i = 0; i < A.length; i++) {
setNewValues(mScript, A[i], B[i]);
intaddscript.invoke_intAdd();
int C = getResult(mScript);
notifyUser.append(" " + C);
}
}
It will be necessary to subclass RenderScript.RSMessageHandler and override the run() method. See http://developer.android.com/reference/android/renderscript/RenderScript.RSMessageHandler.html if you havn't already. Basically there is no way to get around the asynchronous nature which I find to be a double edged sword.
As for the inefficiency, I would consider creating a RenderScript instance, leave it running (you can pause it when not needed, will stay in memory but stop the threads, thus not incurring the construction cost each time you call a function). From here you can have your structures and then use invoke_myFunction(some arguments here) from the reflected Java layer.
Hopefully this helps at least a little bit.

I had the same problem.
The problem with your program is that doesn't know when the add function in rs file should run
,try this it should work
public void setNewValues(Script script, int A, int B) {
mScript.set_numberA(A);
mScript.set_numberB(B);
mscript.invoke_intAdd();
}

I had the same problem with you. I think rsSendtoClient function is not useful and creates many bugs. Instead, using a pointer and allocate it a memory to bring result back to you is much easier.
I recommend the solution of your problem like this:
In rsintadd.rs use this snippet:
int32_t *a;
int32_t *b;
int32_t *c;
void intAdd() {
for(int i = 0; i<10;i++){
c[i] = a[i] + b[i];
}
In your JAVA code use this snippet:
int[] B = new int[10];
int[] A = new int[10];
for (int i = 0; i < 10; i++) {
A[i] = 2;
B[i] = 1;
}
// provide memory for b using data in B
Allocation b = Allocation.createSized(rs, Element.I32(rs), B.length);
b.copyFrom(B);
inv.bind_b(b);
// provide memory for a using data in A
Allocation a = Allocation.createSized(rs, Element.I32(rs), A.length);
a.copyFrom(A);
inv.bind_a(a);
// create blank memory for c
inv.bind_c(Allocation.createSized(rs, Element.I32(rs), 10));
// call intAdd function
inv.invoke_intAdd();
// get result
int[] C = new int[10];
inv.get_c().copyTo(C);
for (int i = 0; i < C.length; i++) {
System.out.println(C[i]);
}
And this is your result on Logcat:
Your first question is about Asynchronous, you can use thread to wait result. In this example, the function is fast enough and instantly gives the output to C array so result can show on logcat.
Your second question is about implement intAdd() function without recalling it. The code above is the answer. You can access any part of int array in Java until the method is done ( different from root() function ).
Hope this can help someone :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

When to close DL4J INDArrays - java

Related

How to improve Java multi-threading performance? Time efficiency of saving/loading data from NoSQL database (like Redis) vs ArrayList?

Performance difference between assignment and conditional test

Java wordcount: a mediocre implementation

How to create a FloatBuffer dynamically

How to pass array values to and from Android RenderScript using Allocations

Categories

Resources