Disk reading operation is performing very slow | Java Stream - java

I need to read images from a folder and generate checksum for them. There are about 330760 images. Following is the code:
package com.test;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.stream.Stream;
import org.apache.commons.codec.digest.DigestUtils;
public class FileTest2 {
private void readFiles() throws IOException {
try (Stream<Path> filePathStream = Files
.walk(Paths.get("d:\\codebase\\images"))) {
filePathStream.parallel().forEach(filePath -> {
String checksumSHA256 = "";
try {
checksumSHA256 = DigestUtils.sha384Hex(new FileInputStream(filePath.toString()));
} catch (IOException e) {
e.printStackTrace();
}
if (Files.isRegularFile(filePath)) {
System.out.println(checksumSHA256);
System.out.println(filePath);
System.out.println("\n");
}
});
}
}
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
FileTest2 fileTest = new FileTest2();
fileTest.readFiles();
long endTime = System.currentTimeMillis();
System.out.println("Total Time took: " + (endTime - startTime) / 1000);
}
}
It took about 36 minutes.
System configuration:
Cores: 8
Memory: 32 GB (15-17 GB is free). Rest of the memory is being used by another server.
36 minutes are too much. Is there a way to improve performance?

As others pointed out you do not terminate the executor. To see the actual times run the following
public static void main(String[] args) throws Exception {
long startTime = System.currentTimeMillis();
FileTest fileTest = new FileTest();
fileTest.readFiles();
long endTime = System.currentTimeMillis();
System.out.println("Total Time took: "+ (endTime-startTime)/1000);
}
Note: at least from the bit of code you posted there is no reason to use an ExecutorService

Related

Pagination in Getting the File

I have a location where 3000 files is stored. But i want to get the list of 1000 files at a time and in next call another 1000 files and so on.
Please find my below code :
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FileSystem {
public static void main(String args[]) throws Exception {
FileSystem.createListFile();
FileSystem.getFileInBatch();
}
private static void getFileInBatch() {
int MAX_INDEX= 1000;
try (Stream<Path> walk = Files.walk(Paths.get("C://FileTest"))) {
List<String> result = walk.filter(p -> Files.isRegularFile(p) && p.getFileName().toString().endsWith(".txt"))
.sorted(Comparator.comparingInt(FileSystem::pathToInt))
.map(x -> x.toString()).limit(MAX_INDEX).collect(Collectors.toList());
result.forEach(System.out::println);
System.out.println(result.size());
} catch (IOException e) {
e.printStackTrace();
}
}
private static int pathToInt(final Path path) {
return Integer.parseInt(path.getFileName()
.toString()
.replaceAll("Aamir(\\d+).txt", "$1")
);
}
private static void createListFile() throws IOException {
for (int i = 0; i < 3000; i++) {
File file = new File("C://FileTest/Aamir" + i + ".txt");
if (file.createNewFile()) {
System.out.println(file.getName() + " is created!");
}
}
}
}
I am able to get the first 1000 (Aamir0.txt to Aamir999.txt) files using the limit in streams.
Now how can i get the next 1000 files ( Aamir1000.txt to Aamir1999.txt)
You can use skip in your Stream. For example:
int toSkip = 1000; // define as method param/etc.
List<String> result = walk.filter(p -> Files.isRegularFile(p) && p.getFileName().toString().endsWith(".txt"))
.sorted(Comparator.comparingInt(FileSystem::pathToInt))
.map(x -> x.toString()).skip(toSkip).limit(MAX_INDEX).collect(Collectors.toList());

Spark Structured Streaming not running continuously?

Server code:
import java.net.ServerSocket;
import java.net.Socket;
import java.nio.charset.Charset;
import java.util.Random;
public class Socket_server {
public static void main(String[] args) throws Exception {
ServerSocket sc = new ServerSocket(9990);
while (true) {
Socket socket = sc.accept();
java.io.OutputStream out = socket.getOutputStream();
String message = getRandomIntegerBetweenRange(100, 120) + "";
byte b[] = message.getBytes(Charset.defaultCharset());
out.write(b);
out.close();
socket.close();
}
}
private static double getRandomIntegerBetweenRange(double max, double min) {
double x = (int) (Math.random() * ((max - min) + 1)) + min;
return x;
}
}
Spark code:
import java.util.Collections;
import org.apache.avro.ipc.specific.Person;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.Trigger;
import scala.Function1;
public class App1 {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf();
conf.setMaster("local[*]");
conf.setAppName("app");
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
Dataset<Row> lines = spark.readStream().format("socket").option("host", "localhost").option("port", 9990)
.load();
StreamingQuery query = lines.writeStream().format("console").start();
query.awaitTermination();
}
}
I am running server code which is generating random values and after that i am running spark Structured Streaming code to read it and create DataFrame from it. But as my spark code start it just read the first value from the server after that it does not read any further value.When i am using this same server with spark streaming then that is reading values continuously. So can anyone help in what is wrong with code.

Read different portion of a file with multiple threads in Java

I have a 10GB PDF file that I would like to break up into 10 files each 1GB in size. I need to do this operation in parallel, which means spinning 10 threads which each starts from a different position and read up to 1GB of data and write to a file. Basically the final result should be 10 files that each contain a portion of the original 10GB file.
I looked at FileChannel, but the position is shared, so once I modify the position in one thread, it impacts the other thread. I also looked at AsynchronousFileChannel in Java 7 but I'm not sure if that's the way to go. I appreciate any suggestion on this issue.
I wrote this simple program that reads a small text file to test the FileChannel idea, doesn't seem to work for what I'm trying to achieve.
package org.cas.filesplit;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.Paths;
public class ConcurrentRead implements Runnable {
private int myPosition = 0;
public int getPosition() {
return myPosition;
}
public void setPosition(int position) {
this.myPosition = position;
}
static final String filePath = "C:\\Users\\temp.txt";
#Override
public void run() {
try {
readFile();
} catch (IOException e) {
e.printStackTrace();
}
}
private void readFile() throws IOException {
Path path = Paths.get(filePath);
FileChannel fileChannel = FileChannel.open(path);
fileChannel.position(myPosition);
ByteBuffer buffer = ByteBuffer.allocate(8);
int noOfBytesRead = fileChannel.read(buffer);
while (noOfBytesRead != -1) {
buffer.flip();
System.out.println("Thread - " + Thread.currentThread().getId());
while (buffer.hasRemaining()) {
System.out.print((char) buffer.get());
}
System.out.println(" ");
buffer.clear();
noOfBytesRead = fileChannel.read(buffer);
}
fileChannel.close();
}
}

Comparing Fork And Join with Single threaded program

I am trying to get started with the Fork-Join framework for a smaller task. As I start-up example I tried copying mp3 files
import java.io.IOException;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.StandardCopyOption;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveTask;
public class DeepFileCopier extends RecursiveTask<String>{
/**
*
*/
private static final long serialVersionUID = 1L;
private static Path startingDir = Paths.get("D:\\larsen\\Music\\");
private static List<Path> listOfPaths = new ArrayList<>();
private int start, end;
public static void main(String[] args) throws IOException
{
long startMillis = System.currentTimeMillis();
Files.walkFileTree(startingDir, new CustomFileVisitor());
final DeepFileCopier deepFileCopier = new DeepFileCopier(0,listOfPaths.size());
final ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());
pool.invoke(deepFileCopier);
System.out.println("With Fork-Join " + (System.currentTimeMillis() - startMillis));
long secondStartMillis = System.currentTimeMillis();
deepFileCopier.start = 0;
deepFileCopier.end = listOfPaths.size();
deepFileCopier.computeDirectly();
System.out.println("Without Fork-Join " + (System.currentTimeMillis() - secondStartMillis));
}
private static class CustomFileVisitor extends SimpleFileVisitor<Path> {
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException
{
if (file.toString().endsWith(".mp3")) {
listOfPaths.add(file);
}
return FileVisitResult.CONTINUE;
}
}
#Override
protected String compute() {
int length = end-start;
if(length < 4) {
return computeDirectly();
}
int split = length / 2;
final DeepFileCopier firstHalfCopier = new DeepFileCopier(start, start + split);
firstHalfCopier.fork();
final DeepFileCopier secondHalfCopier = new DeepFileCopier(start + split, end);
secondHalfCopier.compute();
firstHalfCopier.join();
return null;
}
private String computeDirectly() {
for(int index = start; index< end; index++) {
Path currentFile = listOfPaths.get(index);
System.out.println("Copying :: " + currentFile.getFileName());
Path targetDir = Paths.get("D:\\Fork-Join Test\\" + currentFile.getFileName());
try {
Files.copy(currentFile, targetDir, StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
private DeepFileCopier(int start, int end ) {
this.start = start;
this.end = end;
}
}
On comparing the performance I noticed -
With Fork-Join 149714
Without Fork-Join 146590
Am working on a Dual Core machine. I was expecting a 50% reduction in the work time but the portion with Fork-Join takes 3 seconds more than a single threaded approach. Please let me know if some thing is incorrect.
Your problem is not well suited to benefit from multithreading on normal systems. The execution time is spent copying all the files. But this is limited by your hard drive that will process the files in sequence.
If you run a more CPU intense task, you should note a difference. For test purposes you could try the following:
private String computeDirectly() {
Integer nonsense;
for(int index = start; index< end; index++) {
for( int j = 0; j < 1000000; j++ )
nonsense += index*j;
}
return nonsense.toString();
}
On my system (i5-2410M) this will print:
With Fork-Join 2628
Without Fork-Join 6421

Multithreaded test to test response time of sites/web services

Below code tests the response time of reading www.google.com into a BufferedReader. I plan on using this code to test the response times of other sites and web services within intranet. Below tests runs for 20 seconds and opens 4 requests per second :
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.Map.Entry;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import org.junit.Test;
public class ResponseTimeTest {
private static final int NUMBER_REQUESTS_PER_SECOND = 4;
private static final int TEST_EXECUTION_TIME = 20000;
private static final ConcurrentHashMap<Long, Long> timingMap = new ConcurrentHashMap<Long, Long>();
#Test
public void testResponseTime() throws InterruptedException {
ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(10);
scheduler.scheduleAtFixedRate(new RequestThreadCreator(), 0, 1, TimeUnit.SECONDS);
Thread.sleep(TEST_EXECUTION_TIME);
System.out.println("Start Time, End Time, Total Time");
for (Entry<Long, Long> entry : timingMap.entrySet())
{
System.out.println(entry.getKey() + "," + entry.getValue() +","+(entry.getValue() - entry.getKey()));
}
}
private final class RequestThreadCreator implements Runnable {
public void run() {
ExecutorService es = Executors.newCachedThreadPool();
for (int i = 1; i <= NUMBER_REQUESTS_PER_SECOND; i++) {
es.execute(new RequestThread());
}
es.shutdown();
}
}
private final class RequestThread implements Runnable {
public void run() {
long startTime = System.currentTimeMillis();
try {
URL oracle = new URL("http://www.google.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
while ((in.readLine()) != null) {
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
long endTime = System.currentTimeMillis();
timingMap.put(startTime, endTime);
}
}
}
The output is :
Start Time, End Time, Total Time
1417692221531,1417692221956,425
1417692213530,1417692213869,339
1417692224530,1417692224983,453
1417692210534,1417692210899,365
1417692214530,1417692214957,427
1417692220530,1417692221041,511
1417692209530,1417692209949,419
1417692215532,1417692215950,418
1417692214533,1417692215075,542
1417692213531,1417692213897,366
1417692212530,1417692212924,394
1417692219530,1417692219897,367
1417692226532,1417692226876,344
1417692211530,1417692211955,425
1417692209529,1417692209987,458
1417692222531,1417692222967,436
1417692215533,1417692215904,371
1417692219531,1417692219954,423
1417692215530,1417692215870,340
1417692217531,1417692218035,504
1417692207547,1417692207882,335
1417692208535,1417692208898,363
1417692207544,1417692208095,551
1417692208537,1417692208958,421
1417692226533,1417692226899,366
1417692224531,1417692224951,420
1417692225529,1417692225957,428
1417692216530,1417692216963,433
1417692223541,1417692223884,343
1417692223546,1417692223959,413
1417692222530,1417692222954,424
1417692208532,1417692208871,339
1417692207536,1417692207988,452
1417692226538,1417692226955,417
1417692220531,1417692220992,461
1417692209531,1417692209953,422
1417692226531,1417692226959,428
1417692217532,1417692217944,412
1417692210533,1417692210964,431
1417692221530,1417692221870,340
1417692216531,1417692216959,428
1417692207535,1417692208021,486
1417692223548,1417692223957,409
1417692216532,1417692216904,372
1417692214535,1417692215071,536
1417692217530,1417692217835,305
1417692213529,1417692213954,425
1417692210531,1417692210964,433
1417692212529,1417692212993,464
1417692213532,1417692213954,422
1417692215531,1417692215957,426
1417692210529,1417692210868,339
1417692218531,1417692219102,571
1417692225530,1417692225907,377
1417692208536,1417692208966,430
1417692218533,1417692219168,635
As System.out.println is synchronized in order to not skew results I add the timings to a ConcurrentHashMap and do not output the timings within the RequestThread itself. Are other gotcha's I should be aware of in above code so as to not skew the results. Or are there area's I should concentrate on in order to improve the accuracy or is it accurate "enough", by enough accurate to approx 100 millliseconds.

Categories

Resources