How could I write a code to be able to delete exactly the duplicates
that I get previously with this code.?? please be specific when
answering as I am new to java.I have very basic knowledge of java.
private static MessageDigest messageDigest;
static {
try {
messageDigest = MessageDigest.getInstance("SHA-512");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException("cannot initialize SHA-512 hash function", e);
}
}
public static void findDuplicatedFiles(Map<String, List<String>> lists, File directory) {
for (File child : directory.listFiles()) {
if (child.isDirectory()) {
findDuplicatedFiles(lists, child);
} else {
try {
FileInputStream fileInput = new FileInputStream(child);
byte fileData[] = new byte[(int) child.length()];
fileInput.read(data);
fileInput.close();
String uniqueFileHash = new BigInteger(1, md.digest(fileData)).toString(16);
List<String> list = lists.get(uniqueFileHash);
if (list == null) {
list = new LinkedList<String>();
lists.put(uniqueFileHash, list);
}
list.add(child.getAbsolutePath());
} catch (IOException e) {
throw new RuntimeException("cannot read file " + child.getAbsolutePath(), e);
}
}
}
}
Map<String, List<String>> lists = new HashMap<String, List<String>>();
FindDuplicates.findDuplicateFiles(lists, dir);
for (List<String> list : lists.values()) {
if (list.size() > 1) {
System.out.println("\n");
for (String file : list) {
System.out.println(file);
}
}
}
System.out.println("\n");
Do not read the entire contents of the file into memory. The whole point of an InputStream is that you can read small, manageable chunks of data, so you don’t have to use a great deal of memory.
Imagine if you were trying to check a file that’s one gigabyte in size. By creating a byte array to hold the entire content, you have forced your program to use a gigabyte of RAM. (If the file were two gigabytes or larger, you wouldn’t be able to allocate the byte array at all, since an array may not have more than 2³¹-1 elements.)
The easiest way to compute the hash of a file’s contents is to copy the file to a DigestOutputStream, which is an OutputStream that makes use of an existing MessageDigest:
messageDigest.reset();
try (DigestOutputStream stream = new DigestOutputStream(
OutputStream.nullOutputStream(), messageDigest)) {
Files.copy(child.toPath(), stream);
}
String uniqueFileHash = new BigInteger(1, messageDigest.digest());
Scanning directories is easier with NIO Path / Files class because it avoids awkward recursion of File class, and it is much quicker for deeper directory trees.
Here is an example scanner which returns a Stream of duplicates - that is where each item in the stream is a List<Path> - a group of TWO or more identical files.
// Scan a directory and returns Stream of List<Path> where each list has 2 or more duplicates
static Stream<List<Path>> findDuplicates(Path dir) throws IOException {
Map<Long, List<Path>> candidates = new HashMap<>();
BiPredicate<Path, BasicFileAttributes> biPredicate = (p,a)->a.isRegularFile()
&& candidates.computeIfAbsent(Long.valueOf(a.size())
, k -> new ArrayList<>()).add(p);
try(var stream = Files.find(dir, Integer.MAX_VALUE, biPredicate)) {
stream.count();
}
Predicate<? super List<Path>> twoOrMore = paths -> paths.size() > 1;
return candidates.values().stream()
.filter(twoOrMore)
.flatMap(Duplicate::duplicateChecker)
.filter(twoOrMore);
}
The above code starts by collating candidates of same file size, then uses a flatMap operation to compare all those candidates to get the exact matches where the files are identical in each List<Path>:
// Checks possible list of duplicates, and returns stream of definite duplicates
private static Stream<List<Path>> duplicateChecker(List<Path> sameLenPaths) {
List<List<Path>> groups = new ArrayList<>();
try {
for (Path p : sameLenPaths) {
List<Path> match = null;
for (List<Path> g : groups) {
Path prev = g.get(0);
if(Files.mismatch(prev, p) < 0) {
match = g;
break;
}
}
if (match == null)
groups.add(match = new ArrayList<>());
match.add(p);
}
} catch(IOException io) {
throw new UncheckedIOException(io);
}
return groups.stream();
}
Finally an example launcher:
public static void main(String[] args) throws IOException {
Path dir = Path.of(args[0]);
Stream<List<Path>> duplicates = findDuplicates(dir);
long count = duplicates.peek(System.out::println).count();
System.out.println("Found "+count+" groups of duplicate files in: "+dir);
}
You will need to process list of duplicate files using Files.delete - I've not added Files.delete at the end so that you can check the results before deciding to delete them.
// findDuplicates(dir).flatMap(List::stream).forEach(dup -> {
// try {
// Files.delete(dup);
// } catch(IOException io) {
// throw new UncheckedIOException(io);
// }
// });
Related
I want to use a Stream to parallelize processing of a heterogenous set of remotely stored JSON files of unknown number (the number of files is not known upfront). The files can vary widely in size, from 1 JSON record per file up to 100,000 records in some other files. A JSON record in this case means a self-contained JSON object represented as one line in the file.
I really want to use Streams for this and so I implemented this Spliterator:
public abstract class JsonStreamSpliterator<METADATA, RECORD> extends AbstractSpliterator<RECORD> {
abstract protected JsonStreamSupport<METADATA> openInputStream(String path);
abstract protected RECORD parse(METADATA metadata, Map<String, Object> json);
private static final int ADDITIONAL_CHARACTERISTICS = Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL;
private static final int MAX_BUFFER = 100;
private final Iterator<String> paths;
private JsonStreamSupport<METADATA> reader = null;
public JsonStreamSpliterator(Iterator<String> paths) {
this(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths);
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths) {
super(est, additionalCharacteristics);
this.paths = paths;
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths, String nextPath) {
this(est, additionalCharacteristics, paths);
open(nextPath);
}
#Override
public boolean tryAdvance(Consumer<? super RECORD> action) {
if(reader == null) {
String path = takeNextPath();
if(path != null) {
open(path);
}
else {
return false;
}
}
Map<String, Object> json = reader.readJsonLine();
if(json != null) {
RECORD item = parse(reader.getMetadata(), json);
action.accept(item);
return true;
}
else {
reader.close();
reader = null;
return tryAdvance(action);
}
}
private void open(String path) {
reader = openInputStream(path);
}
private String takeNextPath() {
synchronized(paths) {
if(paths.hasNext()) {
return paths.next();
}
}
return null;
}
#Override
public Spliterator<RECORD> trySplit() {
String nextPath = takeNextPath();
if(nextPath != null) {
return new JsonStreamSpliterator<METADATA,RECORD>(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths, nextPath) {
#Override
protected JsonStreamSupport<METADATA> openInputStream(String path) {
return JsonStreamSpliterator.this.openInputStream(path);
}
#Override
protected RECORD parse(METADATA metaData, Map<String,Object> json) {
return JsonStreamSpliterator.this.parse(metaData, json);
}
};
}
else {
List<RECORD> records = new ArrayList<RECORD>();
while(tryAdvance(records::add) && records.size() < MAX_BUFFER) {
// loop
}
if(records.size() != 0) {
return records.spliterator();
}
else {
return null;
}
}
}
}
The problem I'm having is that while the Stream parallelizes beautifully at first, eventually the largest file is left processing in a single thread. I believe the proximal cause is well documented: the spliterator is "unbalanced".
More concretely, appears that the trySplit method is not called after a certain point in the Stream.forEach's lifecycle, so the extra logic to distribute small batches at the end of trySplit is rarely executed.
Notice how all the spliterators returned from trySplit share the same paths iterator. I thought this was a really clever way to balance the work across all spliterators, but it hasn't been enough to achieve full parallelism.
I would like the parallel processing to proceed first across files, and then when few large files are still left spliterating, I want to parallelize across chunks of the remaining files. That was the intent of the else block at the end of trySplit.
Is there an easy / simple / canonical way around this problem?
Your trySplit should output splits of equal size, regardless of the size of the underlying files. You should treat all the files as a single unit and fill up the ArrayList-backed spliterator with the same number of JSON objects each time. The number of objects should be such that processing one split takes between 1 and 10 milliseconds: lower than 1 ms and you start approaching the costs of handing off the batch to a worker thread, higher than that and you start risking uneven CPU load due to tasks which are too coarse-grained.
The spliterator is not obliged to report a size estimate, and you are already doing this correctly: your estimate is Long.MAX_VALUE, which is a special value meaning "unbounded". However, if you have many files with a single JSON object, resulting in batches of size 1, this will hurt your performance in two ways: the overhead of opening-reading-closing the file may become a bottleneck and, if you manage to escape that, the cost of thread handoff may be significant compared to the cost of processing one item, again causing a bottleneck.
Five years ago I was solving a similar problem, you can have a look at my solution.
After much experimentation, I was still not able to get any added parallelism by playing with the size estimates. Basically, any value other than Long.MAX_VALUE will tend to cause the spliterator to terminate too early (and without any splitting), while on the other hand a Long.MAX_VALUE estimate will cause trySplit to be called relentlessly until it returns null.
The solution I found is to internally share resources among the spliterators and let them rebalance amongst themselves.
Working code:
public class AwsS3LineSpliterator<LINE> extends AbstractSpliterator<AwsS3LineInput<LINE>> {
public final static class AwsS3LineInput<LINE> {
final public S3ObjectSummary s3ObjectSummary;
final public LINE lineItem;
public AwsS3LineInput(S3ObjectSummary s3ObjectSummary, LINE lineItem) {
this.s3ObjectSummary = s3ObjectSummary;
this.lineItem = lineItem;
}
}
private final class InputStreamHandler {
final S3ObjectSummary file;
final InputStream inputStream;
InputStreamHandler(S3ObjectSummary file, InputStream is) {
this.file = file;
this.inputStream = is;
}
}
private final Iterator<S3ObjectSummary> incomingFiles;
private final Function<S3ObjectSummary, InputStream> fileOpener;
private final Function<InputStream, LINE> lineReader;
private final Deque<S3ObjectSummary> unopenedFiles;
private final Deque<InputStreamHandler> openedFiles;
private final Deque<AwsS3LineInput<LINE>> sharedBuffer;
private final int maxBuffer;
private AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener,
Function<InputStream, LINE> lineReader,
Deque<S3ObjectSummary> unopenedFiles, Deque<InputStreamHandler> openedFiles, Deque<AwsS3LineInput<LINE>> sharedBuffer,
int maxBuffer) {
super(Long.MAX_VALUE, 0);
this.incomingFiles = incomingFiles;
this.fileOpener = fileOpener;
this.lineReader = lineReader;
this.unopenedFiles = unopenedFiles;
this.openedFiles = openedFiles;
this.sharedBuffer = sharedBuffer;
this.maxBuffer = maxBuffer;
}
public AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener, Function<InputStream, LINE> lineReader, int maxBuffer) {
this(incomingFiles, fileOpener, lineReader, new ConcurrentLinkedDeque<>(), new ConcurrentLinkedDeque<>(), new ArrayDeque<>(maxBuffer), maxBuffer);
}
#Override
public boolean tryAdvance(Consumer<? super AwsS3LineInput<LINE>> action) {
AwsS3LineInput<LINE> lineInput;
synchronized(sharedBuffer) {
lineInput=sharedBuffer.poll();
}
if(lineInput != null) {
action.accept(lineInput);
return true;
}
InputStreamHandler handle = openedFiles.poll();
if(handle == null) {
S3ObjectSummary unopenedFile = unopenedFiles.poll();
if(unopenedFile == null) {
return false;
}
handle = new InputStreamHandler(unopenedFile, fileOpener.apply(unopenedFile));
}
for(int i=0; i < maxBuffer; ++i) {
LINE line = lineReader.apply(handle.inputStream);
if(line != null) {
synchronized(sharedBuffer) {
sharedBuffer.add(new AwsS3LineInput<LINE>(handle.file, line));
}
}
else {
return tryAdvance(action);
}
}
openedFiles.addFirst(handle);
return tryAdvance(action);
}
#Override
public Spliterator<AwsS3LineInput<LINE>> trySplit() {
synchronized(incomingFiles) {
if (incomingFiles.hasNext()) {
unopenedFiles.add(incomingFiles.next());
return new AwsS3LineSpliterator<LINE>(incomingFiles, fileOpener, lineReader, unopenedFiles, openedFiles, sharedBuffer, maxBuffer);
} else {
return null;
}
}
}
}
This is not a direct answer to your question. But I think it is worth a try with Stream in library abacus-common:
void test_58601518() throws Exception {
final File tempDir = new File("./temp/");
// Prepare the test files:
// if (!(tempDir.exists() && tempDir.isDirectory())) {
// tempDir.mkdirs();
// }
//
// final Random rand = new Random();
// final int fileCount = 1000;
//
// for (int i = 0; i < fileCount; i++) {
// List<String> lines = Stream.repeat(TestUtil.fill(Account.class), rand.nextInt(1000) * 100 + 1).map(it -> N.toJSON(it)).toList();
// IOUtil.writeLines(new File("./temp/_" + i + ".json"), lines);
// }
N.println("Xmx: " + IOUtil.MAX_MEMORY_IN_MB + " MB");
N.println("total file size: " + Stream.listFiles(tempDir).mapToLong(IOUtil::sizeOf).sum() / IOUtil.ONE_MB + " MB");
final AtomicLong counter = new AtomicLong();
final Consumer<Account> yourAction = it -> {
counter.incrementAndGet();
it.toString().replace("a", "bbb");
};
long startTime = System.currentTimeMillis();
Stream.listFiles(tempDir) // the file/data source could be local file system or remote file system.
.parallel(2) // thread number used to load the file/data and convert the lines to Java objects.
.flatMap(f -> Stream.lines(f).map(line -> N.fromJSON(Account.class, line))) // only certain lines (less 1024) will be loaded to memory.
.parallel(8) // thread number used to execute your action.
.forEach(yourAction);
N.println("Took: " + ((System.currentTimeMillis()) - startTime) + " ms" + " to process " + counter + " lines/objects");
// IOUtil.deleteAllIfExists(tempDir);
}
Till end, the CPU usage on my laptop is pretty high(about 70%), and it took about 70 seconds to process 51,899,100 lines/objects from 1000 files with Intel(R) Core(TM) i5-8365U CPU and Xmx256m jvm memory. Total file size is about: 4524 MB. if yourAction is not a heavy operation, sequential stream could be even faster than parallel stream.
F.Y.I I'm the developer of abacus-common
List<String> list= jsc.wholeTextFiles(hdfsPath).keys().collect();
for (String string : list) {
System.out.println(string);
}
Here i am getting all the zip files.From here i am unable to proceed how to extract each file and store into hdfs path with same zipname folder
You can use like below, But only thing we need to do collect at zipFilesRdd.collect().forEach before writing the contents into hdfs. Map and flat map gives task not serializable at this point.
public void readWriteZipContents(String zipLoc,String hdfsBasePath){
JavaSparkContext jsc = new JavaSparkContext(new SparkContext(new SparkConf()));
JavaPairRDD<String, PortableDataStream> zipFilesRdd = jsc.binaryFiles(zipLoc);
zipFilesRdd.collect().forEach(file -> {
ZipInputStream zipStream = new ZipInputStream(file._2.open());
ZipEntry zipEntry = null;
Scanner sc = new Scanner(zipStream);
try {
while ((zipEntry = zipStream.getNextEntry()) != null) {
String entryName = zipEntry.getName();
if (!zipEntry.isDirectory()) {
//create the path in hdfs and write its contents
Configuration configuration = new Configuration();
configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
configuration.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:8020"), configuration);
FSDataOutputStream hdfsfile = fs.create(new Path(hdfsBasePath + "/" + entryName));
while(sc.hasNextLine()){
hdfsfile.writeBytes(sc.nextLine());
}
hdfsfile.close();
hdfsfile.flush();
}
zipStream.closeEntry();
}
} catch (IllegalArgumentException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
//return fileNames.iterator();
});
}
With gzip files, wholeTextFiles should gunzip everything automatically.
With zip files however, the only way I know is to use binaryFiles and to unzip the data by hand.
sc
.binaryFiles(hdfsDir)
.mapValues(x=> {
var result = scala.collection.mutable.ArrayBuffer.empty[String]
val zis = new ZipInputStream(x.open())
var entry : ZipEntry = null
while({entry = zis.getNextEntry();entry} != null) {
val scanner = new Scanner(zis)
while (sc.hasNextLine()) {result+=sc.nextLine()}
}
zis.close()
result
}
This gives you a (pair) RDD[String, ArrayBuffer[String]] where the key is the name of the file on hdfs and the value the unzipped content of the zip file (one line per element of the ArrayBuffer). If a given zip file contains more than one file, everything is aggregated. You may adapt the code to fit your exact needs. For instance, flatMapValues instead of mapValues would flatten everything (RDD[String, String]) to take advantage of spark's parallelism.
Note also that in the while condition, "{entry = is.getNextEntry();entry} could be replaced by (entry = is.getNextEntry()) in java. In scala however the result of an affectation is Unit so this would yield an infinite loop.
Come up with this solution written in Scala.
Tested with spark2 (version 2.3.0.cloudera2), scala (version 2.11.8)
def extractHdfsZipFile(source_zip : String, target_folder : String,
sparksession : SparkSession) : Boolean = {
val hdfs_config = sparksession.sparkContext.hadoopConfiguration
val buffer = new Array[Byte](1024)
/*
.collect -> run on driver only, not able to serialize hdfs Configuration
*/
val zip_files = sparksession.sparkContext.binaryFiles(source_zip).collect.
foreach{ zip_file: (String, PortableDataStream) =>
// iterate over zip_files
val zip_stream : ZipInputStream = new ZipInputStream(zip_file._2.open)
var zip_entry: ZipEntry = null
try {
// iterate over all ZipEntry from ZipInputStream
while ({zip_entry = zip_stream.getNextEntry; zip_entry != null}) {
// skip directory
if (!zip_entry.isDirectory()) {
println(s"Extract File: ${zip_entry.getName()}, with Size: ${zip_entry.getSize()}")
// create new hdfs file
val fs : FileSystem = FileSystem.get(hdfs_config)
val hdfs_file : FSDataOutputStream = fs.create(new Path(target_folder + "/" + zip_entry.getName()))
var len : Int = 0
// write until zip_stream is null
while({len = zip_stream.read(buffer); len > 0}) {
hdfs_file.write(buffer, 0, len)
}
// close and flush hdfs_file
hdfs_file.close()
hdfs_file.flush()
}
zip_stream.closeEntry()
}
zip_stream.close()
} catch {
case zip : ZipException => {
println(zip.printStackTrace)
println("Please verify that you do not use compresstype9.")
// for DEBUG throw exception
//false
throw zip
}
case e : Exception => {
println(e.printStackTrace)
// for DEBUG throw exception
//false
throw e
}
}
}
true
}
I have a (possibly long) list of binary files that I want to read lazily. There will be too many files to load into memory. I'm currently reading them as a MappedByteBuffer with FileChannel.map(), but that probably isn't required. I want the method readBinaryFiles(...) to return a Java 8 Stream so I can lazy load the list of files as I access them.
public List<FileDataMetaData> readBinaryFiles(
List<File> files,
int numDataPoints,
int dataPacketSize )
throws
IOException {
List<FileDataMetaData> fmdList = new ArrayList<FileDataMetaData>();
IOException lastException = null;
for (File f: files) {
try {
FileDataMetaData fmd = readRawFile(f, numDataPoints, dataPacketSize);
fmdList.add(fmd);
} catch (IOException e) {
logger.error("", e);
lastException = e;
}
}
if (null != lastException)
throw lastException;
return fmdList;
}
// The List<DataPacket> returned will be in the same order as in the file.
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize) throws IOException {
FileDataMetaData fmd;
FileChannel fileChannel = null;
try {
fileChannel = new RandomAccessFile(file, "r").getChannel();
long fileSz = fileChannel.size();
ByteBuffer bbRead = ByteBuffer.allocate((int) fileSz);
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
buffer.get(bbRead.array());
List<DataPacket> dataPacketList = new ArrayList<DataPacket>();
while (bbRead.hasRemaining()) {
int channelId = bbRead.getInt();
long timestamp = bbRead.getLong();
int[] data = new int[numDataPoints];
for (int i=0; i<numDataPoints; i++)
data[i] = bbRead.getInt();
DataPacket dp = new DataPacket(channelId, timestamp, data);
dataPacketList.add(dp);
}
fmd = new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
} catch (IOException e) {
logger.error("", e);
throw e;
} finally {
if (null != fileChannel) {
try {
fileChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return fmd;
}
Returning fmdList.Stream() from readBinaryFiles(...) won't accomplish this because the file contents will already have been read into memory, which I won't be able to do.
The other approaches to reading the contents of multiple files as a Stream rely on using Files.lines(), but I need to read binary files.
I'm, open to doing this in Scala or golang if those languages have better support for this use case than Java.
I'd appreciate any pointers on how to read the contents of multiple binary files lazily.
There is no laziness possible for the reading within the a file as you are reading the entire file for constructing an instance of FileDataMetaData. You would need a substantial refactoring of that class to be able to construct an instance of FileDataMetaData without having to read the entire file.
However, there are several things to clean up in that code, even specific to Java 7 rather than Java 8, i.e you don’t need a RandomAccessFile detour to open a channel anymore and there is try-with-resources to ensure proper closing. Note further that you usage of memory mapping makes no sense. When copy the entire contents into a heap ByteBuffer after mapping the file, there is nothing lazy about it. It’s exactly the same what happens, when call read with a heap ByteBuffer on a channel, except that the JRE can reuse buffers in the read case.
In order to allow the system to manage the pages, you have to read from the mapped byte buffer. Depending on the system, this might still not be better than repeatedly reading small chunks into a heap byte buffer.
public FileDataMetaData readRawFile(
File file, int numDataPoints, int dataPacketSize) throws IOException {
try(FileChannel fileChannel=FileChannel.open(file.toPath(), StandardOpenOption.READ)) {
long fileSz = fileChannel.size();
MappedByteBuffer bbRead=fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
List<DataPacket> dataPacketList = new ArrayList<>();
while(bbRead.hasRemaining()) {
int channelId = bbRead.getInt();
long timestamp = bbRead.getLong();
int[] data = new int[numDataPoints];
for (int i=0; i<numDataPoints; i++)
data[i] = bbRead.getInt();
dataPacketList.add(new DataPacket(channelId, timestamp, data));
}
return new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
} catch (IOException e) {
logger.error("", e);
throw e;
}
}
Building a Stream based on this method is straight-forward, only the checked exception has to be handled:
public Stream<FileDataMetaData> readBinaryFiles(
List<File> files, int numDataPoints, int dataPacketSize) throws IOException {
return files.stream().map(f -> {
try {
return readRawFile(f, numDataPoints, dataPacketSize);
} catch (IOException e) {
logger.error("", e);
throw new UncheckedIOException(e);
}
});
}
This should be sufficient:
return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize));
…if, that is, you are willing to remove throws IOException from the readRawFile method’s signature. You could have that method catch IOException internally and wrap it in an UncheckedIOException. (The problem with deferred execution is that the exceptions also need to be deferred.)
I don't know how performant this is, but you can use java.io.SequenceInputStream wrapped inside of DataInputStream. This will effectively concatenate your files together. If you create a BufferedInputStream from each file, then the whole thing should be properly buffered.
Building on VGR's comment, I think his basic solution of:
return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize))
is correct, in that it will lazily process the files (and stop if a short-circuiting terminal action is invoked off the result of the map() operation. I would also suggest a slightly different to the implementation of readRawFile that leverages try with resources and InputStream, which will not load the whole file into memory:
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize)
throws DataPacketReadException { // <- Custom unchecked exception, nested for class
FileDataMetadata results = null;
try (FileInputStream fileInput = new FileInputStream(file)) {
String filePath = file.getCanonicalPath();
long fileSize = fileInput.getChannel().size()
DataInputStream dataInput = new DataInputStream(new BufferedInputStream(fileInput);
results = new FileDataMetadata(
filePath,
fileSize,
dataPacketsFrom(dataInput, numDataPoints, dataPacketSize, filePath);
}
return results;
}
private List<DataPacket> dataPacketsFrom(DataInputStream dataInput, int numDataPoints, int dataPacketSize, String filePath)
throws DataPacketReadException {
List<DataPacket> packets = new
while (dataInput.available() > 0) {
try {
// Logic to assemble DataPacket
}
catch (EOFException e) {
throw new DataPacketReadException("Unexpected EOF on file: " + filePath, e);
}
catch (IOException e) {
throw new DataPacketReadException("Unexpected I/O exception on file: " + filePath, e);
}
}
return packets;
}
This should reduce the amount of code, and make sure that your files get closed on error.
How can I retrieve size of folder or file in Java?
java.io.File file = new java.io.File("myfile.txt");
file.length();
This returns the length of the file in bytes or 0 if the file does not exist. There is no built-in way to get the size of a folder, you are going to have to walk the directory tree recursively (using the listFiles() method of a file object that represents a directory) and accumulate the directory size for yourself:
public static long folderSize(File directory) {
long length = 0;
for (File file : directory.listFiles()) {
if (file.isFile())
length += file.length();
else
length += folderSize(file);
}
return length;
}
WARNING: This method is not sufficiently robust for production use. directory.listFiles() may return null and cause a NullPointerException. Also, it doesn't consider symlinks and possibly has other failure modes. Use this method.
Using java-7 nio api, calculating the folder size can be done a lot quicker.
Here is a ready to run example that is robust and won't throw an exception. It will log directories it can't enter or had trouble traversing. Symlinks are ignored, and concurrent modification of the directory won't cause more trouble than necessary.
/**
* Attempts to calculate the size of a file or directory.
*
* <p>
* Since the operation is non-atomic, the returned value may be inaccurate.
* However, this method is quick and does its best.
*/
public static long size(Path path) {
final AtomicLong size = new AtomicLong(0);
try {
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) {
size.addAndGet(attrs.size());
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult visitFileFailed(Path file, IOException exc) {
System.out.println("skipped: " + file + " (" + exc + ")");
// Skip folders that can't be traversed
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult postVisitDirectory(Path dir, IOException exc) {
if (exc != null)
System.out.println("had trouble traversing: " + dir + " (" + exc + ")");
// Ignore errors traversing a folder
return FileVisitResult.CONTINUE;
}
});
} catch (IOException e) {
throw new AssertionError("walkFileTree will not throw IOException if the FileVisitor does not");
}
return size.get();
}
You need FileUtils#sizeOfDirectory(File) from commons-io.
Note that you will need to manually check whether the file is a directory as the method throws an exception if a non-directory is passed to it.
WARNING: This method (as of commons-io 2.4) has a bug and may throw IllegalArgumentException if the directory is concurrently modified.
In Java 8:
long size = Files.walk(path).mapToLong( p -> p.toFile().length() ).sum();
It would be nicer to use Files::size in the map step but it throws a checked exception.
UPDATE:
You should also be aware that this can throw an exception if some of the files/folders are not accessible. See this question and another solution using Guava.
public static long getFolderSize(File dir) {
long size = 0;
for (File file : dir.listFiles()) {
if (file.isFile()) {
System.out.println(file.getName() + " " + file.length());
size += file.length();
}
else
size += getFolderSize(file);
}
return size;
}
For Java 8 this is one right way to do it:
Files.walk(new File("D:/temp").toPath())
.map(f -> f.toFile())
.filter(f -> f.isFile())
.mapToLong(f -> f.length()).sum()
It is important to filter out all directories, because the length method isn't guaranteed to be 0 for directories.
At least this code delivers the same size information like Windows Explorer itself does.
Here's the best way to get a general File's size (works for directory and non-directory):
public static long getSize(File file) {
long size;
if (file.isDirectory()) {
size = 0;
for (File child : file.listFiles()) {
size += getSize(child);
}
} else {
size = file.length();
}
return size;
}
Edit: Note that this is probably going to be a time-consuming operation. Don't run it on the UI thread.
Also, here (taken from https://stackoverflow.com/a/5599842/1696171) is a nice way to get a user-readable String from the long returned:
public static String getReadableSize(long size) {
if(size <= 0) return "0";
final String[] units = new String[] { "B", "KB", "MB", "GB", "TB" };
int digitGroups = (int) (Math.log10(size)/Math.log10(1024));
return new DecimalFormat("#,##0.#").format(size/Math.pow(1024, digitGroups))
+ " " + units[digitGroups];
}
File.length() (Javadoc).
Note that this doesn't work for directories, or is not guaranteed to work.
For a directory, what do you want? If it's the total size of all files underneath it, you can recursively walk children using File.list() and File.isDirectory() and sum their sizes.
The File object has a length method:
f = new File("your/file/name");
f.length();
If you want to use Java 8 NIO API, the following program will print the size, in bytes, of the directory it is located in.
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class PathSize {
public static void main(String[] args) {
Path path = Paths.get(".");
long size = calculateSize(path);
System.out.println(size);
}
/**
* Returns the size, in bytes, of the specified <tt>path</tt>. If the given
* path is a regular file, trivially its size is returned. Else the path is
* a directory and its contents are recursively explored, returning the
* total sum of all files within the directory.
* <p>
* If an I/O exception occurs, it is suppressed within this method and
* <tt>0</tt> is returned as the size of the specified <tt>path</tt>.
*
* #param path path whose size is to be returned
* #return size of the specified path
*/
public static long calculateSize(Path path) {
try {
if (Files.isRegularFile(path)) {
return Files.size(path);
}
return Files.list(path).mapToLong(PathSize::calculateSize).sum();
} catch (IOException e) {
return 0L;
}
}
}
The calculateSize method is universal for Path objects, so it also works for files.
Note that if a file or directory is inaccessible, in this case the returned size of the path object will be 0.
Works for Android and Java
Works for both folders and files
Checks for null pointer everywhere where needed
Ignores symbolic link aka shortcuts
Production ready!
Source code:
public long fileSize(File root) {
if(root == null){
return 0;
}
if(root.isFile()){
return root.length();
}
try {
if(isSymlink(root)){
return 0;
}
} catch (IOException e) {
e.printStackTrace();
return 0;
}
long length = 0;
File[] files = root.listFiles();
if(files == null){
return 0;
}
for (File file : files) {
length += fileSize(file);
}
return length;
}
private static boolean isSymlink(File file) throws IOException {
File canon;
if (file.getParent() == null) {
canon = file;
} else {
File canonDir = file.getParentFile().getCanonicalFile();
canon = new File(canonDir, file.getName());
}
return !canon.getCanonicalFile().equals(canon.getAbsoluteFile());
}
I've tested du -c <folderpath> and is 2x faster than nio.Files or recursion
private static long getFolderSize(File folder){
if (folder != null && folder.exists() && folder.canRead()){
try {
Process p = new ProcessBuilder("du","-c",folder.getAbsolutePath()).start();
BufferedReader r = new BufferedReader(new InputStreamReader(p.getInputStream()));
String total = "";
for (String line; null != (line = r.readLine());)
total = line;
r.close();
p.waitFor();
if (total.length() > 0 && total.endsWith("total"))
return Long.parseLong(total.split("\\s+")[0]) * 1024;
} catch (Exception ex) {
ex.printStackTrace();
}
}
return -1;
}
for windows, using java.io this reccursive function is useful.
public static long folderSize(File directory) {
long length = 0;
if (directory.isFile())
length += directory.length();
else{
for (File file : directory.listFiles()) {
if (file.isFile())
length += file.length();
else
length += folderSize(file);
}
}
return length;
}
This is tested and working properly on my end.
private static long getFolderSize(Path folder) {
try {
return Files.walk(folder)
.filter(p -> p.toFile().isFile())
.mapToLong(p -> p.toFile().length())
.sum();
} catch (IOException e) {
e.printStackTrace();
return 0L;
}
public long folderSize (String directory)
{
File curDir = new File(directory);
long length = 0;
for(File f : curDir.listFiles())
{
if(f.isDirectory())
{
for ( File child : f.listFiles())
{
length = length + child.length();
}
System.out.println("Directory: " + f.getName() + " " + length + "kb");
}
else
{
length = f.length();
System.out.println("File: " + f.getName() + " " + length + "kb");
}
length = 0;
}
return length;
}
After lot of researching and looking into different solutions proposed here at StackOverflow. I finally decided to write my own solution. My purpose is to have no-throw mechanism because I don't want to crash if the API is unable to fetch the folder size. This method is not suitable for mult-threaded scenario.
First of all I want to check for valid directories while traversing down the file system tree.
private static boolean isValidDir(File dir){
if (dir != null && dir.exists() && dir.isDirectory()){
return true;
}else{
return false;
}
}
Second I do not want my recursive call to go into symlinks (softlinks) and include the size in total aggregate.
public static boolean isSymlink(File file) throws IOException {
File canon;
if (file.getParent() == null) {
canon = file;
} else {
canon = new File(file.getParentFile().getCanonicalFile(),
file.getName());
}
return !canon.getCanonicalFile().equals(canon.getAbsoluteFile());
}
Finally my recursion based implementation to fetch the size of the specified directory. Notice the null check for dir.listFiles(). According to javadoc there is a possibility that this method can return null.
public static long getDirSize(File dir){
if (!isValidDir(dir))
return 0L;
File[] files = dir.listFiles();
//Guard for null pointer exception on files
if (files == null){
return 0L;
}else{
long size = 0L;
for(File file : files){
if (file.isFile()){
size += file.length();
}else{
try{
if (!isSymlink(file)) size += getDirSize(file);
}catch (IOException ioe){
//digest exception
}
}
}
return size;
}
}
Some cream on the cake, the API to get the size of the list Files (might be all of files and folder under root).
public static long getDirSize(List<File> files){
long size = 0L;
for(File file : files){
if (file.isDirectory()){
size += getDirSize(file);
} else {
size += file.length();
}
}
return size;
}
in linux if you want to sort directories then du -hs * | sort -h
You can use Apache Commons IO to find the folder size easily.
If you are on maven, please add the following dependency in your pom.xml file.
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency>
If not a fan of Maven, download the following jar and add it to the class path.
https://repo1.maven.org/maven2/commons-io/commons-io/2.6/commons-io-2.6.jar
public long getFolderSize() {
File folder = new File("src/test/resources");
long size = FileUtils.sizeOfDirectory(folder);
return size; // in bytes
}
To get file size via Commons IO,
File file = new File("ADD YOUR PATH TO FILE");
long fileSize = FileUtils.sizeOf(file);
System.out.println(fileSize); // bytes
It is also achievable via Google Guava
For Maven, add the following:
<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>28.1-jre</version>
</dependency>
If not using Maven, add the following to class path
https://repo1.maven.org/maven2/com/google/guava/guava/28.1-jre/guava-28.1-jre.jar
public long getFolderSizeViaGuava() {
File folder = new File("src/test/resources");
Iterable<File> files = Files.fileTreeTraverser()
.breadthFirstTraversal(folder);
long size = StreamSupport.stream(files.spliterator(), false)
.filter(f -> f.isFile())
.mapToLong(File::length).sum();
return size;
}
To get file size,
File file = new File("PATH TO YOUR FILE");
long s = file.length();
System.out.println(s);
fun getSize(context: Context, uri: Uri?): Float? {
var fileSize: String? = null
val cursor: Cursor? = context.contentResolver
.query(uri!!, null, null, null, null, null)
try {
if (cursor != null && cursor.moveToFirst()) {
// get file size
val sizeIndex: Int = cursor.getColumnIndex(OpenableColumns.SIZE)
if (!cursor.isNull(sizeIndex)) {
fileSize = cursor.getString(sizeIndex)
}
}
} finally {
cursor?.close()
}
return fileSize!!.toFloat() / (1024 * 1024)
}
I am currently writing a Java application to retrieve BLOB type data from the database and I use a query to get all the data and put them in a List of Map<String, Object> where the columns are stored. When I need to use the data I iterate the list to get the information.
However I got an OutOfMemoryError when I tried to get the list of rows more than a couple times. Do I need to release the memory in the codes? My code is as follows:
ByteArrayInputStream binaryStream = null;
OutputStream out = null;
try {
List<Map<String, Object>> result =
jdbcOperations.query(
sql,
new Object[] {id},
new RowMapper(){
public Object mapRow(ResultSet rs, int i) throws SQLException {
DefaultLobHandler lobHandler = new DefaultLobHandler();
Map<String, Object> results = new HashMap<String, Object>();
String fileName = rs.getString(ORIGINAL_FILE_NAME);
if (!StringUtils.isBlank(fileName)) {
results.put(ORIGINAL_FILE_NAME, fileName);
}
byte[] blobBytes = lobHandler.getBlobAsBytes(rs, "AttachedFile");
results.put(BLOB, blobBytes);
int entityID = rs.getInt(ENTITY_ID);
results.put(ENTITY_ID, entityID);
return results;
}
}
);
int count = 0;
for (Iterator<Map<String, Object>> iterator = result.iterator();
iterator.hasNext();)
{
count++;
Map<String, Object> row = iterator.next();
byte[] attachment = (byte[])row.get(BLOB);
final int entityID = (Integer)row.get(ENTITY_ID);
if( attachment != null) {
final String originalFilename = (String)row.get(ORIGINAL_FILE_NAME);
String stripFilename;
if (originalFilename.contains(":\\")) {
stripFilename = StringUtils.substringAfter(originalFilename, ":\\");
}
else {
stripFilename = originalFilename;
}
String filename = pathName + entityID + "\\"+ stripFilename;
boolean exist = (new File(filename)).exists();
iterator.remove(); // release the resource
if (!exist) {
binaryStream = new ByteArrayInputStream(attachment);
InputStream extractedStream = null;
try {
extractedStream = decompress(binaryStream);
final byte[] buf = IOUtils.toByteArray(extractedStream);
out = FileUtils.openOutputStream(new File(filename));
IOUtils.write(buf, out);
}
finally {
IOUtils.closeQuietly(extractedStream);
}
}
else {
continue;
}
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
IOUtils.closeQuietly(out);
IOUtils.closeQuietly(binaryStream);
}
Consider reorganizing your code so that you don't keep all the blobs in memory at once. Instead of putting them all in a results map, output each one as you retrieve it.
The advice about expanding your memory settings is good also.
You there are also command-line parameters you can use for tuning memory, for example:
-Xms128m -Xmx1024m -XX:MaxPermSize=256m
Here's a good link on using JConsole to monitor a Java application:
http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
Your Java Virtual Machine probably isn't using all the memory it could. You can configure it to get more from the OS (see How can I increase the JVM memory?). That would be a quick and easy fix. If you still run out of memory, look at your algorithm -- do you really need all those BLOBs in memory at once?