I am currently writing a Java application to retrieve BLOB type data from the database and I use a query to get all the data and put them in a List of Map<String, Object> where the columns are stored. When I need to use the data I iterate the list to get the information.
However I got an OutOfMemoryError when I tried to get the list of rows more than a couple times. Do I need to release the memory in the codes? My code is as follows:
ByteArrayInputStream binaryStream = null;
OutputStream out = null;
try {
List<Map<String, Object>> result =
jdbcOperations.query(
sql,
new Object[] {id},
new RowMapper(){
public Object mapRow(ResultSet rs, int i) throws SQLException {
DefaultLobHandler lobHandler = new DefaultLobHandler();
Map<String, Object> results = new HashMap<String, Object>();
String fileName = rs.getString(ORIGINAL_FILE_NAME);
if (!StringUtils.isBlank(fileName)) {
results.put(ORIGINAL_FILE_NAME, fileName);
}
byte[] blobBytes = lobHandler.getBlobAsBytes(rs, "AttachedFile");
results.put(BLOB, blobBytes);
int entityID = rs.getInt(ENTITY_ID);
results.put(ENTITY_ID, entityID);
return results;
}
}
);
int count = 0;
for (Iterator<Map<String, Object>> iterator = result.iterator();
iterator.hasNext();)
{
count++;
Map<String, Object> row = iterator.next();
byte[] attachment = (byte[])row.get(BLOB);
final int entityID = (Integer)row.get(ENTITY_ID);
if( attachment != null) {
final String originalFilename = (String)row.get(ORIGINAL_FILE_NAME);
String stripFilename;
if (originalFilename.contains(":\\")) {
stripFilename = StringUtils.substringAfter(originalFilename, ":\\");
}
else {
stripFilename = originalFilename;
}
String filename = pathName + entityID + "\\"+ stripFilename;
boolean exist = (new File(filename)).exists();
iterator.remove(); // release the resource
if (!exist) {
binaryStream = new ByteArrayInputStream(attachment);
InputStream extractedStream = null;
try {
extractedStream = decompress(binaryStream);
final byte[] buf = IOUtils.toByteArray(extractedStream);
out = FileUtils.openOutputStream(new File(filename));
IOUtils.write(buf, out);
}
finally {
IOUtils.closeQuietly(extractedStream);
}
}
else {
continue;
}
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
IOUtils.closeQuietly(out);
IOUtils.closeQuietly(binaryStream);
}
Consider reorganizing your code so that you don't keep all the blobs in memory at once. Instead of putting them all in a results map, output each one as you retrieve it.
The advice about expanding your memory settings is good also.
You there are also command-line parameters you can use for tuning memory, for example:
-Xms128m -Xmx1024m -XX:MaxPermSize=256m
Here's a good link on using JConsole to monitor a Java application:
http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
Your Java Virtual Machine probably isn't using all the memory it could. You can configure it to get more from the OS (see How can I increase the JVM memory?). That would be a quick and easy fix. If you still run out of memory, look at your algorithm -- do you really need all those BLOBs in memory at once?
Related
I have a program who read information from database. Sometimes, the message is bigger that expected. So, before send that message to my broker I zip it with this code:
public static byte[] zipBytes(byte[] input) throws IOException {
ByteArrayOutputStream bos = new ByteArrayOutputStream(input.length);
OutputStream ej = new DeflaterOutputStream(bos);
ej.write(input);
ej.close();
bos.close();
return bos.toByteArray();
}
Recently, i retrieved a 80M message from DB and when execute my code above just OutOfMemory throw on the "ByteArrayOutputStream" line. My java program just have 512 of memory to do all process and cant give it more memory.
How can I solve this?
This is no a duplicate question. I cant increase heap size memory.
EDIT:
This is flow of my java code:
rs = stmt.executeQuery("SELECT data FROM tbl_alotofinfo"); //rs is a valid resulset
while(rs.next()){
byte[] bt = rs.getBytes("data");
if(bt.length > 60) { //only rows with content > 60. I need to know the size of message, so if I use rs.getBinaryStream I cannot know the size, can I?
if(bt.length >= 10000000){
//here I need to zip the bytearray before send it, so
bt = zipBytes(bt); //this method is above
// then publish bt to my broker
} else {
//here publish byte array to my broker
}
}
}
EDIT
Ive tried with PipedInputStream and the memory that process consume is same as zipBytes(byte[] input) method.
private InputStream zipInputStream(InputStream in) throws IOException {
PipedInputStream zipped = new PipedInputStream();
PipedOutputStream pipe = new PipedOutputStream(zipped);
new Thread(
() -> {
try(OutputStream zipper = new DeflaterOutputStream(pipe)){
IOUtils.copy(in, zipper);
zipper.flush();
} catch (IOException e) {
IOUtils.closeQuietly(zipped); // close it on error case only
e.printStackTrace();
} finally {
IOUtils.closeQuietly(in);
IOUtils.closeQuietly(zipped);
IOUtils.closeQuietly(pipe);
}
}
).start();
return zipped;
}
How can I compress by Deflate my InputStream?
EDIT
That information is sent to JMS in Universal Messaging Server by Software AG. This use a Nirvana client documentation: https://documentation.softwareag.com/onlinehelp/Rohan/num10-2/10-2_UM_webhelp/um-webhelp/Doc/java/classcom_1_1pcbsys_1_1nirvana_1_1client_1_1n_consume_event.html and data is saved in nConsumeEvent objects and the documentation show us only 2 ways to send that information:
nConsumeEvent (String tag, byte[] data)
nConsumeEvent (String tag, Document adom)
https://documentation.softwareag.com/onlinehelp/Rohan/num10-5/10-5_UM_webhelp/index.html#page/um-webhelp%2Fco-publish_3.html%23
The code for connection is:
nSessionAttributes nsa = new nSessionAttributes("nsp://127.0.0.1:9000");
MyReconnectHandler rhandler = new MyReconnectHandler();
nSession mySession = nSessionFactory.create(nsa, rhandler);
if(!mySession.isConnected()){
mySession.init();
}
nChannelAttributes chaAtt = new nChannelAttributes();
chaAttr.setName("mychannel"); //This is a topic
nChannel myChannel = mySession.findChannel(chaAtt);
List<nConsumeEvent> messages = new ArrayList<nConsumeEvent>();
rs = stmt.executeQuery("SELECT data FROM tbl_alotofinfo");
while(rs.next){
byte[] bt = rs.getBytes("data");
if(bt.length > 60){
nEventProperties prop = new nEventProperties();
if(bt.length > 10000000){
bt = compressData(bt); //here a need compress data without ByteArrayInputStream
prop.put("isZip", "true");
nConsumeEvent ncon = new nconsumeEvent("1",bt);
ncon.setProperties(prop);
messages.add(ncon);
} else {
prop.put("isZip", "false");
nConsumeEvent ncon = new nconsumeEvent("1",bt);
ncon.setProperties(prop);
messages.add(ncon);
}
}
ntransactionAttributes tatrib = new nTransactionAttributes(myChannel);
nTransaction myTransaction = nTransactionFactory.create(tattrib);
Vector<nConsumeEvent> m = new Vector<nConsumeEvent>(messages);
myTransaction.publish(m);
myTransaction.commit();
}
Because API exection, to the end of the day I need send the information in byte array, but if this is the only one byte array in my code is OK. How can I compress the byte array or InputStream with rs.getBinaryStream() in this implementation?
EDIT
The database server used is PostgreSQL v11.6
EDIT
I've applied the first one solution of #VGR and works fine.
Only one thing, SELECT query is in a while(true) like:
while(true){
rs = stmt.executeQuery("SELECT data FROM tbl_alotofinfo"); //rs is a valid resulset
// all that implementation you know for this entire post
Thread.sleep(10000);
}
So, a SELECT is execute each 3 second.
But I've done a test with my program running and the memory just increase in each process. Why? If the information that database return is same in each request, should not the memory keep like first request? Or maybe I forgot close a stream?
while(true){
rs = stmt.executeQuery("SELECT data FROM tbl_alotofinfo"); //rs is a valid resulset
while(rs.next()) {
//byte[] bt = rs.getBytes("data");
byte[] bt;
try (BufferedInputStream source = new BufferedInputStream(
rs.getBinaryStream("data"), 10_000_001)) {
source.mark(this.zip+1);
boolean sendToBroker = true;
boolean needCompression = true;
for (int i = 0; i <= 10_000_000; i++) {
if (source.read() < 0) {
sendToBroker = (i > 60);
needCompression = (i >= this.zip);
break;
}
}
if (sendToBroker) {
nEventProperties prop = new nEventProperties();
// Rewind stream
source.reset();
if (needCompression) {
System.out.println("Tamaño del mensaje mayor al esperado. Comprimiendo mensaje");
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
try (OutputStream brokerStream = new DeflaterOutputStream(byteStream)) {
IOUtils.copy(source, brokerStream);
}
bt = byteStream.toByteArray();
prop.put("zip", "true");
} else {
bt = IOUtils.toByteArray(source);
}
System.out.println("size: "+bt.length);
prop.put("host", this.host);
nConsumeEvent ncon = new nConsumeEvent(""+rs.getInt("xid"), bt);
ncon.setProperties(prop);
messages.add(ncon);
}
}
}
}
For example, this is the heap memory in two times. First one memory use above 500MB and second one (with the same information of database) used above 1000MB
rs.getBytes("data") reads the entire 80 megabytes into memory at once. In general, if you are reading a potentially large amount of data, you don’t want to try to keep it all in memory.
The solution is to use getBinaryStream instead.
Since you need to know whether the total size is larger than 10,000,000 bytes, you have two choices:
Use a BufferedInputStream with a buffer of at least that size, which will allow you to use mark and reset in order to “rewind” the InputStream.
Read the data size as part of your query. You may be able to do this by using a Blob or using a function like LENGTH.
The first approach will use up 10 megabytes of program memory for the buffer, but that’s better than hundreds of megabytes:
while (rs.next()) {
try (BufferedInputStream source = new BufferedInputStream(
rs.getBinaryStream("data"), 10_000_001)) {
source.mark(10_000_001);
boolean sendToBroker = true;
boolean needCompression = true;
for (int i = 0; i <= 10_000_000; i++) {
if (source.read() < 0) {
sendToBroker = (i > 60);
needCompression = (i >= 10_000_000);
break;
}
}
if (sendToBroker) {
// Rewind stream
source.reset();
if (needCompression) {
try (OutputStream brokerStream =
new DeflaterOutputStream(broker.getOutputStream())) {
source.transferTo(brokerStream);
}
} else {
try (OutputStream brokerStream =
broker.getOutputStream())) {
source.transferTo(brokerStream);
}
}
}
}
}
Notice that no byte arrays and no ByteArrayOutputStreams are used. The actual data is not kept in memory, except for the 10 megabyte buffer.
The second approach is shorter, but I’m not sure how portable it is across databases:
while (rs.next()) {
Blob data = rs.getBlob("data");
long length = data.length();
if (length > 60) {
try (InputStream source = data.getBinaryStream()) {
if (length >= 10000000) {
try (OutputStream brokerStream =
new DeflaterOutputStream(broker.getOutputStream())) {
source.transferTo(brokerStream);
}
} else {
try (OutputStream brokerStream =
broker.getOutputStream())) {
source.transferTo(brokerStream);
}
}
}
}
}
Both approaches assume there is some API available for your “broker” which allows the data to be written to an OutputStream. I’ve assumed, for the sake of example, that it’s broker.getOutputStream().
Update
It appears you are required to create nConsumeEvent objects, and that class only allows its data to be specified as a byte array in its constructors.
That byte array is unavoidable, obviously. And since there is no way to know the exact number of bytes a compressed version will require, a ByteArrayOutputStream is also unavoidable. (It’s possible to avoid using that class, but the replacement would be essentially a reimplementation of ByteArrayOutputStream.)
But you can still read the data as an InputStream in order to reduce your memory usage. And when you aren’t compressing, you can still avoid ByteArrayOutputStream, thereby creating only one additional byte array.
So, instead of this, which is not possible for nConsumeEvent:
if (needCompression) {
try (OutputStream brokerStream =
new DeflaterOutputStream(broker.getOutputStream())) {
source.transferTo(brokerStream);
}
} else {
try (OutputStream brokerStream =
broker.getOutputStream())) {
source.transferTo(brokerStream);
}
}
It should be:
byte[] bt;
if (needCompression) {
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
try (OutputStream brokerStream = new DeflaterOutputStream(byteStream)) {
source.transferTo(byteStream);
}
bt = byteStream.toByteArray();
} else {
bt = source.readAllBytes();
}
prop.put("isZip", String.valueOf(needCompression));
nConsumeEvent ncon = new nconsumeEvent("1", bt);
ncon.setProperties(prop);
messages.add(ncon);
Similarly, the second example should replace this:
if (length >= 10000000) {
try (OutputStream brokerStream =
new DeflaterOutputStream(broker.getOutputStream())) {
source.transferTo(brokerStream);
}
} else {
try (OutputStream brokerStream =
broker.getOutputStream())) {
source.transferTo(brokerStream);
}
}
with this:
byte[] bt;
if (length >= 10000000) {
ByteArrayOutputStream byteStream = new ByteArrayOutputStream();
try (OutputStream brokerStream = new DeflaterOutputStream(byteStream)) {
source.transferTo(byteStream);
}
bt = byteStream.toByteArray();
prop.put("isZip", "true");
} else {
bt = source.readAllBytes();
prop.put("isZip", "false");
}
nConsumeEvent ncon = new nconsumeEvent("1", bt);
ncon.setProperties(prop);
messages.add(ncon);
How can I save this kind of map to file? (it should work for an android device too)
I tried:
Properties properties = new Properties();
for (Map.Entry<String, Object[]> entry : map.entrySet()) {
properties.put(entry.getKey(), entry.getValue());
}
try {
properties.store(new FileOutputStream(context.getFilesDir() + MainActivity.FileName), null);
} catch (IOException e) {
e.printStackTrace();
}
And I get:
class java.util.ArrayList cannot be cast to class java.lang.String (java.util.ArrayList and java.lang.String are in module java.base of loader 'bootstrap')
What should I do?
I was writing an answer based on String values serialization when I realized for your error that perhaps some value can be an ArrayList... I honestly don't fully understand the reasoning behind the error (of course, it is a cast, but I don't understand the java.util.ArrayList part)...
In any case, the problem is happening when you try storing your properties and it tries to convert your Object[] to String for saving.
In my original answer I suggested you to manually join your values when generating your file. It is straightforward with the join method of the String class:
Properties properties = new Properties();
for (String key : map.keySet()) {
Object[] values = map.get(key);
// Perform null checking
String value = String.join(",", values);
properties.put(key, value);
}
try {
properties.store(new FileOutputStream(context.getFilesDir() + MainActivity.FileName), null);
} catch (IOException e) {
e.printStackTrace();
}
For reading your values, you can use split:
Properties properties = new Properties();
Map<String, String> map = new HashMap<>();
InputStream in = null;
try {
in = new FileInputStream(context.getFilesDir() + MainActivity.FileName);
properties.load(in);
for (String key : properties.stringPropertyNames()) {
String value = properties.getProperty(k);
// Perform null checking
String[] values = value.split(",");
map.put(key, value);
}
} catch (Throwable t) {
t.printStackTrace();
} finally {
if (in != null) {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
But I think you have a much better approach: please, just use the Java builtin serialization mechanisms to save and restore your information.
For saving your map use ObjectOutputStream:
try (ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(context.getFilesDir() + MainActivity.FileName))){
oos.writeObject(map);
}
You can read your map back as follows:
Map<String, Object> map;
try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(context.getFilesDir() + MainActivity.FileName))){
map = (Map)ois.readObject();
}
If all the objects stored in your map are Serializables this second approach is far more flexible and it is not limited to String values like the first one.
How could I write a code to be able to delete exactly the duplicates
that I get previously with this code.?? please be specific when
answering as I am new to java.I have very basic knowledge of java.
private static MessageDigest messageDigest;
static {
try {
messageDigest = MessageDigest.getInstance("SHA-512");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException("cannot initialize SHA-512 hash function", e);
}
}
public static void findDuplicatedFiles(Map<String, List<String>> lists, File directory) {
for (File child : directory.listFiles()) {
if (child.isDirectory()) {
findDuplicatedFiles(lists, child);
} else {
try {
FileInputStream fileInput = new FileInputStream(child);
byte fileData[] = new byte[(int) child.length()];
fileInput.read(data);
fileInput.close();
String uniqueFileHash = new BigInteger(1, md.digest(fileData)).toString(16);
List<String> list = lists.get(uniqueFileHash);
if (list == null) {
list = new LinkedList<String>();
lists.put(uniqueFileHash, list);
}
list.add(child.getAbsolutePath());
} catch (IOException e) {
throw new RuntimeException("cannot read file " + child.getAbsolutePath(), e);
}
}
}
}
Map<String, List<String>> lists = new HashMap<String, List<String>>();
FindDuplicates.findDuplicateFiles(lists, dir);
for (List<String> list : lists.values()) {
if (list.size() > 1) {
System.out.println("\n");
for (String file : list) {
System.out.println(file);
}
}
}
System.out.println("\n");
Do not read the entire contents of the file into memory. The whole point of an InputStream is that you can read small, manageable chunks of data, so you don’t have to use a great deal of memory.
Imagine if you were trying to check a file that’s one gigabyte in size. By creating a byte array to hold the entire content, you have forced your program to use a gigabyte of RAM. (If the file were two gigabytes or larger, you wouldn’t be able to allocate the byte array at all, since an array may not have more than 2³¹-1 elements.)
The easiest way to compute the hash of a file’s contents is to copy the file to a DigestOutputStream, which is an OutputStream that makes use of an existing MessageDigest:
messageDigest.reset();
try (DigestOutputStream stream = new DigestOutputStream(
OutputStream.nullOutputStream(), messageDigest)) {
Files.copy(child.toPath(), stream);
}
String uniqueFileHash = new BigInteger(1, messageDigest.digest());
Scanning directories is easier with NIO Path / Files class because it avoids awkward recursion of File class, and it is much quicker for deeper directory trees.
Here is an example scanner which returns a Stream of duplicates - that is where each item in the stream is a List<Path> - a group of TWO or more identical files.
// Scan a directory and returns Stream of List<Path> where each list has 2 or more duplicates
static Stream<List<Path>> findDuplicates(Path dir) throws IOException {
Map<Long, List<Path>> candidates = new HashMap<>();
BiPredicate<Path, BasicFileAttributes> biPredicate = (p,a)->a.isRegularFile()
&& candidates.computeIfAbsent(Long.valueOf(a.size())
, k -> new ArrayList<>()).add(p);
try(var stream = Files.find(dir, Integer.MAX_VALUE, biPredicate)) {
stream.count();
}
Predicate<? super List<Path>> twoOrMore = paths -> paths.size() > 1;
return candidates.values().stream()
.filter(twoOrMore)
.flatMap(Duplicate::duplicateChecker)
.filter(twoOrMore);
}
The above code starts by collating candidates of same file size, then uses a flatMap operation to compare all those candidates to get the exact matches where the files are identical in each List<Path>:
// Checks possible list of duplicates, and returns stream of definite duplicates
private static Stream<List<Path>> duplicateChecker(List<Path> sameLenPaths) {
List<List<Path>> groups = new ArrayList<>();
try {
for (Path p : sameLenPaths) {
List<Path> match = null;
for (List<Path> g : groups) {
Path prev = g.get(0);
if(Files.mismatch(prev, p) < 0) {
match = g;
break;
}
}
if (match == null)
groups.add(match = new ArrayList<>());
match.add(p);
}
} catch(IOException io) {
throw new UncheckedIOException(io);
}
return groups.stream();
}
Finally an example launcher:
public static void main(String[] args) throws IOException {
Path dir = Path.of(args[0]);
Stream<List<Path>> duplicates = findDuplicates(dir);
long count = duplicates.peek(System.out::println).count();
System.out.println("Found "+count+" groups of duplicate files in: "+dir);
}
You will need to process list of duplicate files using Files.delete - I've not added Files.delete at the end so that you can check the results before deciding to delete them.
// findDuplicates(dir).flatMap(List::stream).forEach(dup -> {
// try {
// Files.delete(dup);
// } catch(IOException io) {
// throw new UncheckedIOException(io);
// }
// });
I am getting a fortify finding for "Unreleased resource stream" on the code below.
Resource[] l_objResource = resourceLoader.getResources(configErrorCode);
Properties l_objProperty = null;
for (int i = 0; i < l_objResource.length; i++) {
l_objProperty = new Properties();
l_objProperty.load(l_objResource[i].getInputStream());
}
The function loadErrorCode() in BaseErrorParser.java sometimes fails to release a system resource allocated by getInputStream();
Can anyone explain the finding or help fix the issue?
From the comment below, but the context is not clear (JW):
ObjectInputStream l_objObjInputStream = null;
Map l_mapRet = null;
try {
l_objObjInputStream = new ObjectInputStream(new FileInputStream(p_objFilename));
Object l_objTemp = l_objObjInputStream.readObject();
l_mapRet = (Map) l_objTemp;
} finally {
if (l_objObjInputStream != null) {
l_objObjInputStream.close();
}
}
You are not closing the input stream which is opened by below line of code
l_objResource[i].getInputStream();
Usually fortify scanner reports Unreleased resource stream issue if there are any input or out streams which are opened but not closed after their usage. The ideal way to deal with these issues is to close all the opened streams in finally block so that even during exception scenarios they won't create any issues.
You can have a try - finally block around the code and close the stream as below.
Resource[] l_objResource = resourceLoader.getResources(configErrorCode);
Properties l_objProperty = null;
InputStream is = null;
for (int i = 0; i < l_objResource.length; i++) {
l_objProperty = new Properties();
try {
is = l_objResource[i].getInputStream();
l_objProperty.load(is);
} finally {
if(is!=null) {
is.close();
}
}
}
Please check if it works in your case.
You can use Try with resource here. This will automatically close your stream.
Map l_mapRet = null;
try (ObjectInputStream l_objObjInputStream = new ObjectInputStream(new FileInputStream(p_objFilename))){
Object l_objTemp = l_objObjInputStream.readObject();
l_mapRet = (Map) l_objTemp;
} Catch(IOException E){
// Handle exception
}
I have a (possibly long) list of binary files that I want to read lazily. There will be too many files to load into memory. I'm currently reading them as a MappedByteBuffer with FileChannel.map(), but that probably isn't required. I want the method readBinaryFiles(...) to return a Java 8 Stream so I can lazy load the list of files as I access them.
public List<FileDataMetaData> readBinaryFiles(
List<File> files,
int numDataPoints,
int dataPacketSize )
throws
IOException {
List<FileDataMetaData> fmdList = new ArrayList<FileDataMetaData>();
IOException lastException = null;
for (File f: files) {
try {
FileDataMetaData fmd = readRawFile(f, numDataPoints, dataPacketSize);
fmdList.add(fmd);
} catch (IOException e) {
logger.error("", e);
lastException = e;
}
}
if (null != lastException)
throw lastException;
return fmdList;
}
// The List<DataPacket> returned will be in the same order as in the file.
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize) throws IOException {
FileDataMetaData fmd;
FileChannel fileChannel = null;
try {
fileChannel = new RandomAccessFile(file, "r").getChannel();
long fileSz = fileChannel.size();
ByteBuffer bbRead = ByteBuffer.allocate((int) fileSz);
MappedByteBuffer buffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
buffer.get(bbRead.array());
List<DataPacket> dataPacketList = new ArrayList<DataPacket>();
while (bbRead.hasRemaining()) {
int channelId = bbRead.getInt();
long timestamp = bbRead.getLong();
int[] data = new int[numDataPoints];
for (int i=0; i<numDataPoints; i++)
data[i] = bbRead.getInt();
DataPacket dp = new DataPacket(channelId, timestamp, data);
dataPacketList.add(dp);
}
fmd = new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
} catch (IOException e) {
logger.error("", e);
throw e;
} finally {
if (null != fileChannel) {
try {
fileChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return fmd;
}
Returning fmdList.Stream() from readBinaryFiles(...) won't accomplish this because the file contents will already have been read into memory, which I won't be able to do.
The other approaches to reading the contents of multiple files as a Stream rely on using Files.lines(), but I need to read binary files.
I'm, open to doing this in Scala or golang if those languages have better support for this use case than Java.
I'd appreciate any pointers on how to read the contents of multiple binary files lazily.
There is no laziness possible for the reading within the a file as you are reading the entire file for constructing an instance of FileDataMetaData. You would need a substantial refactoring of that class to be able to construct an instance of FileDataMetaData without having to read the entire file.
However, there are several things to clean up in that code, even specific to Java 7 rather than Java 8, i.e you don’t need a RandomAccessFile detour to open a channel anymore and there is try-with-resources to ensure proper closing. Note further that you usage of memory mapping makes no sense. When copy the entire contents into a heap ByteBuffer after mapping the file, there is nothing lazy about it. It’s exactly the same what happens, when call read with a heap ByteBuffer on a channel, except that the JRE can reuse buffers in the read case.
In order to allow the system to manage the pages, you have to read from the mapped byte buffer. Depending on the system, this might still not be better than repeatedly reading small chunks into a heap byte buffer.
public FileDataMetaData readRawFile(
File file, int numDataPoints, int dataPacketSize) throws IOException {
try(FileChannel fileChannel=FileChannel.open(file.toPath(), StandardOpenOption.READ)) {
long fileSz = fileChannel.size();
MappedByteBuffer bbRead=fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileSz);
List<DataPacket> dataPacketList = new ArrayList<>();
while(bbRead.hasRemaining()) {
int channelId = bbRead.getInt();
long timestamp = bbRead.getLong();
int[] data = new int[numDataPoints];
for (int i=0; i<numDataPoints; i++)
data[i] = bbRead.getInt();
dataPacketList.add(new DataPacket(channelId, timestamp, data));
}
return new FileDataMetaData(file.getCanonicalPath(), fileSz, dataPacketList);
} catch (IOException e) {
logger.error("", e);
throw e;
}
}
Building a Stream based on this method is straight-forward, only the checked exception has to be handled:
public Stream<FileDataMetaData> readBinaryFiles(
List<File> files, int numDataPoints, int dataPacketSize) throws IOException {
return files.stream().map(f -> {
try {
return readRawFile(f, numDataPoints, dataPacketSize);
} catch (IOException e) {
logger.error("", e);
throw new UncheckedIOException(e);
}
});
}
This should be sufficient:
return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize));
…if, that is, you are willing to remove throws IOException from the readRawFile method’s signature. You could have that method catch IOException internally and wrap it in an UncheckedIOException. (The problem with deferred execution is that the exceptions also need to be deferred.)
I don't know how performant this is, but you can use java.io.SequenceInputStream wrapped inside of DataInputStream. This will effectively concatenate your files together. If you create a BufferedInputStream from each file, then the whole thing should be properly buffered.
Building on VGR's comment, I think his basic solution of:
return files.stream().map(f -> readRawFile(f, numDataPoints, dataPacketSize))
is correct, in that it will lazily process the files (and stop if a short-circuiting terminal action is invoked off the result of the map() operation. I would also suggest a slightly different to the implementation of readRawFile that leverages try with resources and InputStream, which will not load the whole file into memory:
public FileDataMetaData readRawFile(File file, int numDataPoints, int dataPacketSize)
throws DataPacketReadException { // <- Custom unchecked exception, nested for class
FileDataMetadata results = null;
try (FileInputStream fileInput = new FileInputStream(file)) {
String filePath = file.getCanonicalPath();
long fileSize = fileInput.getChannel().size()
DataInputStream dataInput = new DataInputStream(new BufferedInputStream(fileInput);
results = new FileDataMetadata(
filePath,
fileSize,
dataPacketsFrom(dataInput, numDataPoints, dataPacketSize, filePath);
}
return results;
}
private List<DataPacket> dataPacketsFrom(DataInputStream dataInput, int numDataPoints, int dataPacketSize, String filePath)
throws DataPacketReadException {
List<DataPacket> packets = new
while (dataInput.available() > 0) {
try {
// Logic to assemble DataPacket
}
catch (EOFException e) {
throw new DataPacketReadException("Unexpected EOF on file: " + filePath, e);
}
catch (IOException e) {
throw new DataPacketReadException("Unexpected I/O exception on file: " + filePath, e);
}
}
return packets;
}
This should reduce the amount of code, and make sure that your files get closed on error.