Broad discussion question.
Are there any libraries already which allow me to store the state of execution of my application in Java?
E.g I have an application which processes files, now the application may be forced to shutdown suddenly at some point.I want to store the information on what all files have been processed and what all have not been, and what stage the processing was on for the ongoing processes.
Are there already any libraries which abstract this functionality or I would have to implement it from scratch?
It seems like what you are looking for is serialization which can be performed with the Java Serialization API.
You can write even less code if you decide to use known libraries such as Apache Commons Lang, and its SerializationUtils class which itself is built on top the Java Serialization API.
Using the latest, serializing/deserializing your application state into a file is done in a few lines.
The only thing you have to do is create a class holding your application state, let's call it... ApplicationState :-) It can look like that:
class ApplicationState {
enum ProcessState {
READ_DONE,
PROCESSING_STARTED,
PROCESSING_ENDED,
ANOTHER_STATE;
}
private List<String> filesDone, filesToDo;
private String currentlyProcessingFile;
private ProcessState currentProcessState;
}
With such a structure, and using SerializationUtils, serializing is done the following way:
try {
ApplicationState state = new ApplicationState();
...
// File to serialize object to
String fileName = "applicationState.ser";
// New file output stream for the file
FileOutputStream fos = new FileOutputStream(fileName);
// Serialize String
SerializationUtils.serialize(state, fos);
fos.close();
// Open FileInputStream to the file
FileInputStream fis = new FileInputStream(fileName);
// Deserialize and cast into String
String ser = (String) SerializationUtils.deserialize(fis);
System.out.println(ser);
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
It sounds like the Java Preferences API might be a good option for you. This can store user/system settings with minimal effort on your part and you can update/retrieve at any time.
https://docs.oracle.com/javase/8/docs/technotes/guides/preferences/index.html
It's pretty simple to make from scratch. You could follow this:
Have a DB (or just a file) that stores the information of processing progress. Something like:
Id|fileName|status|metadata
As soon as you start processing a file make a entry to this table. Ans mark status as PROCESSING, the you can store intermediate states, and finally when you're done you can set status to DONE. This way, on restart, you would know what are the files processed; what are the files that were in-citu when the process shutdown/crashed. And (obviously) where to start.
In large enterprise environment where applications are loosely coupled (and there is no guarantee if the application will be available or might crash), we use Message Queue to do something like the same to ensure reliable architecture.
There are almost too many ways to mention. I would choice the option you believe is simplest.
You can use;
a file to record what is done (and what is to be done)
a persistent queue on JMS (which support multiple processes, even on different machine)
a embedded or remote database.
An approach I rave about is using memory mapped files. A nice feature is that information is not lost if the application dies or is killed (provided the OS doesn't crash) which means you don't have to flush it, nor worry about losing data if you don't.
This works because the data is partly managed by the OS which means it uses little heap (even for TB of data) and the OS deals with loading and flushing to disk making it much faster (and making sizes much larger than your main memory practical).
BTW: This approach works even with a kill -9 as the OS flushes the data to disk. To test this I use Unsafe.getByte(0) which crashes the application with a SEG fault immediately after making a change (as in the next machine code instruction) and it still writes the change to disk.
This won't work if you pull the power, but you have to be really quick. You can use memory mapped files to force the data to disk before continuing, but I don't know how you can test this really works. ;)
I have a library which could make memory mapped files easier to use
https://github.com/peter-lawrey/Java-Chronicle
Its a not long read and you can use it as an example.
Apache Commons Configuration API: http://commons.apache.org/proper/commons-configuration/userguide/howto_filebased.html#File-based_Configurations
Related
Recording allows to save recorded data into file using dump() method (as shown in example below copied from the javadoc). Is there a way how to get such recorded data in form of byte[].
Preferably without external dependencies. I'm aware of implementing custom in-memory FileSystem.
Configuration c = Configuration.getConfiguration("default");
Recording r = new Recording(c);
r.start();
System.gc();
Thread.sleep(5000);
r.stop();
r.dump(Files.createTempFile("my-recording", ".jfr"));
Are you looking for getStream? That's the only API on Recording that might possibly suggest a way to do it. Other than that, you'll pretty much have to use dump to a Path, either to an in-memory filesystem or maybe even to an actual temporary file.
I'm developing a small web-app whose servlets periodically get access to a shared resource which is a simple text-file on the server side holding some lines of mutable data. Most of the time, servelts just read file for the data, but some servelts may also update it, adding new lines to the file or removing and replacing existing lines. Although file contents is not updated very often, there is still little chance for the data inconsistency and file corruption if two or more servlets decide to read and write to file at the same time.
The first goal is to make the file reading/writing safe. For this purpose, I've created a helper FileReaderWriter class providing some static methods for thread-safe file access. The read and write methods are coordinated by ReentrantReadWiteLock. The rule is quite simple: multiple threads may read from file at any time as far as no other thread is writing to it at the same time.
public class FileReaderWriter {
private static final ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
public static List<String> read(Path path) {
List<String> list = new ArrayList<>();
rwLock.readLock().lock();
try {
list = Files.readAllLines(path);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.readLock().unlock();
}
return list;
}
public static void write(Path path, List<String> list) {
rwLock.writeLock().lock();
try {
Files.write(path, list);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.writeLock().unlock();
}
}
}
Then, every servelt may use the above method for file reading like this:
String dataDir = getServletContext().getInitParameter("data-directory");
Path filePath = Paths.get(dataDir, "test.txt");
ArrayList<String> list = FileReaderWriter.read(filePath);
Similarly, writing may be done with FileReaderWriter.write(filePath, list) method. Note: if some data needs to be replaced or removed (which means fetching the data form a file, processing it and writing updated data back to a file), then the whole code paths for this operation should be locked by rwLock.writeLock() for atomicity reasons.
Now, when access to a shared file seems to be safe (at least, I hope so), the next step is to make it fast. From the scalability perspective, reading a file at every user's request to the servlet doesn't sound reasonable. So, what I thought of is to read the contents of file into ArrayList (or other collection) only once during the context initialization time and then share this ArrayList (not the file) as a context-scoped data-holder attribute. Then a context-scoped attribute can be shared by servlets with the same locking mechanism as described above and the contents of the updated ArrayList may be independently stored back to the file on some regular basis.
Another solution (in order to avoid locking) would be to use CopyOnWriteArrayList (or some other collection from java.util.concurrent package) for holding a shared data and designate a single-threaded ExecutorService to dump its contents into a file when needed. I also heard of Java Memory-Mapped Files for mapping the entire file into internal memory, but not sure if such approach is appropriate for this particular situation.
So, could anybody, please, guide me thorough the most effective ways (maybe, suggesting some other alternatives) to solve the problem with a shared file access, provided that the writing to a file is quite infrequent and the contents of it is not expected to exceed a dozens of lines.
You don't explain your real problem, only your current attempt then, is difficult to provide a good solution.
Your approach has two serious problems:
Problem 1: concurrency
a shared resource which is a simple text-file on the server side
holding some lines of mutable data
90% of the solution to a problem is a good data structure. A mutable file it's not. Even popular database engines have important concurrency limitations (eg. SQLite), don't try to reinvent the wheel.
Problem 2: horizontal scalability
Even if he solves his local concurrency problems (eg. synchronous methods), you won't be able to deploy multiple instances (nodes/servers) of your application.
Solution 1: use the right tool for the job
You don't explain exactly the nature of your (data management) problem but probably any NoSQL database will do you good (reading about MongoDB can be a good starting point).
(Bad) solution 2: use FileLock
If for some reason you insist on doing what you indicate, use low level file locks using FileLock. You will only have to deal with partial file locks and even these can be distributed horizontally. You won't have to worry about synchronizing other resources either, as file-level locks will suffice.
(Limited) solution 3: in memory structure
If you don't need horizontal scalability, you can use a shared in memory structure like ConcurrentHashMap but you will lose the horizontal scalability and you could lose transactions if you do not persist the information before an application stop.
Conclusion
Although there are more exotic distributed data models, using a database for even a single table may be the best and simplest solution.
So I'm not really sure what actually happens deep-down.
Let's say we are in a web application and user request to download a dynamically generated file which can be several mb in size, possibly even 100 mb or more. The application does this
String disposition = "attachment; fileName="myFile.txt";
response.setHeader("Content-Disposition", disposition);
ServletOutputStream output = response.getOutputStream();
OutputStreamWriter writer = new OutputStreamWriter(output);
service.exportFile(ids, writer, properties);
Am I right that the whole file is never completely in memory? eg. whatever data was generated is sent to the user and then discarded on the server (assuming everything went well, no packet loss)?
I ask this because I need to change the library that generates the files (3rd party) and the new one doesn't not use standard Java IO stuff probably because it is just an API and the actual library is in C. Anyway to get a buffers data the documentation says to call
String data = buffer.toString();
(the files are ASCII)
So is my assumption correct that memory consumption will be affected especially when multiple users download large files at the same time?
Yes in you're first code snippet data is streamed directly to client, assuming, that implementation of service.exportFile(ids, writer, properties) itself never holds the generated data in memory but really streames it directly to writter.
With String data = buffer.toString(); you'll definitively place the entire data in heap space, at the latest when buffer.toString()is called, may be earlier depending on exact implementation.
In conclusion, you have in my opinion to be aware of two things:
- never alocate data to a variable in you're codem but diretly write it to output stream
- insure that implmentation also never holds the whole data in memory whike generating it
the 3rd party lib offers a second solution and that is writing to a file. however that requires extra hassle to create unique file names, then read them in and delete them afterwards. But I implemented it anyway.
In
service.exportFile(ids, writer, properties)
ids is a collection of database identifiers of records to be exported and hence ids.size() gives a rough estimate of the resulting files size. So as long as it is small I use the buffer version and if it is large the file one.
Is there a way to force the temporary files created in a java program in memory? Since I use several large xml file, I would have advantages in this way? Should I use a transparent method that allows me to not upset the existing application.
UPDATE: I'm looking at the source code and I noticed that it uses libraries (I can not change) which requires the path of those files ...
Thanks
The only way I can think of is to create a RAM disk and then point the system property java.io.tmpdir to that RAM disk.
XML is just a String, why not just reference Strings in memory, I think the File interface is a distraction. Use StringBuilder if you need to manipulate the data. Use StringBuffer if you need thread safety. Put them in a type safe Map if you have a variable number of things that need to be looked up on with a key.
If you absolutely have to keep the File interface, then create a InMemoryFileWriter that wraps ByteArrayOutputStream and ByteArrayInputStream to keep them in memory, but again I think the whole File in memory thing is a bad decision if you just want to cache things in memory, that is a lot of overhead when a simple String would do.
Don't use files if you don't have to. Consider com.google.common.io.FileBackedOutputStream from Guava:
An OutputStream that starts buffering to a byte array, but switches to file buffering once the data reaches a configurable size.
You probably can force the default behaviour of java.io.File with some reflection magic, but I'm sure you don't want to do that as it can lead to unpredicted behaviour. You're better off providing a mechanism where it would be possible to switch between usual and in-memory behaviour, and route all calls via this mechanism.
Look at this example, it shows how to use file API to create in-memory files.
Assuming you have control over the the streams that are being used to write to the file -
Do you absolutely want the in-memory behavior? If all that you want to do is reduce the number of system calls to write to the disk, you can wrap the FileOutputStream in a BufferedOutputStream (with appropriately big buffer size) and write to this BufferedOutputStream (or BufferedWriter) instead of writing directly to the original FileOutputStream.
(This does require a change in the existing application)
Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?
Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}
Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.
If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.