I have an application that wants to keep open many files: periodically it receives a client request
saying "add some data to file X", and it would be ideal to have that file already opened, and the file's
header section already parsed, so that this write is fast. However, keeping open this many files is
not very nice to the operating system, and could become impossible if our data-storage needs grow.
So I would like a "give me this file handle, opening if it's not cached" function, and some process
for automatically closing files which have not been written to for, say, five minutes. For the
specific case of caching file handles which are written to in short spurts, this is probably enough,
but this seems a general enough problem that there ought to be functions like "give me the object named X,
from cache if possible" and "I'm now done with object X, so make it eligible for eviction five
minutes from now".
core.cache looks like it might be suitable for this
purpose, but the documentation is quite lacking and the
source
provides no particular clues about how to use it. Its TTLCache looks promising, but as well as being
unclear how to use it relies on garbage collection to evict items, so I can't cleanly close a
resource when I'm ready to expire it.
I could roll my own, of course, but there are a number of tricky spots and I'm sure I would get some
things subtly wrong, so I'm hoping someone out there can point me to an implementation of this
functionality. My code is in clojure, but of course using a java library would be perfectly fine if
that's where the best implementation can be found.
Check out Guava's cache implementation.
You can supply a Callable (or a CacheLoader) to the get method for "if handle is cached, return it, otherwise open, cache and return it" semantics
You can configure timed eviction such as expireAfterAccess
You can register a RemovalListener to close the handles on removal
Modifying the code examples from the linked Guava page slightly, using CacheLoader:
LoadingCache<Key, Handle> graphs = CacheBuilder.newBuilder()
.maximumSize(100) // sensible value for open handles?
.expireAfterAccess(5, TimeUnit.MINUTES)
.removalListener(removalListener)
.build(
new CacheLoader<Key, Handle>() {
public Handle load(Key key) throws AnyException {
return openHandle(key);
}
});
RemovalListener<Key, Handle> removalListener =
new RemovalListener<Key, Handle>() {
public void onRemoval(RemovalNotification<Key, Handle> removal) {
Handle h = removal.getValue();
h.close(); // tear down properly
}
};
* DISCLAIMER * I have not used the cache myself this way, ensure you test this sensibly.
If you don't mind some java, see http://ehcache.org/ and http://ehcache.org/apidocs/net/sf/ehcache/event/CacheEventListener.html.
Related
I'm developing a small web-app whose servlets periodically get access to a shared resource which is a simple text-file on the server side holding some lines of mutable data. Most of the time, servelts just read file for the data, but some servelts may also update it, adding new lines to the file or removing and replacing existing lines. Although file contents is not updated very often, there is still little chance for the data inconsistency and file corruption if two or more servlets decide to read and write to file at the same time.
The first goal is to make the file reading/writing safe. For this purpose, I've created a helper FileReaderWriter class providing some static methods for thread-safe file access. The read and write methods are coordinated by ReentrantReadWiteLock. The rule is quite simple: multiple threads may read from file at any time as far as no other thread is writing to it at the same time.
public class FileReaderWriter {
private static final ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock();
public static List<String> read(Path path) {
List<String> list = new ArrayList<>();
rwLock.readLock().lock();
try {
list = Files.readAllLines(path);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.readLock().unlock();
}
return list;
}
public static void write(Path path, List<String> list) {
rwLock.writeLock().lock();
try {
Files.write(path, list);
} catch (IOException e) {
e.printStackTrace();
} finally {
rwLock.writeLock().unlock();
}
}
}
Then, every servelt may use the above method for file reading like this:
String dataDir = getServletContext().getInitParameter("data-directory");
Path filePath = Paths.get(dataDir, "test.txt");
ArrayList<String> list = FileReaderWriter.read(filePath);
Similarly, writing may be done with FileReaderWriter.write(filePath, list) method. Note: if some data needs to be replaced or removed (which means fetching the data form a file, processing it and writing updated data back to a file), then the whole code paths for this operation should be locked by rwLock.writeLock() for atomicity reasons.
Now, when access to a shared file seems to be safe (at least, I hope so), the next step is to make it fast. From the scalability perspective, reading a file at every user's request to the servlet doesn't sound reasonable. So, what I thought of is to read the contents of file into ArrayList (or other collection) only once during the context initialization time and then share this ArrayList (not the file) as a context-scoped data-holder attribute. Then a context-scoped attribute can be shared by servlets with the same locking mechanism as described above and the contents of the updated ArrayList may be independently stored back to the file on some regular basis.
Another solution (in order to avoid locking) would be to use CopyOnWriteArrayList (or some other collection from java.util.concurrent package) for holding a shared data and designate a single-threaded ExecutorService to dump its contents into a file when needed. I also heard of Java Memory-Mapped Files for mapping the entire file into internal memory, but not sure if such approach is appropriate for this particular situation.
So, could anybody, please, guide me thorough the most effective ways (maybe, suggesting some other alternatives) to solve the problem with a shared file access, provided that the writing to a file is quite infrequent and the contents of it is not expected to exceed a dozens of lines.
You don't explain your real problem, only your current attempt then, is difficult to provide a good solution.
Your approach has two serious problems:
Problem 1: concurrency
a shared resource which is a simple text-file on the server side
holding some lines of mutable data
90% of the solution to a problem is a good data structure. A mutable file it's not. Even popular database engines have important concurrency limitations (eg. SQLite), don't try to reinvent the wheel.
Problem 2: horizontal scalability
Even if he solves his local concurrency problems (eg. synchronous methods), you won't be able to deploy multiple instances (nodes/servers) of your application.
Solution 1: use the right tool for the job
You don't explain exactly the nature of your (data management) problem but probably any NoSQL database will do you good (reading about MongoDB can be a good starting point).
(Bad) solution 2: use FileLock
If for some reason you insist on doing what you indicate, use low level file locks using FileLock. You will only have to deal with partial file locks and even these can be distributed horizontally. You won't have to worry about synchronizing other resources either, as file-level locks will suffice.
(Limited) solution 3: in memory structure
If you don't need horizontal scalability, you can use a shared in memory structure like ConcurrentHashMap but you will lose the horizontal scalability and you could lose transactions if you do not persist the information before an application stop.
Conclusion
Although there are more exotic distributed data models, using a database for even a single table may be the best and simplest solution.
I'm new to Hazelcast, and I'm trying to use it to store data in a map that is too large than possible to fit on a single machine.
One of the processes that I need to implement is to go over each of the values in the map and do something with them - not accumulating or aggregation and I don't need to see all the data at once, so there is no memory concern with that.
My trivial implementation would be to use IMap.keySet() and then to iterate over all the keys to get each stored value in turn (and allow the value to be GCed after processing), but my concern is that there is going to be so much data in the system that even just getting the list of keys will be large enough to put undue stress on the system.
I was hoping that there was a streaming API that I can stream keys (or even full entries) in such a way that the local node will not have to cache the entire set locally - but failed to find anything that seemed relevant to me in the documentation.
I would appreciate any suggestions that you may come up with. Thanks.
Hazelcast Jet provides distributed version of j.u.s and adds «streaming» capabilities to IMap.
It allows execution of Java Streams API on the Hazelcast cluster.
import com.hazelcast.jet.JetInstance;
import com.hazelcast.jet.stream.DistributedCollectors;
import com.hazelcast.jet.stream.IStreamMap;
import com.hazelcast.jet.stream.IStreamList;
import static com.hazelcast.jet.stream.DistributedCollectors.toIList;
final IStreamMap<String, Integer> streamMap = instance1.getMap("source");
// stream of entries, you can grab keys from it
IStreamList<String> counts = streamMap.stream()
.map(entry -> entry.getKey().toLowerCase())
.filter(key -> key.length() >= 5)
.sorted()
// this will store the result on cluster as well
// so there is no data movement between client and cluster
.collect(toIList());
Please, find more info about jet here and more examples here.
Cheers,
Vik
While the Hazelcast Jet stream implementation looks impressive, I didn't have a lot of time to invest in looking at upgrading to Hazelcast Jet (in our pretty much bog-standard vert.x setup). Instead I used IMap.executeOnEntries which seems to be doing about the same thing as detailed for Hazelcast Jet by #Vik Gamov, except with a more annoying syntax.
My example:
myMap.executeOnEntries(new EntryProcessor<String,MyEntity>(){
private static final long serialVersionUID = 1L;
#Override
public Object process(Entry<String, MyEntity> entry) {
entry.getValue().fondle();
return null;
}
#Override
public EntryBackupProcessor<String, MyEntity> getBackupProcessor() {
return null;
}});
As you can see, the syntax is quite annoying:
We need to create an actual object, that can be serialized to the cluster - no fancy lambdas here (don't use my serial ID, if you copy&paste this - its broken by design).
One reason it cannot be lambda is that the interface is not functional - you need another method to handle backup copies (or at least to declare that you don't want to handle them, as I do), which while I acknolwedge its importance, its not important all of the time and I would guess that its only important in rare cases.
Obviously you can't (or at least its not trivial) to return data from the process - which is not important in my case, but still.
Broad discussion question.
Are there any libraries already which allow me to store the state of execution of my application in Java?
E.g I have an application which processes files, now the application may be forced to shutdown suddenly at some point.I want to store the information on what all files have been processed and what all have not been, and what stage the processing was on for the ongoing processes.
Are there already any libraries which abstract this functionality or I would have to implement it from scratch?
It seems like what you are looking for is serialization which can be performed with the Java Serialization API.
You can write even less code if you decide to use known libraries such as Apache Commons Lang, and its SerializationUtils class which itself is built on top the Java Serialization API.
Using the latest, serializing/deserializing your application state into a file is done in a few lines.
The only thing you have to do is create a class holding your application state, let's call it... ApplicationState :-) It can look like that:
class ApplicationState {
enum ProcessState {
READ_DONE,
PROCESSING_STARTED,
PROCESSING_ENDED,
ANOTHER_STATE;
}
private List<String> filesDone, filesToDo;
private String currentlyProcessingFile;
private ProcessState currentProcessState;
}
With such a structure, and using SerializationUtils, serializing is done the following way:
try {
ApplicationState state = new ApplicationState();
...
// File to serialize object to
String fileName = "applicationState.ser";
// New file output stream for the file
FileOutputStream fos = new FileOutputStream(fileName);
// Serialize String
SerializationUtils.serialize(state, fos);
fos.close();
// Open FileInputStream to the file
FileInputStream fis = new FileInputStream(fileName);
// Deserialize and cast into String
String ser = (String) SerializationUtils.deserialize(fis);
System.out.println(ser);
fis.close();
} catch (Exception e) {
e.printStackTrace();
}
It sounds like the Java Preferences API might be a good option for you. This can store user/system settings with minimal effort on your part and you can update/retrieve at any time.
https://docs.oracle.com/javase/8/docs/technotes/guides/preferences/index.html
It's pretty simple to make from scratch. You could follow this:
Have a DB (or just a file) that stores the information of processing progress. Something like:
Id|fileName|status|metadata
As soon as you start processing a file make a entry to this table. Ans mark status as PROCESSING, the you can store intermediate states, and finally when you're done you can set status to DONE. This way, on restart, you would know what are the files processed; what are the files that were in-citu when the process shutdown/crashed. And (obviously) where to start.
In large enterprise environment where applications are loosely coupled (and there is no guarantee if the application will be available or might crash), we use Message Queue to do something like the same to ensure reliable architecture.
There are almost too many ways to mention. I would choice the option you believe is simplest.
You can use;
a file to record what is done (and what is to be done)
a persistent queue on JMS (which support multiple processes, even on different machine)
a embedded or remote database.
An approach I rave about is using memory mapped files. A nice feature is that information is not lost if the application dies or is killed (provided the OS doesn't crash) which means you don't have to flush it, nor worry about losing data if you don't.
This works because the data is partly managed by the OS which means it uses little heap (even for TB of data) and the OS deals with loading and flushing to disk making it much faster (and making sizes much larger than your main memory practical).
BTW: This approach works even with a kill -9 as the OS flushes the data to disk. To test this I use Unsafe.getByte(0) which crashes the application with a SEG fault immediately after making a change (as in the next machine code instruction) and it still writes the change to disk.
This won't work if you pull the power, but you have to be really quick. You can use memory mapped files to force the data to disk before continuing, but I don't know how you can test this really works. ;)
I have a library which could make memory mapped files easier to use
https://github.com/peter-lawrey/Java-Chronicle
Its a not long read and you can use it as an example.
Apache Commons Configuration API: http://commons.apache.org/proper/commons-configuration/userguide/howto_filebased.html#File-based_Configurations
Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?
Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}
Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.
If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.
I need to create a desktop application that will run third party code, and I need to avoid the third party code from export by any way (web, clipboard, file io) informations from the application.
Somethig like:
public class MyClass {
private String protectedData;
public void doThirdPartyTask() {
String unprotedtedData = unprotect(protectedData);
ThirdPartyClass.doTask(unprotectedData);
}
private String unprotect(String data) {
// ...
}
}
class ThirdPartyClass {
public static void doTask(String unprotectedData) {
// Do task using unprotected data.
// Malicious code may try to externalize the data.
}
}
I'm reading about SecurityManager and AccessControler, but I'm still not sure what's the best approach to handle this.
What should I read about to do this implementation?
First of all, there is pretty much no way you can stop every information leak on a local computer. You can certainly restrict network access, and even a lot of file system access, but there is nothing that would stop the gui from popping up a dialog showing the information to the user on the screen, or any other 100 ways you could "leak" data.
Secondly, you keep talking about the policy file being changeable by the user. yes, it is. it sounds like you are basically trying to recreate DRM. I'd suggest reading up on DRM and the general futility of it. It essentially boils down to giving someone a locked box and the key to the box and telling them not to open it. If someone has physical access to your program, there is almost nothing you can do to stop them from getting data out of it, in java or pretty much any other programming language (at least, not on computers as they are built today).
A general approach would be to run your jvm with a security policy that grants java.security.AllPermission to your codebase (i.e. jar) and no permissions whatsoever to the third-party codebase. Here is some documentation on how to run with a policy file and what to put in said file.