Java Program Architecture: IO Buffer - java

I am new to Java and wrote a program which is very difficult to update because business logic, modal and repo data is combined in the classes. I have since researched many tutorials but I cannot find the answer to ensure my program is written in the most resource efficient manner.
The program imports 1 line of a CSV via BufferedReader class and creates a modal object instance to reflect each column read. However then a second CSV sheet is read via BufferedReader to check if any of the data from the first CSV matches, then updates the modal object with this new information.
Then once the updated modal object is updated, the updated modal objects are added to an ArrayList.
Which architectural way is more efficient, version 1 or version 2 and why:
VERSION 1
READ LINE
CREATE OBJECT INSTANCE
READ LINE
UPDATE OBJECT INSTANCE (& RUN LOGIC)
WRITE LINE (OBJECT) TO NEW FILE VIA BUFFEREDWRITER
REPEAT 3000 TIMES
VERSION 2
READ LINE
ADD TO ARRAYLIST A
READ LINE
ADD TO ARRAYLIST B
RUN LOGIC BY COMPARING ARRAYLIST A AND B
WRITE FINAL ARRAYLIST TO NEW FILE VIA BUFFEREDREADER
Please note, that although this is currently using CSV data with a limited number of lines (3000), in future the CSV lines will increase to over 50,000 lines of data, so is it better to add to ArrayLists and run logic on the complete ArrayList, or to run logic on each object first, then add the complated objects to ArrayList?

Version 2 is more efficient because you're batching operations, and you're not repeating expensive tasks. A classic example of that is a database connection - establishing a database connection is generally expensive, so opening a connection and doing 100 updates through it is always more efficient (for the caller) than opening connection, doing an update, closing connection x 100.
The trade-off with that example is that databases can only hold a finite number of connections, so doing a 100 updates through one connection is more efficient for the caller but might block other callers to the database.
Using a buffer is another "expensive" operation, that's why version 1 is so much slower, because you're setting it up 300 (50,000) times rather than once.

Related

How to distribute work correctly in JSR-352?

So I have been using Java Batch Processing for some time now. My jobs were either import/export jobs which chunked from a reader to a writer, or I would write Batchlets that would do some more complex processing. Since I am beginning to hit memory limits I need to rethink the architecture.
So I want to want to better leverage the chunked Reader/Processor/Writer pattern. And apparently I feel unsure how to distribute the work over the three items. During processing it becomes clear whether to write zero, one or several other records.
The reader is quite clear: It reads the data to be processed from the DB. But I am unsure how to write the records back to the database. I see these options:
Make the processor now store the variable amounts of data in the DB.
Make the processor send variable amount of data to the writer that would then perform the writing.
Place the entire logic into the writer.
Which way would be the best for this kind of task?
Looking at https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf, especially the chapters Chunk/The Processor and Chunk/The Writer it becomes obvious that it is up to me.
The processor can return an object, and the writer will have to understand and write this object. So for the above case where the processor has zero, one or many items to write per input record, it should simply return a list. This list can contain zero, one or several elements. The writer has to understand the list and write its elements to the database.
Since the logic is divided this way, the code is still pluggable and can easily be extended or maintained.
Addon: Since both reader and writer this time connect to the same database, I perceived the problem that upon commit for each chunk the connection for the reader was also invalidated. The solution was to use a nonJTA datasource for the reader.
Typically, an item processor processes an input item passed from an item reader, and the processing result can be null or a domain object. So it's not suited for your cases where the processing result may be split into multiple objects. I would assume even in your case, multiple objects from a processing iteration is not common. So I would suggest to use list or any collection type as the element type of the processed object only when necessary. In other more common cases, the item processor will still return null (to skip the current processed item) or a domain object.
When the item writer iterates through accumulated items, it can check if it's a collection and then write out all contained elements. For domain object type, then just write it as usual.
Using non-jta datasource for the reader is fine. I think you would want to keep the reader connection open from the start to end to keep reading from the result set. In an item writer, the connection is typically acquired at the beginning of the write operation and closed at the end of the chunk transaction commit or rollback.
Some resources that may be of help:
Jakarta Batch API,
jberet-support JdbcItemReader,
jberet-support JdbcItemWriter

Replicating a Looping File Backup in Java

I'm trying to implement a file storage system that is like a video recording system that loops over existing data. Say, we have a maximum file size of 10MB and append an integer value every second. We can setup a FileChannel and keep appending the values. But what to do once we've reached our 10MB? Now we want to append a new value but pop one off the head of the file. We see the problem is a sort of queue and it's easy to put and pop values from a queue in memory, but not so easy when using files.
I implemented a circular buffer with a FileChannel as the behind the scenes storage. It works but the problem is the first and last indices move through the file as data is added and removed. Ideally, I would always like the oldest data value to be at file index 0 and the most recent data at file index n-1, so that when a file is read it is from the start to end.
I saw that FileChannel supports methods transferTo() and transferFrom() and also did an implementation using these methods and again it works. The problem with this method is continually having to transfer blocks of data from the current file to a temporary file and then replace the current with the new file. It works but not particularly efficient.
Thus, I've tried a few things but not found the ideal solution or replicating a file-queue as yet and was wondering if anyone else has implemented the golden bullet solution? Maybe, a file version of a queue in which the data is shuffled along is simply not possible, but hopefully someone knows an answer. Thanks.

Splitting a large log file in to multiple files in Scala

I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.
I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:
Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.
Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.
Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.
You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:
import java.io._
import scala.io._
Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) {
case (printers, line) =>
val id = line.split(" ").head
val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
printer.println(line)
printers.updated(id, printer)
}.values.foreach(_.close)
Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.

Special OutputStream to work into memory and file depending on the amount of input data

Currently I'm working with an SSH client api providing me stdout and stderr as InputStreams. I have to read all the data from these streams at client side and provide an api for implementors to be able to work with these data the way they want (just drop it, write it to DB, process it etc). First I tried to keep the whole data read in byte arrays, but with huge amount of data (could happen sometimes) this can cause serious memory problems. But I don't want to write all the data of every call into files if that isn't really necessary.
Anyone knows about a solution which reads data into memory until it reaches a limit (like 1mb), after it writes data from memory to a file and appends all the remaining data of the inputstream to the same file?
commons io has a workable solution: DeferredFileOutputStream.
Can you avoid reading the stream until you know what you are going to do with it?
If you use this approach you can dump them, read portions of data and write them to a database as you read it, or read and process the data as you read it.
This way you would not need to read more than 1 MB (or less) at any one time.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks
Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.
Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.
Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.
(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.
Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.
As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Categories

Resources