Best way to read file and process content in java

Best way to read file and process content in java - java

I'm curious about the best way to read files and then process each line of the file. Assuming that the resource that needs to be read from can grow in size (e.g. a very large file) and the reading and processing of files can be swapped with a different implementation (e.g. reading an xml source instead of a file containing Strings). Please consider the following approaches:
Create 2 services. First service is used to extract data from the file and return a list. Second service takes the list and iterate thru it to process each item. Pros for this approach is that it adheres to SRP and makes it posible to switch services(e.g. get the data from a different source like XML JSON). The only con i could think about is that the performance hit of iterating thru the collection again (first time is when reading then putting in a collection) to do the processing.
Have only 1 service that will do both tasks. This way you can do the processing inside the initial loop. Trade offs would be the coupling between the reading and processing functionality which breaks the SRP.
If you have other suggestions on how to accomplish this all ideas are very welcome! TYIA!!!
Another thing i want to take away from this question is how great designers and developers (guys from this site ☺) come up with decisions for this case: why would you trade off one benefit for the other? best practices when comes to trade offs? Whats an acceptable trade off for you and why?Etc. Thanks again!

Related

Live queries implementation / aproaches in backend

I am working on think, which should be "live". I.e. use web-sockets or SSE to show current data in browser. Source of my data are two and they should be combined with a bit of business logic. Data can be retrieved using http get and they also come as web-hook notifications.
I am able to code needed thing in java + spring, but readability would suffer. I have discovered that using RethinkDB would make my task much more easier. But it seems that given project is not backed by live development.
I would like any java idiomatic approach / library / external SW (like database) to make easy (maintainable ~= less code) algorithm which would for example do thing like this:
2 inputs:
filesystem tree (git repo)
list of trees with some processing info in it. Each tree in list does contain:
root node with some irrelevant info
some number of children nodes
leaf nodes with filename from filesystem (with path), duration of action with file and status of file processing
Note: second input can contain for example 20 trees, where it have info about processing single file from filesystem tree 20 times. i.e. it is possible that for getting info about some particular file, we will need to crawl whole list of trees and there is no guarantee that file will have any matching processing info in second input. In this case, we will output "N/A" to resulting tree for given file.
I would like to transform these two inputs to another tree, which will have structure like first input and will contain info about last status (last from array) and sum of duration.
My current approach was not reactive. It have involved a lot of java stream api and using http GET to get actual data from two sources. It was working ok, it was fast enough, but not enough to introduce pooling and make user feel that it is real time.
To make this reactive, it would involve a lot of spaghetti code to keep current algorithms in place. so I have started another approach (from scratch).
I have started to make some nice OOP class, which will receive changes from both inputs and will produce "observable" changes as an output. This would be relatively nice, if my "query", which computes output would be immutable. It is not, due to design changes of business logic.
Can you point me to some approach making this problem implementation easy to maintain?
PS: I was considering using spring cache mechanism for receiving changes (caching methods which make http get calls for inputs and returns parsed, partly processed, input data). But this part of code is a bit small to mane any difference.

I want to preserve my data during service restart, but my data is not in simple variable name-value or table format. How should I go about this?

I want to preserve data during service restart, which uses a arraylist of {arraylist of integers} and some other variables.
Since it is about 40-60 MB, I don't want it be generated each time the service restarts(it takes a lot of time); I want to generate data once, and maybe copy it for next service restart.
How can it be done?
Please consider how will I go about putting a data structure similar to multidimensional array(3d or above) into file, before suggesting writing the data in a file; which when done, will likely take significant time to read too.

You can try writing your data after generation to a file. Then on next service restart, you can simply read that from the file.

If you need persistent data, then put it into database
https://developer.android.com/guide/topics/data/data-storage
or try some object database like http://objectbox.io/

So you're afraid reading from the file would take along time due to its size, the number and size of the rows (the inner arrays).
I think it might be worthy to stop for a minute and ask yourself whether you need all this data at once. Maybe you only need a portion of it at any given time and there are scenarios in which you don't use some (or maybe most) of the data? If this is likely, I would suggest that you'll compute the data on demand, when required, and only keep a memory based cache for future demand in the current session.
Otherwise, if you do need all the data at a given time, you have a trade-off here. Trade-off between size on disk and processing time. You can shrink the data using some algorithm, but it would be at the expense of the processing time. On the hand, you can just serialize your object of data and save it to disk as is. Less time, more disk space.
Another solution for your scenario, could be, to just use a DB and a cursor (room on top sqlite). I don't exactly know what it is that you're trying to do, but your arrays can easily be modeled into a DB. Model a single row as you'd like and add to that model the outer index of the array. Then save the models into the DB, potentially making the outer index field the primary key if the DB.
Regardless of the things I wrote, try to think if you really need this data persistent on your client, maybe you can store it at the server side? If so, there are other storage and access solutions which are not included at the Android client side.

Thank you all for answering this question.
This is what I have finally settled for:
Instead of using the structure as part of the app, I made this into a
tool, which will prepare data to be used with the main app. In doing
so, it also stopped the concern regarding service restart.
This tool will first read all the strings from input file(s).
Then put all of them into the structure one at a time.(This will be
the part which I was having doubts, and asked the question about.
Since all the data is into the structure here, as soon as program
terminates, this structured data is unusable.)
Now, I prepared another structure for putting this data into file,
and put all this data into file so that I do not need to read to all
input file again and again, but only few lines.
Then I thought, why spend time "read"ing files while I can hard code
it into my app. So, as final step of this preprocessing tool, I made
it into a class which has switch(input){case X: return Y}.
Now I will just have to put this class into the app I wanted to make.
I know this all sounds very abstract, even stretching the concept of abstract, if you want to know details, please let me know. I am also including link of my "tool". Please visit and let me know if there would have been some better way.
P.S. There could be errors in this tool yet, which if you find, let me know to fix them.
P.P.S.
link: Kompressor Tool

Use multi threading to read files/processing using Java?

So I haven't really done any serious multithreading before( with the exception of the typical for-loop textbook example) so I thought I might give it a try. The task that I am trying to accomplish is the following:
Read an identification code from a file called ids.txt
Search for that identification code in a separate file called sequence.txt
Once identification is found, extract the string that follows the id.
Create an object of type DataSequence (which encapsulates the identification code and the extracted sequence) and add it to an ArrayList.
Repeat for 3000+ ids.
I have tried this the "regular" way within a single thread but the process is way too slow.How can I approach this issue in a multi-threaded fashion ?

Without seeing profiling data, it's hard to know what to recommend. But as a blind guess, I'd say that repeatedly opening, searching, and closing sequence.txt is taking most of the time. If this is guess is accurate, then the biggest improvement (by far) would be to find a way to process sequence.txt only once. The easiest way to do that would be to read in the relevant information from the file into memory, building a hash map from id to the string that follows it. The entire file is only 53.3 MB, so this is an eminently reasonable approach. Then as you process ids.txt, you only need to look up the relevant string from the map—a very quick operation.
An alternative would be to use the java.nio classes to create a memory-mapped file for sequence.txt.
I'd be hesitant about looking to multithreading to improve what seems to be a disk-bound operation, particularly if the threads will all end up contending for access to the same file (even if it is only read access). This does not strike me as a good problem with which to learn multithreading techniques; the payoff is just not likely to be there.

Multi-threading could be an overkill here. try the following algorithmic approach.
1. Open the file ids.txt in read mode
2. Declare a HashMap for storing key-value pair
2. Loop till end of the file
2A. Read a line as a string
2B. Parse the line as id (key) and rest of the line (value) to store in the HashMap object
3. Now search using the HashMap as desired or do whatever you need with this.
Note: 2A and 2B can be put in two different tasks for two different threads in a producer-consumer framework of design.

Searching for regex patterns on a 30GB XML dataset. Making use of 16gb of memory

I currently have a Java SAX parser that is extracting some info from a 30GB XML file.
Presently it is:
reading each XML node
storing it into a string object,
running some regexex on the string
storing the results to the database
For several million elements. I'm running this on a computer with 16GB of memory, but the memory is not being fully utilized.
Is there a simple way to dynamically 'buffer' about 10gb worth of data from the input file?
I suspect I could manually take a 'producer' 'consumer' multithreaded version of this (loading the objects on one side, using them and discarding on the other), but damnit, XML is ancient now, are there no efficient libraries to crunch em?

Just to cover the bases, is Java able to use your 16GB? You (obviously) need to be on a 64-bit OS, and you need to run Java with -d64 -XMx10g (or however much memory you want to allocate to it).
It is highly unlikely memory is a limiting factor for what you're doing, so you really shouldn't see it fully utilized. You should be either IO or CPU bound. Most likely, it'll be IO. If it is, IO, make sure you're buffering your streams, and then you're pretty much done; the only thing you can do is buy a faster harddrive.
If you really are CPU-bound, it's possible that you're bottlenecking at regex rather than XML parsing.
See this (which references this)
If your bottleneck is at SAX, you can try other implementations. Off the top of my head, I can think of the following alternatives:
StAX (there are multiple implementations; Woodstox is one of the fastest)
Javolution
Roll your own using JFlex
Roll your own ad hoc, e.g. using regex
For the last two, the more constrained is your XML subset, the more efficient you can make it.
It's very hard to say, but as others mentioned, an XML-native database might be a good alternative for you. I have limited experience with those, but I know that at least Berkeley DB XML supports XPath-based indices.

First, try to find out what's slowing you down.
How much faster is the parser when you parse from memory?
Does using a BufferedInputStream with a large size help?
Is it easy to split up the XML file? In general, shuffling through 30 GiB of any kind of data will take some time, since you have to load it from the hard drive first, so you are always limited by the speed of this. Can you distribute the load to several machines, maybe by using something like Hadoop?

No Java experience, sorry, but maybe you should change the parser? SAX should work sequentially and there should be no need to buffer most of the file ...

SAX is, essentially, "event driven", so the only state you should be holding on to from element to element is state that relevant to that element, rather than the document as a whole. What other state are you maintaining, and why? As each "complete" node (or set of nodes) comes by, you should be discarding them.

I don't really understand what you're trying to do with this huge amount of XML, but I get the impression that
using XML was wrong for the data stored
you are buffering way beyond what you should do (and you are giving up all advantages of SAX parsing by doing so)
Apart from that: XML is not ancient and in massive and active use. What do you think all those interactive web sites are using for their interactive elements?

Are you being slowed down by multiple small commits to your db? Sounds like you would be writing to the db almost all the time from your program and making sure you don't commit too often could improve performance. Possibly also preparing your statements and other standard bulk processing tricks could help
Other than this early comment, we need more info - do you have a profiler handy that can scrape out what makes things run slowly

You can use the Jibx library, and bind your XML "nodes" to objects that represent them. You can even overload an ArrayList, then when x number of objects are added, perform the regexes all at once (presumably using the method on your object that performs this logic) and then save them to the database, before allowing the "add" method to finish once again.
Jibx is hosted on SourceForge: Jibx
To elaborate: you can bind your XML as a "collection" of these specialized String holders. Because you define this as a collection, you must choose what collection type to use. You can then specify your own ArrayList implementation.
Override the add method as follows (forgot the return type, assumed void for example):
public void add(Object o) {
super.add(o);
if(size() > YOUR_DEFINED_THRESHOLD) {
flushObjects();
}
}
YOUR_DEFINED_THRESHOLD
is how many objects you want to store in the arraylist until it has to be flushed out to the database. flushObjects(); is simply the method that will perform this logic. The method will block the addition of objects from the XML file until this process is complete. However, this is ok, the overhead of the database will probably be much greater than file reading and parsing anyways.

I would suggest to first import your massive XML file into a native XML database (such as eXist if you are looking for open source stuff, never tested it myself), and then perform iterative paged queries to process your data small chunks at a time.

You may want to try Stax instead of SAX, I hear it's better for that sort of thing (I haven't used it myself).

If the data in the XML is order independent, can you multi-thread the process to split the file up or run multiple processes starting in different locations in the file? If you're not I/O bound that should help speed it along.

Advice on handling large data volumes

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.
Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.
Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?

So then what if the processing requires jumping around in the data for multiple files and multiple buffers? Is constant opening and closing of binary files going to become expensive?
I'm a big fan of 'memory mapped i/o', aka 'direct byte buffers'. In Java they are called Mapped Byte Buffers are are part of java.nio. (Basically, this mechanism uses the OS's virtual memory paging system to 'map' your files and present them programmatically as byte buffers. The OS will manage moving the bytes to/from disk and memory auto-magically and very quickly.
I suggest this approach because a) it works for me, and b) it will let you focus on your algorithm and let the JVM, OS and hardware deal with the performance optimization. All to frequently, they know what is best more so than us lowly programmers. ;)
How would you use MBBs in your context? Just create an MBB for each of your files and read them as you see fit. You will only need to store your results. .
BTW: How much data are you dealing with, in GB? If it is more than 3-4GB, then this won't work for you on a 32-bit machine as the MBB implementation is defendant on the addressable memory space by the platform architecture. A 64-bit machine & OS will take you to 1TB or 128TB of mappable data.
If you are thinking about performance, then know Kirk Pepperdine (a somewhat famous Java performance guru.) He is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips and other Java performance related things.

You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).
The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.

You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.
It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.
To answer your questions in order:
Should I load everything into memory all at once?
Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.
If not, is opening what's a good way of loading the data partially?
BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.
What are some Java-relevant efficiency tips?
That really depends on what you're doing with the data itself!

Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.
Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!
If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.
So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.
Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.

This depends a lot on the data in the file. Big mainframes have been doing sequential data processing for a long time but they don't normally use random access for the data. They just pull it in a line at a time and process that much before continuing.
For random access it is often best to build objects with caching wrappers which know where in the file the data they need to construct is. When needed they read that data in and construct themselves. This way when memory is tight you can just start killing stuff off without worrying too much about not being able to get it back later.

You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?
Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.

I've found Informatica to be an exceptionally useful data processing tool. The good news is that the more recent versions even allow Java transformations. If you're dealing with terabytes of data, it might be time to pony up for the best-of-breed ETL tools.
I'm assuming you want to do something with the results of the processing here, like store it somewhere.

If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.

I recommend strongly leveraging Regular Expressions and looking into the "new" IO nio package for faster input. Then it should go as quickly as you can realistically expect Gigabytes of data to go.

If at all possible, get the data into a database. Then you can leverage all the indexing, caching, memory pinning, and other functionality available to you there.

If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.