Searching a row in a file java spring batch

Searching a row in a file java spring batch - java

I have 2 files File A and History file. I have to iterate over each row in file A and look up for its entry in History file, if that entry is found i need to write this in a new file A_History. How can i achieve this in java.
Time and performance are critical factors. I would like to achieve this without bumping memory.
I searched for spring batch but nothing seems to be available there. Any ideas ?

Related

Multiple readers in single job in single step using spring batch

I am newbie in spring batch. I have a use case in which, I've to read files from a specific folder and write those files in to DB.
For example, I've a files In folder like this
-company_group
|
-my_company_group.json
-my_company_group_alternate_id.json
-sg_company_group.json
-sg_company_group_alternate_id.json
Note: sg = Singapore, my=Malaysia
Now, I want to read these files In the following order.
SG files should be read first than my files.
for each country alternate file should come first.
For example,
sg_company_group_alternate_id.json
sg_company_group.json
And same for the MY files
Currently, I'm reading all files by writing custom MultiResourcePartitioner and sorting my files order in the way which I mentioned above.
There will be 1 writer and reader for 1 file.
There will be 1 job.
Now, the problem Is I've a step in which I've a custom partitioner which I mentioned above it gets all file sort it but it goes in to only 1 reader. It will go through one reader for all files. I want multiple readers for all files.
I mean to say, in 1 job I've a step which loads all files. Now in this step, 1 file get read, write in db repeat for other file in same step.
As per my understanding spring batch do not allow multiple readers in 1 step.
Is there any workaround?
Thanks.

I would recommend to create a job instance per file, meaning pass the file as an identifying job parameter. This has at least two major benefits:
Scaling: you can run multiple jobs in parallel, each job processing a different file
Fault-tolerance: if a job fails, you only restart the failed job, without affecting other files

Spring Batch: Create new steps at job runtime

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.
Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?
Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.
Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.
The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.

You could use partitioned steps
Pass variable containing list of csv as resources to
JobExecutionContext during your inspect step
In partition method retrieve the list of csv and create a partition for each one.
The step will be executed for each of the partition created

Use the Checkstyle API without providing a java.io.File

Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.

There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.

writing a file finder (java)

I am writing an application which will search for for files with special filename extension on computer. (JPG for example). Input data: "D:", ".JPG" Output: txt file with results(file directories); I know one simple reccursive algo, but may be there is smth better. So, may be you tell me an efficient algorithm to traverse the file directory. Also I want to use multithreading for solving this problem to make better performance. But how many threads should I use? If I will use 1 thread for 1 directory - this will be stupid.

The recursive option you name is the only way to go, unless you want to get your hands dirty with the file system. I suspect you don't.
Regarding thread performance, your best choice is to make the number of threads configurable, create some sample directories, and measure performance for each setting.
By the way, most file-finders create an index of files. They scan the disc on a schedule, and update a file which contains the relevant information about the files and directories on disk. The file is in a format designed to facilitate searching. This index file is used to perform actual searches. If you're planning on repeatedly running this search against the same directory, you should do this.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.