Read a CSV file in Java - java

I wrote a programm that reads a csv file and puts it into a TableModel. My problem is that I want to expand the programm so, that if the csv file gets changes from outside my tablemodel gets updated and gets the new values.
I would now programm a scheduler so that the thread sleeps for about a minute and checks it every minute if the timestamp of the file changed. If so it would read the file again. But i dont know what happens to the whole programm if i use a scheduler because this little software i write will be a part of a much much bigger software wich is running on JDK 6. So I search for a performant and independent from the bigger software solution to get the changes in the tablemodel.
Can someone help out?

java.nio.file package now contains the Watch Service API. This, effectively:
This API enables you to register a directory (or directories) with the
watch service. When registering, you tell the service which types of
events you are interested in: file creation, file deletion, or file
modification. When the service detects an event of interest, it is
forwarded to the registered process. The registered process has a
thread (or a pool of threads) dedicated to watching for any events it
has registered for. When an event comes in, it is handled as needed.
See reference here.
Oh! This API is only available from JDK 7 (onwards).

**OpenCsv is a best way to read csv file in java.
if your are using maven then you can use below dependency or download it's jar from web.**
#SuppressWarnings({"rawtypes", "unchecked"})
public void readCsvFile() {
CSVReader csvReader;
CsvToBean csv;
File fileEntry;
try {
fileEntry = new File("path of your file");
csv = new CsvToBean();
csvReader = new CSVReader(new FileReader(fileEntry), ',', '"', 1);
List list = csv.parse(setColumMapping(), csvReader);
//List of LabReportSampleData class
} catch (IOException e) {
e.printStackTrace();
}
}
//Below function is used to map the your csv file to your mapping object.
//columns String array: The value inside your csv file. means 0 index map with degree variable in your mapping class.
#SuppressWarnings({"rawtypes", "unchecked"})
private static ColumnPositionMappingStrategy setColumMapping() {
ColumnPositionMappingStrategy strategy = new ColumnPositionMappingStrategy();
strategy.setType(LabReportSampleData.class);
String[] columns =
new String[] {"degree", "radian", "shearStress", "shearingStrain", "sourceUnit"};
strategy.setColumnMapping(columns);
return strategy;
}

Related

Create n number of task and execute them in parallel in Spring Batch

I have requirement where read 100 of S3 folder's csv file. In Single execution, it may get files in only few S3 folders like 60 folders have files. I need to process those 60 files and publish those data into Kafka topic. This job needs to scheduled each 4hr. And CSV data can be small records & with huge data like 6 GB also.
I have to develop in Java and deploy into AWS.
Thinking to use Spring Batch: Like below steps:
1. Traverse all 100 S3 folders and identify each folder which has files e.g. 60 folder has files.
2. create those many jobs\task like e.g. 60 jobs and execute them in parallel.
restriction: I should not use AWS EMR for this process.
Please suggest me a good approach to handle this best performance, with minimal failure data process.
For your use case, if all files are same type(i.e if it can be processed one by one), then you can use below option.
Using ResourceLoader, we can read files in S3 in ItemReader as like other resource. This would help to read files in S3 in chunks instead of loading entire file into memory.
With the dependencies injected for ResourceLoader and AmazonS3 client, have your reader configuration as below:
Replace values for sourceBucket and sourceObjectPrefix as needed.
#Autowired
private ResourceLoader resourceLoader;
#Autowired
private AmazonS3 amazonS3Client;
// READER
#Bean(destroyMethod="")
#StepScope
public SynchronizedItemStreamReader<Employee> employeeDataReader() {
SynchronizedItemStreamReader synchronizedItemStreamReader = new SynchronizedItemStreamReader();
List<Resource> resourceList = new ArrayList<>();
String sourceBucket = yourBucketName;
String sourceObjectPrefix = yourSourceObjectPrefix;
log.info("sourceObjectPrefix::"+sourceObjectPrefix);
ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
.withBucketName(sourceBucket)
.withPrefix(sourceObjectPrefix);
ObjectListing sourceObjectsListing;
do{
sourceObjectsListing = amazonS3Client.listObjects(listObjectsRequest);
for (S3ObjectSummary sourceFile : sourceObjectsListing.getObjectSummaries()){
if(!(sourceFile.getSize() > 0)
|| (!sourceFile.getKey().endsWith(DOT.concat("csv")))
){
// Skip if file is empty (or) file extension is not "csv"
continue;
}
log.info("Reading "+sourceFile.getKey());
resourceList.add(resourceLoader.getResource("s3://".concat(sourceBucket).concat("/")
.concat(sourceFile.getKey())));
}
listObjectsRequest.setMarker(sourceObjectsListing.getNextMarker());
}while(sourceObjectsListing.isTruncated());
Resource[] resources = resourceList.toArray(new Resource[resourceList.size()]);
MultiResourceItemReader<Employee> multiResourceItemReader = new MultiResourceItemReader<>();
multiResourceItemReader.setName("employee-multiResource-Reader");
multiResourceItemReader.setResources(resources);
multiResourceItemReader.setDelegate(employeeFileItemReader());
synchronizedItemStreamReader.setDelegate(multiResourceItemReader);
return synchronizedItemStreamReader;
}
#Bean
#StepScope
public FlatFileItemReader<Employee> employeeFileItemReader()
{
FlatFileItemReader<Employee> reader = new FlatFileItemReader<Employee>();
reader.setLinesToSkip(1);
reader.setLineMapper(new DefaultLineMapper() {
{
setLineTokenizer(new DelimitedLineTokenizer() {
{
setNames(Employee.fields());
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Employee>() {
{
setTargetType(Employee.class);
}
});
}
});
return reader;
}
For each file/resource, the MultiResourceItemReader will delegate to the FlatFileItemReader being configured.
For itemProcessor part, you can scale using asyncProcessor/writer approach also as needed.
Here is one possible approach for you to think about. (Fyi, i have done file processing using spring-batch and threading using the strategy i am outlining here. But that code belongs to my company and cannot share it.)
I would suggest you read these articles to understand how to scale up using spring-batch.
First, spring-batch documentation
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html
Next, a good post from stackoverflow itself.
Best Spring batch scaling strategy
After reading both and understanding all the different ways, i would suggest you concentrate on Partitioning,
https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html#partitioning
This is the technique i used as well. In your case, you can spawn the number of threads for each file from the partitioner.
You may need to maintain the state, ie, if the file is assigned to a thread or not. 'Processing' and 'Completed Processing' also could be states in your code.
This depends on your requirement. (I had a whole set of states maintained in a singleton which all threads would update after picking up a file, and finished processing a file etc)
You also need to think about finishing each file before the 4 hour window is over. You may be able to keep the file as is, or you may want to move this to a new location while processing. or rename the file while processing. Again it depends on your requirements. But you need to think about this scenario. (In my case, i renamed the file by adding a unique suffix made of timestamp in milliseconds , so it could not be overwritten by new file. )
Finally, a sample from a blog which processes 5 csv files through partitioner.
You can start from this sample.
https://www.baeldung.com/spring-batch-partitioner
And search for more samples to see if this is the approach you want to take. Good luck.

How to lock folder in java before writing file?

I have usecase where multiple thread can write file to a folder. At a given point of time I want to identify which is the latest file in that folder.
Since I cannot use timestamp as it can be same for more than 1 file in the folder. So I want to lock the folder, generate sequence number by counting number of file in folder, write new file by using the generated sequence number, release lock. Is this possible in java?
Similarly while reading take the file with largest sequence number.
Chances of concurrent writing file to a folder is less so performance won't be an issue.
You can't use FileLock on a directory so you will have to handle locking in Java. You could do something like:
private final Object lock = new Object();
public void writeToNext(String dirPath) {
synchronized(lock) {
File dir = new File(dirPath);
List<File> files = Arrays.asList(dir.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return !pathname.isDirectory();
}
}));
int numFiles = files.size();
String nextFile = dir.getAbsolutePath() + File.separator + (numFiles + 1) + ".txt"; // get a path for the new file
System.out.println("Writing to " + nextFile);
// TODO write to file
}
}
Note
You could implement your solution such that each write increments a counter somewhere and you can just use that to get the next value; only order and look for the last file if the counter hasn't been initialized.
Using Java SE 7 or above:
WatchService API allows tracking file operations (create, modify and delete file) in a specified directory. In this scenario create a watch service to track new files created in the specific folder. Each time a new file is created the file creating an event is triggered and the process allows do some user-defined action.
The file already has a created time attribute (java.nio.file.attribute.BasicFileAttributes). This can be extracted as of type java.nio.file.attribute.FileTime which is in millis or can be a more specific java.util.concurrent.TimeUnit (this allows nanosecond precision). This gives a chance to be more specific about what is the newest file.
Also, there is an option to create a custom user-defined file attribute for any file. The attribute allows defining as key-value pair. This unique attribute value can be associated with a file to identify if its the latest. The following APIs allow creating and reading a custom file attribute: java.nio.file.attribute.UserDefinedFileAttributeView and Files.getFileAttributeView().
I think using the above APIs and methods one can create an application to track the latest files created in a specified folder and perform a required action. Note there is no locking mechanism involved if one is using these APIs.
EDIT (included):
Using a collection to retrieve latest file:
A thread-safe collection can be used to store the filenames (or file path) and retrieve them LIFO (last-in-first-out). The watch service (or similar process) can store the filename of the (latest) file created in the folder to this collection. A read operation just gets the latest filename from this collection and work with it. One can consider java.util.concurrent.ConcurrentLinkedDeque or LinkedBlockingDeque based on requirement.
EDIT (included):
A possible solution's process diagram:
Use File.createNewFile() in a loop for writing. Because it
Atomically creates a new, empty file named by this abstract pathname if and only if a file with this name does not yet exist. The check for the existence of the file and the creation of the file if it does not exist are a single operation that is atomic with respect to all other filesystem activities that might affect the file.
Like this:
import java.io.*;
import java.util.*;
public class FileCreator {
public static void main(String[] args) throws IOException {
String creatorId = UUID.randomUUID().toString();
File dir = new File("dir");
for (int filesCreated = 0; filesCreated < 1000; filesCreated++) {
File newFile;
for (int fileIdx = dir.list().length; ; fileIdx++) {
newFile = new File(dir, "file-" + fileIdx + ".txt");
if (newFile.createNewFile()) {
break;
}
}
try (PrintWriter pw = new PrintWriter(newFile)) {
pw.println(creatorId);
}
}
}
}
Another option would be Files.createFile(...). It throws an exception if the file already exists.
As for reading:
Similarly while reading take the file with largest sequence number.
What's the question here? Just take it.

Directory Listener z Commons IO - action can't be completed

I wrote an application that listens to a given folder, then logs events and writes to the database (action type, file name, content, date).
I based the applications on the Producer Consumer pattern and used the ArrayBlockingQueue.
I have a problem of this type, that when I add a file in this folder and later I want to delete or modify it (this applies only to the first file created in this folder) it pops up
This is one thing, the other one that I would like to skip while (in some way) in DbWriter and I have no idea how to do it.
Thanks for all the answers
You can not delete the file because inFileProcessor.getContent, you create a FileReader and did not close it, this will cause the JVM lock the file. To solve this problem, just close the FileReader after using, like this :
public static String getContent (File file) throws IOException {
FileReader fileReader = new FileReader(file);
String content = IOUtils.toString(fileReader);
fileReader.close();
return content;
}

Regarding stitching of multiple files into a single file

I work on query latencies and have a requirement where I have several files which contain data. I want to aggregate this data into a single file. I use a naive technique where I open each file and collect all the data in a global file. I do this for all the files but this is time taking. Is there a way in which you can stitch the end of one file to the beginning of another and create a big file containing all the data. I think many people might have faced this problem before. Can anyone kindly help ?
I suppose you are currently doing the opening and appending by hand; otherwise I do not know why it would take a long time to aggregate the data, especially since you describe the amount of files using multiple and several which seem to indicate it's not an enormous number.
Thus, I think you are just looking for a way to automatically to the opening and appending for you. In that case, you can use an approach similar to below. Note this creates the output file or overwrites it if it already exists, then appends the contents of all specified files. If you want to call the method multiple times and append to the same file instead of overwriting an existing file, an alternative is to use a FileWriter instead with true as a second argument to its constructor so it will append to an existing file.
void aggregateFiles(List<String> fileNames, String outputFile) {
PrintWriter writer = null;
try {
writer = new PrintWriter(outputFile);
for(String fileName : fileNames) {
Path path = Paths.get(fileName);
String fileContents = new String(Files.readAllBytes(path));
writer.println(fileContents);
}
} catch(IOException e) {
// Handle IOException
} finally {
if(writer != null) writer.close();
}
}
List<String> files = new ArrayList<>();
files.add("f1.txt");
files.add("someDir/f2.txt");
files.add("f3.txt");
aggregateFiles(files, "output.txt");

Can multiple threads write data into a file at the same time?

If you have ever used a p2p downloading software, they can download a file with multi-threading, and they created only one file, So I wonder how the threads write data into that file. Sequentially or in parallel?
Imagine that you want to dump a big database table to a file, and how to make this job faster?
You can use multiple threads writing a to a file e.g. a log file. but you have to co-ordinate your threads as #Thilo points out. Either you need to synchronize file access and only write whole record/lines, or you need to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes.
This is rarely done for performance reasons as most disk subsystems perform best when being written to sequentially and disk IO is the bottleneck. If CPU to create the record or line of text (or network IO) is the bottleneck it can help.
Image that you want to dump a big database table to a file, and how to make this job faster?
Writing it sequentially is likely to be the fastest.
Java nio package was designed to allow this. Take a look for example at http://docs.oracle.com/javase/1.5.0/docs/api/java/nio/channels/FileChannel.html .
You can map several regions of one file to different buffers, each buffer can be filled separately by a separate thread.
The synchronized declaration enables doing this. Try the below code which I use in a similar context.
package hrblib;
import java.io.*;
public class FileOp {
static int nStatsCount = 0;
static public String getContents(String sFileName) {
try {
BufferedReader oReader = new BufferedReader(new FileReader(sFileName));
String sLine, sContent = "";
while ((sLine=oReader.readLine()) != null) {
sContent += (sContent=="")?sLine: ("\r\n"+sLine);
}
oReader.close();
return sContent;
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid file path/File cannot be read: \n" + sFileName);
}
}
static public void setContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Invalid folder path/File cannot be written: \n" + sFileName);
}
}
public static synchronized void appendContents(String sFileName, String sContent) {
try {
File oFile = new File(sFileName);
if (!oFile.exists()) {
oFile.createNewFile();
}
if (oFile.canWrite()) {
BufferedWriter oWriter = new BufferedWriter(new FileWriter(sFileName, true));
oWriter.write (sContent);
oWriter.close();
}
}
catch (IOException oException) {
throw new IllegalArgumentException("Error appending/File cannot be written: \n" + sFileName);
}
}
}
You can have multiple threads write to the same file - but one at a time. All threads will need to enter a synchronized block before writing to the file.
In the P2P example - one way to implement it is to find the size of the file and create a empty file of that size. Each thread is downloading different sections of the file - when they need to write they will enter a synchronized block - move the file pointer using seek and write the contents of the buffer.
What kind of file is this? Why do you need to feed it with more threads? It depends on the characteristics (I don't know better word for it) of the file usage.
Transferring a file from several places over network (short: Torrent-like)
If you are transferring an existing file, the program should
as soon, as it gets know the size of the file, create it with empty content: this prevents later out-of-disk error (if there's not enough space, it will turns out at the creation, before downloading anything of it), also it helps the the performance;
if you organize the transfer well (and why not), each thread will responsible for a distinct portion of the file, thus file writes will be distinct,
even if somehow two threads pick the same portion of the file, it will cause no error, because they write the same data for the same file positions.
Appending data blocks to a file (short: logging)
If the threads just appends fixed or various-lenght info to a file, you should use a common thread. It should use a relatively large write buffer, so it can serve client threads quick (just taking the strings), and flush it out optimal scheduling and block size. It should use dedicated disk or even computer.
Also, there can be several performance issues, that's why are there logging servers around, even expensive commercial ones.
Reading and writing random time, random position (short: database)
It requires complex design, with mutexes etc., I never done this kinda stuff, but I can imagine. Ask Oracle for some tricks :)

Categories

Resources