I have a Spring Batch application where I am processing multiple .txt files in parallel. My simple job configuration looks like below:
#Value("file:input/*.txt")
private Resource[] inputResources;
#Bean("partitioner")
#StepScope
public Partitioner partitioner() {
log.info("In Partitioner");
MultiResourcePartitioner partitioner = new MultiResourcePartitioner();
partitioner.setResources(inputResources);
partitioner.partition(10);
return partitioner;
}
#Bean
#StepScope
#Qualifier("nodeItemReader")
#DependsOn("partitioner")
public FlatFileItemReader<FolderNodePojo> NodeItemReader(#Value("#{stepExecutionContext['fileName']}") String filename)
throws MalformedURLException {
return new FlatFileItemReaderBuilder<FolderNodePojo>().name("NodeItemReader").delimited().delimiter("<##>")
.names(new String[] { "id" }).fieldSetMapper(new BeanWrapperFieldSetMapper<FolderNodePojo>() {
{
setTargetType(FolderNodePojo.class);
}
}).linesToSkip(0).resource(new UrlResource(filename)).build();
}
There are thousands of .txt files having thousands of lines which are getting processed. Some files have corrupted data and the job fails. I need to generate and send a report about the file names having corrupted data.
What should I do to log the name of the files which were processed successfully for all their lines, or, if possible, if I can log the unsuccessful ones, that will help too? So that I can generate a report based on that and also when I start the job again, I can remove those successful ones from the input directory. Any pointers/solution will be greatly appreciated.
As a strategy, you could do something like this:
once a file is in progress (say, filename "file1.txt"), step 1 is to rename that file, so the filename reflects that it's in progress; for example: "file1.txt.started"
when a file completes successfully, rename it to reflect that; for example: "file1.txt.complete"
if an error is encountered, and the code has a chance to rename the file, rename it to something like "file1.txt.error"
What's nice about this is you allow the filesystem to act like database for you, capturing the state of everything.
Once your program is finished, you can just count up the totals by file extension:
anything named *.complete is good, those are the obvious successful runs
anything with *.started is broken, and specifically it means something went really wrong
anything named *.error is broken, but your code had a chance to notice
anything named *.txt was, for some reason, not picked up
Those last three scenarios – *.started, *.error, *.txt – are the ones to study further, understand what went wrong. If all goes well, you'll end up with *.complete only.
Related
I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.
Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
#Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* #param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.
I have a CSV file containing customer info, one customer each row.
The CSV file has a size of about 170,000 lines.
The app first parsed the whole file line by line and saved each line as a Customer object into an ArrayList. It implied that the size of the list would also be in the order of 170k.
The code is like the below:
final class CustomerInfoLineProcessor implements LineProcessor<CustomerInfo> {
...
#Override
public boolean processLine(final String line) {
parseLine(line);
return true;
}
private void parseLine(final String line) {
try {
if (!line.trim().isEmpty()) {
//do job
}
} catch (final RuntimeException e) {
handleLineError(e.getClass().getName() + ": " + e.getMessage(), e, lineStatus);
}
}
...
}
It was found intermittently that the parsing process ended abnormally in the middle. No errors or runtime exceptions were thrown. The whole process was also not stopped. The app kept doing further jobs based on whatever inside the ArrayList.
In the beginning, I thought there might be some invisible characters hidden somewhere in the file, which caused the process quit early. But the possibility was excluded after the same file was tested without any problem by the same app on my test machine.
The second guess was: the memory setting -Xmx256m was too small, thus I changed it to an even smaller one, -Xmx128m. The app immediately threw an OutOfMemoryError, and the app was terminated automatically. It implied that the memory usage of -Xmx256m seemed not to be an issue.
Any other reasons I have not yet thought about?
Here is the problem found.
* the client's app ftp the csv file to us in a specified folder every morning;
* then the file_sync app started parsing the cvs file;
* sometimes the cvs file's ftp transferring was not complete while the file_sync app was kicked started. It caused the problem.
Thus the solution is to make sure the csv file is not being opened by another process before starting the file_sync app.
I want to create an application that shows a user how many times he opened or used the software. For this I have created the code below. But it is not showing correct output: when I run the application first it is showing 1 and then the second time I run it it is also showing 1.
public Founder() {
initComponents();
int c=0;
c++;
jLabel1.setText(""+c);
return;
}
I’m unsure whether I’m helping you or giving you a load of new problems and unanswered questions. The following will store the count of times the class Founder has been constructed in a file called useCount.txt in the program’s working directory (probably the root binary directory, where your .class files are stored). Next time you run the program, it will read the count from the file, add 1 and write the new value back to the file.
static final Path counterFile = FileSystems.getDefault().getPath("useCount.txt");
public Founder() throws IOException {
initComponents();
// read use count from file
int useCount;
if (Files.exists(counterFile)) {
List<String> line = Files.readAllLines(counterFile);
if (line.size() == 1) { // one line in file as expected
useCount = Integer.parseInt(line.get(0));
} else { // not the right file, ignore lines from it
useCount = 0;
}
} else { // program has never run before
useCount = 0;
}
useCount++;
jLabel1.setText(String.valueOf(useCount));
// write new use count back to file
Files.write(counterFile, Arrays.asList(String.valueOf(useCount)));
}
It’s not the most elegant nor robust solution, but it may get you started. If you run the program on another computer, it will not find the file and will start counting over from 0.
When you are running your code the first time, the data related to it will be stored in your system's RAM. Then when you close your application, all the data related to it will be deleted from the RAM (for simplicity let's just assume it will be deleted, although in reality it is a little different).
Now when you are opening your application second time, new data will be stored in the RAM. This new data contains the starting state of your code. So the value of c is set to 0 (c=0).
If you want to remember the data, you have to store it in the permanent storage (your system hard drive for example). But I think you are a beginner. These concepts are pretty advanced. You should do some basic programming practice before trying such things.
Here you need to store it on permanent basic.
Refer properties class to store data permanently: https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
You can also use data files ex. *.txt, *.csv
Serialization also provide a way for persistent storage.
You can create a class that implements Serializable with a field for each piece of data you want to store. Then you can write the entire class out to a file, and you can read it back in later.Learn about serialization here:https://www.tutorialspoint.com/java/java_serialization.htm
I am running into a peculiar issue (peculiar for me anyways) that seems to happen in a SwingWorker that I use for saving the result of another 'SwingWorker' thread as a tab-delimited file (just a spreadsheet of data).
Here is the worker, that initializes and declares an object which organizes the data and writes each table row to a file (using BufferedWriter):
// Some instance variables outside of the SwingWorker:
// model: holds a matrix of numerical data (double[][])
// view: the GUI class
class SaveWorker extends SwingWorker<Void, Void> {
/* The finished reordered matrix axes */
private String[] reorderedRows;
private String[] reorderedCols;
private String filePath; // the path of the file that will be generated
public SaveWorker(String[] reorderedRows, String[] reorderedCols) {
// variables have been checked for null outside of the worker
this.reorderedRows = reorderedRows;
this.reorderedCols = reorderedCols;
}
#Override
protected Void doInBackground() throws Exception {
if (!isCancelled()) {
LogBuffer.println("Initializing writer.");
final CDTGenerator cdtGen = new CDTGenerator(
model, view, reorderedRows, reorderedCols);
LogBuffer.println("Generating CDT.");
cdtGen.generateCDT();
LogBuffer.println("Setting file path.");
filePath = cdtGen.getFilePath(); // stops inside here, jumps to done()
LogBuffer.println("Path: " + filePath);
}
return null;
}
#Override
protected void done() {
if (!isCancelled()) {
view.setLoadText("Done!");
LogBuffer.println("Done saving. Opening file now.");
// need filePath here to load and then display generated file
visualizeData(filePath);
} else {
view.setReorderOngoing(false);
LogBuffer.println("Reordering has been cancelled.");
}
}
}
When I run the program from Eclipse, this all works perfectly fine. No issues whatsoever. Now I know there have been tons of question on here that are about Eclipse running fine while the runnable JAR fails. It's often due to not including dependencies or referring to them in the wrong way. But what's weird is that the JAR also works completely fine when it's being started from command line (Windows 8.1):
java -jar reorder.jar
Et voilà, everything as expected. The CDTGenerator will finish, write all the matrix rows to a file, and return the filePath. With the filePath I can subsequently open the new file and display the matrix.
In the case of double-clicking the JAR on my desktop, where I placed it when creating it from Eclipse, this is where the program will let me know that stuff happens. I get the error message I created for the case of filePath == null and using some logging I closed in on where the CDTGenerator object stops executing its method generateCDT() (Eclipse debugger also won't reproduce the error and do everything as planned).
What the log shows made me think it's an issue with concurrency, but I am actually leaning against that because Eclipse and command line both run the code fine. The log just tells me that the code suddenly stops executing during a loop which transforms double values from a matrix row (double[]) to Strings to be stored in a String[] for later writing with BufferedWriter.
If I use more logging in that loop, the loop will stop at a different iterator (???).
Furthermore, the code does work for small matrices (130x130) but not for larger ones (1500x3500) but I haven't tested where the limit is. This makes it seem almost time dependent, or memory.
I also used jVisualVM to look at potential memory issues, but even for the larger matrices I am on ~250MB which is nowhere near problematic regarding potential OutOfMemoryExceptions.
And finally, the last potential factor I can think of: Generating the JAR 'fails' due to some classpath issues (clean & rebuild have no effect...) but this has never been an issue before as I have run the code many many times using the 'broken' JAR and execute from Desktop.
I am a real newbie to programming, so please point in some direction if possible. I have tried to find logged exceptions, logged the values of variables, I am checking for null and IndexOutOfBound issues at the array where it stops executing... I am at a complete loss especially because this runs fine from command line.
It looks like the problem had to see with the java versions installed in OP's computer. They checked the file extensions and the programs associated to each one in order to see if it was the same java version as executed from Eclipse and the command line.
Once they cleaned older java versions the jar started to work by double-clicking it :)
Cause I do not have enough points (need 50 to directly answer your question), I need to ask this way:
If you double click a JAR you won't see a console which is often the problem because you can't see stack traces. They get just written to "nowhere". Maybe you get an NPE ore something else.
Try to attach an Exceptionhandler like this Thread.setDefaultUncaughtExceptionHandler(UncaughtExceptionHandler) and let this handler write down a message to a file or such...
Just an idea.
Background:
I have a requirement that messages displayed to the user must vary both by language and by company division. Thus, I can't use out of the box resource bundles, so I'm essentially writing my own version of resource bundles using PropertiesConfiguration files.
In addition, I have a requirement that messages must be modifiable dynamically in production w/o doing restarts.
I'm loading up three different iterations of property files:
-basename_division.properties
-basename_2CharLanguageCode.properties
-basename.properties
These files exist in the classpath. This code is going into a tag library to be used by multiple portlets in a Portal.
I construct the possible .properties files, and then try to load each of them via the following:
PropertiesConfiguration configurationProperties;
try {
configurationProperties = new PropertiesConfiguration(propertyFileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
} catch (ConfigurationException e) {
/* This is ok -- it just means that the specific configuration file doesn't
exist right now, which will often be true. */
return(null);
}
If it did successfully locate a file, it saves the created PropertiesConfiguration into a hashmap for reuse, and then tries to find the key. (Unlike regular resource bundles, if it doesn't find the key, it then tries to find the more general file to see if the key exists in that file -- so that only override exceptions need to be put into language/division specific property files.)
The Problem:
If a file did not exist the first time it was checked, it throws the expected exception. However, if at a later time a file is then later dropped into the classpath and this code is then re-run, the exception is still thrown. Restarting the portal obviously clears the problem, but that's not useful to me -- I need to be able to allow them to drop new messages in place for language/companyDivision overrides w/o a restart. And I'm not that interested in creating blank files for all possible divisions, since there are quite a few divisions.
I'm assuming this is a classLoader issue, in that it determines that the file did not exist in the classpath the first time, and caches that result when trying to reload the same file. I'm not interested in doing anything too fancy w/ the classLoader. (I'd be the only one who would be able to understand/maintain that code.) The specific environment is WebSphere Portal.
Any ways around this or am I stuck?
My guess is that I am not sure if Apache's FileChangedReloadingStrategy also reports the events of ENTRY_CREATE on a file system directory.
If you're using Java 7, I propose to try the following. Simply, implement a new ReloadingStrategy using Java 7 WatchService. In this way, every time either a file is changed in your target directories or a new property file is placed there, you poll for the event and able to add the properties to your application.
If not on Java 7, maybe using a library such as JNotify would be a better solution to get the event of a new entry in a directory. But again, you need to implement the ReloadingStrategy.
UPDATE for Java 6:
PropertiesConfiguration configurationProperties;
try {
configurationProperties = new PropertiesConfiguration(propertyFileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
} catch (ConfigurationException e) {
JNotify.addWatch(propertyFileDirectory, JNotify.FILE_CREATED, false, new FileCreatedListener());
}
where
class FileCreatedListener implements JNotifyListener {
// other methods
public void fileCreated(int watchId, String rootPath, String fileName) {
configurationProperties = new PropertiesConfiguration(rootPath + "/" + fileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
// or any other business with configurationProperties
}
}