FTP using Talend, get only most recent file? - java

I have a Talend job that I need to pull down an XML file from an sFTP server to then be processed into an Oracle database. The date of the XML extraction is in the file name, for example "FileNameHere_Outbound_201407092215.xml", which I believe is yyyyMMddhhmm formatting. The beginning portion where "FileNameHere" is the same for all the files. I need to be able to read the date from the end of the file name and only pull that one down from the server to be processed.
I am not sure how to do this with FTP. I've previously used tFilelist to order the items by date descending, but that is not an option with FTP. I know it probably has some Java involved in how to pull the portion of the File Name out, but I'm not very Java-literate. I can manage though with a bit of assistance.
Does anyone have any insight on how to only download the most recent file from an FTP?

There's a tFTPFileList component on the palette. That should give you a list of all the files on the FTP location. From here you then want to parse out the time stamp which could be done with a regular expression or alternatively by substringing it depending on which you feel more comfortable with.
Then it's just a case of sorting by the extracted time stamp and then that gives you the newest file name so you can then go fetch that specific file.
Here's an outline of an overly laborious way to get this done but it works. You should be able to tweak this easily yourself too:
In the above job design I've gone for a tFileList rather than a tFTPFileList because I don't have an example FTP location to play with for testing here. The premise stays the same although this would be pointless with a real tFileList due to the ability to sort by modified date (among other options).
We start off by running the tFileList/tFTPFileList component to iterate through all the files (it's possible to file mask these too to limit what you return here) in the location. We then read this in iteratively to a tFixedFlowInput component which allows us to retrieve the values from the globalMap as the tFileList/tFTPFileList iterates through each file:
I've listed everything that the tFileList provides (you can see the options by pressing ctrl+space) but you only really need the file name and potentially the file path or file directory. From here we then throw everything into a buffer with a tBufferOutput component so that we can gather every iteration of the location.
Once the tFileList/tFTPFileList has iterated through every file in the directory it then triggers the next sub job with an OnSubjobOk link where we start by reading the completed buffer back in with a tBufferInput component. At this point I've started scattering tLogRow components throughout the flow so I can better visualise the data at each step.
After this we then use a tExtractRegexFields component to extract the date time stamp from the file name:
Here, I am using the following regex "^.+?_Outbound_([0-9]{12})\\.xml$" to capture the date time stamp. It relies on the file name being a combination of any characters, followed by the string literal _Outbound_, then followed by the date time stamp that we want to capture (which is represented by 12 numeric characters) and then finished with .xml.
We also add a column to our schema to accommodate the captured date time stamp like so:
As the extra column is a date time stamp of the form yyyyMMddhhmm we can specify this directly here and use it as a date object from then on.
From here we simply sort by date descending on the extracted date time stamp column and then use a tSampleRow to take only the first row of the flow of data as per the guidelines on the component configuration.
To finish this job you would then output the target file path to the globalMap (either in a tJavaRow or using a tFlowToIterate that will automatically do this for you) and then use the globalMap stored file path in the tFTPFileGet's file mask setting:

Related

Logging JConsole output into a CSV file. What logic ist behind the time value?

When I open a JConsole window, I am able to save the memory data by rightclicking onto the graph and choose save data as (.csv).
After that the data output is like
Time,Used,Committed,Max
43738.560828,877853920,4294947296,15032385536
How can I parse the Time into some more comfortable data?
My first idea was trying to parse the data with excel into some readable value, but this was not successful.
Excel removed the decimal separator when you imported the csv. At import time you have to be sure Excel is considering the point (.) as the decimal sepator and not the one of your region.
Once you do that you can format the cell to date and the result will be
30/09/2019 13:27:36
Hope I helped

Check if a a file was ran earlier the same day

I need to create a program that sets an int to 01 at the beginning of each day. Every time the file is run, the int increments up until the next day. This int will be inserted into a file name, e.g. FileName(insertdatehere)01.txt, FileName(insertdatehere02.txt, FileName(insertdatehere)03.txt, etc...
I was wondering if this is possible:
-Checking if the file already exists, and if it does, then the int value will increment. This will work since the file name has the date on it, so that each day, a new file name will be created anyways.
Am I going in the right direction, or should I completely rethink this question?
Sorry if this isn't clear, if you need me to clarify, I will.
Your ideas seem correct and doing it in this way would probably work well.
Something to watch out for would be if two of the same process exist and both try to create a file assuming it doesn't exist.
As long as you consider this case, and your process runs reliably throughout the day (and you don't fall into timezone traps), you should be good to go.
Did you try using java.util.Date class for setting time stamp , date etc.
You can set date whenever the file is opened in some other file or you can set the same value at some specific place in the same file.
Then whenever you open the file again you can compare and check the earlier date which is already set.
This would surely help you.
Firstly try it yourself and then if you remain unable to do the same post issues whichever you face.
Your Idea is good .
But for the case when there does not exists any file and two processes are trying to create it simultaniously with same name then the problem with occur .
Above problem you can solve by using Synchronization in Java, so that code block(which contains logic for check file if it exists and creates new file) cannot be accessed simultaneously .

Writing one file per group in Pig Latin

The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Best way to get file date created from Java on a UNIX system?

java.io.File doesn't provide a way to get a file's creation date. You can get file.lastModified, but not anything like dateCreated.
Java 7 adds the excellent java.nio.file package with access to date created, but it's not out yet.
My question: what's the best way to get a file's date created from Java on a UNIX/OSX system? I suppose it's executing a shell script, but my command line skills are pretty weak. So if shell scripting's the way to go, if you could provide a full example I'd be very grateful.
Thanks!
From the command line, the easiest way to get the creation date is with mdls. I think you'd want to do /usr/bin/mdls -name kMDItemFSCreationDate $filename (where $filename is the file you're asking about). You can specify multiple filenames, but that might make it harder to parse the output.
Unix doesn't have a "creation date". The only dates stored on files are modification date, which stores the time the file contents where changed; the access date, which stores the last time a file was read; and the "change date", which stores the last time the file's metadata was changed. (The metadata contains things like permission bits, ownership, etc.)
If you review the structure supported by the stat(2) API, you can see the three timespec's.
You can't - it isn't stored anywhere
It actually depends on the file system, and most file systems don't provide such feature as storing a file "birth time". You need to check the filesystem.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks
Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.
Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.
Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.
(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.
Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.
As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Categories

Resources