Can I trace record source in Spark?

Can I trace record source in Spark? - java

While processing multiple files in parallel with Spark I'd like to know from which file particular record is coming. My goal is to assign file ID (or at least file name) to each record for internal audit purposes.
Is there any way to do this?
I'm using Spark Java API.

Yes you can use SparkContext.wholeTextFiles which gives you the file name as the key and the whole file as the value. If sc is a SparkContext (JavaSparkContext in your case), just call sc.wholeTextFiles("path/to/dir/")
P.S.: I answered a similar question before and discovered that this function does have issues reading compressed files (only tested with gzip) so be aware of that.

Related

Spring Batch: Create new steps at job runtime

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.
Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?
Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.
Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.
The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.

You could use partitioned steps
Pass variable containing list of csv as resources to
JobExecutionContext during your inspect step
In partition method retrieve the list of csv and create a partition for each one.
The step will be executed for each of the partition created

Use the Checkstyle API without providing a java.io.File

Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.

There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.

Delete transferred logs with logstash using sincedb, or use an alternative solution

I want to move log files from a local directory to an elasticsearch client using logstash.
I want to remove the transferred logs (or alternatively alter their name),
in order to keep a reasonable log directory size.
I already understood that there's no built-in functionality in logstash for that, and I wondered if I can use the sincedb file to understand whether the file was completely processed and transferred, because I could also consider writing code which could handle that.
In case it is not possible, I'd could also use a completely different solution instead of logstash.
To sum it up:
Is there a way to understand which files the logstash has finished processing using the sincedb file?
If the answer to the previous question is no, is there another tool which could replace logstash in this case? I don't use any of the logstash's parsing ability, only reading from local directory, and passing it to elasticsearch

The %{path} variable will have the file name from where the current event is read, if that helps.

Change Log4J2 Output File During Runtime

I'm trying to integrate Log4J2 with a bit of custom logging functionality in a project of mine and running into some issues getting Log4J2 to behave similarly to my own implementation. I'd like to be able to change the log file being written to (make if doesn't exist, cease writing to old file, only write to new file) based on events that occur in execution.
I'll give an example that hopefully illustrates what I'm looking for. We start the application and write to some predetermined log file name but we're just recording preliminary logs because nothing of interest has really happened yet. Once environmental conditions are correct and we've received user input the software is engaged and we'd like to begin logging for both debugging information and data capture. To separate the preliminary data from the interesting data we'd like to change our log file to a new log file who's name contains more information about the state of the environment when the system was engaged, to make it easier to sort through which log files we want to analyze and post-process.
I've seen posts on how to accomplish similar things, but they seem to either require the new filename to be known before Log4J2 is initialized (i.e. setting a system property) or using a RoutingAppender which seems to be closer but appears to require knowing all possible values (to define the routes) that we might want to put in our file name. We'd have to define one route per each environment state (or worse yet, each state combination) we want to put in the file name.
Any ideas?

You use the RoutingAppender to dynamically create new files based on a ThreadContext key/value.
Here is an example of how to do this:
https://logging.apache.org/log4j/2.x/faq.html#separate_log_files
In the example, the value of ROUTINGKEY in the ThreadContext map determines the new file name.

Java questions CSV/batch/js in jar

i have multiple questions here that may sound annoying...
What is batch processing in java is it related to .bat files and how to write batch files?
How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
can we include js files in jar ? if yes then how ?
how to compile a java file from command prompt and mention the jar used by it.

1) What is batch processing in java is it related to .bat files and how to write batch files?
Batch Processing is not Java specific. It is explained pretty well in this Wikipedia article
Batch processing is execution of a series of programs ("jobs") on a computer without manual intervention.
Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters. This is in contrast to "online" or interactive programs which prompt the user for such input. A program takes a set of data files as input, processes the data, and produces a set of output data files. This operating environment is termed as "batch processing" because the input data are collected into batches of files and are processed in batches by the program.
There are different ways to implement batch processing in Java, but I guess the most powerful library available is Spring Batch (but it has a steep learning curve). Batch processing is only marginally related to windows .bat batch files.
2) How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
When dealing with CSV (or other structured data, like XML, JSON or database contents), you usually want to map the data to Java objects, so you need a library that does Object mapping. For CSV, OpenCSV is such a library (see this section on Java bean mapping).
3) can we include js files in jar ? if yes then how ?
see gdj's answer. You can put anything in a jar, but resources in a jar will not be available as File objects, only as InputStream using the Class.getResourceAsStream(name) or ClassLoader.getResourceAsStream(name) methods

Batch processing is not a Java specific term. Whenever you perform an action on a group of objects/files, we can term it as batch processing. ".bat" files are Windows equivalent of shell scripts. They do not have any connection to Java or batch processing in Java.
CSV are "Comma Separated Values" i.e each column in a line of the file is delimted by "comma". You can read CSV files using normal FileReader and then using StringTokenizer to parse through each line.
I guess we could include anything in a Jar file. I don't see how it is going to prevent that.

there is no direct relationship between java and bat. Batch files are files written in windows shell language. Sometimes we use bat files to run our java programs on windows. In this case batch file typically is used to generate the java command line like
java -cp THE-CLASSPATH com.mycompany.Main arg1 arg2
You can read CSV file as a regular text file and then split it using String.split() method. Alternatively you can use one of available open source CSV parsers, e.g. from jakarta: http://commons.apache.org/sandbox/csv/apidocs/org/apache/commons/csv/CSVParser.html
JAR file is just a ZIP file. You can include everything into ZIP including js. How to do this? It depends on how do you create jar file at all. If for example you are using ant script just include *.js into the include pattern.
If you need more specific answer ask more specific question.

processing a lot of data at once
CSV is comma-separated values, a file format. try the OpenCSV library
yes but you can only read them from Java code (you can't tell Apache to serve them directly over HTTP)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.