Java questions CSV/batch/js in jar

Java questions CSV/batch/js in jar - java

i have multiple questions here that may sound annoying...
What is batch processing in java is it related to .bat files and how to write batch files?
How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
can we include js files in jar ? if yes then how ?
how to compile a java file from command prompt and mention the jar used by it.

1) What is batch processing in java is it related to .bat files and how to write batch files?
Batch Processing is not Java specific. It is explained pretty well in this Wikipedia article
Batch processing is execution of a series of programs ("jobs") on a computer without manual intervention.
Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters. This is in contrast to "online" or interactive programs which prompt the user for such input. A program takes a set of data files as input, processes the data, and produces a set of output data files. This operating environment is termed as "batch processing" because the input data are collected into batches of files and are processed in batches by the program.
There are different ways to implement batch processing in Java, but I guess the most powerful library available is Spring Batch (but it has a steep learning curve). Batch processing is only marginally related to windows .bat batch files.
2) How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
When dealing with CSV (or other structured data, like XML, JSON or database contents), you usually want to map the data to Java objects, so you need a library that does Object mapping. For CSV, OpenCSV is such a library (see this section on Java bean mapping).
3) can we include js files in jar ? if yes then how ?
see gdj's answer. You can put anything in a jar, but resources in a jar will not be available as File objects, only as InputStream using the Class.getResourceAsStream(name) or ClassLoader.getResourceAsStream(name) methods

Batch processing is not a Java specific term. Whenever you perform an action on a group of objects/files, we can term it as batch processing. ".bat" files are Windows equivalent of shell scripts. They do not have any connection to Java or batch processing in Java.
CSV are "Comma Separated Values" i.e each column in a line of the file is delimted by "comma". You can read CSV files using normal FileReader and then using StringTokenizer to parse through each line.
I guess we could include anything in a Jar file. I don't see how it is going to prevent that.

there is no direct relationship between java and bat. Batch files are files written in windows shell language. Sometimes we use bat files to run our java programs on windows. In this case batch file typically is used to generate the java command line like
java -cp THE-CLASSPATH com.mycompany.Main arg1 arg2
You can read CSV file as a regular text file and then split it using String.split() method. Alternatively you can use one of available open source CSV parsers, e.g. from jakarta: http://commons.apache.org/sandbox/csv/apidocs/org/apache/commons/csv/CSVParser.html
JAR file is just a ZIP file. You can include everything into ZIP including js. How to do this? It depends on how do you create jar file at all. If for example you are using ant script just include *.js into the include pattern.
If you need more specific answer ask more specific question.

processing a lot of data at once
CSV is comma-separated values, a file format. try the OpenCSV library
yes but you can only read them from Java code (you can't tell Apache to serve them directly over HTTP)

Related

Spring Batch: Create new steps at job runtime

Context: I am working on Spring Batch pipeline which will upload data to database. I have already figured out how to read flat .csv files and write items from it with JdbcBatchItemWriter. The pipeline I am working on must read data from Zip archive which contains multiple .csv files of different types. I'd like to have archive downloading and inspecting as two first steps of job. I do not have enough disk space to unpack the whole loaded archive. Instead of unpacking, I inspect zip file content to define .csv files paths inside Zip file system and their types. Zip file inspecting allows obtaining InputStream of corresponding csv file easily. After that, reading and uploading (directly from zip) all discovered .csv files will be executed in separate steps of the job.
Question: Is there any way to dynamically populate new Steps for each of discovered csv entries of zip at job runtime as result of inspect step?
Tried: I know that Spring Batch has conditional flow but it seems that this technique allows configuring only static number of steps that are well defined before job execution. In my case, the number of steps (csv files) and reader types are discovered at the second step of the job.
Also, there is a MultiResourceItemReader which allows to read multiple resources sequentially. But I'm going to read different types of csv files with appropriate readers. Moreover, I'd like to have "filewise" step encapsulation such that if one file loading step fails, others will be executed anyway.
The similar question How to create dynamic steps in Spring Batch does not have suitable solution for me as the answer supposes steps creation before job running, while I need to add steps as result of second step of job.

You could use partitioned steps
Pass variable containing list of csv as resources to
JobExecutionContext during your inspect step
In partition method retrieve the list of csv and create a partition for each one.
The step will be executed for each of the partition created

multiple files input to stanford NER preserving naming for each output

I have many files, (the NYTimes corpus for '05, '06, & '07) , I want to run them all through the Stanford NER, "easy" you might think, "just follow the commands in the README doc", but if you thought that just now, you would be mistaken, because my situation is a bit more complicated. I don't want them all outputted into some big jumbled mess, I want to preserve the naming structure of each file, so for example, one file is named 1822873.xml and I processed it earlier using the following command:
java -mx600m -cp /home/matthias/Workbench/SUTD/nytimes_corpus/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -textFile /home/matthias/Workbench/SUTD/nytimes_corpus/1822873.xml -outputFormat inlineXML >> output.curtis
If I were to follow this question, i.e. many files all listed in the command one after the other, and then pipe that to somewhere, wouldn't it just send them all to the same file? That sounds like a headache disastor of the highest order.
Is there some way to send each file to a seperate output file, so for instance, our old friend 1822873.xml would emerge from this process as, say 1822873.output.xml, and likewise for each of the other thousand some odd files. Please keep in mind that I'm trying to achieve this expeditiously.
I guess this should be possible, but what is the best way to do it? with some kind of terminal command, or maybe write a small script?
Maybe one among you has some experience with this type of thing.
Thank you for your consideration.

If you use the -filelist option and the -outputDirectory option, you can read in a list of files you wish to process, and the directory in which you would like to save the processed files. Example:
java -cp "*" -mx5g edu.stanford.nlp.pipeline.StanfordCoreNLP -prop annotators.prop -filelist list_of_files_to_process.txt -outputDirectory "my_output_directory"
For reference, here are the contents of list_of_files_to_process.txt:
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0334.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0335.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0336.txt
C:/Users/dduhaime/Desktop/pq/analysis/data/washington_correspondence_data/collect_full_text/washington_full_text\02-09-02-0337.txt
Here are the contents of my annotators.prop file:
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref, gender, sentiment, natlog, entitymentions, relation
And here's what the contents of my_output_directory will look like:

UPDATE
you can do it with a bash script like this.
#duhaime I tried that but I had an issue with the classifier, also is it possible to formulate the output for that as inline xml?
With respect to my original question, check out what I've found:
Unfortunately, there is no option to have multiple input files go to
multiple output files. The best you can do in the current situation
is to run the CRFClassifier once for each input file you have. If
you
have a ton of small files, loading the model will be an expensive
part
of this operation, and you might want to use the CRFClassifier
server
program and feed files one at a time through the client. However, I
doubt that will be worth the effort except in the specific case of
having very many small files.
We will try to add this as a feature for the next distribution (we
have a general fix-it day coming up) but no promises.
John
My files are all numbered in ascending order, do you think it would be possible to write some kind of bash script with a loop to processes each of them one at a time?

Best way to compress files

I am reading different chunks of data from DB and writing each chunk into CSV file and adding that entry to zip file. Here are my questions:
I am dealing with huge data, Is it advisable to open zip stream open in the beginning and closing at end of transaction? If I do so, will it hold all these data in RAM and cause any memory issues?
Will there be any advantage if I keep these csv files in hard drive and zipping it at the end of transaction? If so, what is the best way to do it in java?
Note: We are using Java 1.6 for our application.

Have a look at the new File system introduced with Java 7
http://fahdshariff.blogspot.com/2011/08/java-7-working-with-zip-files.html
http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
This allows you to handle a zip file like a file system and just copy or write your data directly into files inside the zip file. However the Path.toFile() method is not supported on a zip File system, so for all legacy code that required a File object, you need to create a temporary file and then copy it over.
For your application you could just use something like Files.newBufferedWriter(...) to write the file directly into the zip archive without having to worry about the specifics.

Make sure the ZipOutputStream is wrapped around an outputstream that is not in memory (like a FileOutputStream). This will keep memory consumption to a minimum and you can basically write until your filesystem is full.
There is no advantage to first creating a csv file, then zipping it, write the csv line directly to the outputstream. This can easily be done with java 1.6
The only limitation you might run into if it gets really big is that java 1.6 does not support zip64 and as such you are limited to 4gb. At some point I backported the zip functionality of 1.7 to 1.6 to resolve this issue.

Would it be faster to have an all java program or a perl file that calls different jar files?

I have a Perl file that calls several JAR files one after each other. It uses one JAR file output as the input of the other JAR file and so on.
I wonder if I make the whole thing a java program and call different classes o java all by self would it be faster and if yes why?
To explain more, one of the JAR files creates a lot file in a directory and the other JAR file goes and read those file as inputs.
and obviously by faster I mean running time!

Probably. You'd then only pay the java startup overhead cost once.
But the best way to find out is to benchmark it.

Security for file uploads in Java

I have a web application using Java Servlets in which the user can upload files. What can I do to prevent malicious files and viruses from being uploaded?

The ClamAV antivirus team provide a very easy interface for integrating the clamd daemon into your own programs. It is sockets-based instead of API based, so you might need to write some convenience wrappers to make it look "natural" in your code, but the end result is they do not need to maintain a dozen or more language bindings.

Alternatively, if you have enough access to the machine in question, you could simply call a command line application to do the scanning. There is enough info on starting command line applications and most if not all locally installed virusscanners have a command line option. This has the advantage that not every IP packet has to pass through the scanner (but you will have to read and parse the output of the virusscanner). It also makes sure you got the info available in your Java application so you can warn the user.

You also need to protect from Path Traversal (making sure users cannot upload files to a place they do not belong, such as overwriting a JAR file in the classpath or a DLL in the path)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.