Best way to compress files

Best way to compress files - java

I am reading different chunks of data from DB and writing each chunk into CSV file and adding that entry to zip file. Here are my questions:
I am dealing with huge data, Is it advisable to open zip stream open in the beginning and closing at end of transaction? If I do so, will it hold all these data in RAM and cause any memory issues?
Will there be any advantage if I keep these csv files in hard drive and zipping it at the end of transaction? If so, what is the best way to do it in java?
Note: We are using Java 1.6 for our application.

Have a look at the new File system introduced with Java 7
http://fahdshariff.blogspot.com/2011/08/java-7-working-with-zip-files.html
http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
This allows you to handle a zip file like a file system and just copy or write your data directly into files inside the zip file. However the Path.toFile() method is not supported on a zip File system, so for all legacy code that required a File object, you need to create a temporary file and then copy it over.
For your application you could just use something like Files.newBufferedWriter(...) to write the file directly into the zip archive without having to worry about the specifics.

Make sure the ZipOutputStream is wrapped around an outputstream that is not in memory (like a FileOutputStream). This will keep memory consumption to a minimum and you can basically write until your filesystem is full.
There is no advantage to first creating a csv file, then zipping it, write the csv line directly to the outputstream. This can easily be done with java 1.6
The only limitation you might run into if it gets really big is that java 1.6 does not support zip64 and as such you are limited to 4gb. At some point I backported the zip functionality of 1.7 to 1.6 to resolve this issue.

Related

Use the Checkstyle API without providing a java.io.File

Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.

There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.

How to store files in memory with java?

I am trying to implement a minimal FTP Server using java. On this server, I want all files to exist in memory only. Nothing should be written on disk.
Having said this, I have to create a virtual file system, comprised of a root directory, some sub-directories and files. A few of those will initially be loaded from the hard disk and then will only be handled in memory.
My question is: is there an efficient way to implement this in Java? Is there something that is preimplemented? A Class I should use? (I don't have access to all libraries: java.lang, java.io)
Assuming there is not, I have created my own simple FileSystem, Directory and File classes. I have no idea, however, how I should store the actual data in memory. Knowing that a file can be an image, a text file or anything else that could plausibly be exchanged with an FTP server, how should I store it? Also, there are two transfer modes that I should be able to use: binary and ASCII. So in whatever format I store the data, I should be able to convert them to some kind of binary or ASCII format.
I know the question is a bit abstract, any sort of hints as to where I should look will be appreciated.

The data will be in memory unless written to the disk.
Assuming you have the relevant data stored in some variables then simply do not call any file writer and the variable will remain in java's stack/heap without being stored onto the filesystem.

How do I append files to a .tar archive in Java?

I would like to create a tar archive in Java. I have files which are constantly being created and I'd like a worker thread to take a reference to those files from a queue and copy them into the archive.
I tried using Apache Compression library's TarArchiveOutputStream to do this, but I do not wish to keep the archive open for the entire duration of the program (since unless i finalize the archive, it can become corrupted - so i'd rather append to it in batches), and I haven't found a good way to append to an existing tar archive with their library (They do have the "ChangeSetPerformer" class, but it basically just creates a new tar and needs to copy the old one entirely, which isn't good for me, performance wise).
I also need the library to not have a low limit for the size of the archive (i.e. 4g or so is not enough), and i'd rather avoid having to actually compress the archive.
Any suggestions would be greatly appreciated!
thank you.

You run here in a limitation of tar: http://en.wikipedia.org/wiki/Tar_(file_format)#Random_access
because of that it is hard to add or remove single files without copying the whole archive.

I use a library called Java Tar: http://www.trustice.com/java/tar/
It's worked for me. In that package, look for:
http://www.gjt.org/javadoc/com/ice/tar/TarArchive.html#writeEntry(com.ice.tar.TarEntry, boolean)
Which lets you add an entry to the file without using a stream at the user level. I don't know about file size - but it would be a simple matter to test this aspect.

Java questions CSV/batch/js in jar

i have multiple questions here that may sound annoying...
What is batch processing in java is it related to .bat files and how to write batch files?
How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
can we include js files in jar ? if yes then how ?
how to compile a java file from command prompt and mention the jar used by it.

1) What is batch processing in java is it related to .bat files and how to write batch files?
Batch Processing is not Java specific. It is explained pretty well in this Wikipedia article
Batch processing is execution of a series of programs ("jobs") on a computer without manual intervention.
Batch jobs are set up so they can be run to completion without manual intervention, so all input data is preselected through scripts or command-line parameters. This is in contrast to "online" or interactive programs which prompt the user for such input. A program takes a set of data files as input, processes the data, and produces a set of output data files. This operating environment is termed as "batch processing" because the input data are collected into batches of files and are processed in batches by the program.
There are different ways to implement batch processing in Java, but I guess the most powerful library available is Spring Batch (but it has a steep learning curve). Batch processing is only marginally related to windows .bat batch files.
2) How to read CSV files in java? and what are CSV Files how do we clarify which value depicts which thing?
When dealing with CSV (or other structured data, like XML, JSON or database contents), you usually want to map the data to Java objects, so you need a library that does Object mapping. For CSV, OpenCSV is such a library (see this section on Java bean mapping).
3) can we include js files in jar ? if yes then how ?
see gdj's answer. You can put anything in a jar, but resources in a jar will not be available as File objects, only as InputStream using the Class.getResourceAsStream(name) or ClassLoader.getResourceAsStream(name) methods

Batch processing is not a Java specific term. Whenever you perform an action on a group of objects/files, we can term it as batch processing. ".bat" files are Windows equivalent of shell scripts. They do not have any connection to Java or batch processing in Java.
CSV are "Comma Separated Values" i.e each column in a line of the file is delimted by "comma". You can read CSV files using normal FileReader and then using StringTokenizer to parse through each line.
I guess we could include anything in a Jar file. I don't see how it is going to prevent that.

there is no direct relationship between java and bat. Batch files are files written in windows shell language. Sometimes we use bat files to run our java programs on windows. In this case batch file typically is used to generate the java command line like
java -cp THE-CLASSPATH com.mycompany.Main arg1 arg2
You can read CSV file as a regular text file and then split it using String.split() method. Alternatively you can use one of available open source CSV parsers, e.g. from jakarta: http://commons.apache.org/sandbox/csv/apidocs/org/apache/commons/csv/CSVParser.html
JAR file is just a ZIP file. You can include everything into ZIP including js. How to do this? It depends on how do you create jar file at all. If for example you are using ant script just include *.js into the include pattern.
If you need more specific answer ask more specific question.

processing a lot of data at once
CSV is comma-separated values, a file format. try the OpenCSV library
yes but you can only read them from Java code (you can't tell Apache to serve them directly over HTTP)

How to safely update a file that has many readers and one writer?

I have a set of files. The set of files is read-only off a NTFS share, thus can have many readers. Each file is updated occasionally by one writer that has write access.
How do I ensure that:
If the write fails, that the previous file is still readable
Readers cannot hold up the single writer
I am using Java and my current solution is for the writer to write to a temporary file, then swap it out with the existing file using File.renameTo(). The problem is on NTFS, renameTo fails if target file already exists, so you have to delete it yourself. But if the writer deletes the target file and then fails (computer crash), I don't have a readable file.
nio's FileLock only work with the same JVM, so it useless to me.
How do I safely update a file with many readers using Java?

According to the JavaDoc:
This file-locking API is intended to
map directly to the native locking
facility of the underlying operating
system. Thus the locks held on a file
should be visible to all programs that
have access to the file, regardless of
the language in which those programs
are written.

I don't know if this is applicable, but if you are running in a pure Vista/Windows Server 2008 solution, I would use TxF (transactional NTFS) and then make sure you open the file handle and perform the file operations by calling the appropriate file APIs through JNI.
If that is not an option, then I think you need to have some sort of service that all clients access which is responsible to coordinate the reading/writing of the file.

On a Unix system, I'd remove the file and then open it for writing. Anybody who had it open for reading would still see the old one, and once they'd all closed it it would vanish from the file system. I don't know if NTFS has similar semantics, although I've heard that it's losely based on BSD's file system so maybe it does.

Something that should always work, no matter what OS etc, is changing your client software.
If this is an option, then you could have a file "settings1.ini" and if you want to change it, you create a file "settings2.ini.wait", then write your stuff to it and then rename it to "settings2.ini" and then delete "settings1.ini".
Your changed client software would simply always check for settings2.ini if it has read settings1.ini last, and vice versa.
This way you have always a working copy.

There might be no need for locking. I am not too familiar with the FS API on Windows, but as NTFS supports both hard links and soft links, AFAIK, you can try this if your setup allows it:
Use a hard or soft link to point to the actual file, and name the file diferently. Let everyone access the file using the link's name.
Write the new file under a different name, in the same folder.
Once it is finished, have the file point to the new file. Optimally, Windows would allow you to create the new link with replacing the existing link in one atomic operation. Then you'd effectively have the link always identify a valid file, either the old or the new one. At worst, you'd have to delete the old one first, then create the link to the new file. In that case, there'd be a short time span in which a program would not be able to locate the file. (Also, Mac OS X offers a "ExchangeObjects" function that allows you to swap two items atomically - maybe Windows offers something similar).
This way, any program that has the old file already opened will continue to access the old one, and you won't get into its way creating the new one. Only if an app then notices the existence of the new version, it could then close the current and open it again, this way getting access to the new version.
I don't know, however, how to create links in Java. Maybe you have to use some native API for that.
I hope this helps anyways.

I have been dealing with something similar recently. If you are running Java 5, perhaps you could consider using NIO file locks in conjunction with a ReentrantReadWriteLock? Make sure all code referencing the FileChannel object ALSO references the ReentrantReadWriteLock. This way the NIO locks it at a per-VM level while the reentrant lock locks it at a per-thread level.
FileLock fileLock = filechannel.lock(position, size, shared);
reentrantReadWriteLock.lock();
// do stuff
fileLock.release();
reentrantReadWriteLock.unlock();
Of course, some exception handling would be required.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.