Spark Reading .7z files

Spark Reading .7z files - java

I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.
For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream.
https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html
Zip file code
spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
I am trying similar code for the 7z files something like
spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new SevenZFile(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
But SevenZFile doesnt accept these formats.Looking for ideas.
If the file is in local filessytem following solution works, but my file is in hdfs
Local fileSystem Code
public static void decompress(String in, File destination) throws IOException {
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
}
After all these years of spark evolution there should be easy way to do it.

Instead of using the java.io.File-based approach, you could try the SeekableByteChannel method as shown in this alternative constructor.
You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.
With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.
Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.

Related

How to access every entry of CompressedSource in google cloud dataflow? And get Byte[] of each subfile

I have a compressed file which is a gzip file composed of multiple text file on google storage. I need to access each subfile and do some operation like regular expression.
I can do the same thing on my local computer like this.
pubic static void untarFile( String filepath ) throw IOException {
try {
FileInputStream fin = new FileInputStream(filepath);
BufferedInputStream in = new BufferedInputStream(fin);
GzipCompressorInputStream gzIn = new GzipCompressorInputStream(in);
TarArchiveInputStream tarInput = new TarArchiveInputStream(gzIn);
TarArchiveEntry entry = null;
while ((entry = (TarArchiveEntry) tarInput.getNextTarEntry() ) != null) {
byte[] fileContent = new byte (int)entry.getSize() ];
tarInput.read(fileContent, 0, fileContent.length);
}
}
}
Therefore, I can do some other operation on fileContent which is a byte[ ]. So I used CompressedSource on google cloud dataflow and refer to its test code.It seems that I can only get every byte from file instead of whole byet[] of subfile, so I am wondering if there is any solution for me to do this on google cloud dataflow.

TextIO does not support this directly, but you can create a new subclass of FileBasedSource to do this. You'll want to override isSplittable() to always return false, and then have readNextRecord() just read the entire file.

Combining compressed Gzipped Text Files using Java

my question might not be entirely related to Java but I'm currently seeking a method to combine several compressed (gzipped) textfiles without the requirement to recompress them manually. Lets say I have 4 files, all text that is compressed using gzip and want to compress these into one single *.gz file without de + recompressing them. My current method is to open an InputStream and parse the file linewise, storing in a GZIPoutputstream, which works but isn't very fast.... I could of course also call
zcat file1 file2 file3 | gzip -c > output_all_four.gz
This would work, too but isn't really fast either.
My idea would be to copy the inputstream and write it to outputstream directly without "parsing" the stream, as I don't need to manipulate anything actually. Is something like this possible?

Find below a simple solution in Java (it does the same as my cat ... example). Any kind of buffering the input/output has been omitted to keep the code slim.
public class ConcatFiles {
public static void main(String[] args) throws IOException {
// concatenate the single gzip files to one gzip file
try (InputStream isOne = new FileInputStream("file1.gz");
InputStream isTwo = new FileInputStream("file2.gz");
InputStream isThree = new FileInputStream("file3.gz");
SequenceInputStream sis = new SequenceInputStream(new SequenceInputStream(isOne, isTwo), isThree);
OutputStream bos = new FileOutputStream("output_all_three.gz")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = sis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
// ungezip the single gzip file, the output contains the
// concatenated input of the single uncompressed files
try (GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("output_all_three.gz"));
OutputStream bos = new FileOutputStream("output_all_three")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = gzipis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
}
}

The above method works if you just require to gzip many zipped files. In my case I had made a web servlet and my response was in 20-30 KBs. So I was sending the zipped response.
I tried to zip all my individual JS files on server start only and then add dynamic code runtime using the above method. I could print the entire response in my log file but chrome was able to unzip the first file only. Rest output was coming in bytes.
After research I found out that this is not possible with chrome and they have closed the bug also without solving it.
https://bugs.chromium.org/p/chromium/issues/detail?id=20884

Combining all text files in a folder into a single file

How can I combine all txt files in a folder into a single file? A folder usually contains hundreds to thousands of txt files.
If this program were only to be run on windows machines I would just go with a batch file containing something like
copy /b *.txt merged.txt
But that is not the case, so I figured it might be easier to just write it in Java to complement everything else we have.
I have written something like this
// Retrieves a list of files from the specified folder with the filter applied
File[] files = Utils.filterFiles(downloadFolder + folder, ".*\\.txt");
try
{
// savePath is the path of the output file
FileOutputStream outFile = new FileOutputStream(savePath);
for (File file : files)
{
FileInputStream inFile = new FileInputStream(file);
Integer b = null;
while ((b = inFile.read()) != -1)
outFile.write(b);
inFile.close();
}
outFile.close();
}
catch (Exception e)
{
e.printStackTrace();
}
But it takes several minutes to combine thousands of files so it is not feasible.

Use NIO, it is much easier than using inputstreams/outputstreams. Note: uses Guava's Closer, which means all resources are safely closed; even better would be to use Java 7 and try-with-resources.
final Closer closer = Closer.create();
final RandomAccessFile outFile;
final FileChannel outChannel;
try {
outFile = closer.register(new RandomAccessFile(dstFile, "rw"));
outChannel = closer.register(outFile.getChannel());
for (final File file: filesToCopy)
doWrite(outChannel, file);
} finally {
closer.close();
}
// doWrite method
private static void doWrite(final WriteableByteChannel channel, final File file)
throws IOException
{
final Closer closer = Closer.create();
final RandomAccessFile inFile;
final FileChannel inChannel;
try {
inFile = closer.register(new RandomAccessFile(file, "r"));
inChannel = closer.register(inFile.getChannel());
inChannel.transferTo(0, inChannel.size(), channel);
} finally {
closer.close();
}
}

Because of this
Integer b = null;
while ((b = inFile.read()) != -1)
outFile.write(b);
Your OS is making a lot of IO calls. read() only reads one byte of data. Use the other read methods that accept a byte[]. You can then use that byte[] to write to your OutputStream. Similarly write(int) does an IO call writing a single byte. Change that too.
Of course, you can look into tools that do this for you, like Apache Commons IO or even the Java 7 NIO package.

Try using BufferedReader and BufferedWriter instead of writing bytes one by one.

You can use IoUtils to merge files,IoUtils.copy() method will help you for merging files.
This link may be useful merging file in java

I would do it this way !
check for the OS
System.getProperty("os.name")
Run the System Level command from Java.
If windows
copy /b *.txt merged.txt
if Unix
cat *.txt > merged.txt
or whatever best System level command available.

Extract a .tar.gz file in java (JSP)

I can't seem to import the packages needed or find any online examples of how to extract a .tar.gz file in java.
What makes it worse is I'm using JSP pages and am having trouble importing packages into my project. I'm copying the .jar's into WebContent/WEB-INF/lib/ and then right clicking on the project and selecting import external jar and importing it. Sometimes the packages resolve, other times they don't. Can't seem to get GZIP to import either. The imports in eclipse for jsp aren't intuitive like they are in normal Java code where you can right click a recognized package and select import.
I've tried the Apache commons library, the ice and another one called JTar. Ice has imported, but I can't find any examples of how to use it?
I guess I need to uncompress the gzipped part first, then open it with the tarstream?
Any help is greatly appreciated.

The accepted answer works fine, but I think it is redundant to have a write to file operation.
You could use something like
TarArchiveInputStream tarInput =
new TarArchiveInputStream(new GZipInputStream(new FileInputStream("Your file name")));
TarArchiveEntry currentEntry = tarInput.getNextTarEntry();
while(currentEntry != null) {
File f = currentEntry.getFile();
// TODO write to file as usual
}
Hope this help.
Maven Repo

Ok, i finally figured this out, here is my code in case this helps anyone in the future.
Its written in Java, using the apache commons io and compress librarys.
File dir = new File("directory/of/.tar.gz/files/here");
File listDir[] = dir.listFiles();
if (listDir.length!=0){
for (File i:listDir){
/* Warning! this will try and extract all files in the directory
if other files exist, a for loop needs to go here to check that
the file (i) is an archive file before proceeding */
if (i.isDirectory()){
break;
}
String fileName = i.toString();
String tarFileName = fileName +".tar";
FileInputStream instream= new FileInputStream(fileName);
GZIPInputStream ginstream =new GZIPInputStream(instream);
FileOutputStream outstream = new FileOutputStream(tarFileName);
byte[] buf = new byte[1024];
int len;
while ((len = ginstream.read(buf)) > 0)
{
outstream.write(buf, 0, len);
}
ginstream.close();
outstream.close();
//There should now be tar files in the directory
//extract specific files from tar
TarArchiveInputStream myTarFile=new TarArchiveInputStream(new FileInputStream(tarFileName));
TarArchiveEntry entry = null;
int offset;
FileOutputStream outputFile=null;
//read every single entry in TAR file
while ((entry = myTarFile.getNextTarEntry()) != null) {
//the following two lines remove the .tar.gz extension for the folder name
String fileName = i.getName().substring(0, i.getName().lastIndexOf('.'));
fileName = fileName.substring(0, fileName.lastIndexOf('.'));
File outputDir = new File(i.getParent() + "/" + fileName + "/" + entry.getName());
if(! outputDir.getParentFile().exists()){
outputDir.getParentFile().mkdirs();
}
//if the entry in the tar is a directory, it needs to be created, only files can be extracted
if(entry.isDirectory){
outputDir.mkdirs();
}else{
byte[] content = new byte[(int) entry.getSize()];
offset=0;
myTarFile.read(content, offset, content.length - offset);
outputFile=new FileOutputStream(outputDir);
IOUtils.write(content,outputFile);
outputFile.close();
}
}
//close and delete the tar files, leaving the original .tar.gz and the extracted folders
myTarFile.close();
File tarFile = new File(tarFileName);
tarFile.delete();
}
}

java.util.zip.ZipException: too many entries in ZIP file

I am trying to write a Java class to extract a large zip file containing ~74000 XML files. I get the following exception when attempting to unzip it utilizing the java zip library:
java.util.zip.ZipException: too many entries in ZIP file
Unfortunately due to requirements of the project I can not get the zip broken down before it gets to me, and the unzipping process has to be automated (no manual steps). Is there any way to get around this limitation utilizing java.util.zip or with some 3rd party Java zip library?
Thanks.

Using ZipInputStream instead of ZipFile should probably do it.

Using apache IOUtils:
FileInputStream fin = new FileInputStream(zip);
ZipInputStream zin = new ZipInputStream(fin);
ZipEntry ze = null;
while ((ze = zin.getNextEntry()) != null) {
FileOutputStream fout = new FileOutputStream(new File(
outputDirectory, ze.getName()));
IOUtils.copy(zin, fout);
IOUtils.closeQuietly(fout);
zin.closeEntry();
}
IOUtils.closeQuietly(zin);

The Zip standard supports a max of 65536 entries in a file.
Unless the Java library supports ZIP64 extensions, it won't work properly if you are trying to read or write an archive with 74,000 entries.

I reworked the method to deal with directory structures more convenient and to zip a whole bunch of targets at once.
Plain files will be added to the root of the zip file, if you pass a directory, the underlying structure will be preserved.
def zip (String zipFile, String [] filesToZip){
def result = new ZipOutputStream(new FileOutputStream(zipFile))
result.withStream { zipOutStream ->
filesToZip.each {fileToZip ->
ftz = new File(fileToZip)
if(ftz.isDirectory()){
pathlength = new File(ftz.absolutePath).parentFile.absolutePath.size()
ftz.eachFileRecurse {f ->
if(!f.isDirectory()) writeZipEntry(f, zipOutStream, f.absolutePath[pathlength..-1])
}
}
else writeZipEntry(ftz, zipOutStream, '')
}
}
}
def writeZipEntry(File plainFile, ZipOutputStream zipOutStream, String path) {
zipOutStream.putNextEntry(new ZipEntry(path+plainFile.name))
new FileInputStream(plainFile).withStream { inStream ->
def buffer = new byte[1024]
def count
while((count = inStream.read(buffer, 0, 1024)) != -1)
zipOutStream.write(buffer)
}
zipOutStream.closeEntry()
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Reading .7z files - java

Related

How to access every entry of CompressedSource in google cloud dataflow? And get Byte[] of each subfile

Combining compressed Gzipped Text Files using Java

Combining all text files in a folder into a single file

Extract a .tar.gz file in java (JSP)

java.util.zip.ZipException: too many entries in ZIP file

Categories

Resources