Combining all text files in a folder into a single file - java

How can I combine all txt files in a folder into a single file? A folder usually contains hundreds to thousands of txt files.
If this program were only to be run on windows machines I would just go with a batch file containing something like
copy /b *.txt merged.txt
But that is not the case, so I figured it might be easier to just write it in Java to complement everything else we have.
I have written something like this
// Retrieves a list of files from the specified folder with the filter applied
File[] files = Utils.filterFiles(downloadFolder + folder, ".*\\.txt");
try
{
// savePath is the path of the output file
FileOutputStream outFile = new FileOutputStream(savePath);
for (File file : files)
{
FileInputStream inFile = new FileInputStream(file);
Integer b = null;
while ((b = inFile.read()) != -1)
outFile.write(b);
inFile.close();
}
outFile.close();
}
catch (Exception e)
{
e.printStackTrace();
}
But it takes several minutes to combine thousands of files so it is not feasible.

Use NIO, it is much easier than using inputstreams/outputstreams. Note: uses Guava's Closer, which means all resources are safely closed; even better would be to use Java 7 and try-with-resources.
final Closer closer = Closer.create();
final RandomAccessFile outFile;
final FileChannel outChannel;
try {
outFile = closer.register(new RandomAccessFile(dstFile, "rw"));
outChannel = closer.register(outFile.getChannel());
for (final File file: filesToCopy)
doWrite(outChannel, file);
} finally {
closer.close();
}
// doWrite method
private static void doWrite(final WriteableByteChannel channel, final File file)
throws IOException
{
final Closer closer = Closer.create();
final RandomAccessFile inFile;
final FileChannel inChannel;
try {
inFile = closer.register(new RandomAccessFile(file, "r"));
inChannel = closer.register(inFile.getChannel());
inChannel.transferTo(0, inChannel.size(), channel);
} finally {
closer.close();
}
}

Because of this
Integer b = null;
while ((b = inFile.read()) != -1)
outFile.write(b);
Your OS is making a lot of IO calls. read() only reads one byte of data. Use the other read methods that accept a byte[]. You can then use that byte[] to write to your OutputStream. Similarly write(int) does an IO call writing a single byte. Change that too.
Of course, you can look into tools that do this for you, like Apache Commons IO or even the Java 7 NIO package.

Try using BufferedReader and BufferedWriter instead of writing bytes one by one.

You can use IoUtils to merge files,IoUtils.copy() method will help you for merging files.
This link may be useful merging file in java

I would do it this way !
check for the OS
System.getProperty("os.name")
Run the System Level command from Java.
If windows
copy /b *.txt merged.txt
if Unix
cat *.txt > merged.txt
or whatever best System level command available.

Related

Spark Reading .7z files

I am trying to read the spark .7z files using scala or java. I dont find any appropriate methods or functionality.
For the zip file, i am able to read as the ZipInputStream class takes a Input stream, but for the 7Z files the class SevenZFile doesnt take any input stream.
https://commons.apache.org/proper/commons-compress/javadocs/api-1.16/org/apache/commons/compress/archivers/sevenz/SevenZFile.html
Zip file code
spark.sparkContext.binaryFiles("fileName").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
I am trying similar code for the 7z files something like
spark.sparkContext.binaryFiles(""filename"").flatMap{case (name: String, content: PortableDataStream) =>
val zis = new SevenZFile(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
But SevenZFile doesnt accept these formats.Looking for ideas.
If the file is in local filessytem following solution works, but my file is in hdfs
Local fileSystem Code
public static void decompress(String in, File destination) throws IOException {
SevenZFile sevenZFile = new SevenZFile(new File(in));
SevenZArchiveEntry entry;
while ((entry = sevenZFile.getNextEntry()) != null){
if (entry.isDirectory()){
continue;
}
File curfile = new File(destination, entry.getName());
File parent = curfile.getParentFile();
if (!parent.exists()) {
parent.mkdirs();
}
FileOutputStream out = new FileOutputStream(curfile);
byte[] content = new byte[(int) entry.getSize()];
sevenZFile.read(content, 0, content.length);
out.write(content);
out.close();
}
}
After all these years of spark evolution there should be easy way to do it.
Instead of using the java.io.File-based approach, you could try the SeekableByteChannel method as shown in this alternative constructor.
You can use a SeekableInMemoryByteChannel to read a byte array. So as long as you can pick up the 7zip files from S3 or whatever and hand them off as byte arrays you should be alright.
With all of that said, Spark is really not well-suited for processing things like zip and 7zip files. I can tell you from personal experience I've seen it fail badly once the files are too large for Spark's executors to handle.
Something like Apache NiFi will work much better for expanding archives and processing them. FWIW, I'm currently handling a large data dump that has me frequently dealing with 50GB tarballs that have several million files in them, and NiFi handles them very gracefully.

How to make a copy of a file containing images and text using java

I have some word documents and excel sheets which has some images along with the file text content. I want to create a copy of that file and keep it at a specific location. I tried the following method which is creating file at specified location but the file is corrupted and cannot be read.
InputStream document = Thread.currentThread().getContextClassLoader().getResourceAsStream("upgradeworkbench/Resources/Upgrade_TD_Template.docx");
try {
OutputStream outStream = null;
Stage stage = new Stage();
stage.setTitle("Save");
byte[] buffer= new byte[document.available()];
document.read(buffer);
FileChooser fileChooser = new FileChooser();
fileChooser.setInitialFileName(initialFileName);
if (flag) {
fileChooser.getExtensionFilters().addAll(new FileChooser.ExtensionFilter("Microsoft Excel Worksheet", "*.xls"));
} else {
fileChooser.getExtensionFilters().addAll(new FileChooser.ExtensionFilter("Microsoft Word Document", "*.docx"));
}
fileChooser.setTitle("Save File");
File file = fileChooser.showSaveDialog(stage);
if (file != null) {
outStream = new FileOutputStream(file);
outStream.write(buffer);
// IOUtils.copy(document, outStream);
}
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
Can anyone suggest me any different ways to get the proper file.
PS: I am reading the file using InputStream because it is inside the project jar.
PPS: I also tried Files.copy() but it didnt work.
I suggest you never trust on InputStream.available to know the real size of the input, because it just returns the number of bytes ready to be immediately read from the buffer. It might return a small number, but doesn't mean the file is small, but that the buffer is temporarily half-full.
The right algorithm to read an InputStream fully and write it over an OutputStream is this:
int n;
byte[] buffer=new byte[4096];
do
{
n=input.read(buffer);
if (n>0)
{
output.write(buffer, 0, n);
}
}
while (n>=0);
You can use the Files.copy() methods.
Copies all bytes from an input stream to a file. On return, the input stream will be at end of stream.
Use:
Files.copy(document, file.toPath(), StandardCopyOption.REPLACE_EXISTING);
As the class says, the second argument is a Path, not a File.
Generally, since this is 2015, use Path and drop File; if an API still uses File, make it so that it uses it at the last possible moment and use Path all the way.

Combining compressed Gzipped Text Files using Java

my question might not be entirely related to Java but I'm currently seeking a method to combine several compressed (gzipped) textfiles without the requirement to recompress them manually. Lets say I have 4 files, all text that is compressed using gzip and want to compress these into one single *.gz file without de + recompressing them. My current method is to open an InputStream and parse the file linewise, storing in a GZIPoutputstream, which works but isn't very fast.... I could of course also call
zcat file1 file2 file3 | gzip -c > output_all_four.gz
This would work, too but isn't really fast either.
My idea would be to copy the inputstream and write it to outputstream directly without "parsing" the stream, as I don't need to manipulate anything actually. Is something like this possible?
Find below a simple solution in Java (it does the same as my cat ... example). Any kind of buffering the input/output has been omitted to keep the code slim.
public class ConcatFiles {
public static void main(String[] args) throws IOException {
// concatenate the single gzip files to one gzip file
try (InputStream isOne = new FileInputStream("file1.gz");
InputStream isTwo = new FileInputStream("file2.gz");
InputStream isThree = new FileInputStream("file3.gz");
SequenceInputStream sis = new SequenceInputStream(new SequenceInputStream(isOne, isTwo), isThree);
OutputStream bos = new FileOutputStream("output_all_three.gz")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = sis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
// ungezip the single gzip file, the output contains the
// concatenated input of the single uncompressed files
try (GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("output_all_three.gz"));
OutputStream bos = new FileOutputStream("output_all_three")) {
byte[] buffer = new byte[8192];
int intsRead;
while ((intsRead = gzipis.read(buffer)) != -1) {
bos.write(buffer, 0, intsRead);
}
bos.flush();
}
}
}
The above method works if you just require to gzip many zipped files. In my case I had made a web servlet and my response was in 20-30 KBs. So I was sending the zipped response.
I tried to zip all my individual JS files on server start only and then add dynamic code runtime using the above method. I could print the entire response in my log file but chrome was able to unzip the first file only. Rest output was coming in bytes.
After research I found out that this is not possible with chrome and they have closed the bug also without solving it.
https://bugs.chromium.org/p/chromium/issues/detail?id=20884

How to create a copy of a file in the same directory in java?

I want all the features of file.renameTo in java but without the source file getting deleted.
For eg:
Say I have a file report.doc and I want to create the file report.xml without report.doc getting deleted. Also, the contents of both the files should be the same. (A simple copy)
How do I go about doing this?
I know this might be trivial but some basic searching didn't help.
For filesystem operations Apache Commons IO provides useful shortcuts.
See:
FileUtils.html
FileUtils.html#copyFile
You can create a new File with the same content of the original.
Use the java NIO (Java 1.4 or later) for that:
private static void copy(File source, File destination) throws IOException {
long length = source.length();
FileChannel input = new FileInputStream(source).getChannel();
try {
FileChannel output = new FileOutputStream(destination).getChannel();
try {
for (long position = 0; position < length; ) {
position += input.transferTo(position, length-position, output);
}
} finally {
output.close();
}
} finally {
input.close();
}
}
see the answers for this question for more

Creating a file is causing problem, the File.getPath() doesn't seem to work

I am trying to create a back up file for an html file on a web server.
I want the backup to be in the same location as the existing file (it's a quick fix). I want to create the file using File file = new File(PathName);
public void backUpOldPage(String oldContent) throws IOException{
// this.uri is a class variable with the path of the file to be backed up
String fileName = new File(this.uri).getName();
String pathName = new File(this.uri).getPath();
System.out.println(pathName);
String bckPath = pathName+"\\"+bckName;
FileOutputStream fout;
try
{
// Open an output stream
fout = new FileOutputStream (bckFile);
fout.close();
}
// Catches any error conditions
catch (IOException e)
{
System.err.println ("Unable to write to file");
System.exit(-1);
}
}
But if instead I was to set bckPath like this, it will work.
String bckPath = "C://dev/server/tomcat6/webapps/sample-site/index_sdjf---sd.html";
I am working on Windows, not sure if that makes a difference.
The result of String bckPath = pathName+"\"+bckName;
is bckPath = C:\dev\server\tomcat6\webapps\sample-site\filename.html - this doesn't result in a new file.
Use File.pathSeparator, that way you dont need to worry what OS you are using.
Try to use File.getCanonicalPath() instead of plain getPath(). This helps if the orginal path is not fully specified.
Regarding slashes, / or \ or File.pathSeparator is not causing the problem, because they are all the same on Windows and Java. (And you do not define bckFile in your code, only bckPath. Also use getCanonicalPath() on the new created bckPath.)

Categories

Resources