Split HDFS files into multiple local files using Java

Split HDFS files into multiple local files using Java - java

I have to copy HDFS files into local file system using Java Code and before writing to the disk split into multiple parts . The files are compressed using snappy / lzo. I have used Bufferedreader and Filewriter to read and write the file . But this operation is very slow . 20 Mins for 30 GB file. I can dump file using hadoop fs -text in 2 minutes (but can not split it). Is there anything else that I can do to speed up the operation ?

Since I had two do pass , first to get the line count and then the Split. hadoop fs -text was cpu intensive. Did the below approach :
1) Use a line count Java program as Map reduce to get the line count in the file. Then dividing it by total number of files I need i got the count of lines to write to each file .
2) Use the code mentioned this link with hadoop fs -text
https://superuser.com/a/485602/220236
Hope it helps someone else.

Related

if I have multiple Java flight recorder files (.JFR) , how do I merge them together to see them at Java mission control in one time

I received for analysis a bunch of multiple JFR files , like 170 , and it is taking too long to open one by one, can I put them in one single file, starting from the fact that I only have the files and I cannot configure the JVM and obtain the JFR again.
for example
2020_03_30_20_37_01_2333_0.jfr
2020_03_30_21_37_01_2333_0.jfr
2020_03_30_22_37_01_2333_0.jfr
2020_03_30_23_37_01_2333_0.jfr
.
.
.
N

You can use the 'jfr' tool located in JDK_HOME/bin from 11.06 or later.
$ jfr assemble <repository> <file>
where repository is the directory where the files are located and file is the name of the recording file (.jfr) to create.
A recording file is just a concatenation of chunk files so you could do it in the shell as well. For example, using the copy /b command in Windows
$ copy /b 1.jfr + 2.jfr + 3.jfr combined.jfr

Here is a script in windows powershell to concatenate the JFR files, just take care of the size of the final concatenated file, it might be too large, in that case you might want to reduce the files to concatenate
$Location = ""
$outputConcatenatedJFR = "all_2333_1.jfr"
$items = Get-ChildItem -Path . -Filter *2333_1.jfr | Sort-Object -Property Name
New-Item $outputConcatenatedJFR -ItemType file
ForEach ($item in $items) {
Write-Host "Processing file - " $item
cmd /c copy /b $outputConcatenatedJFR+$item $outputConcatenatedJFR
}

Copying a file inside a same hdfs using FileUtil API is taking too much time

I have 1 HDFS and my local system from where I'm executing my program to perform a copy inside a same hdfs system.
Like: hadoop fs -cp /user/hadoop/SrcFile /user/hadoop/TgtFile
I'm using:
FileUtil.copy(FileSystem srcFS,
FileStatus srcStatus,
FileSystem dstFS,
Path dst,
boolean deleteSource,
boolean overwrite,
Configuration conf)
But something weird is happening, when I'm doing copy from command line, it just take a moment to copy but when I do it programmatically it takes a 10 - 15 minute to copy 190 mb file.
For me it look like it's streaming the data via my local system instead of streaming directly because the destination is also on the same filesystem as of source.
Correct me if I'm wrong and also help me to find out the best solution.

You are right in that using FileUtil.copy the streaming is passed through your program (src --> yourprogram --> dst). If hadoops filesystem shell API (hadoop dfs -cp) is faster than you can use the same through Runtime.exec(cmd)
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java

Flume java.lang.IllegalStateException: File has changed size since being read

I have a java application that gather data from different sources and write the output into files under a specific directory.
And I have a flume agent configured to use spooldir source to read from that directory and write the output to Solr using MorphlineSolrSink.
The flume agent throws the following exception
java.lang.IllegalStateException: File has changed size since being read
Here is the configuration of the flume agent
agent02.sources = s1
agent02.sinks = solrSink
agent02.channels = ch1
agent02.channels.ch1.type = file
agent02.channels.ch1.checkpointDir=/home/flume/prod_solr_chkpoint/file-channel/checkpoint
agent02.channels.ch1.dataDirs= /home/flume/prod_solr_chkpoint/file-channel/data
agent02.sources.s1.type = spooldir
agent02.sources.s1.channels = ch1
agent02.sources.s1.spoolDir = /DataCollection/json_output/solr/
agent02.sources.s1.deserializer.maxLineLength = 100000
agent02.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
agent02.sinks.solrSink.channel = ch1
agent02.sinks.solrSink.batchSize = 10000
agent02.sinks.solrSink.batchDurationMillis = 10000
agent02.sinks.solrSink.morphlineFile = morphlines.conf
agent02.sinks.solrSink.morphlineId = morphline
What I understand from the exception is that the flume agent start working on a file while the java application did not finish writing it.
How can I fix this problem ?
Edit
I have no idea this information is valuable or not.
These configurations were working before without any problem. We faced a hard desk failure in the machine we run flume from. After recovering from that failure flume throws this exception.

As stated in the documentation regarding the Spooling Directory Source:
In exchange for this reliability, only immutable, uniquely-named files
must be dropped into the spooling directory. Flume tries to detect
these problem conditions and will fail loudly if they are violated:
If a file is written to after being placed into the spooling directory, Flume will print an error to its log file and stop
processing.
If a file name is reused at a later time, Flume will print an error to its log file and stop processing.
I'll suggest your Java application dumps buckets of data into temporal files; name them by adding the timestamp of creation. Once the bucket is full (i.e. a certain size is reached), then move the file to the spooling directory.

write the source file to another directory, then move (mv command) the files to spool source directory. it should work. dont use copy command.

File not getting created during second MR

I have a hadoop implementation for an algorithm.
I am doing it in Eclipse:
When i run in eclipse my algorithm works fine and creates necessary files and output.
Algorithm
|
|___creates a file0.txt file.
|
|___creates a file1.txt file.
|
|___creates a file3.txt file.
|
|___creates a file4.txt file.
|
|___creates a file5.txt file.
|
|___creates a file6.txt file.
|
|___creates a file7.txt file.
Completes the job.
When i tried my program in Hadoop cluster except file0.txt all other files are not getting created in hdfs from reducer phase.
Do any one gone through this issue.
Pls help.
Source
Output from eclipse
Output from cluster

The output file is specified by the Driver code, irrespective of the MR job. Please check your Driver code or share it here

Your questions is slightly confusing. All I understand is that you have 413 bytes long file and you are trying to run 7 MR jobs.
So, are you saying you have 7 pairs of Mapper and Reducer classes that you want to run on that 413 byte file ?
Again you mentioned my algorithm runs different MR jobs depending upon the data sets , so I'm left to assume that a dataset gets to be used only by one pair of Mapper-Reducer class. Did you verify that your dataset satisfies the condition for Mapper-Reducer pair 1,3,4,5,6,7,
Are all these Mapper-Reducer pair using same output folder... ? That might also be big concern.
Please answer them , then possibly I can help.

How to redirect the stdout and stderr in rolling file with Unix redirection

I have java application which I am running on Unix from the command prompt.
I am redirecting stdout and stderr to console.out and console.err files.
The file size is increasing because a lot of information is being logged.
I want to create a rolling file, when the file size increases above a particular size,
e.g console1.out should get created if console.out size exceeds 500KB.
Currently I am using
java MyAppName > logs/Console.out 2> logs/Console.err &
How can I do this?

Pipe the result to split like this:
java MyAppName | split -b500k - Console.log
This will create a new file every time you go over 500k. See the man-page for split for more details and options.

Or you can use rotatelogs
nohup java MyAppName 2>&1 | rotatelogs -l Console_%Y-%m-%d.log 86400 &
This will create a new file with today's date every day

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split HDFS files into multiple local files using Java - java

Related

if I have multiple Java flight recorder files (.JFR) , how do I merge them together to see them at Java mission control in one time

Copying a file inside a same hdfs using FileUtil API is taking too much time

Flume java.lang.IllegalStateException: File has changed size since being read

File not getting created during second MR

How to redirect the stdout and stderr in rolling file with Unix redirection

Categories

Resources