Save a spark RDD using mapPartition with iterator - java

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt

A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.

Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Related

Streaming data into delta lake, reading filtered results

My goal is to continuously put incoming parquet files into delta-lake, make queries, and get the results into a Rest API.
All files are in s3 buckets.
//listen for changes
val df = spark.readStream().parquet("s3a://myBucket/folder")
//write changes to delta lake
df.writeStream()
.format("delta")
.option("checkpointLocation", "s3a://myBucket-processed/checkpoint")
.start("s3a://myBucket-processed/")
.awaitTermination() //this call lives in another thread (because it's blocking)
//this is a bad example
val query = df.select(convertedColumnNames)
query.show()
//another bad example:
spark.readStream().format("delta").load("s3a://myBucket-processed/").select(convertedColumnNames).show()
//org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
How can I get the filtered data out from delta lake?
Did you try using foreachBatch?
It brings all batch like features to streaming and you can also somewhat control number of files you are writing into delta lake.

Spark: show dataframe content in logging (Java)

I want to know how to show dataframe content (rows) in Java ?
I tried using log.info(df.showString()), but it prints unreadable characters. I want to use df.collectAsList() but I have to do filter afterwards so I can't do this.
Thank you.
There are several options to log the data:
Collecting the data to the driver
You can call collectAsList() and continue processing afterwards. Spark datasets are immutable, so collecting them to the driver will trigger the execution, but you can re-use the dataset afterwards for further processing steps:
Dataset<Data> ds = ... //1
List<Data> collectedDs = ds.collectAsList(); //2
doSomeLogging(collectedDs);
ds = ds.filter(<filter condition>); //3
ds.show();
The code above will collect the data in line //2, log it and then continue processing in line //3.
Depending on how complex the creation of the dataset in line //1 was, you would want to cache the dataset, so that the processing in line //1 is run only once.
Dataset<Data> ds = ... //1
ds = ds.cache();
List<Data> collectedDs = ds.collectAsList(); //2
....
Using map
Calling collectAsList() will send all your data to the driver. Usually you use Spark in order to distribute the data over several executor nodes, so your driver might not be large enough to hold all of the data at the same time. In this case, you can log the data in a map call:
Dataset<Data> ds = ... //1
ds = ds.map(d -> {
System.out.println(d); //2
return d; //3
}, Encoders.bean(Data.class));
ds = ds.filter(<filter condition>);
ds.show();
In this example, line //2 does the logging and line //3 simply returns the original object, so that the dataset remains unchanged. I assume that the Data class comes with a readable toString() implementation. Otherwise, line //2 needs some more logic. It always might be helpful to a log library (like log4j) instead of writing directly to standard out.
In this second approach, the logs will not be written on the driver but on each executor. You would have to collect the logs after the Spark job has finished and combine them into one file.
If you have an untyped dataframe instead of a dataset like above, the same code would work. You only would have to operate directly on a Row object using the getXXX methods for creating logging output instead of the Data class.
All logging operations will have an impact on the performance of your code.

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});
Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Hbase CopyTable inside Java

I want to copy one Hbase table to another location with good performance.
I would like to reuse the code from CopyTable.java from Hbase-server github page
I've been looking the doccumentation from hbase but it didn't help me much http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CopyTable.html
After looking in this post of stackoverflow: Can a main() method of class be invoked in another class in java
I think I can directly call it using its main class.
Question: Do you think anyway better to get this copy done rather than using CopyTable from hbase-server ? Do you see any inconvenience using this CopyTable ?
Question: Do you think anyway better to get this copy done rather than
using CopyTable from hbase-server ? Do you see any inconvenience using
this CopyTable ?
First thing is snapshot is better way than CopyTable.
HBase Snapshots allow you to take a snapshot of a table without too much impact on Region Servers. Snapshot, Clone and restore operations don't involve data copying. Also, Exporting the snapshot to another cluster doesn't have impact on the Region Servers.
Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table. The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
Snapshot is not just rename, between multiple operations if you want to restore at one particular point then this is the right case to use :
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
Also, see Snapshots+and+Repeatable+reads+for+HBase+Tables
Snapshot Internals
Another Map reduce way than CopyTable :
You can implement something like below in your code this is for standalone program where as you have write mapreduce job to insert multiple put records as a batch (may be 100000).
This increased performance for standalone inserts in to hbase client you can try this in mapreduce way
public void addMultipleRecordsAtaShot(final ArrayList<Put> puts, final String tableName) throws Exception {
try {
final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
table.put(puts);
LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
} catch (final Throwable e) {
e.printStackTrace();
} finally {
LOG.info("Processed ---> " + puts.size());
if (puts != null) {
puts.clear();
}
}
}
along with that you can also consider below...
Enable write buffer to large value than default
1) table.setAutoFlush(false)
2) Set buffer size
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value> // you can double this for better performance 2 x 20971520 = 41943040
</property>
OR
void setWriteBufferSize(long writeBufferSize) throws IOException
The buffer is only ever flushed on two occasions:
Explicit flush
Use the flushCommits() call to send the data to the servers for permanent storage.
Implicit flush
This is triggered when you call put() or setWriteBufferSize().
Both calls compare the currently used buffer size with the configured limit and optionally invoke the flushCommits() method.
In case the entire buffer is disabled, setting setAutoFlush(true) will force the client to call the flush method for every invocation of put().

Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.
You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}
An object with static initialization which invokes your system process should do the trick.
object SparkStandIn extends App {
object invokeSystemProcess {
import sys.process._
val errorCode = "echo Whatever you put in this object should be executed once per jvm".!
def doIt(): Unit = {
// this object will construct once per jvm, but objects are lazy in
// another way to make sure instantiation happens is to check that the errorCode does not represent an error
}
}
invokeSystemProcess.doIt()
invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}
A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:
(1 until 100).toSeq.toDS.
mapPartitions(itr => {
sys.process.Process(
Seq("bash", "-c", "echo $(hostname && date)")
).
lines.
toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)
Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.

Categories

Resources