Spark: show dataframe content in logging (Java)

Spark: show dataframe content in logging (Java) - java

I want to know how to show dataframe content (rows) in Java ?
I tried using log.info(df.showString()), but it prints unreadable characters. I want to use df.collectAsList() but I have to do filter afterwards so I can't do this.
Thank you.

There are several options to log the data:
Collecting the data to the driver
You can call collectAsList() and continue processing afterwards. Spark datasets are immutable, so collecting them to the driver will trigger the execution, but you can re-use the dataset afterwards for further processing steps:
Dataset<Data> ds = ... //1
List<Data> collectedDs = ds.collectAsList(); //2
doSomeLogging(collectedDs);
ds = ds.filter(<filter condition>); //3
ds.show();
The code above will collect the data in line //2, log it and then continue processing in line //3.
Depending on how complex the creation of the dataset in line //1 was, you would want to cache the dataset, so that the processing in line //1 is run only once.
Dataset<Data> ds = ... //1
ds = ds.cache();
List<Data> collectedDs = ds.collectAsList(); //2
....
Using map
Calling collectAsList() will send all your data to the driver. Usually you use Spark in order to distribute the data over several executor nodes, so your driver might not be large enough to hold all of the data at the same time. In this case, you can log the data in a map call:
Dataset<Data> ds = ... //1
ds = ds.map(d -> {
System.out.println(d); //2
return d; //3
}, Encoders.bean(Data.class));
ds = ds.filter(<filter condition>);
ds.show();
In this example, line //2 does the logging and line //3 simply returns the original object, so that the dataset remains unchanged. I assume that the Data class comes with a readable toString() implementation. Otherwise, line //2 needs some more logic. It always might be helpful to a log library (like log4j) instead of writing directly to standard out.
In this second approach, the logs will not be written on the driver but on each executor. You would have to collect the logs after the Spark job has finished and combine them into one file.
If you have an untyped dataframe instead of a dataset like above, the same code would work. You only would have to operate directly on a Row object using the getXXX methods for creating logging output instead of the Data class.
All logging operations will have an impact on the performance of your code.

Related

Apache Flink : Add side inputs for DataStream API

In my Java application, I have three DataStreams. For example, for One stream data is consumed from Kafka, for another stream data is consumed from Apache Nifi. For these two streams Object type is different. For example, Stream-1 object type is Person, Stream-2 object type is Address.
The third one is the broadcast stream (for this data is consumed from Kafka).
Now I want to combine Stream-1 and Stream-2 in a Job class and want to split in the task process element. How to implement this?
Note :
Stream-1 is mainstream and Stream-2 is side input. MainStream is continuously fetching data from Kafka. For Side Input, initially while the application is UP all table data is loaded from DB and then read new data when the table data is updated (not frequently) .
Sample structure:
DataStream<Person> stream-1 = env.addSource(read data from kafka)....
DataStream<Address> stream-2 = env.addSource(read data from nifi)....
BroadcastStream<String> BroadCastStream = stream-3.broadcast(read data from kafka);
I was referred to as the following links.
FLIP-17 Side Inputs for DataStream API
jira/browse/FLINK-6131
My Use case is :
Join stream with slowly evolving data: The side input that we use for enriching is evolving over time (Data is read from DB). This can be done by waiting for some initial data to be available before processing the main input and the continuously ingesting new data into the internal side input structure as it arrives.

Based on the latest response, the recommendation by #Arvid was in fact what was needed here.
Core of the answer:
You can easily join stream1 and stream2 even if they have different
types. Then you can add the broadcast to the result
Links to doc and example, and a relevant snippet from the doc (the example is too long to be included in here):
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream.join(greenStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction<Integer, Integer, String> (){
#Override
public String join(Integer first, Integer second) {
return first + "," + second;
}
});

How to process multiple files separately after SparkContext.wholeTextFiles?

I'm trying to use wholeTextFiles to read all the files names in a folder and process them one-by-one seperately(For example, I'm trying to get the SVD vector of each data set and there are 100 sets in total). The data are saved in .txt files spitted by space and arranged in different lines(like a matrix).
The problem I came across with is that after I use "wholeTextFiles("path with all the text files")", It's really difficult to read and parse the data and I just can't use the method like what I used when reading only one file. The method works fine when I just read one file and it gives me the correct output. Could someone please let me know how to fix it here? Thanks!
public static void main (String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("whole text files").setMaster("local[2]").set("spark.executor.memory","1g");;
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaPairRDD<String, String> fileNameContentsRDD = jsc.wholeTextFiles("/Users/peng/FMRITest/regionOutput/");
JavaRDD<String[]> lineCounts = fileNameContentsRDD.map(new Function<Tuple2<String, String>, String[]>() {
#Override
public String[] call(Tuple2<String, String> fileNameContent) throws Exception {
String content = fileNameContent._2();
String[] sarray = content .split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i< sarray.length; i++){
values[i] = Double.parseDouble(sarray[i]);
}
pd.cache();
RowMatrix mat = new RowMatrix(pd.rdd());
SingularValueDecomposition<RowMatrix, Matrix> svd = mat.computeSVD(84, true, 1.0E-9d);
Vector s = svd.s();
}});

Quoting the scaladoc of SparkContext.wholeTextFiles:
wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
In other words, wholeTextFiles might not simply be what you want.
Since by design "Small files are preferred" (see the scaladoc), you could mapPartitions or collect (with filter) to grab a subset of the files to apply the parsing to.
Once you have the files per partitions in your hands, you could use Scala's Parallel Collection API and schedule Spark jobs to execute in parallel:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Optimized Way for Data Parameterization in TestNG

I am trying to create a framework using selenium and TestNG. As a part of the framework i am trying to implement Data Parameterization. But i am confused about optimized way of implementing Data parameterization. Here is the following approaches i made.
With Data Providers (from excel i am reading and storing in object[][])
With testng.xml
Issues with Data Providers:
Lets say if my Test needs to handle large volumes of data , say 15 different data, then i need to pass 15 parameters to it. Alternative , if i try to create a class TestData to handle this parameters and maintain in it , then for every Test there will be different data sets. so my TestData class will be filled with more than 40 different params.
Eg: In a Ecom Web site , There will be many different params exists like for Accounts , cards , Products , Rewards , History , store Locations etc., for this we may need atleast 40 different params need to declared in Test Data.Which i am thinking not a suggestable solution. Some other tests may need sometimes 10 different test data, some may need 12 . Even some times in a single test one iteration i need only 7 params in other iteration i need 12 params .
How do i manage it effectively?
Issues with Testng.xml
Maintaining 20 different accounts , 40 different product details , cards , history all in a single xml file and configuring test suite like parallel execution , configuring only particular classes to execute etc., all together will mess the testng.xml file
So can you please suggest which is a optimized way to handle data in Testing Framework .
How in real time the data parameterization , iterations with different test datas will be handled

Assuming that every test knows what sort of test data it is going to be receiving here's what I would suggest that you do :
Have your testng suite xml file pass in the file name from which data is to be read to the data provider.
Build your data provider such that it receives the file name from which to read via TestNG parameters and then builds a generic map as test data iteration (Every test will receive its parameters as a key,value pair map) and then work with the passed in map.
This way you will just one data provider which can literally handle anything. You can make your data provider a bit more sophisticated by having it deal with test methods and then provide the values accordingly.
Here's a skeleton implementation of what I am talking about.
public class DataProviderExample {
#Test (dataProvider = "dp")
public void testMethod(Map<String, String> testdata) {
System.err.println("****" + testdata);
}
#DataProvider (name = "dp")
public Object[][] getData(ITestContext ctx) {
//This line retrieves the value of <parameter name="fileName" value="/> from within the
//<test> tag of the suite xml file.
String fileName = ctx.getCurrentXmlTest().getParameter("fileName");
List<Map<String, String>> maps = extractDataFrom(fileName);
Object[][] testData = new Object[maps.size()][1];
for (int i = 0; i < maps.size(); i++) {
testData[i][0] = maps.get(i);
}
return testData;
}
private static List<Map<String, String>> extractDataFrom(String file) {
List<Map<String, String>> maps = Lists.newArrayList();
maps.add(Maps.newHashMap());
maps.add(Maps.newHashMap());
maps.add(Maps.newHashMap());
return maps;
}
}

I'm actually currently trying to do the same (or similar) thing. I write automation to validate product data on several eComm sites.
My old method
The data comes in Excel sheet format that I process slightly to get in a format that I want. I run the automation that reads from Excel and executes the runs sequentially.
My new method (so far, WIP)
My company recently started using SauceLabs so I started prototyping ways to take advantage of X # of VMs in parallel and see the same issues as you. This isn't a polished or even a finished solution. It's something I'm currently working on but I thought I would share some of what I'm doing to see if it will help you.
I started reading SauceLabs docs and ran across the sample code below which started me down the path.
https://github.com/saucelabs-sample-scripts/C-Sharp-Selenium/blob/master/SaucePNUnit_Test.cs
I'm using NUnit and I found in their docs a way to pass data into the test that allows parallel execution and allows me to store it all neatly in another class.
https://github.com/nunit/docs/wiki/TestFixtureSource-Attribute
This keeps me from having a bunch of [TextFixture] tags stacked on top of my script class (as in the demo code above). Right now I have,
[TestFixtureSource(typeof(Configs), "StandardBrowsers")]
[Parallelizable]
public class ProductSetupUnitTest
where the Configs class contains an object[] called StandardBrowsers like
public class Configs
{
static object[] StandardBrowsers =
{
new object[] { "chrome", "latest", "windows 10", "Product Name1", "Product ID1" },
new object[] { "chrome", "latest", "windows 10", "Product Name2", "Product ID2" },
new object[] { "chrome", "latest", "windows 10", "Product Name3", "Product ID3" },
new object[] { "chrome", "latest", "windows 10", "Product Name4", "Product ID4" },
};
I actually got this working this morning so I know now the approach will work and I'm working on ways to further tweak and improve it.
So, in your case you would just load up the object[] with all the data you want to pass. You will probably have to declare a string for each of the possible fields you might want to pass. If you don't need that particular field in this run, then pass empty string.
My next step is to load the object[] by loading the data from Excel. The pain for me is how to do logging. I have a pretty mature logging system in my existing sequential execution script. It's going to be hard to give that up or setting for something with reduced functionality. Currently I write everything to a CSV, load that into Excel, and then I can quickly process failures using Excel filtering, etc. My current thought is to have each script write it's own CSV and then pull them all together after all the runs are complete. That part is still theoretical right now though.
Hope this helps. Feel free to ask me questions if something isn't clear. I'll answer what I can.

Save a spark RDD using mapPartition with iterator

I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt

A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.

Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.

Serializing RDD

I have an RDD which I am trying to serialize and then reconstruct by deserializing. I am trying to see if this is possible in Apache Spark.
static JavaSparkContext sc = new JavaSparkContext(conf);
static SerializerInstance si = SparkEnv.get().closureSerializer().newInstance();
static ClassTag<JavaRDD<String>> tag = scala.reflect.ClassTag$.MODULE$.apply(JavaRDD.class);
..
..
JavaRDD<String> rdd = sc.textFile(logFile, 4);
System.out.println("Element 1 " + rdd.first());
ByteBuffer bb= si.serialize(rdd, tag);
JavaRDD<String> rdd2 = si.deserialize(bb, Thread.currentThread().getContextClassLoader(),tag);
System.out.println(rdd2.partitions().size());
System.out.println("Element 0 " + rdd2.first());
I get an exception on the last line when I perform an action on the newly created RDD. The way I am serializing is similar to how it is done internally in Spark.
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.take(RDD.scala:1177)
at org.apache.spark.rdd.RDD.first(RDD.scala:1189)
at org.apache.spark.api.java.JavaRDDLike$class.first(JavaRDDLike.scala:477)
at org.apache.spark.api.java.JavaRDD.first(JavaRDD.scala:32)
at SimpleApp.sparkSend(SimpleApp.java:63)
at SimpleApp.main(SimpleApp.java:91)
The RDD is created and loaded within the same process, so I don't understand how this error happens.

I'm the author of this warning message.
Spark does not support performing actions and transformations on copies of RDDs that are created via deserialization. RDDs are serializable so that certain methods on them can be invoked in executors, but end users shouldn't try to manually perform RDD serialization.
When an RDD is serialized, it loses its reference to the SparkContext that created it, preventing jobs from being launched with it (see here). In earlier versions of Spark, your code would result in a NullPointerException when Spark tried to access the private, null RDD.sc field.
This error message was worded this way because users were frequently running into confusing NullPointerExceptions when trying to do things like rdd1.map { _ => rdd2.count() }, which caused actions to be invoked on deserialized RDDs on executor machines. I didn't anticipate that anyone would try to manually serialize / deserialize their RDDs on the driver, so I can see how this error message could be slightly misleading.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.