I'm new to Spark. I need to compare two JavaRDD streams from hbase clusters using spark and export the differences to files. Data will be in billions of records. is there any way I can compare 2 JavaRDD using java?
Related
I am new to Spark. I am trying to implement spark program in java. I just want to read multiple files from a folder and combine altogether by pairing its words#filname as key and value(count).
I don't know how to combine all data together.. and I want the output to be like pairs
(word#filname,1)
ex:
(happy#file1,2)
(newyear#file1,1)
(newyear#file2,1)
refer to the java-spark documentation : https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#input_file_name()
and following this response : https://stackoverflow.com/a/36356253/8357778
you will be able to add a column with filename to your dataframe storing your data. Next to these steps, you just have to select and transform your rows as you want.
If you prefer using an RDD, you convert your dataframe and map it.
We have a RDD of N = 10^6 elements. We know that to compare each element to each other element will take N-squared = 10^12 operations and decided to use Spark to accomplish this on a cluster.
I don't believe we actually need to produce a Cartesian set, there's no reason we need a set like {(a,a),(a,b),(b,a),(b,b)} to stick around. If not persisted I see Spark would get rid of it once its useful life is done, but I'd rather not let it live in the first place :) The Cartesian obviously takes a lot more memory and we'd like to avoid that.
Is there not a way in Spark to iterate the way we want without creating a Cartesian product of the same RDD by itself?
There must be something, I have been looking at the by partition type functions.
I am thinking, based on the chat session linked below, that assigning
an "artificial" key to subsets of RDD elements, evenly divided across
workers on partitions, then starting to compare by key partition by
partition until it's all compared.
NOTES:
For what it's worth we can use a JavaPairRDD and have the DropResult be the index, but it's not necessary to compare DropResults to all other DropResults in any particular order, as long as each one gets compared to all the others. Thanks.
(NOTE: I don't think using a DataFrame would work because these are custom classes, are DataFrames not for pre-defined SQL-like datatypes?
And before anyone suggests it, our target cluster is currently running 1.4.1 and it's out of our control so if Datasets are useful I'd like to know but don't know when I could take advantage of that)
I have looked at these other questions including a couple I asked but they don't cover this specific case:
How to compare every element in the RDD with every other element in the RDD ?
Spark - Nested RDD Operation
Comparing Subsets of an RDD
*** interesting chat leads off this question!!!
https://chat.stackoverflow.com/rooms/99735/discussion-between-zero323-and-daniel-imberman
THESE I asked about different subjects, mostly how to control creation of RDDs to a desired size:
https://stackoverflow.com/questions/34339300/nesting-parallelizations-in-spark-whats-the-right-approach
https://stackoverflow.com/questions/34279781/in-apache-spark-can-i-easily-repeat-nest-a-sparkcontext-parallelize
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-meet-nested-loop-on-pairRdd-td21121.html
I am creating the JavaRDD<Model> by reading a text file and mapping each line to Model Class properties.
Then i am converting JavaRDD<Model> to DataFrame using sqlContext.
DataFrame fileDF = sqlContext.createDataFrame(javaRDD, Model.class);
Basically, we are trying to use DataFrame API to improve performance and easy to write.
Is there any performance degradation or will it create the Model Objects again when converting DataFrame to JavaRDD.
The reason i am doing this, i don't see any methods to read text file directly using sqlContext.
Is there any alternate efficient way to do this?
Will it be slower?
There definitely will be some overhead, although I did not benchmark how much. Why? Because the createDataFrame has to:
use reflection to get the schema for the DataFrame (once for the whole RDD)
map an entity in the RDD to a row record (so it fits the dataframe format) - N time, once per entity in the RDD
create the actual DataFrame object.
Will it matter?
I doubt it. The reflection will be really fast as it's just one object and you probably have only a handful fields there.
Will the transformation be slow? Again probably no as you have only a few fields per record to iterate through.
Alternatives
But if you are not using that RDD for anything else you have a few options in the DataFrameReader class which can be accessed through SQLContext.read():
json: several methods here
parquet: here
text: here
The good thing about 1 and 2 is that you get an actual schema. The last one, you pass the path to the file (like with other two methods) but since the format is not specified Spark does not have any information about the schema -> each line in the file is treated as a new row in the DF with a single column value which contains the whole line.
If you have a text file in a format that would allow creating a schema, for instance CSV, you can try using a third party library such as Spark CSV.
I am using Apache Spark to analyse the data from Cassandra and will insert the data back into Cassandra by designing new tables in Cassandra as per our queries. I want to know that whether it is possible for spark to analyze in real time? If yes then how? I have read so many tutorials regarding this, but found nothing.
I want to perform the analysis and insert into Cassandra whenever a data comes into my table instantaneously.
This is possible with Spark Streaming, you should take a look at the demos and documentation which comes packaged with the Spark Cassandra Connector.
https://github.com/datastax/spark-cassandra-connector
This includes support for streaming, as well as support for creating new tables on the fly.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
Spark Streaming extends the core API to allow high-throughput,
fault-tolerant stream processing of live data streams. Data can be
ingested from many sources such as Akka, Kafka, Flume, Twitter,
ZeroMQ, TCP sockets, etc. Results can be stored in Cassandra.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#saving-rdds-as-new-tables
Use saveAsCassandraTable method to automatically create a new table
with given name and save the RDD into it. The keyspace you're saving
to must exist. The following code will create a new table words_new in
keyspace test with columns word and count, where word becomes a
primary key:
case class WordCount(word: String, count: Long) val collection =
sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveAsCassandraTable("test", "words_new",
SomeColumns("word", "count"))
I have a bunch of data in a csv file which I need to store into Cassandra through spark.
I'm using the spark to cassandra connector for this.
Normally to store into Cassandra , I create a Pojo and then serialize it to RDD and then store :
Employee emp = new Employee(1 , 'Mr', 'X');
JavaRDD<Employee> empRdd = SparkContext.parallelize(emp);
Finally I write this to cassandra as :
CassandraJavaUtil.javaFunctions(empRdd, Emp.class).saveToCassandra("dev", "emp");
This is fine , but my data is stored in a csv file. Every line represents a tuple in cassandra database.
I know I can read each line , split the columns , create object using the column values , add it to a list and then finally serialize the entire list. I was wondering if there is an easier more direct way to do this ?
Well you could just use the SSTableLoader for BulkLoading and avoid spark altogether.
If you rely on spark then I think you're out of luck... Although I am not sure how much easier than reading line by line and splitting the lines is even possible...