CSV to RDD to Cassandra store in Apache Spark - java

I have a bunch of data in a csv file which I need to store into Cassandra through spark.
I'm using the spark to cassandra connector for this.
Normally to store into Cassandra , I create a Pojo and then serialize it to RDD and then store :
Employee emp = new Employee(1 , 'Mr', 'X');
JavaRDD<Employee> empRdd = SparkContext.parallelize(emp);
Finally I write this to cassandra as :
CassandraJavaUtil.javaFunctions(empRdd, Emp.class).saveToCassandra("dev", "emp");
This is fine , but my data is stored in a csv file. Every line represents a tuple in cassandra database.
I know I can read each line , split the columns , create object using the column values , add it to a list and then finally serialize the entire list. I was wondering if there is an easier more direct way to do this ?

Well you could just use the SSTableLoader for BulkLoading and avoid spark altogether.
If you rely on spark then I think you're out of luck... Although I am not sure how much easier than reading line by line and splitting the lines is even possible...

Related

Spark read CSV file using Data Frame and query from PostgreSQL DB

I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records.
Any help appreciated. I'm using Java.
Like #mck also pointed out: the best way is to use join.
with spark you can read external jdbc table using the DataRame Api
for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields

Searching and storing values from CSV

I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?
If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.
It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).

Interact with multiple files in Spark - Java code

I am new to Spark. I am trying to implement spark program in java. I just want to read multiple files from a folder and combine altogether by pairing its words#filname as key and value(count).
I don't know how to combine all data together.. and I want the output to be like pairs
(word#filname,1)
ex:
(happy#file1,2)
(newyear#file1,1)
(newyear#file2,1)
refer to the java-spark documentation : https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#input_file_name()
and following this response : https://stackoverflow.com/a/36356253/8357778
you will be able to add a column with filename to your dataframe storing your data. Next to these steps, you just have to select and transform your rows as you want.
If you prefer using an RDD, you convert your dataframe and map it.

Copy data from one table to other in Cassandra using Java

I am trying to move all my data from one column-family (table) to the other. Since both the tables have different descriptions, I would have to pull all data from table-1 and create a new object for table-2 and then do a bulk aync insert. My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out. I am looking out for solutions to do that easily using Spring Data Cassandra with Java.
I initially planned for moving all the data to a temp table first followed by creating some composite key relations and then querying back my master table. However, it doesn't seems favorable to me. Can anyone suggest a good strategy to do this? Any leads would be appreciated. Thanks!
My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out.
With datastax java driver you can get all data by token ranges and work out data from each token range. For example:
Set<TokenRange> tokenRanges = cassandraSession.getCluster().getMetadata().getTokenRanges();
for(TokenRange tr: tokenRanges) {
List<Row> rows = new ArrayList<>();
for(TokenRange sub: tr.unwrap()){
String query = "SELECT * FROM keyspace.table WHERE token(pk) > ? AND token(pk) <= ?";
SimpleStatement st = new SimpleStatement( query, sub.getStart(), sub.getEnd() );
rows.addAll( session.execute( st ).all() );
}
transformAndWriteToNewTable(rows);
}
Each token range contains only piece of all data and can be handled by one physical machine. You can handle each token range independently (in parallel or asynchronously) to get more performance.
You could use Apache Spark Streaming. Technically, you will read data from the first table, do on-the-fly transformation and write to the second table.
Note, I prefer Spark scala API, as it has more elegant API and streaming jobs code would be more laconic. But if you want to do it using pure Java, that's your choice.

Apache Spark - Converting JavaRDD to DataFrame and vice versa, any performance degradation?

I am creating the JavaRDD<Model> by reading a text file and mapping each line to Model Class properties.
Then i am converting JavaRDD<Model> to DataFrame using sqlContext.
DataFrame fileDF = sqlContext.createDataFrame(javaRDD, Model.class);
Basically, we are trying to use DataFrame API to improve performance and easy to write.
Is there any performance degradation or will it create the Model Objects again when converting DataFrame to JavaRDD.
The reason i am doing this, i don't see any methods to read text file directly using sqlContext.
Is there any alternate efficient way to do this?
Will it be slower?
There definitely will be some overhead, although I did not benchmark how much. Why? Because the createDataFrame has to:
use reflection to get the schema for the DataFrame (once for the whole RDD)
map an entity in the RDD to a row record (so it fits the dataframe format) - N time, once per entity in the RDD
create the actual DataFrame object.
Will it matter?
I doubt it. The reflection will be really fast as it's just one object and you probably have only a handful fields there.
Will the transformation be slow? Again probably no as you have only a few fields per record to iterate through.
Alternatives
But if you are not using that RDD for anything else you have a few options in the DataFrameReader class which can be accessed through SQLContext.read():
json: several methods here
parquet: here
text: here
The good thing about 1 and 2 is that you get an actual schema. The last one, you pass the path to the file (like with other two methods) but since the format is not specified Spark does not have any information about the schema -> each line in the file is treated as a new row in the DF with a single column value which contains the whole line.
If you have a text file in a format that would allow creating a schema, for instance CSV, you can try using a third party library such as Spark CSV.

Categories

Resources