I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records.
Any help appreciated. I'm using Java.
Like #mck also pointed out: the best way is to use join.
with spark you can read external jdbc table using the DataRame Api
for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields
Related
I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?
If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.
It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).
How to read from Hive without map reduce? I am trying to read a column from a table created on Hive, but I don't want the overhead that exist from map reduce. Basicaly I want to retrive the values from a table created on Hive without overhead and get them the fastest way possible.
Instead of MapReduce, you can use Tez or Spark as you execution engine in Hive.
See hive.execution.engine in Hive Configuration Properties.
There are also quite a few SQL engines compatible with the hive metadata e.g Presto, Spark SQL, Impala.
generally, if you do a "select *from" a table in hive mapreduce wont run..
In your case are you using just a select column from a hive table also mapreduce wont run.
or you can create a subtable on the main table with the needed columns and number of rows and just do a select * on the table.
I've been upgrading a JAVA spark project from using txt file input to reading from a MongoDB. My question is can we just query the data needed, for example, I have a millions of records. I want to get only the records from the beginning of this week and start processing on it.
Looking at MongoDB documentation, they all start like this:
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Basically, the MongoSpark load the whole collection to the context and then transform it into a DF, which means even if I only need 1000 records of this week, the program still has to get the whole 1 million records before doing anything else.
I wonder if there is something else which allow me to pass the query directly to MongoSpark instead of doing this?
Thank you.
A DataFrame or even RDD's represent a lazy collection so doing:
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Will not cause any compute to happen inside Spark and no data will be requested from MongoDB.
Only, when you do an action will Spark request data to be processed. At this stage the Mongo Spark Connector will partition the data you have requested and return the partition information to the Spark Driver. The Spark Driver will allocate tasks to the Spark Worker and each worker will ask for the relevant partition from the Mongo Spark Connector.
One of the nice features of DataFrames / Datasets is that when using filters the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark. This means that not all the data is sent across the wire! Just the data you need.
Things to be aware of make sure you are using the latest Mongo Spark Connector. Also there is a ticket to push the filters down into the partitioning logic as well. Potentially, reducing the number of empty partitions and providing further speed ups.
I am trying to move all my data from one column-family (table) to the other. Since both the tables have different descriptions, I would have to pull all data from table-1 and create a new object for table-2 and then do a bulk aync insert. My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out. I am looking out for solutions to do that easily using Spring Data Cassandra with Java.
I initially planned for moving all the data to a temp table first followed by creating some composite key relations and then querying back my master table. However, it doesn't seems favorable to me. Can anyone suggest a good strategy to do this? Any leads would be appreciated. Thanks!
My table-1 has millions of records so I cannot get all the data directly in my data structure and work that out.
With datastax java driver you can get all data by token ranges and work out data from each token range. For example:
Set<TokenRange> tokenRanges = cassandraSession.getCluster().getMetadata().getTokenRanges();
for(TokenRange tr: tokenRanges) {
List<Row> rows = new ArrayList<>();
for(TokenRange sub: tr.unwrap()){
String query = "SELECT * FROM keyspace.table WHERE token(pk) > ? AND token(pk) <= ?";
SimpleStatement st = new SimpleStatement( query, sub.getStart(), sub.getEnd() );
rows.addAll( session.execute( st ).all() );
}
transformAndWriteToNewTable(rows);
}
Each token range contains only piece of all data and can be handled by one physical machine. You can handle each token range independently (in parallel or asynchronously) to get more performance.
You could use Apache Spark Streaming. Technically, you will read data from the first table, do on-the-fly transformation and write to the second table.
Note, I prefer Spark scala API, as it has more elegant API and streaming jobs code would be more laconic. But if you want to do it using pure Java, that's your choice.
I have a bunch of data in a csv file which I need to store into Cassandra through spark.
I'm using the spark to cassandra connector for this.
Normally to store into Cassandra , I create a Pojo and then serialize it to RDD and then store :
Employee emp = new Employee(1 , 'Mr', 'X');
JavaRDD<Employee> empRdd = SparkContext.parallelize(emp);
Finally I write this to cassandra as :
CassandraJavaUtil.javaFunctions(empRdd, Emp.class).saveToCassandra("dev", "emp");
This is fine , but my data is stored in a csv file. Every line represents a tuple in cassandra database.
I know I can read each line , split the columns , create object using the column values , add it to a list and then finally serialize the entire list. I was wondering if there is an easier more direct way to do this ?
Well you could just use the SSTableLoader for BulkLoading and avoid spark altogether.
If you rely on spark then I think you're out of luck... Although I am not sure how much easier than reading line by line and splitting the lines is even possible...