Interact with multiple files in Spark - Java code - java

I am new to Spark. I am trying to implement spark program in java. I just want to read multiple files from a folder and combine altogether by pairing its words#filname as key and value(count).
I don't know how to combine all data together.. and I want the output to be like pairs
(word#filname,1)
ex:
(happy#file1,2)
(newyear#file1,1)
(newyear#file2,1)

refer to the java-spark documentation : https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#input_file_name()
and following this response : https://stackoverflow.com/a/36356253/8357778
you will be able to add a column with filename to your dataframe storing your data. Next to these steps, you just have to select and transform your rows as you want.
If you prefer using an RDD, you convert your dataframe and map it.

Related

How do you query a parquet file using parquet-mr?

I have a parquet file stored in AWS S3 that I want to query. I want to retrieve a certain row of data given that it equals a value. Almost like I would in SQL:
SELECT * FROM file.parquet WHERE id = '1234';
I am using parquet-mr to load it in to memory directly from S3 and read it and have it set up with a AvroParquetReader to read the rows.
I've copied every row into a Map for easy querying for now, however is there a better way to do this? The documentation for parquet-mr is not great, and most tutorials use deprecated methods.
Here is some example code of what i've got:
final ParquetReader<GenericRecord> reader = AvroParquetReader
.<GenericRecord>builder(internalPath)
.withConf(parquetConfiguration).build();
You can use reader.read() to get the next row in the file (which is what i've used to put it in to a HashMap, but I can't find any methods in parquet-mr that allow you to query a file without loading the entire file in to memory.
The feature you are looking for is called predicate pushdown. You can read about it and find examples here.

Searching and storing values from CSV

I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?
If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.
It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).

Spark - Update the record (in parquet file) if already exists

I am writing a Spark job to read the data from json file and write it to parquet file, below is the example code:
DataFrame dataFrame = new DataFrameReader(sqlContext).json(textFile);
dataFrame = dataFrame.withColumn("year", year(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame = dataFrame.withColumn("month", month(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame.write().mode(SaveMode.Append).partitionBy("year", "month").parquet("<some_path>");
Json file consists of lots of json records and I want the record to be updated in parquet if it already exists. I have tried Append mode but it seems to be working on file level rather than record level (i.e. if file already exists, it writes in the end). So, running this job for the same file duplicates the records.
Is there any way we can specify dataframe row id as a unique key and ask spark to update the record if it already exists? All the save modes seem to be checking the files and not the records.
Parquet is a file format rather than a database, in order to achieve an update by id, you will need to read the file, update the value in memory, than re-write the data to a new file (or overwrite the existing file).
You might be better served using a database if this is a use-case that will occur frequently.
You can have a look at Apache ORC file format instead, see:
https://orc.apache.org/docs/acid.html
According your use case, or HBase if you want stay in top of HDFS.
But keep in mind that HDFS is a write once file system, if this is not fitting your need, go for something else (maybe elasticsearch, mongodb).
Else, in HDFS, you must create new files every-time, you must setup an incremental process to build a "delta" file, then merge OLD + DELTA = NEW_DATA.

Apache Spark - Converting JavaRDD to DataFrame and vice versa, any performance degradation?

I am creating the JavaRDD<Model> by reading a text file and mapping each line to Model Class properties.
Then i am converting JavaRDD<Model> to DataFrame using sqlContext.
DataFrame fileDF = sqlContext.createDataFrame(javaRDD, Model.class);
Basically, we are trying to use DataFrame API to improve performance and easy to write.
Is there any performance degradation or will it create the Model Objects again when converting DataFrame to JavaRDD.
The reason i am doing this, i don't see any methods to read text file directly using sqlContext.
Is there any alternate efficient way to do this?
Will it be slower?
There definitely will be some overhead, although I did not benchmark how much. Why? Because the createDataFrame has to:
use reflection to get the schema for the DataFrame (once for the whole RDD)
map an entity in the RDD to a row record (so it fits the dataframe format) - N time, once per entity in the RDD
create the actual DataFrame object.
Will it matter?
I doubt it. The reflection will be really fast as it's just one object and you probably have only a handful fields there.
Will the transformation be slow? Again probably no as you have only a few fields per record to iterate through.
Alternatives
But if you are not using that RDD for anything else you have a few options in the DataFrameReader class which can be accessed through SQLContext.read():
json: several methods here
parquet: here
text: here
The good thing about 1 and 2 is that you get an actual schema. The last one, you pass the path to the file (like with other two methods) but since the format is not specified Spark does not have any information about the schema -> each line in the file is treated as a new row in the DF with a single column value which contains the whole line.
If you have a text file in a format that would allow creating a schema, for instance CSV, you can try using a third party library such as Spark CSV.

CSV to RDD to Cassandra store in Apache Spark

I have a bunch of data in a csv file which I need to store into Cassandra through spark.
I'm using the spark to cassandra connector for this.
Normally to store into Cassandra , I create a Pojo and then serialize it to RDD and then store :
Employee emp = new Employee(1 , 'Mr', 'X');
JavaRDD<Employee> empRdd = SparkContext.parallelize(emp);
Finally I write this to cassandra as :
CassandraJavaUtil.javaFunctions(empRdd, Emp.class).saveToCassandra("dev", "emp");
This is fine , but my data is stored in a csv file. Every line represents a tuple in cassandra database.
I know I can read each line , split the columns , create object using the column values , add it to a list and then finally serialize the entire list. I was wondering if there is an easier more direct way to do this ?
Well you could just use the SSTableLoader for BulkLoading and avoid spark altogether.
If you rely on spark then I think you're out of luck... Although I am not sure how much easier than reading line by line and splitting the lines is even possible...

Categories

Resources