I want to retrieve rows from Apache Hive via Apache Spark and put each row to Aerospike cache.
Here is a simple case.
var dataset = session.sql("select * from employee");
final var aerospikeClient = aerospike; // to remove binding between lambda and the service class itself
dataset.foreach(row -> {
var key = new Key("namespace", "set", randomUUID().toString());
aerospikeClient.add(
key,
new Bin(
"json-repr",
row.json()
)
);
});
I get an error:
Caused by: java.io.NotSerializableException: com.aerospike.client.reactor.AerospikeReactorClient
Obviously I can't make AerospikeReactorClient serializable. I tried to add dataset.collectAsList() and that did work. But as far as understood, this method loads all the content into one node. There might an enormous amount of data. So, it's not the option.
What are the best practices to deal with such problems?
You can write directly from a data frame. No need to loop through the dataset.
Launch the spark shell and import the com.aerospike.spark.sql._ package:
$ spark-shell
scala> import com.aerospike.spark.sql._
import com.aerospike.spark.sql._
Example of writing data into Aerospike
val TEST_COUNT= 100
val simpleSchema: StructType = new StructType(
Array(
StructField("one", IntegerType, nullable = false),
StructField("two", StringType, nullable = false),
StructField("three", DoubleType, nullable = false)
))
val simpleDF = {
val inputBuf= new ArrayBuffer[Row]()
for ( i <- 1 to num_records){
val one = i
val two = "two:"+i
val three = i.toDouble
val r = Row(one, two, three)
inputBuf.append(r)
}
val inputRDD = spark.sparkContext.parallelize(inputBuf.toSeq)
spark.createDataFrame(inputRDD,simpleSchema)
}
//Write the Sample Data to Aerospike
simpleDF.write
.format("aerospike") //aerospike specific format
.option("aerospike.writeset", "spark-test") //write to this set
.option("aerospike.updateByKey", "one")//indicates which columns should be used for construction of primary key
.option("aerospike.write.mode","update")
.save()
I managed to overcome this issue by creating AerospikeClient manually inside foreach lambda.
var dataset = session.sql("select * from employee");
dataset.foreach(row -> {
var key = new Key("namespace", "set", randomUUID().toString());
newAerospikeClient(aerospikeProperties).add(
key,
new Bin(
"json-repr",
row.json()
)
);
});
Now I only have to declare AerospikeProperties as Serializable.
Related
I have a table with the fields
CREATE TABLE app_category_agg (
category text,
app_count int,
sp_count int,
subscriber_count int,
window_revenue bigint,
top_apps frozen <list<map<text,int>>>,
PRIMARY KEY (category)
);
when I try to map it to kotlin model
#Table("app_category_agg")
class AppCategoryAggData {
#PrimaryKeyColumn(name = "category", ordinal = 0, type = PrimaryKeyType.PARTITIONED)
lateinit var category: String
#Column("app_count")
var appCount: Int = 0
#Column("sp_count")
var spCount: Int = 0
#Column("subscriber_count")
var subscriberCount: Int = 0
#Column("window_revenue")
var windowRevenue: Long = 0
#Column("top_apps")
var topApps: List<Any> = arrayListOf()
}
interface AppCategoryAggRepository: CassandraRepository<AppCategoryAggData, String> {
#Query(value = "SELECT * FROM analytics_info.app_category_agg")
fun findAllAppCategoryAggData(): List<AppCategoryAggData>
}
I get this error
Query; CQL [SELECT * FROM analytics_info.app_category_agg]; Codec not found for requested operation: [map<varchar, int> <-> java.util.Map]; nested exception is com.datastax.driver.core.exceptions.CodecNotFoundException: Codec not found for requested operation: [map<varchar, int> <-> java.util.Map]
how can I resolve it? I read about making codecs but it's not very much clear to me
I've created a table with your structure, and populated it with sample data:
insert into app_category_agg (category, app_count, sp_count, subscriber_count, window_revenue, top_apps)
values('test', 2, 1, 10, 100, [{'t1':1, 't2':2}]);
For object mapper from the Java driver 3, the working code is following.
Class declaration:
import com.datastax.driver.mapping.MappingManager
import com.datastax.driver.mapping.annotations.Column
import com.datastax.driver.mapping.annotations.PartitionKey
import com.datastax.driver.mapping.annotations.Table
#Table(keyspace = "test", name = "app_category_agg")
class AppCategoryAggData {
#PartitionKey
lateinit var category: String
#Column(name = "app_count")
var appCount: Int = 0
#Column(name = "sp_count")
var spCount: Int = 0
#Column(name = "subscriber_count")
var subscriberCount: Int = 0
#Column(name = "window_revenue")
var windowRevenue: Long = 0
#Column(name = "top_apps")
var topApps: List<Map<String, Int>> = emptyList()
override fun toString(): String {
return "AppCategoryAggData(category='$category', appCount=$appCount, spCount=$spCount, subscriberCount=$subscriberCount, windowRevenue=$windowRevenue, topApps=$topApps)"
}
}
Main function - it first inserts data from Kotlin code, and then read the data that I pre-inserted:
import com.datastax.driver.core.Cluster
object KtTestObjMapper {
#JvmStatic
fun main(args: Array<String>) {
val cluster = Cluster.builder()
.addContactPoint("10.101.34.176")
.build()
val session = cluster.connect()
val manager = MappingManager(session)
val mapper = manager.mapper(AppCategoryAggData::class.java)
val appObj = AppCategoryAggData()
appObj.category = "kotlin"
appObj.appCount = 5
appObj.spCount = 10
appObj.subscriberCount = 50
appObj.windowRevenue = 10000
appObj.topApps = listOf(mapOf("t2" to 2))
mapper.save(appObj)
val obj2 = mapper.get("test")
print("obj2=$obj2")
session.close()
cluster.close()
}
}
When I run this code, I receive following output:
Object from =AppCategoryAggData(category='test', appCount=2, spCount=1, subscriberCount=10, windowRevenue=100, topApps=[{t1=1, t2=2}])
and when I select data from table using cqlsh, I see that data were inserted by Kotlin:
cqlsh:test> SELECT * from app_category_agg ;
category | app_count | sp_count | subscriber_count | top_apps | window_revenue
----------+-----------+----------+------------------+----------------------+----------------
test | 2 | 1 | 10 | [{'t1': 1, 't2': 2}] | 100
kotlin | 5 | 10 | 50 | [{'t2': 2}] | 10000
(2 rows)
The full code is in my repository. The one drawback of this solution is that it's based on the Java driver 3.x that is previous major release of the driver. If you don't have strict requirement for it, it's recommended to use latest major release - 4.x, that works with both Cassandra & DSE, and has a lot of new functionality.
Although the object mapper in new version works differently - instead of run-time annotations, the compile annotations are used, so code looks differently, and we need to configure compilation process differently, and it could be more complicated compared to driver 3.x, but code itself could be simpler (full code is here).
We need to define data class (entity):
#Entity
#CqlName("app_category_agg")
data class AppCategoryAggData(
#PartitionKey var category: String,
#CqlName("app_count") var appCount: Int? = null,
#CqlName("sp_count") var spCount: Int? = null,
#CqlName("subscriber_count") var subscriberCount: Int? = null,
#CqlName("window_revenue") var windowRevenue: Long? = null,
#CqlName("top_apps") var topApps: List<Map<String, Int>>? = null
) {
constructor() : this("")
}
Define the DAO interface with 2 operations (insert and findByCategory):
#Dao
interface AppCategoryAggDao {
#Insert
fun insert(appCatAgg: AppCategoryAggData)
#Select
fun findByCategory(appCat: String): AppCategoryAggData?
}
Define the Mapper to obtain the DAO:
#Mapper
interface AppCategoryMapper {
#DaoFactory
fun appCategoryDao(#DaoKeyspace keyspace: CqlIdentifier?): AppCategoryAggDao?
}
And use it:
object KtTestObjMapper {
#JvmStatic
fun main(args: Array<String>) {
val session = CqlSession.builder()
.addContactPoint(InetSocketAddress("10.101.34.176", 9042))
.build()
// get mapper - please note that we need to use AppCategoryMapperBuilder
// that is generated by annotation processor
val mapper: AppCategoryMapper = AppCategoryMapperBuilder(session).build()
val dao: AppCategoryAggDao? = mapper.appCategoryDao(CqlIdentifier.fromCql("test"))
val appObj = AppCategoryAggData("kotlin2",
10, 11, 12, 34,
listOf(mapOf("t2" to 2)))
dao?.insert(appObj)
val obj2 = dao?.findByCategory("test")
println("Object from =$obj2")
session.close()
}
}
The change compared to Java is that we need to use the generated class AppCategoryMapperBuilder to obtain the instance of AppCategoryMapper in:
val mapper: AppCategoryMapper = AppCategoryMapperBuilder(session).build()
I have spring-data-mogodb application on java or kotlin, and need create text search request to mongodb by spring template.
In mongo shell it look like that:
db.stores.find(
{ $text: { $search: "java coffee shop" } },
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
I already tried to do something but it is not exactly what i need:
#override fun getSearchedFiles(searchQuery: String, pageNumber: Long, pageSize: Long, direction: Sort.Direction, sortColumn: String): MutableList<SystemFile> {
val matching = TextCriteria.forDefaultLanguage().matching(searchQuery)
val match = MatchOperation(matching)
val sort = SortOperation(Sort(direction, sortColumn))
val skip = SkipOperation((pageNumber * pageSize))
val limit = LimitOperation(pageSize)
val aggregation = Aggregation
.newAggregation(match, skip, limit)
.withOptions(Aggregation.newAggregationOptions().allowDiskUse(true).build())
val mappedResults = template.aggregate(aggregation, "files", SystemFile::class.java).mappedResults
return mappedResults
}
May be someone already working with text searching on mongodb with java, please share your knowledge with us )
Setup Text indexes
First you need to set up text indexes on the fields on which you want to perform your text query.
If you are using Spring data mongo to insert your documents in your database, you can use #TextIndexed annotation and indexes will be built while inserting your document.
#Document
class MyObject{
#TextIndexed(weight=3) String title;
#TextIndexed String description;
}
If your document are already inserted in your database, you need to build your text indexes manually
TextIndexDefinition textIndex = new TextIndexDefinitionBuilder()
.onField("title", 3)
.onField("description")
.build();
After the build and config of your mongoTemplate you can pass your text indexes/
template.indexOps(MyObject.class).ensureIndex(textIndex);
Building your text query
List<MyObject> getSearchedFiles(String textQuery){
TextQuery textQuery = TextQuery.queryText(new TextCriteria().matchingAny(textQuery)).sortByScore();
List<MyObject> result = mongoTemplate.find(textQuery, MyObject.class, "myCollection");
return result
}
Imagine simple test:
#Test
public void testIfColumnHasMentionsInPrimaryKeys() {
List<Row> data = Arrays.asList(
RowFactory.create("ID, ID1"),
RowFactory.create("ID,COLUMN_UNDERSCORE_1"),
RowFactory.create("ID1, ID2")
);
StructType schema = new StructType(new StructField[]{
new StructField("COLUMN", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(data, schema);
Set<String> primaryKeys = new HashSet<>();
primaryKeys.add("ID1");
primaryKeys.add("ID2");
RegexTokenizer regexTokenizer = new RegexTokenizer().setInputCol("COLUMN")
.setOutputCol("COLUMN_AS_LIST").setToLowercase(false).setPattern("\\s*,\\s*");
Dataset<Row> transformedRows = regexTokenizer.transform(rows);
transformedRows.show();
}
The question is: what to do next in order to check if at least 1 value in COLUMN_AS_LIST mentioned in primaryKeys. Columns.isin() works for a single value in column and Column.contains() - for a single value in braces, while I need to find based on intersection (if exists). Any ideas, please?
UPDATE: There is an option like:
Dataset<Row> filter = transformedRows.filter(e -> e.getList(1).stream().anyMatch(primaryKeys::contains));
But it looks ugly, isn't it?
I've attempted to merge multiple JavaRDD but i get only 2 merged can someone kindly help. I've been struggling with this for a while but overall i would like to be able to obtain multiple collections and use sqlContext create a group and print out all results.
here my code
JavaRDD<AppLog> logs = mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppa_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.fav_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.pps_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.dd_logs").union(
mapCollection(sc, "mongodb://hadoopUser:Pocup1ne9#localhost:27017/hbdata.ppt_logs")
)
)
)
);
public JavaRDD<AppLog> mapCollection(JavaSparkContext sc ,String uri){
Configuration mongodbConfig = new Configuration();
mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri", uri);
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
mongodbConfig, // Configuration
MongoInputFormat.class, // InputFormat: read from a live cluster.
Object.class, // Key class
BSONObject.class // Value class
);
return documents.map(
new Function<Tuple2<Object, BSONObject>, AppLog>() {
public AppLog call(final Tuple2<Object, BSONObject> tuple) {
AppLog log = new AppLog();
BSONObject header =
(BSONObject) tuple._2();
log.setTarget((String) header.get("target"));
log.setAction((String) header.get("action"));
return log;
}
}
);
}
// printing the collections
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame logsSchema = sqlContext.createDataFrame(logs, AppLog.class);
logsSchema.registerTempTable("logs");
DataFrame groupedMessages = sqlContext.sql(
"select * from logs");
// "select target, action, Count(*) from logs group by target, action");
// "SELECT to, body FROM messages WHERE to = \"eric.bass#enron.com\"");
groupedMessages.show();
logsSchema.printSchema();
If you wanted to merge multiple JavaRDDs , simply use sc.union(rdd1,rdd2,..) instead rdd1.union(rdd2).union(rdd3).
Also check this RDD.union vs SparkContex.union
I'm using sparkSql 1.6.2 (Java API) and I have to process the following DataFrame that has a list of value in 2 columns:
ID AttributeName AttributeValue
0 [an1,an2,an3] [av1,av2,av3]
1 [bn1,bn2] [bv1,bv2]
The desired table is:
ID AttributeName AttributeValue
0 an1 av1
0 an2 av2
0 an3 av3
1 bn1 bv1
1 bn2 bv2
I think I have to use a combination of the explode function and a custom UDF function.
I found the following resources:
Explode (transpose?) multiple columns in Spark SQL table
How do I call a UDF on a Spark DataFrame using JAVA?
and I can successfully run an example that read the two columns and return the concatenation of the first two strings in a column
UDF2 combineUDF = new UDF2<Seq<String>, Seq<String>, String>() {
public String call(final Seq<String> col1, final Seq<String> col2) throws Exception {
return col1.apply(0) + col2.apply(0);
}
};
context.udf().register("combineUDF", combineUDF, DataTypes.StringType);
the problem is to write the signature of a UDF returning two columns (in Java).
As far as I understand I must define a new StructType as the one shown below and set that as return type, but so far I didn't manage to have the final code working
StructType retSchema = new StructType(new StructField[]{
new StructField("#AttName", DataTypes.StringType, true, Metadata.empty()),
new StructField("#AttValue", DataTypes.StringType, true, Metadata.empty()),
}
);
context.udf().register("combineUDF", combineUDF, retSchema);
Any help will be really appreciated.
UPDATE: I'm trying to implement first the zip(AttributeName,AttributeValue) so then I will need just to apply the standard explode function in sparkSql:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
I built the following UDF:
UDF2 combineColumns = new UDF2<Seq<String>, Seq<String>, List<List<String>>>() {
public List<List<String>> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
List<List<String>> zipped = new LinkedList<>();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
List<String> subRow = Arrays.asList(col1.apply(i), col2.apply(i));
zipped.add(subRow);
}
return zipped;
}
};
But when I run the code
myDF.select(callUDF("combineColumns", col("AttributeName"), col("AttributeValue"))).show(10);
I got the following error message:
scala.MatchError: [[an1,av1],[an1,av2],[an3,av3]] (of class java.util.LinkedList)
and it looks like the combining has been performed correctly but then the return type is not the expected one in Scala.
Any Help?
Finally I managed to get the result I was looking for but probably not in the most efficient way.
Basically the are 2 step:
Zip of the two list
Explode of the list in rows
For the first step I defined the following UDF Function
UDF2 concatItems = new UDF2<Seq<String>, Seq<String>, Seq<String>>() {
public Seq<String> call(final Seq<String> col1, final Seq<String> col2) throws Exception {
ArrayList zipped = new ArrayList();
for (int i = 0, listSize = col1.size(); i < listSize; i++) {
String subRow = col1.apply(i) + ";" + col2.apply(i);
zipped.add(subRow);
}
return scala.collection.JavaConversions.asScalaBuffer(zipped);
}
};
Missing the function registration to SparkSession:
sparkSession.udf().register("concatItems",concatItems,DataTypes.StringType);
and then I called it with the following code:
DataFrame df2 = df.select(col("ID"), callUDF("concatItems", col("AttributeName"), col("AttributeValue")).alias("AttName_AttValue"));
At this stage the df2 looks like that:
ID AttName_AttValue
0 [[an1,av1],[an1,av2],[an3,av3]]
1 [[bn1,bv1],[bn2,bv2]]
Then I called the following lambda function for exploding the list into rows:
DataFrame df3 = df2.select(col("ID"),explode(col("AttName_AttValue")).alias("AttName_AttValue_row"));
At this stage the df3 looks like that:
ID AttName_AttValue
0 [an1,av1]
0 [an1,av2]
0 [an3,av3]
1 [bn1,bv1]
1 [bn2,bv2]
Finally to split the attribute name and value into two different columns, I converted the DataFrame into a JavaRDD in order to use the map function:
JavaRDD df3RDD = df3.toJavaRDD().map(
(Function<Row, Row>) myRow -> {
String[] info = String.valueOf(myRow.get(1)).split(",");
return RowFactory.create(myRow.get(0), info[0], info[1]);
}).cache();
If anybody has a better solution feel free to comment.
I hope it helps.