Spark structured streaming: converting row to json - java

I'm trying to convert Row of DataFrame into json string using only spark API.
From input Row
+----------------+-----------+
| someThing| else|
+----------------+-----------+
| life| 42|
+----------------+-----------+
with
myDataFrame
.select(struct("*").as("col"))
.select(to_json(col("col")))
.writeStream()
.foreach(new KafkaWriter())
.start()
using KafkaWriter, that is using row.toString() i got:
[{
"someThing":"life",
"else":42
}]
When i would like to get this instead:
{
"someThing":"life",
"else":42
}
(without the [])
Any idea?

Just found the solution. Using Row.mkString instead of Row.toString solved my case.

Related

how to concat all columns in a spark dataframe, using java?

This is how I do do for 2 specific columns:
dataSet.withColumn("colName", concat(dataSet.col("col1"), lit(","),dataSet.col("col2") ));
but dataSet.columns() retruns Sting array, and not Column array.
How should I craete a List<Column>?
Thanks!
Simple Way - Instead of df.columns use concat_ws(",","*"), Check below code.
df.withColumn("colName",expr("concat_ws(',',*)")).show(false)
+---+--------+---+-------------+
|id |name |age|colName |
+---+--------+---+-------------+
|1 |Srinivas|29 |1,Srinivas,29|
|2 |Ravi |30 |2,Ravi,30 |
+---+--------+---+-------------+
Java has more verbose syntax.
Try this -
df.withColumn("colName",concat_ws(",", toScalaSeq(Arrays.stream(df.columns()).map(functions::col).collect(Collectors.toList()))));
Use below utility to convert java list to scala seq-
<T> Buffer<T> toScalaSeq(List<T> list) {
return JavaConversions.asScalaBuffer(list);
}
If someone is looking for a way to concat all the columns of a DataFrame in Scala, this is what worked for me:
val df_new = df.withColumn(new_column_name, concat_ws("-", df.columns.map(col): _*))

ElasticSearch Java API: Enabling fielddata on text fields

I have created an ElasticSearch index using the ElasticSearch Java API. Now I would like to perform some aggregations on data stored in this index, but I get the following error:
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [item] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
As suggested at this link, to solve this issue I should enable fielddata on the "item" text field, but how can I do that using the ElasticSearch Java API?
An alternative might be mapping the "item" field as a keyword, but same question: how can I do that with the ElasticSearch Java API?
For a new index you can set the mappings at creation time by doing something like:
CreateIndexRequest createIndexRequest = new CreateIndexRequest(indexName);
String source = // probably read the mapping from a file
createIndexRequest.source(source, XContentType.JSON);
restHighLevelClient.indices().create(createIndexRequest);
The mapping should have the same format as the request you can do against the rest endpoint similar to this:
{
"mappings": {
"your_type": {
"properties": {
"your_property": {
"type": "keyword"
}
}
}
}
}
Use XContentBuilder, easy to create json string to create or update mapping. It's seem like
.startObject("field's name")
.field("type", "text")
.field("fielddata", true)
.endObject()
After just use IndexRequest to create new indices or use PutMappingRequest to update old mapping
I have recently encountered this issue but in my case, I faced this issue when I was performing the Sorting. I have posted a working solution for this problem in another similar question -> set field data = true on java elasticsearch

convert mongo shell query to spring data

I have a mongoshell query like the below
db.viewedProfile.aggregate(
{$match : { "viewedMemberId": "54d6dd15e4b0611ba5762e3d" }},
{$group : { _id: null,
total: {$sum: "$count"}}})
I am struggling with converting this to spring data mongodb. I am using spring data version 1.4.3.RELEASE. the aggregation constructor seems to not recognize the match method.
This should do it:
import static org.springframework.data.mongodb.core.aggregation.Aggregation.*;
...
Aggregation aggregation = newAggregation(
ViewedProfile.class,
match(Criteria.where("viewedMemberId").is("54d6dd15e4b0611ba5762e3d")),
group().sum("count").as("total")
);

How to get array of document using Mongodb java?

How to get all the document under array in mongodb java. My Database is as below. Want to retrieve all the data under array 198_168_1_134.
below is some of What i tried,
eventlist.find(new BasicDBObject("$match","192_168_10_17"))
eventlist.find(new BasicDBObject("$elemMatch","192_168_10_17"))
eventlist.find(null, new BasicDBObject("$192_168_10_17", 1))
You have two options:
using .find() with cherry-picking which document you have to have fetched.
using the aggregation framework by projecting the documents.
By using .find() , you can do:
db.collection.find({}, { 192_168_10_17 : 1 })
By using the aggregation framework, you can do:
db.collection.aggregate( { $project : { 192_168_10_17 : 1 } } )
which will fetch only the 192_168_10_17 document data.
Of course, in order to get this working in Java, you have to translate these queries to a corresponding chain of BasicDBObject instances.
By using mongo java driver you can do this by following query -
eventlist.find(new BasicDBObject(), new BasicDBObject("198_168_1_134", 1))

Apache Spark DataFrame no RDD partitioning

According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.
First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?
`
val connectionString` = "jdbc:oracle:thin:username/password#111.11.1.11:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))

Categories

Resources