Currently working on a transformation project where I need to feed the data to elastic search from Oracle. So my work goes like this
1. Sqoop - From oracle
2. Java Spark - Dataframe Joins then saving them into elastic search repo's
And my elastic document will look like
{
Field 1: Value
Field 2: value
Field 3: Value
Field 4: [ -- Array of Maps
{
Name: Value
Age: Value
},{
Name: Value
Age: Value
}
]
Field 5:{ -- Maps
Code :Value
Key : Value
}
}
So would like to know, how to form a javaRDD for the above structure.
I have coded till dataframe join and got stuck, Unable to proceed from there.
So I want my data in normalized form
My spark code
Dataframe esDF = df.select(
df.col("Field1") , df.col("Field2") ,df.col("Field3")
,df.col("Name") ,df.col("Age") ,
df.col("Code"),df.col("Key")
)
Please help.
Few options:
1 - Use saveToES method in dataFrame itself. ( this may not be supported in older versions , works for elasticsearch-spark-20_2.11-5.1.1.jar
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql.functions._
import org.elasticsearch.spark.sql._
dataFrame.saveToEs("<index>/<type>",Map(("es.nodes" -> <ip:port>"))
2 - Create a case class and use RDD[] method to save. ( Works for older versions as well )
import org.elasticsearch.spark._
case class ESDoc(...)
val rdd = df.map( row => EsDoc(..))
rdd.saveToEs("<index>/<type>",Map(("es.nodes" -> <ip:port>"))
3 - With older versions of scala ( < 2.11 ) , you will be stuck with 22 fields limit in case class . Note that you can use Map instead of case class
import org.elasticsearch.spark._
val rdd = df.map( row => Map(<key>:<value>...) )
rdd.saveToEs("<index>/<type>",Map(("es.nodes" -> <ip:port>")) // saves RDD[Map<K,V>]
For all above methods, you may want to pass es.batch.write.retry.count to appropriate value , or -1 ( infinite retries) if you have another way of controlling lifecycle of EMR ( making sure it wont run for ever)
val esOptions = Map("es.nodes" -> <host>:<port>, "es.batch.write.retry.count" -> "-1")
Related
Using spark-sql-2.4.1v with java8.
I am trying to join two data sets as below:
computed_df.as('s).join(accumulated_results_df.as('f),$"s.company_id" === $"f.company_id","inner")
Which is working fine in databrick's notebooks.
But when I try to implement the same in my spark java code in my Ide.
It wont recognize the "$" function/operator even after including
import static org.apache.spark.sql.functions.*;
So what should be done to use it in my spark java code ?
thanks
The answer is org.apache.spark.sql.Column. See This.
public class Column
...
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated with a DataFrame.
col("columnName.field") // Extracting a struct field
col("`a.column.with.dots`") // Escape `.` in column names.
$"columnName" // Scala short hand for a named column.
expr("a + 1") // A column that is constructed from a parsed SQL Expression.
lit("abc") // A column that produces a literal (constant) value.
using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.
Example:
Input:
file-123.txt
file-524.txt
file-9577.txt
...
file-"anotherNumber".txt
Result on my HBase table:
Row-----------------Value
123-----------------"content of 1st file"
524-----------------"content of 2nd file"
...etc
If anyone has already faced this situation to help me with it
Thanks in advance.
Your
rowkey
can be like this
rowkey = prefix + (filenamepart or full file name) + Murmurhash(fileContent)
where your prefix can be between what ever presplits you have done with your table creation time.
For ex :
create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'},
{SPLITS => ['0','1','2','3','4','5','6','7']}
prefix can be any random id generated between range of pre-splits.
This kind of row key will avoid hot-spotting also if data increases.
& Data will be spread across region server.
According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.
First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?
`
val connectionString` = "jdbc:oracle:thin:username/password#111.11.1.11:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))
I have a Mongodb collection. Simply, it has two columns: user and url. It has 39274590 rows. The key of this table is {user, url}.
Using Java, I try to list distinct urls:
MongoDBManager db = new MongoDBManager( "Website", "UserLog" );
return db.getDistinct("url");
But I receive an exception:
Exception in thread "main" com.mongodb.CommandResult$CommandFailure: command failed [distinct]:
{ "serverUsed" : "localhost/127.0.0.1:27017" , "errmsg" : "exception: distinct too big, 16mb cap" , "code" : 10044 , "ok" : 0.0}
How can I solve this problem? Is there any plan B that can avoid this problem?
In version 2.6 you can use the aggregate commands to produce a separate collection:
http://docs.mongodb.org/manual/reference/operator/aggregation/out/
This will get around mongodb's limit of 16mb for most queries. You can read more about using the aggregation framework on large datasets in mongodb 2.6 here:
http://vladmihalcea.com/mongodb-2-6-is-out/
To do a 'distinct' query with the aggregation framework, group by the field.
db.userlog.aggregate([{$group: {_id: '$url'} }]);
Note: I don't know how this works for the Java driver, good luck.
Take a look at this answer
1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values
2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.
Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().
If you are using mongodb 3.0 and above you can use
DistinctIterable class with batchSize.
MongoCollection coll = null;
coll = mongodb.getCollection("mycollection");
DistinctIterable<String> ids = coll.distinct("id", String.class).batchSize(100);
for (String id: ids) {
System.out.println("" + id);
}
http://api.mongodb.com/java/current/com/mongodb/client/DistinctIterable.html
Version 3.x on Groovy :
import com.mongodb.client.AggregateIterable
import com.mongodb.client.MongoCollection
import com.mongodb.client.MongoCursor
import com.mongodb.client.MongoDatabase
import static com.mongodb.client.model.Accumulators.sum
import static com.mongodb.client.model.Aggregates.group
import static java.util.Arrays.asList
import org.bson.Document
//other code
AggregateIterable<Document> iterable = collection.aggregate(
asList(
group("\$" + "url", sum("count", 1))
)
).allowDiskUse(true)
MongoCursor cursor = iterable.iterator()
while(cursor.hasNext()) {
Document doc = cursor.next()
println(doc.toJson())
}
I have to generate random number within a range (0-100.000) in a cluster environment (many stateless Java based app servers + Mongodb) - so every user request will get some unique number and will maintain it in the next few requests.
As I understand, I have two options:
1. have some number persisted in mongo and incrementAndGet it - but it's not atomic - bad choice.
2. Use Redis - it's atomic and support counters.
3. Any idea? Is it safe to use UUID and set a range for it ?
4. Hazelcast ?
Any other though?
Thanks
I would leverage the existing MongoDB infrastructure and use the MongoDB findAndModify command to do an atomic increment and get operation.
For the shell the command would look like.
var result = db.ids.findAndModify( {
query: { _id: "counter" },
sort: { rating: 1 },
new : true,
update: { $inc: { counter: 1 } },
upsert : true
} );
The 'new : true' returns the document after the update. Upsert creates the document if it is missing.
The 10gen supported driver and the Asynchronous Driver both contain helper methods/builders for the find and modify command.