I am quite new to cassandra and hector & trying to create a super column. I did already a lot of research, but somehow, nothing works so far. During my research on stackoverflow I found this question here, which seemed helpful for me. So I tried to insert the code for my example, but I'm getting an exception.
This is my code (Should be copy/pastable if you have hector) - sry if it might be not perfect readable, I did a lot of try-and-error before asking here:
import java.util.Arrays;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.cassandra.service.template.SuperCfTemplate;
import me.prettyprint.cassandra.service.template.SuperCfUpdater;
import me.prettyprint.cassandra.service.template.ThriftSuperCfTemplate;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;
import me.prettyprint.hector.api.ddl.ComparatorType;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.factory.HFactory;
public class DatabaseDataImporter {
private Cluster myCluster;
private KeyspaceDefinition keyspaceDefinition;
private Keyspace keyspace;
private SuperCfTemplate<String, String, String> template;
final static StringSerializer ss = StringSerializer.get();
public DatabaseDataImporter() {
initializeCluster();
SuperCfTemplate<String, String, String> template = new ThriftSuperCfTemplate<String, String, String>(
keyspace, "Nodes", ss, ss, ss);
SuperCfUpdater<String, String, String> updater = template
.createUpdater("key", "newcf");
updater.setString("subname", "1");
template.update(updater);
}
private void initializeCluster() {
// get Cluster
myCluster = HFactory.getOrCreateCluster("Test Cluster",
"localhost:9160");
keyspaceDefinition = myCluster.describeKeyspace("Graphs");
// If keyspace does not exist, the CFs don't exist either. => create
// them.
if (keyspaceDefinition == null) {
createSchema();
}
keyspace = HFactory.createKeyspace("Graphs", myCluster);
}
private void createSchema() {
// get Cluster
Cluster myCluster = HFactory.getOrCreateCluster("Test Cluster",
"localhost:9160");
// add Schema
int replicationFactor = 1;
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
KeyspaceDefinition newKeyspace = HFactory.createKeyspaceDefinition(
"Graphs", ThriftKsDef.DEF_STRATEGY_CLASS, replicationFactor,
Arrays.asList(cfDef));
// Add the schema to the cluster.
// "true" as the second param means that Hector will block until all
// nodes see the change.
myCluster.addKeyspace(newKeyspace, true);
}
public static void main(String[] args) {
new DatabaseDataImporter();
}
}
The exception I get is:
Exception in thread "main" me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:supercolumn parameter is invalid for standard CF Nodes)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:52)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:115)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:159)
at me.prettyprint.cassandra.service.template.SuperCfTemplate.update(SuperCfTemplate.java:203)
at algorithms.DatabaseDataImporter.<init>(DatabaseDataImporter.java:43)
at algorithms.DatabaseDataImporter.main(DatabaseDataImporter.java:87)
Caused by: InvalidRequestException(why:supercolumn parameter is invalid for standard CF Nodes)
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20833)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253)
... 7 more
I can understand that I somehow do a bad thing because I'm attempting to insert a supercolumninto a standardcolumnfamily. (Source for this is here). So maybe this is the code where everything gets broken during creation:
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
And this is the point where I don't know how to proceed. I tried to find a "SuperColumnFamilyDefinition"-class but I couldn't find it. Do you have any ideas or suggestions what I need to change to fix my code? I would be very happy.
Thanks a lot for every thought you're sharing with me.
I found the answer to my problem and would like to share (Maybe it helps someone in the future). As I thought the solution was quite simple.
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
needed to be extended to
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
// defines it as super column
((ThriftCfDef) cfDef).setColumnType(ColumnType.SUPER);
Related
I have read countless of articles and code examples on MongoDB Change Streams, but I still can't manage to set it up properly. I'm trying to listen to a specific collection in my MongoDB and whenever a document is inserted, updated or deleted, I want to do something.
This is what I've tried:
#Data
#Document(collection = "teams")
public class Teams{
private #MongoId(FieldType.OBJECT_ID)
ObjectId id;
private Integer teamId;
private String name;
private String description;
}
import com.mongodb.client.MongoClients;
import com.mongodb.client.MongoClient;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Aggregates;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.changestream.FullDocument;
import com.mongodb.client.ChangeStreamIterable;
import org.bson.Document;
import org.bson.conversions.Bson;
import java.util.Arrays;
import java.util.List;
public class MongoDBChangeStream {
// connect to the local database server
MongoClient mongoClient = MongoClients.create("db uri goes here");
// Select the MongoDB database
MongoDatabase database = mongoClient.getDatabase("MyDatabase");
// Select the collection to query
MongoCollection<Document> collection = database.getCollection("teams");
// Create pipeline for operationType filter
List<Bson> pipeline = Arrays.asList(
Aggregates.match(
Filters.in("operationType",
Arrays.asList("insert", "update", "delete"))));
// Create the Change Stream
ChangeStreamIterable<Document> changeStream = collection.watch(pipeline)
.fullDocument(FullDocument.UPDATE_LOOKUP);
// Iterate over the Change Stream
for (Document changeEvent : changeStream) {
// Process the change event here
}
}
So this is what I have so far and everything is good until the for-loop which gives three errors:
There is a red line under 'for (', which says unexpected token.
There is a red line under ' :', which says ';' expected.
There is a red line under 'changeStream)', which says unknown class: 'changeStream'.
First of all you should put your code inside class method, not class body. Second - ChangeStreamIterable<Document> iterator element is ChangeStreamDocument<Document> and not Document.
Summing things up:
public class MongoDBChangeStream {
public void someMethod() {
// connect to the local database server
MongoClient mongoClient = MongoClients.create("db uri goes here");
// Select the MongoDB database
MongoDatabase database = mongoClient.getDatabase("MyDatabase");
// Select the collection to query
MongoCollection<Document> collection = database.getCollection("teams");
// Create pipeline for operationType filter
List<Bson> pipeline = Arrays.asList(
Aggregates.match(
Filters.in(
"operationType",
Arrays.asList("insert", "update", "delete")
)));
// Create the Change Stream
ChangeStreamIterable<Document> changeStream = collection.watch(pipeline)
.fullDocument(FullDocument.UPDATE_LOOKUP);
// Iterate over the Change Stream
for (ChangeStreamDocument<Document> changeEvent : changeStream) {
// Process the change event here
}
}
}
I am trying to create a simple application where the app will consume Kafka message do some cql transform and publish to Kafka and below is the code:
JAVA: 1.8
Flink: 1.13
Scala: 2.11
flink-siddhi: 2.11-0.2.2-SNAPSHOT
I am using library: https://github.com/haoch/flink-siddhi
input json to Kafka:
{
"awsS3":{
"ResourceType":"aws.S3",
"Details":{
"Name":"crossplane-test",
"CreationDate":"2020-08-17T11:28:05+00:00"
},
"AccessBlock":{
"PublicAccessBlockConfiguration":{
"BlockPublicAcls":true,
"IgnorePublicAcls":true,
"BlockPublicPolicy":true,
"RestrictPublicBuckets":true
}
},
"Location":{
"LocationConstraint":"us-west-2"
}
}
}
main class:
public class S3SidhiApp {
public static void main(String[] args) {
internalStreamSiddhiApp.start();
//kafkaStreamApp.start();
}
}
App class:
package flinksidhi.app;
import com.google.gson.JsonObject;
import flinksidhi.event.s3.source.S3EventSource;
import io.siddhi.core.SiddhiManager;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.siddhi.SiddhiCEP;
import org.json.JSONObject;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import static flinksidhi.app.connector.Consumers.createInputMessageConsumer;
import static flinksidhi.app.connector.Producer.*;
public class internalStreamSiddhiApp {
private static final String inputTopic = "EVENT_STREAM_INPUT";
private static final String outputTopic = "EVENT_STREAM_OUTPUT";
private static final String consumerGroup = "EVENT_STREAM1";
private static final String kafkaAddress = "localhost:9092";
private static final String zkAddress = "localhost:2181";
private static final String S3_CQL1 = "from inputStream select * insert into temp";
private static final String S3_CQL = "from inputStream select json:toObject(awsS3) as obj insert into temp;" +
"from temp select json:getString(obj,'$.awsS3.ResourceType') as affected_resource_type," +
"json:getString(obj,'$.awsS3.Details.Name') as affected_resource_name," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration') as encryption," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration.Rules[0].ApplyServerSideEncryptionByDefault.SSEAlgorithm') as algorithm insert into temp2; " +
"from temp2 select affected_resource_name,affected_resource_type, " +
"ifThenElse(encryption == ' ','Fail','Pass') as state," +
"ifThenElse(encryption != ' ' and algorithm == 'aws:kms','None','Critical') as severity insert into outputStream";
public static void start(){
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//DataStream<String> inputS = env.addSource(new S3EventSource());
//Flink kafka stream consumer
FlinkKafkaConsumer<String> flinkKafkaConsumer =
createInputMessageConsumer(inputTopic, kafkaAddress,zkAddress, consumerGroup);
//Add Data stream source -- flink consumer
DataStream<String> inputS = env.addSource(flinkKafkaConsumer);
SiddhiCEP cep = SiddhiCEP.getSiddhiEnvironment(env);
cep.registerExtension("json:toObject", io.siddhi.extension.execution.json.function.ToJSONObjectFunctionExtension.class);
cep.registerExtension( "json:getString", io.siddhi.extension.execution.json.function.GetStringJSONFunctionExtension.class);
cep.registerStream("inputStream", inputS, "awsS3");
inputS.print();
System.out.println(cep.getDataStreamSchemas());
//json needs extension jars to present during runtime.
DataStream<Map<String,Object>> output = cep
.from("inputStream")
.cql(S3_CQL1)
.returnAsMap("temp");
//Flink kafka stream Producer
FlinkKafkaProducer<Map<String, Object>> flinkKafkaProducer =
createMapProducer(env,outputTopic, kafkaAddress);
//Add Data stream sink -- flink producer
output.addSink(flinkKafkaProducer);
output.print();
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Consumer class:
package flinksidhi.app.connector;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.json.JSONObject;
import java.util.Properties;
public class Consumers {
public static FlinkKafkaConsumer<String> createInputMessageConsumer(String topic, String kafkaAddress, String zookeeprAddr, String kafkaGroup ) {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", kafkaAddress);
properties.setProperty("zookeeper.connect", zookeeprAddr);
properties.setProperty("group.id",kafkaGroup);
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>(
topic,new SimpleStringSchema(),properties);
return consumer;
}
}
Producer class:
package flinksidhi.app.connector;
import flinksidhi.app.util.ConvertJavaMapToJson;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema;
import org.json.JSONObject;
import java.util.Map;
public class Producer {
public static FlinkKafkaProducer<Tuple2> createStringProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Tuple2>(kafkaAddress, topic, new AverageSerializer());
}
public static FlinkKafkaProducer<Map<String,Object>> createMapProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Map<String,Object>>(kafkaAddress, topic, new SerializationSchema<Map<String, Object>>() {
#Override
public void open(InitializationContext context) throws Exception {
}
#Override
public byte[] serialize(Map<String, Object> stringObjectMap) {
String json = ConvertJavaMapToJson.convert(stringObjectMap);
return json.getBytes();
}
});
}
}
I have tried many things but the code where the CQL is invoked is never called and doesn't even give any error not sure where is it going wrong.
The same thing if I do creating an internal stream source and use the same input json to return as string it works.
Initial guess: if you are using event time, are you sure you have defined watermarks correctly? As stated in the docs:
(...) an incoming element is initially put in a buffer where elements are sorted in ascending order based on their timestamp, and when a watermark arrives, all the elements in this buffer with timestamps smaller than that of the watermark are processed (...)
If this doesn't help, I would suggest to decompose/simplify the job to a bare minimum, for example just a source operator and some naive sink printing/logging elements. And if that works, start adding back operators one by one. You could also start by simplifying your CEP pattern as much as possible.
First of all thanks a lot #Piotr Nowojski , just because of your small pointer which no matter how many times I pondered over about event time , it did not came in my mind. So yes while debugging the two cases:
With internal datasource , where it was processing successfully, while debugging the flow , I identified that it was processing a watermark after it was processing the data, but it did not catch me that it was somehow managing the event time of the data implicitly.
With kafka as a datasource , while I was debugging I could very clearly see that it was not processing any watermark in the flow, but it did not occur to me that , it is happening because of the event time and watermark not handled properly.
Just adding a single line of code in the application code which I understood from below Flink code snippet:
#deprecated In Flink 1.12 the default stream time characteristic has been changed to {#link
* TimeCharacteristic#EventTime}, thus you don't need to call this method for enabling
* event-time support anymore. Explicitly using processing-time windows and timers works in
* event-time mode. If you need to disable watermarks, please use {#link
* ExecutionConfig#setAutoWatermarkInterval(long)}. If you are using {#link
* TimeCharacteristic#IngestionTime}, please manually set an appropriate {#link
* WatermarkStrategy}. If you are using generic "time window" operations (for example {#link
* org.apache.flink.streaming.api.datastream.KeyedStream#timeWindow(org.apache.flink.streaming.api.windowing.time.Time)}
* that change behaviour based on the time characteristic, please use equivalent operations
* that explicitly specify processing time or event time.
*/
I got to know that by default flink considers event time and for that watermark needs to be handled properly which I didn't so I added below link for setting the time characteristics of the flink execution environment:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
and kaboom ... it started working , while this is deprecated and needs some other configuration, but thanks a lot , it was a great pointer and helped me a lot and I solved the issue..
Thanks again #Piotr Nowojski
I am fetching neo4j data into spark dataframe using neo4j-spark connector. I am able to fetch it successfully as I am able to show the dataframe. Then I register the dataframe with createOrReplaceTempView() method. Then I try running spark sql on it, but it gives exception saying
org.apache.spark.sql.AnalysisException: Table or view not found: neo4jtable;
This is how my whole code looks like:
import java.text.ParseException;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.neo4j.spark.Neo4JavaSparkContext;
import org.neo4j.spark.Neo4j;
import scala.collection.immutable.HashMap;
public class Neo4jDF {
private static Neo4JavaSparkContext neo4jjsc;
private static SparkConf sConf;
private static JavaSparkContext jsc;
private static SparkContext sc;
private static SparkSession ss;
private static Dataset<Row> neo4jdf;
static String neo4jip = "ll.mm.nn.oo";
public static void main(String[] args) throws AnalysisException, ParseException
{
setSparkConf();
setJavaSparkContext();
setNeo4jJavaSparkContext();
setSparkContext();
setSparkSession();
neo4jdf = loadNeo4jDataframe();
neo4jdf.createOrReplaceTempView("neo4jtable");
neo4jdf.show(false); //this prints correctly
Dataset<Row> neo4jdfsqled = ss.sql("SELECT * from neo4jtable");
neo4jdfsqled.show(false); //this throws exception
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(jsc.sc());
HashMap<String, Object> a = new HashMap<String, Object>();
Dataset<Row> rdd = neo4j.cypher("cypher query deleted for irrelevance", a).loadDataFrame();
return rdd;
}
private static void setSparkConf()
{
sConf = new SparkConf().setAppName("GetNeo4jToRddDemo");
sConf.set("spark.neo4j.bolt.url", "bolt://" + neo4jip + ":7687");
sConf.set("spark.neo4j.bolt.user", "neo4j");
sConf.set("spark.neo4j.bolt.password", "admin");
sConf.setMaster("local");
sConf.set("spark.testing.memory", "471859200");
sConf.set("spark.sql.warehouse.dir", "file:///D:/Mahesh/workspaces/spark-warehouse");
}
private static void setJavaSparkContext()
{
jsc = new JavaSparkContext(sConf);
}
private static void setSparkContext()
{
sc = JavaSparkContext.toSparkContext(jsc);
}
private static void setSparkSession()
{
ss = new SparkSession(sc);
}
private static void setNeo4jJavaSparkContext()
{
neo4jjsc = Neo4JavaSparkContext.neo4jContext(jsc);
}
}
I feel the issue might be with how all spark environment variables are created.
I first created SparkConf sConf.
From sConf, I created JavaSparkContext jsc.
From jsc, I created SparkContext sc.
From sc, I created SparkSession ss.
From ss, I created Neo4jJavaSparkContext neo4jjjsc.
So visually:
sConf -> jsc -> sc -> ss
-> neo4jjsc
Also note that
Inside loadNeo4jDataframe(), I use sc to instantiate instance Neo4j neo4j, which is then used for fetching neo4j data.
Data is fetched using Neo4j instance.
neo4jjjsc is never used, but I kept it as a possible hint for issue.
Given all these points and observations, please tell me why I get table not found exception? I must be missing something stupid. :\
Update
Tried setting ss (after data is fetched using SparkContext of neo4j) as follows:
private static void setSparkSession(SparkContext sc)
{
ss = new SparkSession(sc);
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(sc);
Dataset<Row> rdd = neo4j.cypher("deleted cypher for irrelevance", a).loadDataFrame();
//initalizing ss after data is fetched using SparkContext of neo4j
setSparkSession(neo4j.sc());
return rdd;
}
Update 2
From comments, just realised that neo4j creates a its own spark session using spark context sc instance provided to it. I am not having access to that spark session. So, how I am supposed to add / register arbitrary dataframe (here, neo4jdf) which is created in some other spark session (here spark session created by neo4j.cypher) to my spark session ss?
Based on the symptoms we can infer that both pieces of code use different SparkSession / SQLContext. Assuming there is nothing unusual going on in the Neo4j connector, you should be able to fix this by changing:
private static void setSparkSession()
{
ss = SparkSession().builder.getOrCreate();
}
or by initializing SparkSession before calling setNeo4jJavaSparkContext.
If these won't work, you can switch to using createGlobalTempView.
Important:
In general I would recommend initializing single SparkSession using builder pattern, and deriving other contexts (SparkContexts) from it, when necessary.
I want to save a Twitter stream in a HBase database. What I have now, is the Saprk Application to receive and transform the data. But I don't know how to save my TwitterStream into HBase?
The only thing I found that could be useful is the PairRDD.saveAsNewAPIHadoopDataset(conf) method. But how shall I use it, which Configurations do I have to make to able to save the RDD data to my HBase table?
The only thing I found yet is the HBase client library, which can insert data to a table via Put objects. But this isn't a solution for inside a Spark program, is it (would be necessary to iterate over all items inside the RDD!!)?
Can someone give an example in JAVA? My main problem seems to be the set-up of the org.apache.hadoop.conf.Configuration instance, I have to submit in the saveAsNewAPIHadoopDataset...
Here a code snippet:
JavaReceiverInputDStream<Status> statusDStream = TwitterUtils.createStream(streamingCtx);
JavaPairDStream<Long, String> statusPairDStream = statusDStream.mapToPair(new PairFunction<Status, Long, String>() {
public Tuple2<Long, String> call(Status status) throws Exception {
return new Tuple2<Long, String> (status.getId(), status.getText());
}
});
statusPairDStream.foreachRDD(new Function<JavaPairRDD<Long,String>, Void>() {
public Void call(JavaPairRDD<Long, String> status) throws Exception {
org.apache.hadoop.conf.Configuration conf = new Configuration();
status.saveAsNewAPIHadoopDataset(conf);
// HBase PUT here can't be correct!?
return null;
}
});
First thing is functions are discouraged, if you are using java 8. Pls. use lambda.
Below code snippet could address all your queries.
sample snippet:
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
....
public static void processYourMessages(final JavaRDD<YourMessage> rdd, final HiveContext sqlContext,
, MyMessageUtil messageutil) throws Exception {
final JavaRDD<Row> yourrdd = rdd.filter(msg -> messageutil.filterType(.....) // create a java rdd
final JavaPairRDD<ImmutableBytesWritable, Put> yourrddPuts = yourrdd.mapToPair(row -> messageutil.getPuts(row));
yourrddPuts.saveAsNewAPIHadoopDataset(conf);
}
where conf is like below
private Configuration conf = HBaseConfiguration.create();
conf.set(ZOOKEEPER_QUORUM, "comma seperated list of zookeeper quorum");
conf.set("hbase.mapred.outputtable", "your table name");
conf.set("mapreduce.outputformat.class", "org.apache.hadoop.hbase.mapreduce.TableOutputFormat");
MyMessageUtil has getPuts methods which is like below
public Tuple2<ImmutableBytesWritable, Put> getPuts(Row row) throws Exception {
Put put = ..// prepare your put with all the columns you have.
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
Hope this helps!
After many hours of tinkering and reading the whole internet several times I just can't figure out how to sign requests for use with the Product Advertising API.
So far I managed to generate a client from the provided WSDL file. I used a tutorial by Amazon for this. You can find it here:
Tutorial for generating the web service client
So far no problems. To test the client I wrote a small piece of code. The code is intended to simply get some information about a product. The product is specified by its ASIN.
The code:
package client;
import com.ECS.client.jax.AWSECommerceService;
import com.ECS.client.jax.AWSECommerceServicePortType;
import com.ECS.client.jax.ItemLookup;
import com.ECS.client.jax.ItemLookupResponse;
import com.ECS.client.jax.ItemLookupRequest;
public class Client {
public static void main(String[] args) {
System.out.println("API Test startet");
AWSECommerceService service = new AWSECommerceService();
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
ItemLookupRequest itemLookup = new ItemLookupRequest();
itemLookup.setIdType("ASIN");
itemLookup.getItemId().add("B000RE216U");
ItemLookup lookup = new ItemLookup();
lookup.setAWSAccessKeyId("<mykeyishere>");
lookup.getRequest().add(itemLookup);
ItemLookupResponse response = port.itemLookup(lookup);
String r = response.toString();
System.out.println("response: " + r);
System.out.println("API Test stopped");
}
}
As you can see there is no part where I sign the request. I have worked my way through a lot of the classes used and found no methods for signing the request.
So, how to sign a request?
I actually found something in the documentation: request authentication
But they don't use their own API. The proposed solutions are more or less for manual use only. So I looked in the client classes to sort out if I could get the request URL and put all the parts needed for request signing in myself. But there are no such methods.
I hope someone can point out what I am doing wrong.
This is what I did to solve the problem. All the credit goes to Jon and the guys of the Amazon forums.
Before I outline what I did, here is a link to the post which helped me to solve the problem: Forum Post on Amazon forums.
I downloaded the awshandlerresolver.java which is linked in the post. Than I modified my own code so it looks like this:
package client;
import com.ECS.client.jax.AWSECommerceService;
import com.ECS.client.jax.AWSECommerceServicePortType;
import com.ECS.client.jax.ItemLookup;
import com.ECS.client.jax.ItemLookupResponse;
import com.ECS.client.jax.ItemLookupRequest;
public class Client {
public static void main(String[] args) {
System.out.println("API Test startet");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver("<Secret Key>")); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
ItemLookupRequest itemLookup = new ItemLookupRequest();
itemLookup.setIdType("ASIN");
itemLookup.getItemId().add("B000RE216U");
ItemLookup lookup = new ItemLookup();
lookup.setAWSAccessKeyId("<Access Key>"); // important
lookup.getRequest().add(itemLookup);
ItemLookupResponse response = port.itemLookup(lookup);
String r = response.toString();
System.out.println("response: " + r);
System.out.println("API Test stopped");
}
}
The println on the end are more or less useless. But it works. I also used the WSDL Jon linked to generate a new webservice client. I just changed the URLs in the tutorial I posted in my question.
Try this afer you create the service
service.setHandlerResolver(new AwsHandlerResolver(my_AWS_SECRET_KEY));
You'll need this class and this jar file to add as a reference to your project as AwsHandlerResolver uses Base64 encoding.
You'll need to rename the AwsHandlerResolver file to the name of the class as the file name is all lower case.
I think the rest of the code you have is fine.
The WSDL is http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.wsdl
This discussion and the related Amazon post helped me get the client working. That being said, I felt that the solution could be improved with regards to the following:
Setting WebService handlers in code is discouraged. A XML configuration file and a corresponding #HandlerChain annotation are recommended.
A SOAPHandler is not required in this case, LogicalHandler would do just fine. A SOAPHandler has more reach than a LogicalHandler and when it comes to code, more access is not always good.
Stuffing the signature generation, addition of a Node and printing the request in one handler seems like a little too much. These could be separated out for separation of responsibility and ease of testing. One approach would be to add the Node using a XSLT transformation so that the handler could remain oblivious of the transformation logic. Another handler could then be chained which just prints the request.
Example
i did this in spring it's working fine.
package com.bookbub.application;
import com.ECS.client.jax.*;
import com.ECS.client.jax.ItemSearch;
import javax.xml.ws.Holder;
import java.math.BigInteger;
import java.util.List;
public class TestClient {
private static final String AWS_ACCESS_KEY_ID = "AI*****2Y7Z****DIHQ";
private static final String AWS_SECRET_KEY = "lIm*****dJuiy***YA+g/vnj/Ix*****Oeu";
private static final String ASSOCIATE_TAG = "****-**";
public static void main(String[] args) {
TestClient ist = new TestClient();
ist.runSearch();
}
public void runSearch()
{
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(AWS_SECRET_KEY));
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
ItemSearchRequest request = new ItemSearchRequest();
request.setSearchIndex("Books");
request.setKeywords("java web services up and running oreilly");
ItemSearch search = new ItemSearch();
search.getRequest().add(request);
search.setAWSAccessKeyId(AWS_ACCESS_KEY_ID);
Holder<OperationRequest> operation_request =null;
Holder<List<Items>> items = new Holder<List<Items>>();
port.itemSearch(
search.getMarketplaceDomain(),
search.getAWSAccessKeyId(),
search.getAssociateTag(),
search.getXMLEscaping(),
search.getValidate(),
search.getShared(),
search.getRequest(),
operation_request,
items);
java.util.List<Items> result = items.value;
BigInteger totalPages = result.get(0).getTotalResults();
System.out.println(totalPages);
for (int i = 0; i < result.get(0).getItem().size(); ++i)
{ Item myItem = result.get(0).getItem().get(i);
System.out.print(myItem.getASIN());
System.out.print(", ");
System.out.println(myItem.getDetailPageURL());
System.out.print(", ");
System.out.println(myItem.getSmallImage() == null ? "" : myItem.getSmallImage().getURL());
}
}
}
You could achieve the same monetization outcomes with the IntentBrite API as well