Spark Streaming: Using PairRDD.saveAsNewHadoopDataset function to save data to HBase

Spark Streaming: Using PairRDD.saveAsNewHadoopDataset function to save data to HBase - java

I want to save a Twitter stream in a HBase database. What I have now, is the Saprk Application to receive and transform the data. But I don't know how to save my TwitterStream into HBase?
The only thing I found that could be useful is the PairRDD.saveAsNewAPIHadoopDataset(conf) method. But how shall I use it, which Configurations do I have to make to able to save the RDD data to my HBase table?
The only thing I found yet is the HBase client library, which can insert data to a table via Put objects. But this isn't a solution for inside a Spark program, is it (would be necessary to iterate over all items inside the RDD!!)?
Can someone give an example in JAVA? My main problem seems to be the set-up of the org.apache.hadoop.conf.Configuration instance, I have to submit in the saveAsNewAPIHadoopDataset...
Here a code snippet:
JavaReceiverInputDStream<Status> statusDStream = TwitterUtils.createStream(streamingCtx);
JavaPairDStream<Long, String> statusPairDStream = statusDStream.mapToPair(new PairFunction<Status, Long, String>() {
public Tuple2<Long, String> call(Status status) throws Exception {
return new Tuple2<Long, String> (status.getId(), status.getText());
}
});
statusPairDStream.foreachRDD(new Function<JavaPairRDD<Long,String>, Void>() {
public Void call(JavaPairRDD<Long, String> status) throws Exception {
org.apache.hadoop.conf.Configuration conf = new Configuration();
status.saveAsNewAPIHadoopDataset(conf);
// HBase PUT here can't be correct!?
return null;
}
});

First thing is functions are discouraged, if you are using java 8. Pls. use lambda.
Below code snippet could address all your queries.
sample snippet:
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
....
public static void processYourMessages(final JavaRDD<YourMessage> rdd, final HiveContext sqlContext,
, MyMessageUtil messageutil) throws Exception {
final JavaRDD<Row> yourrdd = rdd.filter(msg -> messageutil.filterType(.....) // create a java rdd
final JavaPairRDD<ImmutableBytesWritable, Put> yourrddPuts = yourrdd.mapToPair(row -> messageutil.getPuts(row));
yourrddPuts.saveAsNewAPIHadoopDataset(conf);
}
where conf is like below
private Configuration conf = HBaseConfiguration.create();
conf.set(ZOOKEEPER_QUORUM, "comma seperated list of zookeeper quorum");
conf.set("hbase.mapred.outputtable", "your table name");
conf.set("mapreduce.outputformat.class", "org.apache.hadoop.hbase.mapreduce.TableOutputFormat");
MyMessageUtil has getPuts methods which is like below
public Tuple2<ImmutableBytesWritable, Put> getPuts(Row row) throws Exception {
Put put = ..// prepare your put with all the columns you have.
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
Hope this helps!

Related

Not able to process kafka json message with Flink siddhi library

I am trying to create a simple application where the app will consume Kafka message do some cql transform and publish to Kafka and below is the code:
JAVA: 1.8
Flink: 1.13
Scala: 2.11
flink-siddhi: 2.11-0.2.2-SNAPSHOT
I am using library: https://github.com/haoch/flink-siddhi
input json to Kafka:
{
"awsS3":{
"ResourceType":"aws.S3",
"Details":{
"Name":"crossplane-test",
"CreationDate":"2020-08-17T11:28:05+00:00"
},
"AccessBlock":{
"PublicAccessBlockConfiguration":{
"BlockPublicAcls":true,
"IgnorePublicAcls":true,
"BlockPublicPolicy":true,
"RestrictPublicBuckets":true
}
},
"Location":{
"LocationConstraint":"us-west-2"
}
}
}
main class:
public class S3SidhiApp {
public static void main(String[] args) {
internalStreamSiddhiApp.start();
//kafkaStreamApp.start();
}
}
App class:
package flinksidhi.app;
import com.google.gson.JsonObject;
import flinksidhi.event.s3.source.S3EventSource;
import io.siddhi.core.SiddhiManager;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.siddhi.SiddhiCEP;
import org.json.JSONObject;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import static flinksidhi.app.connector.Consumers.createInputMessageConsumer;
import static flinksidhi.app.connector.Producer.*;
public class internalStreamSiddhiApp {
private static final String inputTopic = "EVENT_STREAM_INPUT";
private static final String outputTopic = "EVENT_STREAM_OUTPUT";
private static final String consumerGroup = "EVENT_STREAM1";
private static final String kafkaAddress = "localhost:9092";
private static final String zkAddress = "localhost:2181";
private static final String S3_CQL1 = "from inputStream select * insert into temp";
private static final String S3_CQL = "from inputStream select json:toObject(awsS3) as obj insert into temp;" +
"from temp select json:getString(obj,'$.awsS3.ResourceType') as affected_resource_type," +
"json:getString(obj,'$.awsS3.Details.Name') as affected_resource_name," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration') as encryption," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration.Rules[0].ApplyServerSideEncryptionByDefault.SSEAlgorithm') as algorithm insert into temp2; " +
"from temp2 select affected_resource_name,affected_resource_type, " +
"ifThenElse(encryption == ' ','Fail','Pass') as state," +
"ifThenElse(encryption != ' ' and algorithm == 'aws:kms','None','Critical') as severity insert into outputStream";
public static void start(){
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//DataStream<String> inputS = env.addSource(new S3EventSource());
//Flink kafka stream consumer
FlinkKafkaConsumer<String> flinkKafkaConsumer =
createInputMessageConsumer(inputTopic, kafkaAddress,zkAddress, consumerGroup);
//Add Data stream source -- flink consumer
DataStream<String> inputS = env.addSource(flinkKafkaConsumer);
SiddhiCEP cep = SiddhiCEP.getSiddhiEnvironment(env);
cep.registerExtension("json:toObject", io.siddhi.extension.execution.json.function.ToJSONObjectFunctionExtension.class);
cep.registerExtension( "json:getString", io.siddhi.extension.execution.json.function.GetStringJSONFunctionExtension.class);
cep.registerStream("inputStream", inputS, "awsS3");
inputS.print();
System.out.println(cep.getDataStreamSchemas());
//json needs extension jars to present during runtime.
DataStream<Map<String,Object>> output = cep
.from("inputStream")
.cql(S3_CQL1)
.returnAsMap("temp");
//Flink kafka stream Producer
FlinkKafkaProducer<Map<String, Object>> flinkKafkaProducer =
createMapProducer(env,outputTopic, kafkaAddress);
//Add Data stream sink -- flink producer
output.addSink(flinkKafkaProducer);
output.print();
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Consumer class:
package flinksidhi.app.connector;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.json.JSONObject;
import java.util.Properties;
public class Consumers {
public static FlinkKafkaConsumer<String> createInputMessageConsumer(String topic, String kafkaAddress, String zookeeprAddr, String kafkaGroup ) {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", kafkaAddress);
properties.setProperty("zookeeper.connect", zookeeprAddr);
properties.setProperty("group.id",kafkaGroup);
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>(
topic,new SimpleStringSchema(),properties);
return consumer;
}
}
Producer class:
package flinksidhi.app.connector;
import flinksidhi.app.util.ConvertJavaMapToJson;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema;
import org.json.JSONObject;
import java.util.Map;
public class Producer {
public static FlinkKafkaProducer<Tuple2> createStringProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Tuple2>(kafkaAddress, topic, new AverageSerializer());
}
public static FlinkKafkaProducer<Map<String,Object>> createMapProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Map<String,Object>>(kafkaAddress, topic, new SerializationSchema<Map<String, Object>>() {
#Override
public void open(InitializationContext context) throws Exception {
}
#Override
public byte[] serialize(Map<String, Object> stringObjectMap) {
String json = ConvertJavaMapToJson.convert(stringObjectMap);
return json.getBytes();
}
});
}
}
I have tried many things but the code where the CQL is invoked is never called and doesn't even give any error not sure where is it going wrong.
The same thing if I do creating an internal stream source and use the same input json to return as string it works.

Initial guess: if you are using event time, are you sure you have defined watermarks correctly? As stated in the docs:
(...) an incoming element is initially put in a buffer where elements are sorted in ascending order based on their timestamp, and when a watermark arrives, all the elements in this buffer with timestamps smaller than that of the watermark are processed (...)
If this doesn't help, I would suggest to decompose/simplify the job to a bare minimum, for example just a source operator and some naive sink printing/logging elements. And if that works, start adding back operators one by one. You could also start by simplifying your CEP pattern as much as possible.

First of all thanks a lot #Piotr Nowojski , just because of your small pointer which no matter how many times I pondered over about event time , it did not came in my mind. So yes while debugging the two cases:
With internal datasource , where it was processing successfully, while debugging the flow , I identified that it was processing a watermark after it was processing the data, but it did not catch me that it was somehow managing the event time of the data implicitly.
With kafka as a datasource , while I was debugging I could very clearly see that it was not processing any watermark in the flow, but it did not occur to me that , it is happening because of the event time and watermark not handled properly.
Just adding a single line of code in the application code which I understood from below Flink code snippet:
#deprecated In Flink 1.12 the default stream time characteristic has been changed to {#link
* TimeCharacteristic#EventTime}, thus you don't need to call this method for enabling
* event-time support anymore. Explicitly using processing-time windows and timers works in
* event-time mode. If you need to disable watermarks, please use {#link
* ExecutionConfig#setAutoWatermarkInterval(long)}. If you are using {#link
* TimeCharacteristic#IngestionTime}, please manually set an appropriate {#link
* WatermarkStrategy}. If you are using generic "time window" operations (for example {#link
* org.apache.flink.streaming.api.datastream.KeyedStream#timeWindow(org.apache.flink.streaming.api.windowing.time.Time)}
* that change behaviour based on the time characteristic, please use equivalent operations
* that explicitly specify processing time or event time.
*/
I got to know that by default flink considers event time and for that watermark needs to be handled properly which I didn't so I added below link for setting the time characteristics of the flink execution environment:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
and kaboom ... it started working , while this is deprecated and needs some other configuration, but thanks a lot , it was a great pointer and helped me a lot and I solved the issue..
Thanks again #Piotr Nowojski

How to get just English tweets using apache spark streaming and Java API from all?

Hello i am newbie in Spark)I would like to make some Spark project which will be collect and process tweets from this social network with help spark-streaming module(For my little university research). But i have got a little problem i don't now how to get tweets only in English.Can anyone help me with this?I tried to do filter operation with already received data but i have java.lang.NullPointerException at this line: "if (status.getPlace().getCountryCode().equals("(us)"))". But it's also bad solution.Is it possible to filter data before receiving? Please help i really don't now ho this.I'll be happy to get your hints.
package TwitterAnalysis;
import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.twitter.*;
import twitter4j.GeoLocation;
import twitter4j.Status;
public class Twitter {
private static void setTwitterOAuth() {
System.setProperty("twitter4j.oauth.consumerKey", TwitterOAuthKey.consumerKey);
System.setProperty("twitter4j.oauth.consumerSecret", TwitterOAuthKey.consumerSecret);
System.setProperty("twitter4j.oauth.accessToken", TwitterOAuthKey.accessToken);
System.setProperty("twitter4j.oauth.accessTokenSecret", TwitterOAuthKey.accessTokenSecret);
}
public static void main(String [] args) {
setTwitterOAuth();
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("SparkTwitter");
JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(1000));
JavaReceiverInputDStream<Status> twitterStream = TwitterUtils.createStream(jssc);
//filtering already received tweets
JavaDStream<Status> englishTweets=twitterStream.filter(
new Function <Status, Boolean>(){
public Boolean call (Status status){
if (status.getPlace().getCountryCode().equals("(us)")){
return true;
}else
{return false;}
}
}
);
//Without filter: Output text of all tweets
JavaDStream<String> statuses = englishTweets.map(
new Function<Status, String>() {
public String call(Status status) { return status.getText(); }
}
);
statuses.print();
jssc.start();
}
}

Here is the answer i just created new JavaDStream and used getLang() for him. Solutions looks like this:
JavaDStream<Status> enTweetdDStream=twitterStream.filter((status) -> "en".equalsIgnoreCase(status.getLang()));

org.apache.spark.sql.AnalysisException: Table and view not found

I am fetching neo4j data into spark dataframe using neo4j-spark connector. I am able to fetch it successfully as I am able to show the dataframe. Then I register the dataframe with createOrReplaceTempView() method. Then I try running spark sql on it, but it gives exception saying
org.apache.spark.sql.AnalysisException: Table or view not found: neo4jtable;
This is how my whole code looks like:
import java.text.ParseException;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.neo4j.spark.Neo4JavaSparkContext;
import org.neo4j.spark.Neo4j;
import scala.collection.immutable.HashMap;
public class Neo4jDF {
private static Neo4JavaSparkContext neo4jjsc;
private static SparkConf sConf;
private static JavaSparkContext jsc;
private static SparkContext sc;
private static SparkSession ss;
private static Dataset<Row> neo4jdf;
static String neo4jip = "ll.mm.nn.oo";
public static void main(String[] args) throws AnalysisException, ParseException
{
setSparkConf();
setJavaSparkContext();
setNeo4jJavaSparkContext();
setSparkContext();
setSparkSession();
neo4jdf = loadNeo4jDataframe();
neo4jdf.createOrReplaceTempView("neo4jtable");
neo4jdf.show(false); //this prints correctly
Dataset<Row> neo4jdfsqled = ss.sql("SELECT * from neo4jtable");
neo4jdfsqled.show(false); //this throws exception
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(jsc.sc());
HashMap<String, Object> a = new HashMap<String, Object>();
Dataset<Row> rdd = neo4j.cypher("cypher query deleted for irrelevance", a).loadDataFrame();
return rdd;
}
private static void setSparkConf()
{
sConf = new SparkConf().setAppName("GetNeo4jToRddDemo");
sConf.set("spark.neo4j.bolt.url", "bolt://" + neo4jip + ":7687");
sConf.set("spark.neo4j.bolt.user", "neo4j");
sConf.set("spark.neo4j.bolt.password", "admin");
sConf.setMaster("local");
sConf.set("spark.testing.memory", "471859200");
sConf.set("spark.sql.warehouse.dir", "file:///D:/Mahesh/workspaces/spark-warehouse");
}
private static void setJavaSparkContext()
{
jsc = new JavaSparkContext(sConf);
}
private static void setSparkContext()
{
sc = JavaSparkContext.toSparkContext(jsc);
}
private static void setSparkSession()
{
ss = new SparkSession(sc);
}
private static void setNeo4jJavaSparkContext()
{
neo4jjsc = Neo4JavaSparkContext.neo4jContext(jsc);
}
}
I feel the issue might be with how all spark environment variables are created.
I first created SparkConf sConf.
From sConf, I created JavaSparkContext jsc.
From jsc, I created SparkContext sc.
From sc, I created SparkSession ss.
From ss, I created Neo4jJavaSparkContext neo4jjjsc.
So visually:
sConf -> jsc -> sc -> ss
-> neo4jjsc
Also note that
Inside loadNeo4jDataframe(), I use sc to instantiate instance Neo4j neo4j, which is then used for fetching neo4j data.
Data is fetched using Neo4j instance.
neo4jjjsc is never used, but I kept it as a possible hint for issue.
Given all these points and observations, please tell me why I get table not found exception? I must be missing something stupid. :\
Update
Tried setting ss (after data is fetched using SparkContext of neo4j) as follows:
private static void setSparkSession(SparkContext sc)
{
ss = new SparkSession(sc);
}
private static Dataset<Row> loadNeo4jDataframe(String pAutosysBoxCaption)
{
Neo4j neo4j = new Neo4j(sc);
Dataset<Row> rdd = neo4j.cypher("deleted cypher for irrelevance", a).loadDataFrame();
//initalizing ss after data is fetched using SparkContext of neo4j
setSparkSession(neo4j.sc());
return rdd;
}
Update 2
From comments, just realised that neo4j creates a its own spark session using spark context sc instance provided to it. I am not having access to that spark session. So, how I am supposed to add / register arbitrary dataframe (here, neo4jdf) which is created in some other spark session (here spark session created by neo4j.cypher) to my spark session ss?

Based on the symptoms we can infer that both pieces of code use different SparkSession / SQLContext. Assuming there is nothing unusual going on in the Neo4j connector, you should be able to fix this by changing:
private static void setSparkSession()
{
ss = SparkSession().builder.getOrCreate();
}
or by initializing SparkSession before calling setNeo4jJavaSparkContext.
If these won't work, you can switch to using createGlobalTempView.
Important:
In general I would recommend initializing single SparkSession using builder pattern, and deriving other contexts (SparkContexts) from it, when necessary.

Flume Custom HTTPSourceHandler GZipped File

I am trying to create a custom Flume HTTPSourceHandler that handles the contents of a file that is sent in the POST body of an HTTP request, and the payload of that post will be gzipped.
I am new to Flume, and struggling to understand how to return the contents of this GZip file (or any data for that matter) as Flume events.
Here is some incomplete code I am working on. The main goal right now is to simply print console of file to console.
Any tips, examples, etc. would be very helpful.
import org.apache.flume.Event;
import org.apache.flume.source.http.HTTPSourceHandler;
import org.apache.http.HttpHeaders;
import javax.servlet.http.HttpServletRequest;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.GZIPInputStream;
public class HttpGzipHandler extends HTTPSourceHandler{
public HttpGzipHandler(){
}
public List<Event> getEvents(HttpServletRequest request) throws Exception {
boolean isGzipped = request.getHeader(HttpHeaders.CONTENT_ENCODING) != null
&& request.getHeader(HttpHeaders.CONTENT_ENCODING).contains("gzip");
GZIPInputStream gzipInputStream = new GZIPInputStream(request.getInputStream());
List<Event> eventList = new ArrayList<Event>(0);
//TODO: Return the Events
}
}

You may have a look on a custom Http handler I've developed for a tool named Cygnus, as an inspiration. I think the important part for you will be the code where the event is created and emitted:
// create the appropiate headers
Map<String, String> eventHeaders = new HashMap<String, String>();
eventHeaders.put(..., ...);
// create the event list containing only one event
ArrayList<Event> eventList = new ArrayList<Event>();
Event event = EventBuilder.withBody(data.getBytes(), eventHeaders);
eventList.add(event);
return eventList;

I can't create a super column with hector in java

I am quite new to cassandra and hector & trying to create a super column. I did already a lot of research, but somehow, nothing works so far. During my research on stackoverflow I found this question here, which seemed helpful for me. So I tried to insert the code for my example, but I'm getting an exception.
This is my code (Should be copy/pastable if you have hector) - sry if it might be not perfect readable, I did a lot of try-and-error before asking here:
import java.util.Arrays;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.cassandra.service.template.SuperCfTemplate;
import me.prettyprint.cassandra.service.template.SuperCfUpdater;
import me.prettyprint.cassandra.service.template.ThriftSuperCfTemplate;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.ddl.ColumnFamilyDefinition;
import me.prettyprint.hector.api.ddl.ComparatorType;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.factory.HFactory;
public class DatabaseDataImporter {
private Cluster myCluster;
private KeyspaceDefinition keyspaceDefinition;
private Keyspace keyspace;
private SuperCfTemplate<String, String, String> template;
final static StringSerializer ss = StringSerializer.get();
public DatabaseDataImporter() {
initializeCluster();
SuperCfTemplate<String, String, String> template = new ThriftSuperCfTemplate<String, String, String>(
keyspace, "Nodes", ss, ss, ss);
SuperCfUpdater<String, String, String> updater = template
.createUpdater("key", "newcf");
updater.setString("subname", "1");
template.update(updater);
}
private void initializeCluster() {
// get Cluster
myCluster = HFactory.getOrCreateCluster("Test Cluster",
"localhost:9160");
keyspaceDefinition = myCluster.describeKeyspace("Graphs");
// If keyspace does not exist, the CFs don't exist either. => create
// them.
if (keyspaceDefinition == null) {
createSchema();
}
keyspace = HFactory.createKeyspace("Graphs", myCluster);
}
private void createSchema() {
// get Cluster
Cluster myCluster = HFactory.getOrCreateCluster("Test Cluster",
"localhost:9160");
// add Schema
int replicationFactor = 1;
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
KeyspaceDefinition newKeyspace = HFactory.createKeyspaceDefinition(
"Graphs", ThriftKsDef.DEF_STRATEGY_CLASS, replicationFactor,
Arrays.asList(cfDef));
// Add the schema to the cluster.
// "true" as the second param means that Hector will block until all
// nodes see the change.
myCluster.addKeyspace(newKeyspace, true);
}
public static void main(String[] args) {
new DatabaseDataImporter();
}
}
The exception I get is:
Exception in thread "main" me.prettyprint.hector.api.exceptions.HInvalidRequestException: InvalidRequestException(why:supercolumn parameter is invalid for standard CF Nodes)
at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:52)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:260)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeBatch(AbstractColumnFamilyTemplate.java:115)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.executeIfNotBatched(AbstractColumnFamilyTemplate.java:159)
at me.prettyprint.cassandra.service.template.SuperCfTemplate.update(SuperCfTemplate.java:203)
at algorithms.DatabaseDataImporter.<init>(DatabaseDataImporter.java:43)
at algorithms.DatabaseDataImporter.main(DatabaseDataImporter.java:87)
Caused by: InvalidRequestException(why:supercolumn parameter is invalid for standard CF Nodes)
at org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:20833)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:964)
at org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:950)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:246)
at me.prettyprint.cassandra.model.MutatorImpl$3.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:104)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:253)
... 7 more
I can understand that I somehow do a bad thing because I'm attempting to insert a supercolumninto a standardcolumnfamily. (Source for this is here). So maybe this is the code where everything gets broken during creation:
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
And this is the point where I don't know how to proceed. I tried to find a "SuperColumnFamilyDefinition"-class but I couldn't find it. Do you have any ideas or suggestions what I need to change to fix my code? I would be very happy.
Thanks a lot for every thought you're sharing with me.

I found the answer to my problem and would like to share (Maybe it helps someone in the future). As I thought the solution was quite simple.
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
needed to be extended to
ColumnFamilyDefinition cfDef = HFactory.createColumnFamilyDefinition(
"Graphs", "Nodes", ComparatorType.BYTESTYPE);
// defines it as super column
((ThriftCfDef) cfDef).setColumnType(ColumnType.SUPER);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark Streaming: Using PairRDD.saveAsNewHadoopDataset function to save data to HBase - java

Related

Not able to process kafka json message with Flink siddhi library

How to get just English tweets using apache spark streaming and Java API from all?

org.apache.spark.sql.AnalysisException: Table and view not found

Flume Custom HTTPSourceHandler GZipped File

I can't create a super column with hector in java

Categories

Resources