I'm still fairly new to Spark, and I've a question.
Let's say I need to submit a spark application to a 4 node cluster, and each node has the a standalone storage backend (ex. RocksDB) with exactly the same k,v rows, from where I need to read the data to process. I can create an RDD by getting all the rows I need from the storage and calling parallelize on the dataset:
public JavaRDD<value> parallelize(Map<key, value> data){
return sparkcontext.parallelize(new ArrayList<>(data.values()));
}
However I still need to get every row that I need to process into memory from disk, for every node in the cluster, even though each node is only going to process a part of it, since the data is going to be on the Map structure before creating the RDD.
Is there another way to do this or I'm seeing this wrongly? The database is not supported by hadoop, and I can't use HDFS for this use case. It's not supported by jdbc either.
Thank you in advance.
Related
I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.
I have a hazelcast cluster that performs several calculations for a java-client triggered by command-line. I need to persist parts of the calculated results on the client-system while the nodes are still working. I am going to store parts of the data in Hazelcasts maps. Now I am looking for a way to inform the client that a node have stored data inside the map and that he can start using it. Is there a way to trigger client operations from any hazelcast-node?
Your question is not very clear, but it looks like you could use com.hazelcast.core.EntryListener to trigger a callback that will notify the client, when a new entry is stored in the data map.
You member node can publish some intermediate results (or just notification message) to Hazelcast IQueue, ITopic or RingBuffer.
A flow looks like this.
a client registers a listener for, say, rignbuffer.
a client submits command to perform on the cluster.
a member persists intermediate results to the IMaps or any other data structure
a member sends a message to the topic about the availability of partial results.
a client receives messages and accessing data in IMap.
a member sends a message when it's done with it's task.
Something like that.
You can find some examples here
Let me know if you have any questions about it.
Cheers,
Vik
There are several paths to solve the problem. The most simple one is using a dedicated IMap or any other of Hazelcasts synchronized collections. One can simply write data in such a map and retrieve/remove it after it got added. But this will cause a huge Overhead, because the data has to be synchronized throughout the cluster. If the data is quite big and the cluster is huge with a few hundred nodes all over the world or at least the USA, the data will be synchronized over all nodes just to get deleted a few moments later, which also has to be synchronized. Not deleting is no option, because the data can get several gb big, which will make synchronization of the data even more expensive. The question got answered but the solution is not suited for every scenario.
Currently I have integrated Spark Stream with Kafka in Java and able to aggregate the stats. However, I cannot figure out a way to store the result into a Java object so I can pass this object with the result around with different methods/classes without storing them into database. I have spent quite amount of time searching for tutorial/examples online but all of them are end up with using print() to display the result in console. However, what I am trying to do is to return these results JSON string when users call a rest-api endpoint.
Is it possible that I can have these results in memory and pass them around with different methods, or I need to store them into database first and fetch them from there as needed?
If I got you right you want consume your results from Spark Streaming via Rest APIs.
Even if there are some ways to directly accomplish this (e.g. using Spark SQL/Thrift server) I would separate these two tasks. Else if you're Spark Streaming process fails, your service/REST-API layer will fail too.
Thus it has its advantages to separate these two layers. You are not forced to use a classical database. You could implement a service, which implements/uses JCache and send your results of the Spark streaming process to it.
I am using Apache Spark to analyse the data from Cassandra and will insert the data back into Cassandra by designing new tables in Cassandra as per our queries. I want to know that whether it is possible for spark to analyze in real time? If yes then how? I have read so many tutorials regarding this, but found nothing.
I want to perform the analysis and insert into Cassandra whenever a data comes into my table instantaneously.
This is possible with Spark Streaming, you should take a look at the demos and documentation which comes packaged with the Spark Cassandra Connector.
https://github.com/datastax/spark-cassandra-connector
This includes support for streaming, as well as support for creating new tables on the fly.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
Spark Streaming extends the core API to allow high-throughput,
fault-tolerant stream processing of live data streams. Data can be
ingested from many sources such as Akka, Kafka, Flume, Twitter,
ZeroMQ, TCP sockets, etc. Results can be stored in Cassandra.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#saving-rdds-as-new-tables
Use saveAsCassandraTable method to automatically create a new table
with given name and save the RDD into it. The keyspace you're saving
to must exist. The following code will create a new table words_new in
keyspace test with columns word and count, where word becomes a
primary key:
case class WordCount(word: String, count: Long) val collection =
sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveAsCassandraTable("test", "words_new",
SomeColumns("word", "count"))
Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.