I have a Java function that runs on a single HBase row (a Result), it takes a Result as an input and outputs a byte[]. I would like to run this function on 10K-100K HBase rows and collect the results. I have a List<byte[]> which is the rows I'd like to run this function on, they are distributed evenly across all regions of the table. I would like to do so under these constraints:
Not ship all the rows from the server to the client
No long job init, the entire operations is expected to run in under a second
Utilize processing power of the Hadoop cluster and not the processing power of the client
Obviously, not depend upon the size of the HBase table which can be billions of rows
What's the best way to achieve this? I've thought of these options:
Spark - I'm not sure if this is a good option if my job runs on a tiny % of the number of rows in the table
Coprocessor - is there a way to run coprocessors in bulk on a List<byte[]> of rowkeys and collect the result? Will the work be processed in parallel by the cluster?
Implementing a custom HBase filter and then doing a bulk Get on the List<byte[]> with the custom filter - The Get will be processed by all region servers in parallel and can run custom logic, but this seems like a hack and I'm not sure a custom filter can return data that wasn't present in one of the columns of the row.
Related
I want to count the number of rows in a table three times depending on three filters/conditions. I want to know which one of the following two ways is better for performance and cost-efficiency. We are using AWS as our server, java spring to develop server-side API and MySQL for the database.
Use the count feature of MySQL to query three times in the database for three filtering criteria to get the three count result.
Fetch all the rows of the table from the database first using only one query. Then using java stream three times based on three filtering criteria to get the three count result.
It'll be better to go with option (1) in extreme cases. If it's slow to execute SELECT COUNT(*) FROM table then you should consider some tweak on SQL side. Not sure what you're using but I found this example for sql server
Assuming you go with Option (2) and you have hundreds of thousands of rows, I suspect that your application will run out of memory (especially under high load) before you have time to worry about slow response time from running SELECT count(*). Not to mention that you'll have lots of unnecessary rows and slow down transfer time between database and application
A basic argument against doing counts in the app is that hauling lots of data from the server to the client is time-consuming. (There are rare situations where it is worth the overhead.) Note that your client and AWS may be quite some distance apart, thereby exacerbating the cost of shoveling lots of data. I am skeptical of what you call "server-side API". But even if you can run Java on the server, there is still some cost of shoveling between MySQL and Java.
Sometimes this pattern lets you get 3 counts with one pass over the data:
SELECT
SUM(status='ready') AS ready_count,
SUM(status='complete') AS completed_count,
SUM(status='unk') AS unknown_count,
...
The trick here is that a Boolean expression has a value of 0 (for false) or 1 (for true). Hence the SUM() works like a 'conditional count'.
I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.
I have around 6-8TB of data each in a sharded table with 5 partitions. This table is in HBase. I have built a Java based spark job that reads data from this table and performs some aggregations to get aggregates for a set of columns treated as key and then finally writes back the results into another table. Initially, i tried with spark map and foreach api, and performed aggregations in memory using data structures such HashMap. This was finally upserted into table using jdbc connection. However, the performance was really bad and the job never completed. Then, wrote a new job using DataFrames. I am pulling the data using HBaseRDD API and converting it into dataframe, then i perform groupBY and aggregations and finally saving the results using
" finalDF.save("org.apache.phoenix.spark", SaveMode.Overwrite, output_conf);"
This also was taking time,so i divided the task based on a key range and processed a range (say 1 million users) at a time with repartitioning the data by 2001 to ensure high compression.
DataFrame sessionDF = new PhoenixRDD(sqlContext.sparkContext(),inputTable,JavaConverters.asScalaBufferConverter(cols).asScala().toSeq(),Option.apply(filter),Option.apply(source),hconf).toDataFrame(sqlContext).repartition(partitions);
The spark job properties used are as below:
--spark.app.name test
--spark.master yarn
--spark.deploy.mode cluster
--spark.driver.cores 2
--spark.driver.memory 4G
--spark.executor.instances 8
--spark.executor.cores 2
--spark.executor.memory 16G
--spark.executor.heartbeatInterval 6000000
--spark.default.parallelism 2001
--spark.yarn.executor.memoryOverhead 4096
--spark.yarn.scheduler.heartbeat.interval-ms 6000000
--spark.network.timeout 6000000
--spark.serializer org.apache.spark.serializer.KryoSerializer
--spark.shuffle.io.retryWait 60s
--spark.shuffle.io.maxRetries 10
The problem is that this job takes around 8-10hrs to process just one million users which is close to 1TB of data and after that it usually start giving "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1" and takes another 5-6hrs to finish. I tried increasing executors and memory, but still end up with this issue somewhere during the run and its getting difficult to process this whole data.
Can someone please advise how can i improve the processing of this job?
Please let me know if you need any further information.
Here is the cut-down version of aggregation step:
finalDF
.select(col("OID"), col("CID")
, col(“P”))
.groupBy(col("OID")
, col("CID”))
.agg(sum(when(col("P").equalTo(lit("sd")).or(col("P").equalTo(lit("hd"))), lit(1)).otherwise(lit(0))).alias("P"));
There are many more fields and other aggregations as part of this statement.
I am new to Spring Batch and have just begun to conduct a POC to prove that Spring Batch is capable of processing 1m records in a hour. The architecture however demands that we demonstrate horizontally scalablity as well.
I have read through both the Partitoning and Remote Chunking strategies. Both make sense to me. The essential difference between the two is that Remote Chunking requires a durable message queue as the actual write out to the database or file happens from the master. In partioning a durable message queue is not needed as the write happens from the slave.
Where I am totally lost however is, how to ensure that the results of these 2 variants of parallel processing are written out in the correct sequence? .
Let's take partinioning for example. As far as I understand if a particular step dealing with 1000 records is partioned into 10 parallel step executions each having it's own Reader,Processor and Writer, one of the executions could easily complete before the other. The result is that the ItemWriter of one of the step executions could write the results of processing records 300-400 to a table before results of processing 200-300 are written out to the same table, as that particular step execution could be lagging behind.
What this means is that now I have a output table which does have all results of the processing but they are not in the correct sorted order. A further round of sequential processing may be required just bring them back to the correct sorted order of 1 through to 1000.
I am struggling to understand, how I can ensure correct sorted output and at the same time scale the system horizontally through the remote processing strategies described in Spring Batch.
I have read both these books. http://www.manning.com/templier/ as well as http://www.apress.com/9781430234524 but there is nothing in those books either that answer my question.
I think you can't do that because Table are naturally unsorted. If you need them to be ordered in some way add a order column managed by writer. First partition write 1-100, second partition 101-200 and so on. Next step reader will get items order by [order column]. Holes between order column due to missing write in previous partitioners are not an issue. My 2 cents
I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.