Spark Context not Serializable? - java

So, I am getting the infamous Task Not Serializable error in Spark. Here's the related code block:
val labeledPoints: RDD[LabeledPoint] = events.map(event => {
var eventsPerEntity = try {
HBaseHelper.scan(...filter entity here...)(sc).map(newEvent => {
Try(new Object(...))
}).filter(_.isSuccess).map(_.get)
} catch {
case e: Exception => {
logger.error(s"Failed to convert event ${event}." +
s"Exception: ${e}.")
throw e
}
}
})
Basically what I am trying to achieve is that I am accessing sc which is my Spark Context object in map. And in runtime, I am getting Task Not Serializable error.
Here is a potential solution I could think of:
Query HBase without sc, which I can do, but in turn I will have a list. (If I try to parallelize; I have to use sc again). Having a list will lead me to not being able to use reduceByKey, which is advised here in my other question. So I could not succesfully achieve this one as well, as I don't know how I would achieve this without reduceByKey. Also I would really want to use RDDs :)
So I am looking for another solution + asking whether if I am doing something wrong. Thanks in advance!
Update
So basically, my question have become like this:
I have an RDD named events. This is the whole HBase table. Note: Every event is performed by a performerId which is again a field in event, i.e. event.performerId.
For every event in events, I need to calculate the ratio of event.numericColumn to the average of numericColumnof the events (Subset of events) that are performed by the same performerId.
I was trying to do this when mapping events. Within map I was trying to filter events according to their performerId.
Basically, I am trying to convert every event to a LabeledPoint and the ratio above is going to be one of my features in my Vector. i.e. For every event, I am trying to get
// I am trying to calculate the average, but cannot use filter, because I am in map block.
LabeledPoint(
event.someColumn,
Vectors.dense(
averageAbove,
...
)
)
I would appreciate any help. Thanks!

One option, if applicable, is loading the entire HBase table (or - all the elements that might match one of the events in events RDD, if you have any way of isolating them without going over the RDD) into a Dataframe, and then using join.
To load data from an HBase table into a Dataframe, you can use the preview Spark-HBase Connector from Hortonworks. Then, performing the right join operation between the two dataframes should be easy.

You can add the list as a new field on the event - by that getting a new RDD (event+list of entities). You can then use regular Spark commands to "explode" the list and thus get multiple event+list item records (it is easier to do this with DataFrames/DataSets than with RDDs though)

Its simple you cant use spark context on RDD Closure so find another approach to handle this.

Related

Union of more than two streams in apache flink

I have an architecture question regarding the union of more than two streams in Apache Flink.
We are having three and sometime more streams that are some kind of code book with whom we
have to enrich main stream.
Code book streams are compacted Kafka topics. Code books are something that doesn't change
so often, eg currency. Main stream is a fast event stream.
Our goal is to enrich main stream with code books.
There are three possible ways as I see it to do it:
Make a union of all code books and then join it with main stream and store the
enrichment data as managed, keyed state (so when compact events from kafka expire I have the
codebooks saved in state). This is now only way that I tired to do it.
Deserilized Kafka topic messages which are in JSON to POJOs eg. Currency, OrganizationUnit and so on.
I made one big wrapper class CodebookData with all code books eg:
public class CodebookData {
private Currency currency;
private OrganizationUnit organizationUnit
...
}
Next I mapped incoming stream of every kafka topic to this wrapper class and then made a union:
DataStream<CodebookData> enrichedStream = mappedCurrency.union(mappedOrgUnit).union(mappedCustomer);
When I print CodebookData it is populated like this
CodebookData{
Currency{populated with data},
OrganizationUnit=null,
Customer=null
}
CodebookData{
Curenncy=null,
OrganizationUnit={populated with data},
Customer=null
}
...
Here I stopped because I have problem how to connect this Codebook stream with main stream and save codebook data in value state. I do not have unique foreign key in my Codebook data because every codebook has its own foregin key that connects with main stream, eg. Currency has currencyId, organizationUnit orgID and so on.
Eg.I want to do something like this
SingleOutputStreamOperator<CanonicalMessage> enrichedMainStream = mainStream
.connect(enrichedStream)
.keyBy(?????)
.process(new MyKeyedCoProcessFunction());
and in MyCoProcessFunction I would create ValueState of type CodebookData.
Is this totally wrong or can I do something with this and if it is douable what I am doing wrong?
Second approach is by cascading a series of two-input CoProcessFunction operators with every kafka event source but I read somewhere that this is not optimal approach.
Third approach is broadcast state that is not so much familiar to me. For now I see the problem if I am using RocksDb for checkpointing and savepointing I am not sure that I can then use broadcast state.
Should I use some other approach from approach no.1 whit whom I am currently struggling?
In many cases where you need to do several independent enrichment joins like this, a better pattern to follow is to use a fan-in / fan-out approach, and perform all of the joins in parallel.
Something like this, where after making sure each event on the main stream has a unique ID, you create 3 or more copies of each event:
Then you can key each copy by whatever is appropriate -- the currency, the organization unit, and so on (or customer, IP address, and merchant in the example I took this figure from) -- then connect it to the appropriate cookbook stream, and compute each of the 2-way joins independently.
Then union together these parallel join result streams, keyBy the random nonce you added to each of the original events, and glue the results together.
Now in the case of three streams, this may be overly complex. In that case I might just do a series of three 2-way joins, one after another, using keyBy and connect each time. But at some point, as they get longer, pipelines built that way tend to run into performance / checkpointing problems.
There's an example implementing this fan-in/fan-out pattern in https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6.

Is it possible to update subscription message using previous state?

Suppose I have this subscription query like this:
queryGateway.subscriptionQuery(
FetchListOfBookQuery,
ResponseTypes.multipleInstancesOf(Book::class.java),
ResponseTypes.multipleInstancesOf(Book::class.java)
)
So, it will subscribe to list of the Books in databsae and If I want to add a new book I would have something like this in my projection:
fun on(event: BookAddedEvent){
var book = repo.save(Book(event.bookId)).block()
queryUpdateEmitter.emit(
FetchListOfBookQuery::class.java,
{ it.bookId == book.bookId },
book
)
}
The problem is, since I only got one instance of a new Book which has been added, in order to update to the subscription query I need to have previous list of Books as well. Is there a way to get the previous update state of the subscription query and compare changes and finally update it?
The Subscription Query logic provided by Axon Framework allows you to retrieve an initial response and updates. In code, this translates itself to firstly hitting an #QueryHandler annotated method and secondly emitting the updates through the QueryUpdateEmitter.
What is being emitted is completely up to you. So if you decide to send the newly added Book in combination with all the previous Books, that is perfectly fine. As you have likely noticed though, the QueryUpdateEmitter does not store the updates itself, neither does the SubscriptionQueryResult on the query dispatching end.
Thus if you need logic to filter out what has been send with a previous update, you will have to build this yourself. To that end you could take the route of building a dedicated piece of logic, a service maybe, which does the job. Or, you could create your own QueryUpdateEmitter which enhances the behaviour to simplify the update being send.
I'd argue the latter would be the cleanest approach, for which I'd recommend wrapping the SimpleQueryUpdateEmitter. However, this could be quite some custom code, so I'd first check whether there is a different way around this requirement you are stating:
... but in order to update to the subscription query I need to have previous list of the books.
If you do end up on that route through bare necessity, I would be interested to see the outcome, or potentially help out with suggestions on the matter.
That's my two cents, hope this helps you out #Patrick!

Apache Beam how many writes when using multiple tables

I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.

Multithreading in Grails - Passing domain objects into each thread causes some fields to randomly be null

I am trying to speed up a process in a Grails application by introducing parallel programming. This particular process requires sifting through thousands of documents, gathering the necessary data from them and exporting it to an excel file.
After many hours of trying to track down why this process was going so slowly, I've determined that the process has to do a lot of work gathering specific parts of data from each domain object. (Example: The domain object has lists of data inside it, and this process takes each index in these lists and appends it to a string with commas to make a nice looking, sorted list in a cell of the excel sheet. There are more examples but those shouldn't be important.)
So anything that wasn't a simple data access (document.id, document.name, etc...) was causing this process to take a long time.
My idea was to use threads for each document to asynchronously acquire all this data, when each thread finished gathering the data, it can come back to the main thread and be placed into the excel sheet, now all with simple data access, because the thread already gathered all the data.
This seems to be working, however I have a bug with the domain objects and the threads. Each thread is passed in its corresponding document domain object, but for whatever reason, the document domain objects will randomly have parts of its data changed to null.
For example: Before the document is passed into the thread, one part of the domain object will have a list that looks like this: [US, England, Wales], randomly at any point, the list will look like this in the thread: [US, null, Wales]. And this happens for any random part of the domain object, at any random time.
Generating the threads:
def docThreadPool = Executors.newFixedThreadPool(1)
def docThreadsResults = new Future<Map>[filteredDocs.size()]
filteredDocs.each {
def final document = it
def future = docThreadPool.submit(new DocumentExportCallable(document))
docThreadsResults[docCount] = future
docCount++
}
Getting the data back from the threads:
filteredDocs.each {
def data = docThreadsResults[count].get()
build excel spreadsheet...
}
DocumentExportCallable class:
class DocumentExportCallable implements Callable {
def final document
DocumentExportCallable(document) {
this.document = document
}
Map call() {
def data = [:]
code to get all the data...
return data
}
}
EDIT:
As seen below, it would be useful if I could show you the domain object. However I am not able to do this. BUT, the fact that you guys asked me about the domain object had me thinking that it just might be where the problem lies. Turns out, every part of the domain object that randomly messes up in the threads is a variable in the domain object inside "mapping" which uses SQL joins to get the data for those variables. I've just been made aware of lazy vs eager fetching in Grails. I'm wondering if this might be where the problem lies...by default it is set to lazy fetching so this constant access to the db by each thread might be where things are going wrong. I believe finding a way to change this to eager fetching might solve the problem.
I have the answer to why these null values were appearing randomly. Everything seems to be working now and my implementation is now performing much faster than the previous implementation!
Turns out I was unaware that Grails domain objects with 1-m relationships make separate sql calls when you access these fields even after you get the object itself. This must have caused these threads to be making un-thread-safe sql calls which created these random null values. Setting these 1-m properties in this specific case to be eagerly fetched fixed the issue.
For anyone reading later on, you'll want to read up on lazy vs eager fetching to get a better understanding.
As for the code:
These are the 1-m variables that were the issue in my domain object:
static hasMany = [propertyOne : OtherDomainObject, propertyTwo : OtherDomainObject, propertyThree : OtherDomainObject]
I added a flag to my database call which would enable this code for this specific case, as I didn't want these properties to always be eagerly fetched throughout the app:
if (isEager) {
fetchMode 'propertyOne', FetchMode.JOIN
fetchMode 'propertyTwo', FetchMode.JOIN
fetchMode 'propertyThree', FetchMode.JOIN
setResultTransformer Criteria.DISTINCT_ROOT_ENTITY
}
My apologies, but at the moment I do not remember why I had to put the "setResultTransformer" in the code above, but without it there were issues. Maybe someone later on can explain this, otherwise I'm sure a google search will explain.
What is happening is that your grails domain objects were detaching from the hibernate session thus hitting LazyInitiationException when your thread attempted to load lazy properties.
It's good that switching to eager fetching worked for you but it may not be an option for everyone. What you could have also done is used grails async task framework instead as it has built in session handling. See https://async.grails.org/latest/guide/index.html
However, even with grails async task passing an object between threads seems to detach it as the new thread will have a newly bound session. The solutions that I have found where to either .attach() or .merge() on the new thread to bind it with the session on the calling thread.
I believe the optimal solution would be to have hibernate load the object on the new thread, meaning in your code snippet you would pass a document id and Document.get(id) on your session supported thread.

Handling very large amount of data in MyBatis

My goal is actually to dump all the data of a database to an XML file. The database is not terribly big, it's about 300MB. The problem is that I have a memory limitation of 256MB (in JVM) only. So obviously I cannot just read everything into memory.
I managed to solve this problem using iBatis (yes I mean iBatis, not myBatis) by calling it's getList(... int skip, int max) multiple times, with incremented skip. That does solve my memory problem, but I'm not impressed with the speed. The variable names suggests that what the method does under the hood is to read the entire result-set skip then specified record. This sounds quite redundant to me (I'm not saying that's what the method is doing, I'm just guessing base on the variable name).
Now, I switched to myBatis 3 for the next version of my application. My question is: is there any better way to handle large amount of data chunk by chunk in myBatis? Is there anyway to make myBatis process first N records, return them to the caller while keeping the result set connection open so the next time the user calls the getList(...) it will start reading from the N+1 record without doing any "skipping"?
myBatis CAN stream results. What you need is a custom result handler. With this you can take each row separately and write it to your XML file. The overall scheme looks like this:
session.select(
"mappedStatementThatFindsYourObjects",
parametersForStatement,
resultHandler);
Where resultHandler is an instance of a class implementing the ResultHandler interface. This interface has just one method handleResult. This method provides you with a ResultContext object. From this context you can retrieve the row currently being read and do something with it.
handleResult(ResultContext context) {
Object result = context.getResultObject();
doSomething(result);
}
No, mybatis does not have full capability to stream results yet.
EDIT 1:
If you don't need nested result mappings then you could implement a custom result handler to stream results. on current released versions of MyBatis. (3.1.1) The current limitation is when you need to do complex result mapping. The NestedResultSetHandler does not allow custom result handlers. A fix is available, and it looks like is currently targeted for 3.2. See Issue 577.
In summary, to stream large result sets using MyBatis you'll need.
Implement your own ResultSetHandler.
Increase fetch size. (as noted below by Guillaume Perrot)
For Nested result maps, use the fix discussed on Issue 577. This fix also resolves some memory issues with large result sets.
I have successfully used MyBatis streaming with the Cursor. The Cursor has been implemented on MyBatis at this PR.
From the documentation it is described as
A Cursor offers the same results as a List, except it fetches data
lazily using an Iterator.
Besides, the code documentation says
Cursors are a perfect fit to handle millions of items queries that
would not normally fits in memory.
Here is an example of implementation I have done and which I was able to successfully use it:
import org.mybatis.spring.SqlSessionFactoryBean;
// You have your SqlSessionFactory somehow, if using Spring you can use
SqlSessionFactoryBean sqlSessionFactory = new SqlSessionFactoryBean();
Then you define your mapper, e.g., UserMapper with the SQL query that returns a Cursor of your target object, not a List. The whole idea is to not store all the elements in memory:
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.cursor.Cursor;
public interface UserMapper {
#Select(
"SELECT * FROM users"
)
Cursor<User> getAll();
}
Then you write the that code that will use an open SQL session from the factory and query using your mapper:
try(SqlSession sqlSession = sqlSessionFactory.openSession()) {
Iterator<User> iterator = sqlSession.getMapper(UserMapper.class)
.getAll()
.iterator();
while (iterator.hasNext()) {
doSomethingWithUser(iterator.next());
}
}
handleResult receives as many records as the query gets, no pause.
When there are too many records to process I used sqlSessionFactory.getSession().getConnection().
Then as, normal JDBC, get a Statement, get the Resultset, and process one by one the records. Don't forget to close the session.
If just dumping all the data without any ordering requirement from tables, why not directly do the pagination in SQL? Set a limit to the query statement, where specifying different record id as the offset, to separate the whole table into chunks, each of which could directly be read into memory if the rows limit is a reasonable number.
The sql could be something like:
SELECT * FROM resource
WHERE "ID" >= continuation_id LIMIT 300;
I think this could be viewed as an alternative solution for you to dump all the data by chunks, getting rid of the different feature problems in mybatis, or any Persistence layer, support.

Categories

Resources