Spark - UnsupportedOperationException when calling Java method from Scala code - java

I've implemented code in Scala that is using a method written in Java.
In the code below processSale() is a Java method that takes util.List<Sale> as a parameter.
I've converted Scala Iterable[Sale] to Seq[Sale] and then to util.List<Sale> with the help of scala.collection.JavaConverters._
val parseSales: RDD[(String, Sale)] = rawSales
.map(sale => sale.Id -> sale)
.groupByKey()
.mapValues(a => SaleParser.processSale(a.toSeq.asJava))
However when the code gets executed as part of a Spark driver the job fails due to the task failure with UnsupportedOperationException. I've looked through the logs and it appears that the reason is within the Java processSale method on the call of Collections.sort
Collections.sort(sales, new Comparator<InvocaCall>() {
#Override
public int compare(Sale sale1, Sale sale2) {
return Long.compare(sale1.timestamp, sale2.timestamp);
}
});
I'm stuck at this point because I'm passing the required util.List<Sale>. Why could Collections.sort be an unsupported operation in this case?

From this documentation:
Because Java does not distinguish between mutable and immutable
collections in their type, a conversion from, say,
scala.immutable.List will yield a java.util.List, where all
mutation operations throw an UnsupportedOperationException
toSeq from your code returns immutable.Seq, that's why you get the exception.
So you can convert your list to mutable data structure like ListBuffer:
list.to[scala.collection.mutable.ListBuffer].asJava

Add null check for rawSales util.List<Sale>.
val parseSales: RDD[(String, Sale)] = if (rawSales.nonEmpty)
//rawSales specific stream operations
else
//None or any code as per requirement

Related

How to pass complex Java Class Object as parameter to Scala UDF in Spark?

I have a Java client class (used as a dependency Jar with spark-shell) that responds to an API call - let's call the class SomeAPIRequester.
In plain Java, it would return me desired results with below sample code -
SomeAPIRequester requester = SomeAPIRequester.builder().name("abc").build() // build the class
System.out.println(requester.getSomeItem("id123")) // result: {"id123": "item123"}
I want to call this API in a distributed manner through my RDD of IDs in a stored in spark dataframe (in scala) -
val inputIdRdd = sc.parallelize(List("id1", "id2", "id3"...)) // sample RDD of IDs i want to call the API for
and I define my UDF as -
val test: UserDefinedFunction = udf((id: String, requester: SomeAPIRequester) => {
requester.getSomeItem(id)
})
and call this UDF as -
inputIdRdd.toDf("ids").withColumn("apiResult", test(col("ids"), requester) // requester as built with SomeAPIRequester.builder()....
// or directly with RDD ? udf, or a plain scala function ..
inputIdRdd.foreach{ id => test(id, requester) }
When I run a .show() or .take() on the result, I get NullPointerException on the requester java class.
I also tried sending in literals (lit), and I read about typedLit in scala, but I could not convert the Java Requester class into any allowed typedLit types in scala.
Is there a way to call this Java class object through UDFs and get the result from the API?
Edit:
I also tried to initialize the requester class in the RDD's foreach block -
inputIdRdd.foreach(x =>{
val apiRequester = SomeAPIRequester.builder()...(argPool).build()
try {
apiRequester.getSomeItem(x)
} catch {
case ex: Exception => println(ex.printStackTrace()); ""
}
})
But this returns no response - cannot initialize class etc.
Thanks!
Working with custom classes working with Spark requires having some knowledge about how Spark works under the hood. Don´t put your instance as a parameter in the udf. Parameters in udfs are extracted from the rows of the dataframe, the null pointer exception is understandable in this case. You can try with the following options:
First put the instance in the scope of the udf:
val requester: SomeAPIRequester = ???
val test: UserDefinedFunction = udf((id: String) => {
requester.getSomeItem(id)
})
At this point you will need to mark your class as Serializable if possible, otherwise you will have a NotSerializableException.
If your class is not Seriazable because it comes form a third party you can mark your instance as lazy transient val as you can see in https://mengdong.github.io/2016/08/16/spark-serialization-memo/ or https://medium.com/#swapnesh.chaubal/writing-to-logentries-from-apache-spark-35831282f53d.
If you work in the RDD world you can use mapPartitions to create just one instance per partition.

Writing unit tests for Java 8 streams

I have a list and I'm streaming this list to get some filtered data as:
List<Future<Accommodation>> submittedRequestList =
list.stream().filter(Objects::nonNull)
.map(config -> taskExecutorService.submit(() -> requestHandler
.handle(jobId, config))).collect(Collectors.toList());
When I wrote tests, I tried to return some data using a when():
List<Future<Accommodation>> submittedRequestList = mock(LinkedList.class);
when(list.stream().filter(Objects::nonNull)
.map(config -> executorService.submit(() -> requestHandler
.handle(JOB_ID, config))).collect(Collectors.toList())).thenReturn(submittedRequestList);
I'm getting org.mockito.exceptions.misusing.WrongTypeOfReturnValue:
LinkedList$$EnhancerByMockitoWithCGLIB$$716dd84d cannot be returned by submit() error. How may I resolve this error by using a correct when()?
You can only mock single method calls, not entire fluent interface cascades.
Eg, you could do
Stream<Future> fs = mock(Stream.class);
when(requestList.stream()).thenReturn(fs);
Stream<Future> filtered = mock(Stream.class);
when(fs.filter(Objects::nonNull).thenReturn(filtered);
and so on.
IMO it's really not worth mocking the whole thing, just verify that all filters were called and check the contents of the result list.

Mapping the values of a RDD to their dictionary values

I've this piece of code:
List tmp = colRDD.collect();
int ctr = 0;
for(Object o : tmp){
if (!dictionary.containsKey(o)) {
dictionary.put(o, ctr++);
}
}
revDictionary = dictionary.entrySet().stream()
.collect(Collectors.toMap(Entry::getValue, c -> c.getKey()));
colRDD = colRDD.map(x -> {return dictionary.get(x);});
At the start, I materialize the RDD and put each value in a hash table where the RDD values are keys.
Then, I sımple want to map each value in the RDD to their dictionary value.
However, I get a Task not serializable error. Why is that ?
This will be caused by trying to access a variable scoped to the driver, from within code that is evaluated by an executor.
Given your sample code, the most likely culprit is dictionary in this line of code:
colRDD = colRDD.map(x -> {return dictionary.get(x);});
However the issue could also be coming from further up in your code than you have supplied here, so you might need to check that too.
The reason for this is because dictionary resides in memory of your driver, which is likely running in a separate JVM instance than your executors. The lambda you have passed to colRDD.map is evaluated by an executor, not the driver. The function is serialised as the task to be executed, sent to an executor to be run. But the Spark engine is unable to serialise the task because of the 'closure' around dictionary and hence, exception.

How to use switchIfEmpty RxJava

The logic here is that if the ratings in the database are empty, then I want to get them from the API. I have the following code:
Observable.from(settingsRatingRepository.getRatingsFromDB())
.toList()
.switchIfEmpty(settingsRatingRepository.getSettingsRatingModulesFromAPI())
.compose(schedulerProvider.getSchedulers())
.subscribe(ratingsList -> {
view.loadRatingLevels(ratingsList, hideLocks);
}, this::handleError);
The getRatingsFromDB() call returns List<SettingRating>, but the API call returns Observable<List<SettingRating>>.
However, when I unit test this, when I pass an empty list from the database call, it does not execute the API call. Can someone pls help me in this matter. This is my unit test code:
when(mockSettingsRatingsRepository.getRatingsFromDB()).thenReturn(Collections.emptyList());
List<SettingsRating> settingsRatings = MockContentHelper.letRepositoryReturnSettingsRatingsFromApi(mockSettingsRatingsRepository);
settingsParentalPresenter.onViewLoad(false);
verify(mockView).loadRatingLevels(settingsRatings, false);
As #Kiskae mentioned, it's the fact that I am confusing an empty list with an empty Observable. Therefore, I have used the following which is what I want:
public void onViewLoad(boolean hideLocks) {
Observable.just(settingsRatingRepository.getRatingsFromDB())
.flatMap(settingsRatings -> {
if (settingsRatings.isEmpty()) {
return settingsRatingRepository.getSettingsRatingModules();
} else {
return Observable.just(settingsRatings);
}
})
.compose(schedulerProvider.getSchedulers())
.subscribe(ratingsList -> {
view.loadRatingLevels(ratingsList, hideLocks);
}, this::handleError);
}
Observable#toList() returns a single element. If the observable from which it gets its elements is empty, it will emit an empty list. So by definition the observable will never be empty after calling toList().
switchIfEmpty will only be called when your observer completes without emitting any items.
Since you are doing toList it will emit list object. Thats why your switchIfEmpty is never getting called.
If you want to get data from cache and fallback to your api if cache is empty, use concat along with first or takeFirst operator.
For example:
Observable.concat(getDataFromCache(), getDataFromApi())
.first(dataList -> !dataList.isEmpty());
Building on answer by #kiskae, your use of a toList() emits the elements aggregated as a single List.
There is an alternative to the use of Observable.just() + a flatMap here.
Observable.from will iterate over the list returned by your service and emit each individual items, so an Observable<Rating>. If said list is empty, it naturally produces an empty Observable. Your API call also produces an Observable<Rating>, and in both cases you want to reaggregate that back into a List<Rating> for further processing.
So just move the toList() from your original code down one line, after the switchIfEmpty calling the API:
Observable.from(settingsRatingRepository.getRatingsFromDB())
.switchIfEmpty(settingsRatingRepository.getSettingsRatingModulesFromAPI())
.toList()
.compose(schedulerProvider.getSchedulers())
.subscribe(ratingsList -> {
view.loadRatingLevels(ratingsList, hideLocks);
}, this::handleError);
Granted, that solution may produce a bit more garbage (as the db's List is turned into an Observable just to be turned into a new List later on).

RxJava operators

I'm learn rxjava using this article: http://blog.danlew.net/2014/09/22/grokking-rxjava-part-2/
and can't reproduce first example of this article
I did next:
Observable<List<String>> query(String text); //Gradle: error: missing method body, or declare abstract
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
query.subscribe(urls -> { //Gradle: error: cannot find symbol variable query
for (String url : urls) {
System.out.println(url);
}
});
}
But I have an errors, which I added as comments
What I did wrong?
Java 8 lambdas aside, most of the people here are missing the fact that your code won't compile regardless RetroLambda or any other nifty tool that you find somewhere to work around the missed lambda feature in Android...
So, take a close look to your code, you even had added some comments to the snippet which are actually explaining you why you are having some compilation errors:
1 You have a method with an empty body:
Observable<List<String>> query(String text);
So, add a method body to it and problem solved. What do you want to do? You don't know yet? Then add a dummy or empty body and work that out later:
Observable<List<String>> query(String text) {
return Observable.just(Arrays.asList("url1", "url2"));
}
2 There is no query variable at all in your code. What you've got is a query method, and the syntax to use methods requires you to use braces:
query("whatever").subscribe(urls -> {
for (String url : urls) {
System.out.println(url);
}
});
Now add the RetroLambda or use anonymous classes and you are done. Bear in mind that nothing out of this will add much functionality to your code but will solve just those compilation errors. Now ask yourself what do you want to do in your query method and carry on.
Note: An Observable object is a stream of data, which basically means that you might get zero elements, one element, or many; all of them instances of the specified type. So your code seems to expect a stream of lists of strings, if what you really want is a stream of strings, then replace Observable<List<String>> for Observable<String>.
By Gradle: error: you mean compilation error? You should probably put parentheses between query and .subscribe(urls -> { as this is not a variable or class filed but method instead, so you should call it to get Observable to subscribe to.
Well, also you need to implement query method to return Observable, for example like this:
private Observable<String> query() {
return Observable.just("one", "two", "three");
}
You'll get another build error because of Java 8 but as already mentioned in comments you can easily use retrolamda with gradle to fix the problem. Otherwise you can use Android Studio quick fixes to convert java 8 lambdas into java 6 anonymous classes.

Categories

Resources