I have a use case where i am writing to a Kafka topic in batches using spark job (no streaming).Initially i pump-in suppose 10 records to Kafka topic and run the spark job which does some processing and finally write to another Kafka topic.
Next time when i push another 5 records and run the spark job, my requirement is to start processing these 5 records only not from starting offset. I need to maintain the committed offset so that spark job should run on next offset position and do the processing.
Here is code from kafka side to fetch the offset:
private static List<TopicPartition> getPartitions(KafkaConsumer consumer, String topic) {
List<PartitionInfo> partitionInfoList = consumer.partitionsFor(topic);
return partitionInfoList.stream().map(x -> new TopicPartition(topic, x.partition())).collect(Collectors.toList());
}
public static void getOffSet(KafkaConsumer consumer) {
List<TopicPartition> topicPartitions = getPartitions(consumer, topic);
consumer.assign(topicPartitions);
consumer.seekToBeginning(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " startingOffSet-> " + consumer.position(x));
});
consumer.assign(topicPartitions);
consumer.seekToEnd(topicPartitions);
topicPartitions.forEach(x -> {
System.out.println("Partition-> " + x + " endingOffSet-> " + consumer.position(x));
});
topicPartitions.forEach(x -> {
consumer.poll(1000) ;
OffsetAndMetadata offsetAndMetadata = consumer.committed(x);
long position = consumer.position(x);
System.out.printf("Committed: %s, current position %s%n", offsetAndMetadata == null ? null : offsetAndMetadata
.offset(), position);
});
}
Below code is for spark to load the messages from topic which is not working :
Dataset<Row> kafkaDataset = session.read().format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic)
.option("group.id", "test-consumer-group")
.option("startingOffsets","{\"Topic1\":{\"0\":2}}")
.option("endingOffsets", "{\"Topic1\":{\"0\":3}}")
.option("enable.auto.commit","true")
.load();
After above code executes i am again trying to get the offset by calling
getoffset(consumer)
from the topic which always reads from 0 offset and committed offset fetched initially keeps on increasing. I am new to kafka and still figuring out how to handle such scenarion.Please help here.
Initially i had 10 records in my topic, i published another 2 records and here is the o/p:
Output post getoffset method executes :
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
Output post spark code executes for loading messages.
Partition-> Topic00-0 startingOffSet-> 0 Partition->
Topic00-0 endingOffSet-> 12 Committed: 12, current position
12
I see no diff and . Please take a look and suggest resolution for this sceanario.
I am trying to understand CompletableFutures and chaining of calls that that return completed futures and I have created the bellow example which kind of simulates two calls to a database.
The first method is supposed to be giving a completable future with a list of userIds and then I need to make a call to another method providing a userId to get the user (a string in this case).
to summarise:
1. fetch the ids
2. fetch a list of the users coresponding to those ids.
I created simple methods to simulate the responses with sleap threads.
Please check the code bellow
public class PipelineOfTasksExample {
private Map<Long, String> db = new HashMap<>();
PipelineOfTasksExample() {
db.put(1L, "user1");
db.put(2L, "user2");
db.put(3L, "user3");
db.put(4L, "user4");
}
private CompletableFuture<List<Long>> returnUserIdsFromDb() {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("building the list of Ids" + " - thread: " + Thread.currentThread().getName());
return CompletableFuture.supplyAsync(() -> Arrays.asList(1L, 2L, 3L, 4L));
}
private CompletableFuture<String> fetchById(Long id) {
CompletableFuture<String> cfId = CompletableFuture.supplyAsync(() -> db.get(id));
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("fetching id: " + id + " -> " + db.get(id) + " thread: " + Thread.currentThread().getName());
return cfId;
}
public static void main(String[] args) {
PipelineOfTasksExample example = new PipelineOfTasksExample();
CompletableFuture<List<String>> result = example.returnUserIdsFromDb()
.thenCompose(listOfIds ->
CompletableFuture.supplyAsync(
() -> listOfIds.parallelStream()
.map(id -> example.fetchById(id).join())
.collect(Collectors.toList()
)
)
);
System.out.println(result.join());
}
}
My question is, does the join call (example.fetchById(id).join()) ruins the nonblocking nature of process. If the answer is positive how can I solve this problem?
Thank you in advance
Your example is a bit odd as you are slowing down the main thread in returnUserIdsFromDb(), before any operation even starts and likewise, fetchById slows down the caller rather than the asynchronous operation, which defeats the entire purpose of asynchronous operations.
Further, instead of .thenCompose(listOfIds -> CompletableFuture.supplyAsync(() -> …)) you can simply use .thenApplyAsync(listOfIds -> …).
So a better example might be
public class PipelineOfTasksExample {
private final Map<Long, String> db = LongStream.rangeClosed(1, 4).boxed()
.collect(Collectors.toMap(id -> id, id -> "user"+id));
PipelineOfTasksExample() {}
private static <T> T slowDown(String op, T result) {
LockSupport.parkNanos(TimeUnit.MILLISECONDS.toNanos(500));
System.out.println(op + " -> " + result + " thread: "
+ Thread.currentThread().getName()+ ", "
+ POOL.getPoolSize() + " threads");
return result;
}
private CompletableFuture<List<Long>> returnUserIdsFromDb() {
System.out.println("trigger building the list of Ids - thread: "
+ Thread.currentThread().getName());
return CompletableFuture.supplyAsync(
() -> slowDown("building the list of Ids", Arrays.asList(1L, 2L, 3L, 4L)),
POOL);
}
private CompletableFuture<String> fetchById(Long id) {
System.out.println("trigger fetching id: " + id + " thread: "
+ Thread.currentThread().getName());
return CompletableFuture.supplyAsync(
() -> slowDown("fetching id: " + id , db.get(id)), POOL);
}
static ForkJoinPool POOL = new ForkJoinPool(2);
public static void main(String[] args) {
PipelineOfTasksExample example = new PipelineOfTasksExample();
CompletableFuture<List<String>> result = example.returnUserIdsFromDb()
.thenApplyAsync(listOfIds ->
listOfIds.parallelStream()
.map(id -> example.fetchById(id).join())
.collect(Collectors.toList()
),
POOL
);
System.out.println(result.join());
}
}
which prints something like
trigger building the list of Ids - thread: main
building the list of Ids -> [1, 2, 3, 4] thread: ForkJoinPool-1-worker-1, 1 threads
trigger fetching id: 2 thread: ForkJoinPool-1-worker-0
trigger fetching id: 3 thread: ForkJoinPool-1-worker-1
trigger fetching id: 4 thread: ForkJoinPool-1-worker-2
fetching id: 4 -> user4 thread: ForkJoinPool-1-worker-3, 4 threads
fetching id: 2 -> user2 thread: ForkJoinPool-1-worker-3, 4 threads
fetching id: 3 -> user3 thread: ForkJoinPool-1-worker-2, 4 threads
trigger fetching id: 1 thread: ForkJoinPool-1-worker-3
fetching id: 1 -> user1 thread: ForkJoinPool-1-worker-2, 4 threads
[user1, user2, user3, user4]
which might be a surprising number of threads on the first glance.
The answer is that join() may block the thread, but if this happens inside a worker thread of a Fork/Join pool, this situation will be detected and a new compensation thread will be started, to ensure the configured target parallelism.
As a special case, when the default Fork/Join pool is used, the implementation may pick up new pending tasks within the join() method, to ensure progress within the same thread.
So the code will always make progress and there’s nothing wrong with calling join() occasionally, if the alternatives are much more complicated, but there’s some danger of too much resource consumption, if used excessively. After all, the reason to use thread pools, is to limit the number of threads.
The alternative is to use chained dependent operations where possible.
public class PipelineOfTasksExample {
private final Map<Long, String> db = LongStream.rangeClosed(1, 4).boxed()
.collect(Collectors.toMap(id -> id, id -> "user"+id));
PipelineOfTasksExample() {}
private static <T> T slowDown(String op, T result) {
LockSupport.parkNanos(TimeUnit.MILLISECONDS.toNanos(500));
System.out.println(op + " -> " + result + " thread: "
+ Thread.currentThread().getName()+ ", "
+ POOL.getPoolSize() + " threads");
return result;
}
private CompletableFuture<List<Long>> returnUserIdsFromDb() {
System.out.println("trigger building the list of Ids - thread: "
+ Thread.currentThread().getName());
return CompletableFuture.supplyAsync(
() -> slowDown("building the list of Ids", Arrays.asList(1L, 2L, 3L, 4L)),
POOL);
}
private CompletableFuture<String> fetchById(Long id) {
System.out.println("trigger fetching id: " + id + " thread: "
+ Thread.currentThread().getName());
return CompletableFuture.supplyAsync(
() -> slowDown("fetching id: " + id , db.get(id)), POOL);
}
static ForkJoinPool POOL = new ForkJoinPool(2);
public static void main(String[] args) {
PipelineOfTasksExample example = new PipelineOfTasksExample();
CompletableFuture<List<String>> result = example.returnUserIdsFromDb()
.thenComposeAsync(listOfIds -> {
List<CompletableFuture<String>> jobs = listOfIds.parallelStream()
.map(id -> example.fetchById(id))
.collect(Collectors.toList());
return CompletableFuture.allOf(jobs.toArray(new CompletableFuture<?>[0]))
.thenApply(_void -> jobs.stream()
.map(CompletableFuture::join).collect(Collectors.toList()));
},
POOL
);
System.out.println(result.join());
System.out.println(ForkJoinPool.commonPool().getPoolSize());
}
}
The difference is that first, all asynchronous jobs are submitted, then, a dependent action calling join on them is scheduled, to be executed only when all jobs have completed, so these join invocations will never block. Only the final join call at the end of the main method may block the main thread.
So this prints something like
trigger building the list of Ids - thread: main
building the list of Ids -> [1, 2, 3, 4] thread: ForkJoinPool-1-worker-1, 1 threads
trigger fetching id: 3 thread: ForkJoinPool-1-worker-1
trigger fetching id: 2 thread: ForkJoinPool-1-worker-0
trigger fetching id: 4 thread: ForkJoinPool-1-worker-1
trigger fetching id: 1 thread: ForkJoinPool-1-worker-0
fetching id: 4 -> user4 thread: ForkJoinPool-1-worker-1, 2 threads
fetching id: 3 -> user3 thread: ForkJoinPool-1-worker-0, 2 threads
fetching id: 2 -> user2 thread: ForkJoinPool-1-worker-1, 2 threads
fetching id: 1 -> user1 thread: ForkJoinPool-1-worker-0, 2 threads
[user1, user2, user3, user4]
showing that no compensation threads had to be created, so the number of threads matches the configured target parallelism.
Note that if the actual work is done in a background thread rather than within the fetchById method itself, you now don’t need a parallel stream anymore, as there is no blocking join() call. For such scenarios, just using stream() will usually result in a higher performance.
I have an Observable emitting items, and I would like to merge into it special items acting as "time ticks" since the last item has been received.
I was trying to play around with timeout+onErrorXXX or intervals, but could not get it to work as expected.
import io.reactivex.Observable;
import io.reactivex.functions.Function;
import org.apache.log4j.Logger;
import org.junit.Test;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
public class RXTest {
private static final Logger log = Logger.getLogger(RXTest.class);
#Test
public void rxTest() throws InterruptedException {
log.info("Starting");
Observable.range(0, 26)
.concatMap(
item -> Observable.just(item)
.delay(item, TimeUnit.SECONDS)
)
.timeout(100, TimeUnit.MILLISECONDS)
// .retry()
.onErrorResumeNext((Function) throwable -> {
if (throwable instanceof TimeoutException) {
return Observable.just(-1);
}
throw new RuntimeException((Throwable)throwable);
})
.subscribe(
item -> log.info("Received " + item),
throwable -> log.error("Thrown" + throwable),
() -> log.info("Completed")
);
Thread.sleep(30000);
}
}
I would expect it to output something like:
00:00.000 Received 0
00:00.100 Received -1
00:00.200 Received -1
... (more Received -1 every 100 millis)
00:01.000 Received 1
00:01.100 Received -1
00:01.200 Received -1
...
00:03.000 Received 2
00:03.100 Received -1
00:03.200 Received -1
...
But instead, it receives -1 only once and then completes.
EDIT
Hopefully this marbles diagram would make it easier to understand:
I don't clearly understand what you want, but the expected output will be if you change onResumeNext() to return
Observable.interval(100, TimeUnit.MILLISECONDS)
.take(2)
.map(__ -> -1)
I am using RetryExecutor from : https://github.com/nurkiewicz/async-retry
Below id my code :
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(10);
RetryExecutor retryExecutor = new AsyncRetryExecutor(executorService)
.retryOn(IOException.class)
.withExponentialBackoff(500, 2)
.withMaxDelay(5_000) // 5 seconds
.withUniformJitter()
.withMaxRetries(5);
I have submitted a few tasks to retryExecutor.
retryExecutor.getWithRetry(ctx -> {
if(ctx.getRetryCount()==0)
System.out.println("Starting download from : " + url);
else
System.out.println("Retrying ("+ctx.getRetryCount()+") dowload from : "+url);
return downloadFile(url);
}
).whenComplete((result, error) -> {
if(result!=null && result){
System.out.println("Successfully downloaded!");
}else{
System.out.println("Download failed. Error : "+error);
}
});
Now, how do I wait for all submitted tasks to finish?
I want to wait until all retries are finished (if any).
don't think it will be as simple as executorService.shutdown();
CompletableFuture<DownloadResult> downloadPromise =
retryExecutor.getWithRetry(...)
.whenComplete(...);
DownloadResult downloadResult = downloadPromise.get()
Need help making an observable start on the main thread, and then move on to a pool of threads allowing the source to continue emitting new items (regardless if they are still being processed in the pool of threads).
This is my example:
public static void main(String[] args) {
Observable<Integer> source = Observable.range(1,10);
source.map(i -> sleep(i, 10))
.doOnNext(i -> System.out.println("Emitting " + i + " on thread " + Thread.currentThread().getName()))
.observeOn(Schedulers.computation())
.map(i -> sleep(i * 10, 300))
.subscribe( i -> System.out.println("Received " + i + " on thread " + Thread.currentThread().getName()));
sleep(-1, 30000);
}
private static int sleep(int i, int time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
return i;
}
which always prints:
Emitting 1 on thread main
Emitting 2 on thread main
Emitting 3 on thread main
Received 10 on thread RxComputationScheduler-1
Emitting 4 on thread main
Emitting 5 on thread main
Emitting 6 on thread main
Received 20 on thread RxComputationScheduler-1
Emitting 7 on thread main
Emitting 8 on thread main
Emitting 9 on thread main
Received 30 on thread RxComputationScheduler-1
Emitting 10 on thread main
Received 40 on thread RxComputationScheduler-1
Received 50 on thread RxComputationScheduler-1
Received 60 on thread RxComputationScheduler-1
Received 70 on thread RxComputationScheduler-1
Received 80 on thread RxComputationScheduler-1
Received 90 on thread RxComputationScheduler-1
Received 100 on thread RxComputationScheduler-1
Although items are emitted on the main thread as supposed, I want them to move on to the computation/IO thread-pool afterwards.
Should be something like this:
I don't think you were slowing down the source emissions enough, and they were emitting so quickly that all items were emitted before the observeOn() had a chance to schedule them.
Try sleeping to 500ms instead of 10ms. You will then see interleaving like you would expect.
public class JavaLauncher {
public static void main(String[] args) {
Observable<Integer> source = Observable.range(1,10);
source.map(i -> sleep(i, 500))
.doOnNext(i -> System.out.println("Emitting " + i + " on thread " + Thread.currentThread().getName()))
.observeOn(Schedulers.computation())
.map(i -> sleep(i * 10, 250))
.subscribe( i -> System.out.println("Received " + i + " on thread " + Thread.currentThread().getName()));
sleep(-1, 30000);
}
private static int sleep(int i, int time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
return i;
}
}
OUTPUT
Emitting 1 on thread main
Emitting 2 on thread main
Emitting 3 on thread main
Received 10 on thread RxComputationThreadPool-3
Emitting 4 on thread main
Received 20 on thread RxComputationThreadPool-3
Emitting 5 on thread main
Emitting 6 on thread main
Received 30 on thread RxComputationThreadPool-3
Emitting 7 on thread main
Emitting 8 on thread main
Received 40 on thread RxComputationThreadPool-3
Emitting 9 on thread main
Emitting 10 on thread main
Received 50 on thread RxComputationThreadPool-3
Received 60 on thread RxComputationThreadPool-3
Received 70 on thread RxComputationThreadPool-3
Received 80 on thread RxComputationThreadPool-3
Received 90 on thread RxComputationThreadPool-3
Received 100 on thread RxComputationThreadPool-3
UPDATE - Parallelized Version
public class JavaLauncher {
public static void main(String[] args) {
Observable<Integer> source = Observable.range(1,10);
source.map(i -> sleep(i, 250))
.doOnNext(i -> System.out.println("Emitting " + i + " on thread " + Thread.currentThread().getName()))
.flatMap(i ->
Observable.just(i)
.subscribeOn(Schedulers.computation())
.map(i2 -> sleep(i2 * 10, 500))
)
.subscribe( i -> System.out.println("Received " + i + " on thread " + Thread.currentThread().getName()));
sleep(-1, 30000);
}
private static int sleep(int i, int time) {
try {
Thread.sleep(time);
} catch (InterruptedException e) {
e.printStackTrace();
}
return i;
}
}
OUTPUT
Emitting 1 on thread main
Emitting 2 on thread main
Emitting 3 on thread main
Received 10 on thread RxComputationThreadPool-3
Emitting 4 on thread main
Received 20 on thread RxComputationThreadPool-4
Received 30 on thread RxComputationThreadPool-1
Emitting 5 on thread main
Received 40 on thread RxComputationThreadPool-2
Emitting 6 on thread main
Received 50 on thread RxComputationThreadPool-3
Emitting 7 on thread main
Received 60 on thread RxComputationThreadPool-4
Emitting 8 on thread main
Received 70 on thread RxComputationThreadPool-1
Emitting 9 on thread main
Received 80 on thread RxComputationThreadPool-2
Emitting 10 on thread main
Received 90 on thread RxComputationThreadPool-3
Received 100 on thread RxComputationThreadPool-4