Im pretty new to apache spark. I would like to get some guidance on if this is bad practice for a Apache spark job
The goal is to make requests out to a external rest api and join in the response while processing data. This needs to be able to handle thousands of requests. I am trying to make async http request and return the http responses as a RDD.
Here is an example of what I am trying to do
public final class AsyncSparkJob implements Serializable {
// Java-friendly version of SparkContext
// Used to return JavaRDDs and works with Java Collections.
private static JavaSparkContext sc;
// AsyncSparkJob - constructor
public AsyncSparkJob(JavaSparkContext sc) {
// initialize the spark context
this.sc = sc;
}
// run - execute the spark transformations and actions
public void run(String filePath ) {
System.out.println("Starting spark job");
JavaRDD<String> inputFile = this.sc.textFile(filePath);
// Send a partition of http requests to each executor
Long results = inputFile.mapPartitions(new FlatMapFunction<Iterator<String>, HttpResponse>(){
// call - FlatMapFunction call implementation
public Iterator<HttpResponse> call(Iterator<String> stringIterator) throws Exception {
RequestConfig requestConfig = RequestConfig.custom()
.setSocketTimeout(300000)
.setConnectTimeout(300000).build();
CloseableHttpAsyncClient httpClient = HttpAsyncClients.custom()
.setDefaultRequestConfig(requestConfig).setMaxConnTotal(500).setMaxConnPerRoute(500)
.build();
httpClient.start();
List<HttpResponse> httpResponseList = new LinkedList<HttpResponse>();
try {
List<Future<HttpResponse>> futureResponseList = new LinkedList<Future<HttpResponse>>();
// As long as we have values in the Iterator keep looping
while (stringIterator.hasNext()) {
String uri = stringIterator.next();
HttpGet request = new HttpGet(uri);
Future<HttpResponse> futureResponse = httpClient.execute(request, new FutureCallback<HttpResponse>() {
public void completed(HttpResponse httpResponse) {
System.out.println("Completed request");
}
public void failed(Exception e) {
System.out.println("failed" + e);
}
public void cancelled() {
System.out.println("cancelled");
}
});
futureResponseList.add(futureResponse);
}
// Now that we have submitted all of the responses we can start
// looking threw and trying to read the response.
for (Future<HttpResponse> futureResponse : futureResponseList) {
/* This will cause a block. However We have already submitted
all of our requests. So if we block once we should expect to see less
often blocks when reading from the "future" responses;
*/
httpResponseList.add(futureResponse.get());
}
} catch ( Exception e ) {
System.out.println("Caught " + e);
}finally {
httpClient.close();
}
return httpResponseList.iterator();
}
}).count();
System.out.println("Final result count : " + results);
}
public static void main( String[] args ) {
// Init the spark context
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("AsyncSparkJob"));
// Create the spark job
AsyncSparkJob asj = new AsyncSparkJob(sc);
asj.run(args[0]);
System.out.println("Done");
}
}
Is this a valid use cases ?
Related
I have a Spring Boot application where I created a POST method that sends data in a streaming fashion to the caller. Code below:
#RequestMapping(value = "/mapmatchstreaming", method = RequestMethod.POST)
public ResponseEntity<StreamingResponseBody> handleRequest(#RequestParam(value = "data", required = true) String data, #RequestParam(value = "mnr", required = true) Boolean mnr) {
logger.info("/mapmatchstreaming endpoint");
try {
Semaphore semaphore = new Semaphore(1);
ObjectMapper mapper = new ObjectMapper();
StreamingResponseBody responseBody = new StreamingResponseBody() {
#Override
public void writeTo (OutputStream outputStream) throws IOException {
// For each map
DataReader dataReader = new DataReader(data, "2020.06.011");
for(String mapRoot: dataReader.getMapsFolders()) {
dataReader = new DataReader(data, "2020.06.011");
DistributedMapMatcherStreaming distributedMapMatcher = new DistributedMapMatcherStreaming(dataReader.getTraces(), mapRoot, dataReader.getBoundingBox());
distributedMapMatcher.mapMatchBatch(new DistributedMapMatcherResult() {
#Override
public void onCorrectlyMapMatched(MapMatchedTrajectory mapMatchedTrajectory) {
try {
semaphore.acquire();
outputStream.write(mapper.writeValueAsString(mapMatchedTrajectory).getBytes());
outputStream.flush();
}
catch (Exception e) {
e.printStackTrace();
logger.error(String.format("Writing to output stream error: %s", e.getMessage()));
} finally{
semaphore.release();
}
}
});
}
}
};
return new ResponseEntity<StreamingResponseBody>(responseBody, HttpStatus.OK);
}
catch (Exception e) {
logger.error(String.format("Map-matching result ERROR: %s", ExceptionUtils.getStackTrace(e)));
return new ResponseEntity<StreamingResponseBody>(HttpStatus.BAD_REQUEST);
}
}
It works nicely, but the problem is that if multiple calls arrive to this method, all of them are run in parallel even if I have set server.tomcat.threads.max=1. In the non-streaming version, every next call waits for the current one to complete.
Is it possible to have blocking streaming calls in Spring? Thanks.
EDIT: I temporarily solved by using a global semaphore with only 1 permit, but I don't think this is the ideal solution.
I'm writing a android chat application with socket.io-client-java.I want to check whether the client user exist at first.So I need to send a command like "user/exist" to server url and get the response from server.I need to wait the server response then can go to next step.But the socket.io use the asynchronous callback.For getting the response synchronous I known the Furture and Callable only.So I tried the way using code as below:
//this is request method using socket.io
public JSONObject request(final String method,final String url,final JSONObject data){
final JSONObject responseObj = new JSONObject();
if (mSocket.connected()) {
mSocket.emit(method, reqObj, new Ack() {
#Override
public void call(Object... objects) {
System.out.println("get Ack");
try {
responseObj.put("body", (JSONObject) objects[0]);
}catch (JSONException e){
e.printStackTrace();
}
}
})
}
}
//this is Callable call implement
#Override
public JSONObject call(){
return request("get","https://my-chat-server/user/exist",new JSONObject());
}
//this is call method in activity
ExecutorService executor = Executors.newCachedThreadPool();
Future<JSONObject> response = executor.submit(mApiSocket);
executor.shutdown();
JSONObject respObj = new JSONObject();
JSONObject respBody = new JSONObject();
try {
respObj = response.get();
respBody = respObj.getJSONObject("body");
}catch (ExecutionException e){
}catch(InterruptedException e1){
}catch(JSONException e2){
}
But it dose not work.The respObj is null.
How can i get the reponse synchronous?
I am a green hand on java and forgive my poor chinese english.
Any help would be appreciated!
I known the js can use Promise and await like below:
//request method
static request(method, url, data) {
return new Promise((resolve, reject) => {
this.socket.emit(method,
{
url: url,
method,
data,
},
async (res) => {
if (res.statusCode == 100) {
resolve(res.body, res);
} else {
throw new Error(`${res.statusCode} error: ${res.body}`);
reject(res.body, res);
}
}
)
})
}
//call method
response = await mSocket.request('get','https://my-chat-server/user/exist', {
first_name: 'xu',
last_name: 'zhitong',
});
I'm not sure this is the best way but we can wait for the callback as follows:
#Nullable
Object[] emitAndWaitForAck(#NotNull String event, #Nullable Object[] args,
long timeoutMillis) {
Object[][] response = new Object[1][1];
Semaphore lock = new Semaphore(0);
socketClient.emit(event, args, ackArgs -> {
response[0] = ackArgs;
lock.release();
});
try {
boolean acquired = lock.tryAcquire(timeoutMillis, TimeUnit.MILLISECONDS);
if (acquired) {
return response[0];
}
} catch (InterruptedException ignored) {
}
return null;
}
Assuming your socket.io server returns one argument containing the body (or null) you would call it something like this:
String method = "get";
String url = "https://my-chat-server/user/exist";
long timeoutMillis = 5000;
Object[] args = emitAndWaitForAck(method, new String[]{url}, timeoutMillis);
JSONObject response = (JSONObject) args[0];
I'm currently in a project where I have to do multiple, concurrent http requests to a rest service which returns a JSON response. This is a batch operation and the number of requests at any time could range from several hunderd to several thousend.
That's why I thought it would be a good idea to have an async http client so I could have concurrent requests, which dramatically could speed up the process. I first tried ning's async-http-client. Maybe I was doing something wrong, because it was kind of slow for me. About 10 seconds for 1000 requests.
After which I tried Apache's implementation which was much faster at about 4 seconds for 1000 requests. But I can't seem to get the requests to get stable. Most of the time I will get a List with a 1000 responses (like I expect), but sometimes I am just missing a few responses, like 1 or 2.
This is currently my code:
public class AsyncServiceTest {
public AsyncServiceTest(String serviceURI) {
this.httpClient = HttpAsyncClients.custom().setMaxConnPerRoute(100).setMaxConnTotal(20)
.setDefaultRequestConfig(RequestConfig.custom().build()).build();
this.objectMapper = new ObjectMapper();
this.serviceURI = serviceURI;
}
private List<Object> getResults(List<String> queryStrings) throws Exception {
try {
httpClient.start();
final List<HttpGet> requests = new ArrayList<>(addresses.size());
for (String str : queryStrings) {
requests.add(new HttpGet(buildUri(str))); // In this method we build the absolute request uri.
}
final CountDownLatch latch = new CountDownLatch(requests.size());
final List<Object> responses = new ArrayList<>(requests.size());
final List<String> stringResponses = new ArrayList<>(requests.size());
for (final HttpGet request : requests) {
httpClient.execute(request, new FutureCallback<HttpResponse>() {
#Override
public void completed(HttpResponse response) {
try {
stringResponses.add(IOUtils.toString(response.getEntity().getContent(), "UTF-8"));
latch.countDown();
} catch (IOException e) {
e.printStackTrace();
}
}
#Override
public void failed(Exception e) {
latch.countDown();
}
#Override
public void cancelled() {
latch.countDown();
}
});
}
latch.await();
for (String r : stringResponses) {
responses.add(mapToLocation(r)); // Mapping some Strings to JSON in this method.
}
return responses;
} finally {
httpClient.close();
}
}
}
So, in essence, I am wondering if there is something wrong with my code (probably) or is it just because of the way the library works? Because the CountDownLatch is at zero all the time. Or does anyone have a pointer in the right direction (maybe with another library)?
It seemed to be a concurrency problem (thanks to #vanOekel) in my code. The answer is to replace the ArrayList<E> with a Vector<E>, which is in fact thread-safe. Example code:
public class AsyncServiceTest {
public AsyncServiceTest(String serviceURI) {
this.httpClient = HttpAsyncClients.custom().setMaxConnPerRoute(100).setMaxConnTotal(20)
.setDefaultRequestConfig(RequestConfig.custom().build()).build();
this.objectMapper = new ObjectMapper();
this.serviceURI = serviceURI;
}
private List<Object> getResults(List<String> queryStrings) throws Exception {
try {
httpClient.start();
final CountDownLatch latch = new CountDownLatch(queryStrings.size());
final Vector<Object> responses = new Vector<>(queryStrings.size());
for (String str : queryStrings) {
// buildUri: In this method we build the absolute request uri.
httpClient.execute(new HttpGet(buildUri(str)), new FutureCallback<HttpResponse>() {
#Override
public void completed(HttpResponse response) {
try {
// mapToLocation: Mapping some Strings to JSON in this method.
responses.add(mapToLocation(IOUtils.toString(response.getEntity().getContent(), "UTF-8")));
latch.countDown();
} catch (IOException e) {
failed(e);
}
}
#Override
public void failed(Exception e) {
logger.error(e.getLocalizedMessage(), e);
latch.countDown();
}
#Override
public void cancelled() {
logger.error("Request cancelled.");
latch.countDown();
}
});
}
latch.await();
return responses;
} finally {
httpClient.close();
}
}
}
Thanks for all the helpful responses. If anyone has any suggestions regarding optimization of the above code, I will be glad to hear so.
I have the following XML-RPC implementation working which I copied and slightly modified from the apache website.
public class DemoServer {
public static void main (String [] args) {
try {
WebServer webServer = new WebServer(8080);
XmlRpcServer xmlRpcServer = webServer.getXmlRpcServer();
PropertyHandlerMapping phm = new PropertyHandlerMapping();
phm.addHandler("sample", RequestHandler.class);
xmlRpcServer.setHandlerMapping(phm);
XmlRpcServerConfigImpl serverConfig =
(XmlRpcServerConfigImpl) xmlRpcServer.getConfig();
serverConfig.setEnabledForExtensions(true);
serverConfig.setContentLengthOptional(false);
webServer.start();
} catch (Exception e) {
e.printStackTrace();
}
}
}
With a client:
public class DemoClient {
public static void main (String[] args) {
try {
XmlRpcClientConfigImpl config = new XmlRpcClientConfigImpl();
config.setServerURL(new URL("http://127.0.0.1:8080/xmlrpc"));
config.setEnabledForExtensions(true);
config.setConnectionTimeout(60 * 1000);
config.setReplyTimeout(60 * 1000);
XmlRpcClient client = new XmlRpcClient();
// set configuration
client.setConfig(config);
// make the a regular call
Object[] params = new Object[] { new Integer(2), new Integer(3) };
//!CRITICAL LINE!
Integer result = (Integer) client.execute("sample.sum", params);
System.out.println("2 + 3 = " + result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
I run DemoServer first, and then I run DemoClient, and it prints "2 + 3 = 5".
However if I change
Integer result = (Integer) client.execute("sample.sum", params);
to
client.executeAsync("sample.sum", params, new ClientCallback());
then I get the following:
In error
java.lang.ExceptionInInitializerError
at java.lang.Runtime.addShutdownHook(Runtime.java:192)
at java.util.logging.LogManager.<init>(LogManager.java:237)
at java.util.logging.LogManager$1.run(LogManager.java:177)
at java.security.AccessController.doPrivileged(Native Method)
at java.util.logging.LogManager.<clinit>(LogManager.java:158)
at java.util.logging.Logger.getLogger(Logger.java:273)
at sun.net.www.protocol.http.HttpURLConnection.<clinit>(HttpURLConnection.java:62)
at sun.net.www.protocol.http.Handler.openConnection(Handler.java:44)
at sun.net.www.protocol.http.Handler.openConnection(Handler.java:39)
at java.net.URL.openConnection(URL.java:945)
at org.apache.xmlrpc.client.XmlRpcSun15HttpTransport.newURLConnection(XmlRpcSun15HttpTransport.java:62)
at org.apache.xmlrpc.client.XmlRpcSunHttpTransport.sendRequest(XmlRpcSunHttpTransport.java:62)
at org.apache.xmlrpc.client.XmlRpcClientWorker$1.run(XmlRpcClientWorker.java:80)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.lang.IllegalStateException: Shutdown in progress
at java.lang.Shutdown.add(Shutdown.java:62)
at java.lang.ApplicationShutdownHooks.<clinit>(ApplicationShutdownHooks.java:21)
... 14 more
My ClientCallback class:
public class ClientCallback implements AsyncCallback {
#Override
public void handleError(XmlRpcRequest request, Throwable t) {
System.out.println("In error");
t.printStackTrace();
}
#Override
public void handleResult(XmlRpcRequest request, Object result) {
System.out.println("In result");
System.out.println(request.getMethodName() + ": " + result);
}
}
What is going wrong here? I am working with Apache XML-RPC version 3.1.2, and unfortunately example code I have found are in version 2.x and doesn't apply anymore. Also I have omitted the import statements from the beginning of my classes (there are no syntax errors for sure). Any help would be much appreciated.
Your main program is running off the end because executeAsync returns immediately without waiting for the request to be sent or the response to come back.
What are you trying to accomplish by using executeAsync?
I'm trying to make x amount of HTTP requests asynchronously. I looked questions Asynchronous IO in Java? and How do you create an asynchronous HTTP request in JAVA?. I found good library Asynchronous Http and WebSocket Client library for Java, but I don't understand how I can safely combine multiple results into one result. For example if I have following code:
AsyncHttpClient c = new AsyncHttpClient();
List<String> urls = getUrls();
List<MyResultObject> results = new ArrayList<>();
for(String url : urls)
{
// Create asynchronous request
Future<MyResultObject> f = c.prepareGet(url).execute(handler);
// How can I add completed responses to my results list ???
}
How can I safely combine those results into List and continue when all requests have finished.
I found this tutorial for using futures. You could just do the following:
AsyncHttpClient c = new AsyncHttpClient();
List<String> urls = getUrls();
List<Future<MyResultObject>> futures = new ArrayList<>(); // keep track of your futures
List<MyResultObject> results = new ArrayList<>();
for(String url : urls)
{
// Create asynchronous request
Future<MyResultObject> f = c.prepareGet(url).execute(handler);
futures.add(f);
}
// Now retrieve the result
for (Future<MyResultObject> future : futures) {
try {
results.add(future.get());
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
}
// continue with your result list
You can call the get() method of Future class to obtain the result. Note that call to the method may block until result is available
if you want to combile several http request and get all the result.
you can look at the code blow.
package ParallelTasks;
import org.apache.commons.lang3.tuple.MutablePair;
import org.apache.commons.lang3.tuple.Pair;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.*;
public class ParallelHttpRequest {
//thread pool to execute http request task.
static final ExecutorService businessRequestExecutor = Executors.newCachedThreadPool();
public static void main(String[] args) throws InterruptedException, ExecutionException {
List<String> urlList = new ArrayList<String>();
final CountDownLatch latch = new CountDownLatch(urlList.size());
List<Future<Pair<String, String>>> list = new ArrayList<Future<Pair<String, String>>>();
for (final String url : urlList) {
Future<Pair<String, String>> future = businessRequestExecutor.submit(new Callable<Pair<String, String>>() {
public Pair<String, String> call() throws Exception {
try {
//do post or get http request here.
//SoaHttpUtil.post(config.getUrl(), buReqJson);
String result = "";
return new MutablePair<String, String>(url, result);
} catch (Exception ex) {
System.out.println(ex);
return new MutablePair<String, String>(url, null);
} finally {
latch.countDown();
}
}
});
list.add(future);
}
//wait no more than 5 seconds.
latch.await(5000, TimeUnit.MILLISECONDS);
//print finished request's result.
for (Future<Pair<String, String>> future : list) {
if (future.isDone()) {
System.out.println(future.get().getValue());
}
}
}
}