What is the difference between Spark Serialization and Java Serialization? - java

I'm using Spark + Yarn and I have a service that I want to call on distributed nodes.
When I serialize this service object "by hand" in a Junit test using java serialization, all inner collections of the service are well serialized and deserialized :
#Test
public void testSerialization() {
try (
ConfigurableApplicationContext contextBusiness = new ClassPathXmlApplicationContext("spring-context.xml");
FileOutputStream fileOutputStream = new FileOutputStream("myService.ser");
ObjectOutputStream objectOutputStream = new ObjectOutputStream(fileOutputStream);
) {
final MyService service = (MyService) contextBusiness.getBean("myServiceImpl");
objectOutputStream.writeObject(service);
objectOutputStream.flush();
} catch (final java.io.IOException e) {
logger.error(e.getMessage(), e);
}
}
#Test
public void testDeSerialization() throws ClassNotFoundException {
try (
FileInputStream fileInputStream = new FileInputStream("myService.ser");
ObjectInputStream objectInputStream = new ObjectInputStream(fileInputStream);
) {
final MyService myService = (MyService) objectInputStream.readObject();
// HERE a functionnal test who proves the service has been fully serialized and deserialized .
} catch (final java.io.IOException e) {
logger.error(e.getMessage(), e);
}
}
But when I try to call this service via my Spark launcher, wether I broadcast the service object or not, some inner collection (a HashMap) disappears (is not serialized) like if it was tagged as "transient" (but it's not transient neither static) :
JavaRDD<InputOjbect> listeInputsRDD = sprkCtx.parallelize(listeInputs, 10);
JavaRDD<OutputObject> listeOutputsRDD = listeInputsRDD.map(new Function<InputOjbect, OutputObject>() {
private static final long serialVersionUID = 1L;
public OutputObject call(InputOjbect input) throws TarificationXmlException { // Exception
MyOutput output = service.evaluate(input);
return (new OutputObject(output));
}
});
same result if I broadcast the service :
final Broadcast<MyService> broadcastedService = sprkCtx.broadcast(service);
JavaRDD<InputOjbect> listeInputsRDD = sprkCtx.parallelize(listeInputs, 10);
JavaRDD<OutputObject> listeOutputsRDD = listeInputsRDD.map(new Function<InputOjbect, OutputObject>() {
private static final long serialVersionUID = 1L;
public OutputObject call(InputOjbect input) throws TarificationXmlException { // Exception
MyOutput output = broadcastedService.getValue().evaluate(input);
return (new OutputObject(output));
}
});
If I launch this same Spark code in local mode instead of yarn cluster mode, it works perfectly.
So my question is : What is the difference between Spark Serialization and Java Serialization ? (I'm not using Kryo or any customized serialization).
EDIT : when I try with Kryo serializer (without registering explicitly any class), I have the same problem.

Ok, I've found it out thanks to one of our experimented data analyst.
So, what was this mystery about ?
It was NOT about serialization (java or Kryo)
It was NOT about some pre-treatment or post-treatment Spark would do before/after serialization
It was NOT about the HashMap field which is fully serializable (this one is obvious if u read the first example I give, but not for everyone ;)
So...
The whole problem was about this :
"if I launch this same Spark code in local mode instead of yarn cluster
mode, it works perfectly."
In "yarn cluster" mode the collection was unable to be initialized, cause it was launched on a random node and couldn't access to, the initial reference datas on disk. In local mode, there was a clear exception when the initial datas where not found on disk, but in cluster mode it was fully silent and it looked like the problem was about serialization.
Using "yarn client" mode solved this for us.

Related

ClassCastException depending on which gradle module holds the code?

I have a gradle project with two modules, library and app. library is a java-library that can be published as maven artifact. It contains some util methods that are used by app, hence app depends on library.
Now, I'd like to add two util methods for serialization as shown below. Depending on where the util methods are placed and whether or not the serialized string of MyClass (which is also located in app) is written to a csv file in between (serialize -> export -> import -> deserialize).
Case 1 / No Problem: As long as my util methods are part of app, I can serialize and deserialize MyClass without any problems, even if I export the serialized string to a csv-file in between.
Case 2 / No Problem: When my utils methods are part of library, I can only serialize and deserialize MyClass, if the serialized string is not written to a csv-file in between.
Case 3 / PROBLEM: If my utils are placed in library and the serialized string is written to a csv-file in between, then I get a java.lang.ClassCastException: Cannot cast my.package.MyClass to my.package.MyClass at java.base/java.lang.Class.cast(Class.java:3605).
When I run the debugger up until ois.readObject(), I can even see the properly deserialized object. However, it is not possible to cast it correctly.
Could anyone help me to find the reason for the situation described above? Any ideas why this error could occur? Any ideas why it might not be allowed to have my util methods in library?
#Nonnull
public static <T> String serialize(T object) throws IllegalStateException {
try (
final ByteArrayOutputStream bos = new ByteArrayOutputStream();
final ObjectOutputStream oos = new ObjectOutputStream(bos)
) {
oos.writeObject(object);
return Base64.getEncoder().encodeToString(bos.toByteArray());
} catch (final IOException e) {
throw new IllegalStateException("serialization failed", e);
}
}
#Nonnull
public static <T> T deserialize(String objectAsString, Class<T> tClass) throws IllegalStateException {
final byte[] data = Base64.getDecoder().decode(objectAsString);
try (
final ByteArrayInputStream bis = new ByteArrayInputStream(data);
final ObjectInputStream ois = new ObjectInputStream(bis)
) {
return tClass.cast(ois.readObject());
} catch (Exception e) {
throw new IllegalStateException("deserialization failed", e);
}
}

Apache Flink Kafka Stream Processing based on conditions

Am building a wrapper library using apache-flink where I am listening(consuming) from multiple topics and I have a set of applications that want to process the messages from those topics.
Example :
I have 10 applications app1, app2, app3 ... app10 (each of them is a java library part of the same on-prem project, ie., all 10 jars are part of same .war file)
out of which only 5 are supposed to consume the messages coming to the consumer group. I am able to do filtering for 5 apps with the help of filter function.
The challenge is in the strStream.process(executionServiceInterface) function, where app1 provides an implementation class for ExceucionServiceInterface as ExecutionServiceApp1Impl and similary app2 provides ExecutionServiceApp2Impl.
when there are multiple implementations available spring wants us to provide #Qualifier annotation or #Primary has to be marked on the implementations (ExecutionServiceApp1Impl , ExecutionServiceApp2Impl).
But I don't really want to do this. As am building a generic wrapper library that should support any no of such applications (app1, app2 etc) and all of them should be able to implement their own implementation logic(ExecutionServiceApp1Impl , ExecutionServiceApp2Impl).
Can someone help me here ? how to solve this ?
Below is the code for reference.
#Autowired
private ExceucionServiceInterface executionServiceInterface;
public void init(){
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer011<String> consumer = createStringConsumer(topicList, kafkaAddress, kafkaGroup);
if (consumer != null) {
DataStream<String> strStream = environment.addSource(consumer);
strStream.filter(filterFunctionInterface).process(executionServiceInterface);
}
}
public FlinkKafkaConsumer011<String> createStringConsumer(List<String> listOfTopics, String kafkaAddress, String kafkaGroup) throws Exception {
FlinkKafkaConsumer011<String> myConsumer = null;
try {
Properties props = new Properties();
props.setProperty("bootstrap.servers", kafkaAddress);
props.setProperty("group.id", kafkaGroup);
myConsumer = new FlinkKafkaConsumer011<>(listOfTopics, new SimpleStringSchema(), props);
} catch(Exception e) {
throw e;
}
return myConsumer;
}
Many thanks in advance!!
Solved this problem by using Reflection, below is the code that solved the issue.
Note : this requires me to know the list of fully qualified classNames and method names along with their parameters.
#Component
public class SampleJobExecutor extends ProcessFunction<String, String> {
#Autowired
MyAppProperties myAppProperties;
#Override
public void processElement(String inputMessage, ProcessFunction<String, String>.Context context,
Collector<String> collector) throws Exception {
String className = null;
String methodName = null;
try {
Map<String, List<String>> map = myAppProperties.getMapOfImplementors();
JSONObject json = new JSONObject(inputMessage);
if (json != null && json.has("appName")) {
className = map.get(json.getString("appName")).get(0);
methodName = map.get(json.getString("appName")).get(1);
}
Class<?> forName = Class.forName(className);
Object job = forName.newInstance();
Method method = forName.getDeclaredMethod(methodName, String.class);
method.invoke(job , inputMessage);
} catch (Exception e) {
e.printStackTrace();
}
}

How can I recreate a chained OutputStream with only filename modified

I have got an OutputStream which can be initialized as a chain of OutputStreams. There could be any level of chaining .Only thing guaranteed is that at the end of the chain is a FileOutputStream.
I need to recreate this chained outputStream with a modified Filename in FileOutputStream. This would have been possible if out variable (which stores the underlying chained outputStream) was accessible ; as shown below.
public OutputStream recreateChainedOutputStream(OutputStream os) throws IOException {
if(os instanceof FileOutputStream) {
return new FileOutputStream("somemodified.filename");
} else if (os instanceof FilterOutputStream) {
return recreateChainedOutputStream(os.out);
}
}
Is there any other way of achieving the same?
You can use reflection to access the os.out field of the FilterOutputStream, this has however some drawbacks:
If the other OutputStream is also a kind of RolloverOutputStream, you can have a hard time reconstructing it,
If the other OutputStream has custom settings, like GZip compression parameter, you cannot reliable read this
If there is a
A quick and dirty implementation of recreateChainedOutputStream( might be:
private final static Field out;
{
try {
out = FilterInputStream.class.getField("out");
out.setAccessible(true);
} catch(Exception e) {
throw new RuntimeException(e);
}
}
public OutputStream recreateChainedOutputStream(OutputStream out) throws IOException {
if (out instanceof FilterOutputStream) {
Class<?> c = ou.getClass();
COnstructor<?> con = c.getConstructor(OutputStream.class);
return con.invoke(this.out.get(out));
} else {
// Other output streams...
}
}
While this may be ok in your current application, this is a big no-no in the production world because the large amount of different kind of OutputStreams your application may recieve.
A better way to solve would be a kind of Function<String, OutputStream> that works as a factory to create OutputStreams for the named file. This way the external api keeps its control over the OutputStreams while your api can adress multiple file names. An example of this would be:
public class MyApi {
private final Function<String, OutputStream> fileProvider;
private OutputStream current;
public MyApi (Function<String, OutputStream> fileProvider, String defaultFile) {
this.fileProvider = fileProvider;
selectNewOutputFile(defaultFile);
}
public void selectNewOutputFile(String name) {
OutputStream current = this.current;
this.current = fileProvider.apply(name);
if(current != null) current.close();
}
}
This can then be used in other applications as:
MyApi api = new MyApi(name->new FileOutputStream(name));
For simple FileOutputStreams, or be used as:
MyApi api = new MyApi(name->
new GZIPOutputStream(
new CipherOutputStream(
new CheckedOutputStream(
new FileOutputStream(name),
new CRC32()),
chipper),
1024,
true)
);
For a file stream that stored checksummed using new CRC32(), chipped using chipper, gzip according to a 1024 buffer with sync write mode.

disable akka.jvm-exit-on-fatal-error for actorsystem in java

I am using akka actor system for multi threading. It is working fine in normal use-cases. However, Akka is closing JVM on fatal error. Please let me know how I can configure Akka to disable "akka.jvm-exit-on-fatal-error" in java. Below is code.
public class QueueListener implements MessageListener {
private String _queueName=null;
public static boolean isActorinit=false;
public static ActorSystem system=null;
private ActorRef myActor;
public QueueListener(String actorId, String qName){
this._queueName = qName;
if(!isActorinit){
system=ActorSystem.create(actorId);
isActorinit=true;
}
myActor=system.actorOf( Props.create(MessageExecutor.class, qName),qName+"id");
}
/*
* (non-Javadoc)
* #see javax.jms.MessageListener#onMessage(javax.jms.Message)
*/
#Override
public void onMessage(Message msg) {
executeRequest(msg);
}
/** This method will process the message fetch by the listener.
*
* #param msg - javax.jms.Messages parameter get queue message
*/
private void executeRequest(Message msg){
String requestData=null;
try {
if(msg instanceof TextMessage){
TextMessage textMessage= (TextMessage) msg;
requestData = textMessage.getText().toString();
}else if(msg instanceof ObjectMessage){
ObjectMessage objMsg = (ObjectMessage) msg;
requestData = objMsg.getObject().toString();
}
myActor.tell(requestData, ActorRef.noSender());
} catch (JMSException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
}
Create an application.conf file in your project (sr/main/resources for example) and add the following content:
akka {
jvm-exit-on-fatal-error = false
}
No need to create new config file if you already have one of course, in that case it is just adding the new entry:
jvm-exit-on-fatal-error = false
Be careful. Letting the JVM run after fatal errors like OutOfMemory is normally not a good idea and leads to serious problems.
See here for the configuration details - you can provide a separate config file, but for the small number of changes I was making to the akka config (and also given that I was already using several Spring config files) I found it easier to construct and load the configuration programmatically. Your config would look something like:
import com.typesafe.config.Config;
import com.typesafe.config.ConfigFactory;
StringBuilder configBuilder = new StringBuilder();
configBuilder.append("{\"akka\" : { \"jvm-exit-on-fatal-error\" : \"off\"}}");
Config mergedConfig = ConfigFactory.load(ConfigFactory.parseString(configBuilder.toString()).withFallback(ConfigFactory.load()));
system = ActorSystem.create(actorId, mergedConfig);
This is loading the default Config, overriding its jvm-exit-on-fatal-error entry, and using this new Config as the config for the ActorSystem. I haven't tested this particular config, so there is a 50% chance that you'll get some sort of JSON parsing error when you try to use it; for comparison, the actual config I use which DOES parse correctly (but which doesn't override jvm-exit-on-fatal-error) is
private ActorSystem createActorSystem(int batchManagerCount) {
int maxActorCount = batchManagerCount * 5 + 1;
StringBuilder configBuilder = new StringBuilder();
configBuilder.append("{\"akka\" : { \"actor\" : { \"default-dispatcher\" : {");
configBuilder.append("\"type\" : \"Dispatcher\",");
configBuilder.append("\"executor\" : \"default-executor\",");
configBuilder.append("\"throughput\" : \"1\",");
configBuilder.append("\"default-executor\" : { \"fallback\" : \"thread-pool-executor\" },");
StringBuilder executorConfigBuilder = new StringBuilder();
executorConfigBuilder.append("\"thread-pool-executor\" : {");
executorConfigBuilder.append("\"keep-alive-time\" : \"60s\",");
executorConfigBuilder.append(String.format("\"core-pool-size-min\" : \"%d\",", maxActorCount));
executorConfigBuilder.append(String.format("\"core-pool-size-max\" : \"%d\",", maxActorCount));
executorConfigBuilder.append(String.format("\"max-pool-size-min\" : \"%d\",", maxActorCount));
executorConfigBuilder.append(String.format("\"max-pool-size-max\" : \"%d\",", maxActorCount));
executorConfigBuilder.append("\"task-queue-size\" : \"-1\",");
executorConfigBuilder.append("\"task-queue-type\" : \"linked\",");
executorConfigBuilder.append("\"allow-core-timeout\" : \"on\"");
executorConfigBuilder.append("}");
configBuilder.append(executorConfigBuilder.toString());
configBuilder.append("}}}}");
Config mergedConfig = ConfigFactory.load(ConfigFactory.parseString(configBuilder.toString()).withFallback(ConfigFactory.load()));
return ActorSystem.create(String.format("PerformanceAsync%s", systemId), mergedConfig);
}
As you can see I was primarily interested in tweaking the dispatcher.

Updating Dropwizard config at runtime

Is it possible to have my app update the config settings at runtime? I can easily expose the settings I want in my UI but is there a way to allow the user to update settings and make them permanent ie save them to the config.yaml file? The only way I can see it to update the file by hand then restart the server which seems a bit limiting.
Yes. It is possible to reload the service classes at runtime.
Dropwizard by itself does not have the way to reload the app, but jersey has.
Jersey uses a container object internally to maintain the running application. Dropwizard uses the ServletContainer class of Jersey to run the application.
How to reload the app without restarting it -
Get a handle to the container used internally by jersey
You can do this by registering a AbstractContainerLifeCycleListener in Dropwizard Environment before starting the app. and implement its onStartup method as below -
In your main method where you start the app -
//getting the container instance
environment.jersey().register(new AbstractContainerLifecycleListener() {
#Override
public void onStartup(Container container) {
//initializing container - which will be used to reload the app
_container = container;
}
});
Add a method to your app to reload the app. It will take in the list of string which are the names of the service classes you want to reload. This method will call the reload method of the container with the new custom DropWizardConfiguration instance.
In your Application class
public static synchronized void reloadApp(List<String> reloadClasses) {
DropwizardResourceConfig dropwizardResourceConfig = new DropwizardResourceConfig();
for (String className : reloadClasses) {
try {
Class<?> serviceClass = Class.forName(className);
dropwizardResourceConfig.registerClasses(serviceClass);
System.out.printf(" + loaded class %s.\n", className);
} catch (ClassNotFoundException ex) {
System.out.printf(" ! class %s not found.\n", className);
}
}
_container.reload(dropwizardResourceConfig);
}
For more details see the example documentation of jersey - jersey example for reload
Consider going through the code and documentation of following files in Dropwizard/Jersey for a better understanding -
Container.java
ContainerLifeCycleListener.java
ServletContainer.java
AbstractContainerLifeCycleListener.java
DropWizardResourceConfig.java
ResourceConfig.java
No.
Yaml file is parsed at startup and given to the application as Configuration object once and for all. I believe you can change the file after that but it wouldn't affect your application until you restart it.
Possible follow up question: Can one restart the service programmatically?
AFAIK, no. I've researched and read the code somewhat for that but couldn't find a way to do that yet. If there is, I'd love to hear that :).
I made a task that reloads the main yaml file (it would be useful if something in the file changes). However, it is not reloading the environment. After researching this, Dropwizard uses a lot of final variables and it's quite hard to reload these on the go, without restarting the app.
class ReloadYAMLTask extends Task {
private String yamlFileName;
ReloadYAMLTask(String yamlFileName) {
super("reloadYaml");
this.yamlFileName = yamlFileName;
}
#Override
public void execute(ImmutableMultimap<String, String> parameters, PrintWriter output) throws Exception {
if (yamlFileName != null) {
ConfigurationFactoryFactory configurationFactoryFactory = new DefaultConfigurationFactoryFactory<ReportingServiceConfiguration>();
ValidatorFactory validatorFactory = Validation.buildDefaultValidatorFactory();
Validator validator = validatorFactory.getValidator();
ObjectMapper objectMapper = Jackson.newObjectMapper();
final ConfigurationFactory<ServiceConfiguration> configurationFactory = configurationFactoryFactory.create(ServiceConfiguration.class, validator, objectMapper, "dw");
File confFile = new File(yamlFileName);
configurationFactory.build(new File(confFile.toURI()));
}
}
}
You can change the configuration in the YAML and read it while your application is running. This will not however restart the server or change any server configurations. You will be able to read any changed custom configurations and use them. For example, you can change the logging level at runtime or reload other custom settings.
My solution -
Define a custom server command. You should use this command to start your application instead of the "server" command.
ArgsServerCommand.java
public class ArgsServerCommand<WC extends WebConfiguration> extends EnvironmentCommand<WC> {
private static final Logger LOGGER = LoggerFactory.getLogger(ArgsServerCommand.class);
private final Class<WC> configurationClass;
private Namespace _namespace;
public static String COMMAND_NAME = "args-server";
public ArgsServerCommand(Application<WC> application) {
super(application, "args-server", "Runs the Dropwizard application as an HTTP server specific to my settings");
this.configurationClass = application.getConfigurationClass();
}
/*
* Since we don't subclass ServerCommand, we need a concrete reference to the configuration
* class.
*/
#Override
protected Class<WC> getConfigurationClass() {
return configurationClass;
}
public Namespace getNamespace() {
return _namespace;
}
#Override
protected void run(Environment environment, Namespace namespace, WC configuration) throws Exception {
_namespace = namespace;
final Server server = configuration.getServerFactory().build(environment);
try {
server.addLifeCycleListener(new LifeCycleListener());
cleanupAsynchronously();
server.start();
} catch (Exception e) {
LOGGER.error("Unable to start server, shutting down", e);
server.stop();
cleanup();
throw e;
}
}
private class LifeCycleListener extends AbstractLifeCycle.AbstractLifeCycleListener {
#Override
public void lifeCycleStopped(LifeCycle event) {
cleanup();
}
}
}
Method to reload in your Application -
_ymlFilePath = null; //class variable
public static boolean reloadConfiguration() throws IOException, ConfigurationException {
boolean reloaded = false;
if (_ymlFilePath == null) {
List<Command> commands = _configurationBootstrap.getCommands();
for (Command command : commands) {
String commandName = command.getName();
if (commandName.equals(ArgsServerCommand.COMMAND_NAME)) {
Namespace namespace = ((ArgsServerCommand) command).getNamespace();
if (namespace != null) {
_ymlFilePath = namespace.getString("file");
}
}
}
}
ConfigurationFactoryFactory configurationFactoryFactory = _configurationBootstrap.getConfigurationFactoryFactory();
ValidatorFactory validatorFactory = _configurationBootstrap.getValidatorFactory();
Validator validator = validatorFactory.getValidator();
ObjectMapper objectMapper = _configurationBootstrap.getObjectMapper();
ConfigurationSourceProvider provider = _configurationBootstrap.getConfigurationSourceProvider();
final ConfigurationFactory<CustomWebConfiguration> configurationFactory = configurationFactoryFactory.create(CustomWebConfiguration.class, validator, objectMapper, "dw");
if (_ymlFilePath != null) {
// Refresh logging level.
CustomWebConfiguration webConfiguration = configurationFactory.build(provider, _ymlFilePath);
LoggingFactory loggingFactory = webConfiguration.getLoggingFactory();
loggingFactory.configure(_configurationBootstrap.getMetricRegistry(), _configurationBootstrap.getApplication().getName());
// Get my defined custom settings
CustomSettings customSettings = webConfiguration.getCustomSettings();
reloaded = true;
}
return reloaded;
}
Although this feature isn't supported out of the box by dropwizard, you're able to accomplish this fairly easy with the tools they give you.
Before I get started, note that this isn't a complete solution for the question asked as it doesn't persist the updated config values to the config.yml. However, this would be easy enough to implement yourself simply by writing to the config file from the application. If anyone would like to write this implementation feel free to open a PR on the example project I've linked below.
Code
Start off with a minimal config:
config.yml
myConfigValue: "hello"
And it's corresponding configuration file:
ExampleConfiguration.java
public class ExampleConfiguration extends Configuration {
private String myConfigValue;
public String getMyConfigValue() {
return myConfigValue;
}
public void setMyConfigValue(String value) {
myConfigValue = value;
}
}
Then create a task which updates the config:
UpdateConfigTask.java
public class UpdateConfigTask extends Task {
ExampleConfiguration config;
public UpdateConfigTask(ExampleConfiguration config) {
super("updateconfig");
this.config = config;
}
#Override
public void execute(Map<String, List<String>> parameters, PrintWriter output) {
config.setMyConfigValue("goodbye");
}
}
Also for demonstration purposes, create a resource which allows you to get the config value:
ConfigResource.java
#Path("/config")
public class ConfigResource {
private final ExampleConfiguration config;
public ConfigResource(ExampleConfiguration config) {
this.config = config;
}
#GET
public Response handleGet() {
return Response.ok().entity(config.getMyConfigValue()).build();
}
}
Finally wire everything up in your application:
ExampleApplication.java (exerpt)
environment.jersey().register(new ConfigResource(configuration));
environment.admin().addTask(new UpdateConfigTask(configuration));
Usage
Start up the application then run:
$ curl 'http://localhost:8080/config'
hello
$ curl -X POST 'http://localhost:8081/tasks/updateconfig'
$ curl 'http://localhost:8080/config'
goodbye
How it works
This works simply by passing the same reference to the constructor of ConfigResource.java and UpdateConfigTask.java. If you aren't familiar with the concept see here:
Is Java "pass-by-reference" or "pass-by-value"?
The linked classes above are to a project I've created which demonstrates this as a complete solution. Here's a link to the project:
scottg489/dropwizard-runtime-config-example
Footnote: I haven't verified this works with the built in configuration. However, the dropwizard Configuration class which you need to extend for your own configuration does have various "setters" for internal configuration, but it may not be safe to update those outside of run().
Disclaimer: The project I've linked here was created by me.

Categories

Resources