Running external library with Cloud Dataflow

Running external library with Cloud Dataflow - java

I'm trying to run some external shared library functions with Cloud Dataflow similar to described here: Running external libraries with Cloud Dataflow for grid-computing workloads.
I have a couple of questions according to the approach.
There is the following passage in the article mentioned earlier:
In the case of making a call to an external library, you need to do this step manually for that library. The approach is to:
Store the code (along with versioning information) in Cloud Storage, this removes any concerns about throughput if running 10,000s of cores in the flow.
In the #beginBundle [sic] method, create a synchronized block to check if the file is available on the local resource. If not, use the Cloud Storage client library to pull the file across.
However, with my Java package, I simply put the library .so file into the src/main/resource/linux-x86-64 directory and call the library functions the following way (stripped to a bare minimum for brevity):
import com.sun.jna.Library;
import com.sun.jna.Native;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.values.KV;
public class HostLookupPipeline {
public interface LookupLibrary extends Library {
String Lookup(String domain);
}
static class LookupFn extends DoFn<String, KV<String, String>> {
private static LookupLibrary lookup;
#StartBundle
public void startBundle() {
// src/main/resource/linux-x86-64/liblookup.so
lookup = Native.loadLibrary("lookup", LookupLibrary.class);
}
#ProcessElement
public void processElement(ProcessContext c) {
String domain = c.element();
String results = lookup.Lookup(domain);
if (results != null) {
c.output(KV.of(domain, results));
}
}
}
}
Is such approach considered acceptable or extracting .so file from JAR performs poorly compared to downloading from GCS? If not, where should I put the file after downloading to make it accessible by the Cloud Dataflow worker?
I've noticed that the transformation calling the external library function works rather slow — about 90 elements/s — utilizing 15 Cloud Dataflow workers (autoscaling, default max workers). If my rough calculations are correct, it should be twice as fast. I suppose that's because I call the external library function for every element.
Are there any best practices to improve external libraries performance when running with Java?

The guidance in that blog post is slightly incorrect - a much better place to put the initialization code is the #Setup method, not #StartBundle.
#Setup is called to initialize an instance of your DoFn in every thread on every worker that will be executing it. It is the intended place for heavy setup code. Its counterpart is #Teardown.
#StartBundle and #FinishBundle are much finer granularity: per bundle, which is a quite low-level concept, and I believe the only common legitimate use for them writing batches of elements to an external service: then typically in #StartBundle you would initialize the next batch and in #FinishBundle flush it.
Generally, to debug the performance, try adding logging to your DoFn's methods and see how many milliseconds the calls take and how that compares against your expectations. If you get stuck, include a Dataflow job ID in the question and an engineer will take a look at it.

Related

Create a database-service in AWS Lambda

I am developing a web-application using java and spring-boot on AWS Lambda Service.
I am designing it to have one database-service. This will be collections of Entity(table) and JPARepositories classes. So If I need to have any database schema changes I just have to make the change only in this service.
The other services which will be exposed through an API-gateway will be using this database-service as a Lambda Layer.
parent-project
|
|---database-service
|
|---API-service1
|
|---API-service2
...
The Problem is I need to create the tables before any of the Lambda Service is deployed. So that this API-Services can use them. One way to solve this is to deploy the database-service as a Lambda function and invoke the function which will call a method like below to create all the tables.
#SpringBootApplication
public class DatabaseServiceApplication implements CommandLineRunner {
private DynamoDBMapper dynamoDBMapper;
private final AmazonDynamoDB amazonDynamoDB;
public DatabaseServiceApplication(AmazonDynamoDB amazonDynamoDB) {
this.amazonDynamoDB = amazonDynamoDB;
}
public static void main(String[] args) {
SpringApplication.run(DatabaseServiceApplication.class, args);
}
#Override
public void run(String... strings) {
dynamoDBMapper = new DynamoDBMapper(amazonDynamoDB);
CreateTableRequest tableRequest = dynamoDBMapper
.generateCreateTableRequest(Association.class);
tableRequest.setProvisionedThroughput(
new ProvisionedThroughput(1L, 1L));
TableUtils.createTableIfNotExists(amazonDynamoDB, tableRequest);
}
}
Or use a script to create the tables. I am not sure which is a better option or is there any better option.
Can anyone suggest me if anyone has faced this problem before and fixed it?

To me the best way to do this is on Lambda cold start. Your code needs to be smart enough to not care if the DB is already correct. Based on the code you're showing I would do something on the order of:
public class LambdaExample implements RequestStreamHandler {
// only called on cold start
public LambdaExample() {
dynamoDBMapper = new DynamoDBMapper(amazonDynamoDB);
CreateTableRequest tableRequest = dynamoDBMapper
.generateCreateTableRequest(Association.class);
tableRequest.setProvisionedThroughput(
new ProvisionedThroughput(1L, 1L));
TableUtils.createTableIfNotExists(amazonDynamoDB, tableRequest);
}
public void handleRequest(InputStream inputStream, OutputStream outputStream, Context context) {
// handle request. this lambda type requires reading the inputStream
// yourself but use whatever you normally have here.
}
If you're using a traditional relational database, you could use Flyway instead. It too knows if a DB has already been updated.
Note that if you have thousands of Lambdas they will all call this, slowing the cold start of every single one of them. That is why #MarkB is suggesting a process to externalize the DB creation as really only the very first Lambda kicked off does anything useful. After that you're wasting a bit of time/money with every new Lambda.

Since you are deploying via Terraform then the correct way to do this is to have Terraform create the DynamoDB tables as well. You would configure your aws_lambda_function resources in Terraform with depends_on property referencing the aws_dynamodb_table resource, so that Terraform would ensure the table is created before the Lambda functions.

Can you please answer the below questions?
1) Are you deploying your springboot application in lambda?
If Yes, that doesn't sound like a good use of Springboot, Springboot application should be hosted in EC2/ECS instance to be up and running (24/7).
Think about Lambda as a function that runs to handle a simple task. To achieve that, you can write a simple Java application, and deploy the jar to lambda function.
2) CloudFormation, TerraForm and other languages are used to create the infrastructure, you usually run the infrastructure job first, and the deployment after it.
Here's a link of a terraform structure I built for a personal project.
https://github.com/saifmasadeh/terraform-project-structure

Switch between AWS, Azure. Design pattern

We have developed some lambda function and deployed on AWS which are working fine,
Anyhow, client is now planning for AZURE.
They may even switch back to AWS or any other vendor in future.
We have a separate maven project for AWS related stuff.
Hence, our business logic and classes remains same.
What I have done is created a maven project and added individual lambda functions to this project as dependencies.
Then made a factory class which will get impl based on property AZURE or AWS(using class.forName and reflection).
SO, I can switch to Azure by just removing maven dependency and adding AZURE dependency.
According to picture my plan was to create new AzureUtils and AzureWrapper project and Directly use Azure Cloud, by switching cloud in cloudFactory which is present in Generic utils and that would even work hopefully (Not tested) AWS is working anyhow like that.
Now the problem is client does not want everything packed up in 1 jar, i.e no no to all lambdas in a single jar. He want some layer where the switching should take place.
Now Which design patter would be useful, what would be the approach.
Currently my Lambda function looks like below
public class Hello implements RequestHandler<S3Event, Context > {
public String handleRequest(S3Event s3event, Context context) {
.................
call to business processor as in diag
}
}
And azure function looks somewhat like a simple class with annotations
public class Function {
#FunctionName("hello")
public HttpResponseMessage run(
#HttpTrigger(name = "req", methods = { HttpMethod.GET, HttpMethod.POST }, authLevel = AuthorizationLevel.ANONYMOUS) HttpRequestMessage<Optional<String>> request,
final ExecutionContext context) {
context.getLogger().info("Java HTTP trigger processed a request.");
// Parse query parameter
String query = request.getQueryParameters().get("name");
String name = request.getBody().orElse(query);
if (name != null) {
call to business processor as in diagram
}
}
}
After all this I have only 2 questions
I would like to know first if the design in diagram is right thing to do.
And what my client is asking for a wrapper something magical which should handle both type of cloud implementations. is this even possible?
if possible guide me in right direction
Any help is greatly appreciated.

about you secound question how to handle both type of cloud, please check this 3rd part solution serverless.com. It's a company that create own serverless wrapper, so that you can be free of vendor lock

Configuring DropWizard Programmatically

I have essentially the same question as here but am hoping to get a less vague, more informative answer.
I'm looking for a way to configure DropWizard programmatically, or at the very least, to be able to tweak configs at runtime. Specifically I have a use case where I'd like to configure metrics in the YAML file to be published with a frequency of, say, 2 minutes. This would be the "normal" default. However, under certain circumstances, I may want to speed that up to, say, every 10 seconds, and then throttle it back to the normal/default.
How can I do this, and not just for the metrics.frequency property, but for any config that might be present inside the YAML config file?

Dropwizard reads the YAML config file and configures all the components only once on startup. Neither the YAML file nor the Configuration object is used ever again. That means there is no direct way to configure on run-time.
It also doesn't provide special interfaces/delegates where you can manipulate the components. However, you can access the objects of the components (usually; if not you can always send a pull request) and configure them manually as you see fit. You may need to read the source code a bit but it's usually easy to navigate.
In the case of metrics.frequency you can see that MetricsFactory class creates ScheduledReporterManager objects per metric type using the frequency setting and doesn't look like you can change them on runtime. But you can probably work around it somehow or even better, modify the code and send a Pull Request to dropwizard community.

Although this feature isn't supported out of the box by dropwizard, you're able to accomplish this fairly easy with the tools they give you. Note that the below solution definitely works on config values you've provided, but it may not work for built in configuration values.
Also note that this doesn't persist the updated config values to the config.yml. However, this would be easy enough to implement yourself simply by writing to the config file from the application. If anyone would like to write this implementation feel free to open a PR on the example project I've linked below.
Code
Start off with a minimal config:
config.yml
myConfigValue: "hello"
And it's corresponding configuration file:
ExampleConfiguration.java
public class ExampleConfiguration extends Configuration {
private String myConfigValue;
public String getMyConfigValue() {
return myConfigValue;
}
public void setMyConfigValue(String value) {
myConfigValue = value;
}
}
Then create a task which updates the config:
UpdateConfigTask.java
public class UpdateConfigTask extends Task {
ExampleConfiguration config;
public UpdateConfigTask(ExampleConfiguration config) {
super("updateconfig");
this.config = config;
}
#Override
public void execute(Map<String, List<String>> parameters, PrintWriter output) {
config.setMyConfigValue("goodbye");
}
}
Also for demonstration purposes, create a resource which allows you to get the config value:
ConfigResource.java
#Path("/config")
public class ConfigResource {
private final ExampleConfiguration config;
public ConfigResource(ExampleConfiguration config) {
this.config = config;
}
#GET
public Response handleGet() {
return Response.ok().entity(config.getMyConfigValue()).build();
}
}
Finally wire everything up in your application:
ExampleApplication.java (exerpt)
environment.jersey().register(new ConfigResource(configuration));
environment.admin().addTask(new UpdateConfigTask(configuration));
Usage
Start up the application then run:
$ curl 'http://localhost:8080/config'
hello
$ curl -X POST 'http://localhost:8081/tasks/updateconfig'
$ curl 'http://localhost:8080/config'
goodbye
How it works
This works simply by passing the same reference to the constructor of ConfigResource.java and UpdateConfigTask.java. If you aren't familiar with the concept see here:
Is Java "pass-by-reference" or "pass-by-value"?
The linked classes above are to a project I've created which demonstrates this as a complete solution. Here's a link to the project:
scottg489/dropwizard-runtime-config-example
Footnote: I haven't verified this works with the built in configuration. However, the dropwizard Configuration class which you need to extend for your own configuration does have various "setters" for internal configuration, but it may not be safe to update those outside of run().
Disclaimer: The project I've linked here was created by me.

I solved this with bytecode manipulation via Javassist
In my case, I wanted to change the "influx" reporter
and modifyInfluxDbReporterFactory should be ran BEFORE dropwizard starts
private static void modifyInfluxDbReporterFactory() throws Exception {
ClassPool cp = ClassPool.getDefault();
CtClass cc = cp.get("com.izettle.metrics.dw.InfluxDbReporterFactory"); // do NOT use InfluxDbReporterFactory.class.getName() as this will force the class into the classloader
CtMethod m = cc.getDeclaredMethod("setTags");
m.insertAfter(
"if (tags.get(\"cloud\") != null) tags.put(\"cloud_host\", tags.get(\"cloud\") + \"_\" + host);tags.put(\"app\", \"sam\");");
cc.toClass();
}

Eclipse e4: Accessing properties in PostContextCreate

I am using the PostContextCreate part of the life cycle in an e4 RCP application to create the back-end "business logic" part of my application. I then inject it into the context using an IEclipseContext. I now have a requirement to persist some business logic configuration options between executions of my application. I have some questions:
It looks like properties (e.g. accessible from MContext) would be really useful here, a straightforward Map<String,String> sounds ideal for my simple requirements, but how can I get them in PostContextCreate?
Will my properties persist if my application is being run with clearPersistedState set to true? (I'm guessing not).
If I turn clearPersistedState off then will it try and persist the other stuff that I injected into the context?
Or am I going about this all wrong? Any suggestions would be welcome. I may just give up and read/write my own properties file.

I think the Map returned by MApplicationElement.getPersistedState() is intended to be used for persistent data. This will be cleared by -clearPersistedState.
The PostContextCreate method of the life cycle is run quite early in the startup and not everything is available at this point. So you might have to wait for the app startup complete event (UIEvents.UILifeCycle.APP_STARTUP_COMPLETE) before accessing the persisted state data.
You can always use the traditional Platform.getStateLocation(bundle) to get a location in the workspace .metadata to store arbitrary data. This is not touched by clearPersistedState.
Update:
To subscribe to the app startup complete:
#PostContextCreate
public void postContextCreate(IEventBroker eventBroker)
{
eventBroker.subscribe(UIEvents.UILifeCycle.APP_STARTUP_COMPLETE, new AppStartupCompleteEventHandler());
}
private static final class AppStartupCompleteEventHandler implements EventHandler
{
#Override
public void handleEvent(final Event event)
{
... your code here
}
}

How to dynamically differentiate the memcahce instances in java code?

Can anyone suggest any design pattern to dynamically differentiate the memcahce instances in java code?
Previously in my application there is only one memcache instance configured this way
Step-1:
dev.memcached.location=33.10.77.88:11211
dev.memcached.poolsize=5
Step-2:
Then i am accessing that memcache in code as follows,
private MemcachedInterface() throws IOException {
String location =stringParam("memcached.location", "33.10.77.88:11211");
MemcachedClientBuilder builder = new XMemcachedClientBuilder(AddrUtil.getAddresses(location));
}
Then i am invoking that memcache as follows in code using above MemcachedInterface(),
Step-3:
MemcachedInterface.getSoleInstance();
And then i am using that MemcachedInterface() to get/set data as follows,
MemcachedInterface.set(MEMCACHED_CUSTS, "{}");
resp = MemcachedInterface.gets(MEMCACHED_CUSTS);
My question is if i introduce an new memcache instance in our architechture,configuration is done as follows,
Step-1:
dev.memcached.location=33.10.77.89:11211
dev.memcached.poolsize=5
So, first memcache instance is in 33.10.77.88:11211 and second memcache instance is in 33.10.77.89:11211
until this its ok...but....
how to handle Step-2 and Step-3 in this case,To get the MemcachedInterface dynamically.
1)should i use one more interface called MemcachedInterface2() in step-2
Now the actual problem comes in,
I am having 4 webservers in my application.Previoulsy all are writing to MemcachedInterface(),but now as i will introduce one more memcache instance ex:MemcachedInterface2() ws1 and ws2 should write in MemcachedInterface() and ws3 and ws4 should write in ex:MemcachedInterface2()
So,if i use one more interface called MemcachedInterface2() as mentioned above,
This an code burden as i should change all the classes using WS3 and WS4 to Ex:MemcachedInterface2() .
Can anyone suggest one approach with limited code changes??

xmemcached supports constistent hashing which will allow your client to choose the right memcached server instance from the pool. You can refer to this answer for a bit more detail Do client need to worry about multiple memcache servers?
So, if I understood correctly, you'll have to
use only one memcached client in all your webapps
since you have your own wrapper class around the memcached client MemcachedInterface, you'll have to add some method to this interface, that enables to add/remove server to an existing client. See the user guide (scroll down a little): https://code.google.com/p/xmemcached/wiki/User_Guide#JMX_Support

as far as i can see is, you have duplicate code running on different machines as like parallel web services. thus, i recommend this to differentiate each;
Use Singleton Facade service for wrapping your memcached client. (I think you are already doing this)
Use Encapsulation. Encapsulate your memcached client for de-couple from your code. interface L2Cache
For each server, give them a name in global variable. Assign those values via JVM or your own configuration files via jar or whatever. JVM: --Dcom.projectname.servername=server-1
Use this global variable as a parameter, configure your Service getInstance method.
public static L2Cache getCache() {
if (System.getProperty("com.projectname.servername").equals("server-1"))
return new L2CacheImpl(SERVER_1_L2_REACHIBILITY_ADDRESSES, POOL_SIZE);
}
good luck with your design!

You should list all memcached server instances as space separated in your config.
e.g.
33.10.77.88:11211 33.10.77.89:11211
So, in your code (Step2):
private MemcachedInterface() throws IOException
{
String location =stringParam("memcached.location", "33.10.77.88:11211 33.10.77.89:11211");
MemcachedClientBuilder builder = new XMemcachedClientBuilder(AddrUtil.getAddresses(location));
}
Then in Step3 you don't need to change anything...e.g. MemcachedInterface.getSoleInstance(); .
You can read more in memcached tutorial article:
Use Memcached for Java enterprise performance, Part 1: Architecture and setup
http://www.javaworld.com/javaworld/jw-04-2012/120418-memcached-for-java-enterprise-performance.html
Use Memcached for Java enterprise performance, Part 2: Database-driven web apps
http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Running external library with Cloud Dataflow - java

Related

Create a database-service in AWS Lambda

Switch between AWS, Azure. Design pattern

Configuring DropWizard Programmatically

Eclipse e4: Accessing properties in PostContextCreate

How to dynamically differentiate the memcahce instances in java code?

Categories

Resources