Memory usage in Google App Engine - java

I am a bit confused. I wrote a Java stand alone app and now I want to use GAE to
deploy it on the web and on the way also to learn about GAE.
In my application, I read data from file, store it in memory, process it, and then store the results in memory or file.
I understand that now I need to store the results in the GAE's data store, which is fine. So I can run my program independently on my computer, then write the results to file, and then use GAE to upload all the results to the data store, and then users can query it. However, is there a way that I can transfer the entire process into the GAE application? so the application reads data from file, do the processing (use the memory on the application server and not my computer - needs at least 4GB of RAM), and then when it's done (might take 1-2 hours), writes everything to the GAE data store? (so it's an internal "offline" process that no users are involved).
I'm a bit confused since Google don't mention anything about memory quota.
Thanks!

You will not be able to do your offline processing the way you are envisioning. There is a limit to how much memory your app can use, but that is not the main problem. All processing in app engine is done in request handlers. In other words, any action you want your app to do will be written as if it is handling a web request. Each of these handlers is limited to 30 seconds of running time. If your process tries to run longer, it will get shut down. App engine is optimized for serving web requests, not doing heavy computations.
All that being said, you may be able to break up your computational tasks into 30 second chunks and store intermediate results in the datastore or memcache. In that case you could use a cron job or task queue (both described in the app engine docs) to keep calling your processing handlers until the data crunching was done.
In summary, yes, it may be possible to do what you want, but it might not be worth the trouble. Look into other cloud solutions like Amazon's EC2 or Hadoop if you want to do computationally intensive things.

Related

direct logging on elasticsearch vs using logstash and filebeat

I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.
First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.
Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.
we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.

Control C application from Java web application

I have C applications that will run on multiple machines at different sites.
Now I want to control and monitor these C applications. For that I am thinking about Java Web Application using Servlet/JSP.
I am thinking that C applications will connect to Java Web application over TCP. In my web application, I am thinking to implement manager which communicates with C applications over TCP. I will start manager when web application starts as separate thread. And manager will communicate to servlet requests via Context and Session. So whenever user do something on browser, I want to use functionalities of my manager at server, with ServetContext an Session as interface.
So this is what I am thinking. So, I want to know if there is better approach, or I am doing anything wrong? Can anyone please suggest me better solution?
EDIT
Current workflow: whenever I need to start / stop C application, I have to SSH remote machine puTTY terminal, type long commands, and start / stop it. Whenever there is some issue, I have to scroll long long log files. There couple of other things like live status of what application is doing/processing all things at every second, that I can't log always in log file.
So I find these workflow difficult. And things like live status I can't monitor.
Now I want to have web application interface to it. I can modify my C application and implement web application from scratch.
New Workflow to implement: I want to start / stop C application from web page. I want to view logs and live status reports / live graphs on web page (monitoring what C application is doing). I want to monitor machine status also on web page.
The web interface I thinking to design in Java using JSP/servlets.
So, I will modify my C application so it can communicate with with web application.
Question:
Just need guidelines / best practices for making new workflow.
EDIT 2
Sorry for confusion between controller or manager. Both are same thing.
My thoughts:
System will consist of C applications running at different sites, Java controller and Java web app running parallely in Tomcat server, and DB.
1) C applications will connect to controller over TCP. So, controller here becomes server and C applications client.
2) C applications will be multithreaded, will receive tasks from controller and spawns new thread to perform that task. When controller tells to stop task, C application will stop thread of that task. Additionally, C applications will send work progress (logs) every second to controller.
3) Controller receives task commands from web application (as both running parallelly in Tomcat server, both in same instance on JVM), and web application will receive commands from user over HTTP.
4) The work progress (logs) received every second from C applications to controller, controller will then insert logs in DB for later analysis (need to consider if it is good insert logs in MySQL RDBMS, may be needed to do lot of inserts, may be 100 or 1000 every second, forever). Web application may also request recent 5 minute logs from controller and send to user over HTTP. If user is monitoring logs, then web application will have to retrieve logs every second from controller and send to user over HTTP.
5) User monitoring C application tasks, will see progress in graph, updated every second. Additionally text lines of logs of info/error events that may happen occasionally in C applications.
6) C applications will be per machine, which will execute any task user sends from web browser. C applications will be running as service in machine, which will start on machine startup, will connect to server, and will stay connected to server forever. Can be running idle if no tasks to perform.
It is a valid approach, I believe sockets is how most distributed systems communicate, and more often than not even different services on the same box communicate that way. Also I believe what you are suggesting for the java web service is very typical and will work well (It will probably grow in complexity beyond what you are currently thinking, but the archetecture you describe is a good start).
If your C services are made to also run independantly of the management system then you might want to reverse it and have the management system connect to the services (Unless your firewall prevents it).
You will certainly want a small, well-defined protocol. If you are sending lots of fields you could even make everything you send JSON or xml since they will already have parsers to validate the format.
Be careful about security! On the C side ensure that you won't get any buffer overflows and if you parse the information yourself, be strict about throwing away (and logging!) data that doesn't look right. On Java the buffer overruns aren't as much of a problem but be sure that you log packets that don't fit your protocol exactly to detect both bugs and intrusions.
Another solution that you might consider--Your systems all share a database already you could send commands and responses through the DB (Assuming the command/responses are not happening too often). We don't do this exactly, but we share a variable table in which we place name/value pairs indicating different aspects of our systems performance and configuration (it's 2-way), this is probably not optimal but has been amazingly flexible since it allows us to reconfigure our system at runtime (the values are cached locally in each service and re-read/updated every 30 seconds).
I might be able to give you more info if I knew more specifics about what you expected to do--for instance, how often will your browser update it's fields, what kind of command signals or data requests will be sent and what kind of data do you expect back? Although you certainly don't have to post that stuff here, you must consider it--I suggest mocking up your browser page to start.
edits based on comments:
Sounds good, just a couple comments:
2) Any good database should be able to handle that volume of data for logging but you may want to use a good cache on top of your DB.
5) You will probably want a web framework to render the graph and manage updates. There are a lot and most can do what you are saying pretty easily, but trying to do it all yourself without a framework of some sort might be tough. I only say this because you didn't mention it.
6) Be sure you can handle dropped connections and reconnecting. When you are testing, pull the plug on your server (at least the network cable) and leave it out for 10 minutes, then make sure when you plug it back in you get the results you expect (Should the client automatically reconnect? Should it hold onto the logs or throw them away? How long will it hold onto logs?)
You may want to build in a way to "Reboot" your C services. Since they were started as a service, simply sending a command that tells them to terminate/exit will generally work since the system will restart them. You may also want a little monitoring loop that restarts them under certain criteria (like they haven't gotten a command from the server for n minutes). This can come in handy when you're in california at 10am trying to work with a C service in Austraillia at 2am.
Also, consider that an attacker can insert himself between your client and server. If you are using an SSL socket you should be okay, but if it's a raw socket you must be VERY careful.
Correction:
You may have problems putting that many records into a MySQL database. If it is not indexed and you minimize queries against it you may be okay. You can achieve this by keeping the last 5 minutes of all your logs in memory so you don't have to index your database and by grouping inserts or having a very well tuned cache.
A better approach might be to forgo the database and just use flat log files pre-filtered to what a single user might want to see, so if the user asks for the last 5 minutes "WARN" and "DEBUG" messages from a machine you could just read the logfile from that machine into memory, skipping all but warn/debug messages, and display those. This has it's own problems but should be more scalable than an indexed database. This would also allow you to zip up older data (that a user won't want to query against any more) for a 70-90% savings in disk space.
Here are my recommendations on your current design and since you haven't defined a specific scope for this project:
Define a protocol to communicate between your C apps and your monitor app. Probably you don't need the same info from all the C apps in the same format or there are more important metrics for some C apps than others. I would recommend using plain JSON for this and to define a minimum schema to fulfill in order for both C to produce the data and Java for consume and validate it.
Use a database to store the results of monitoring your C apps. The generic option would be using a RDBMS, probably open source like MySQL or PostgreSQL, or if you (or your company) can get the licenses go for SQL Server or Oracle or another one. This in case you need to maintain a history of the results, and you can clear the data periodically.
Probably you want/need to have the latest results from monitoring available in a sort of cache (because in this time performance is critical), so you may use an in-memory database like Hazelcast or Redis, or just a simple cache like EhCache or Infinispan. Storing the data in an external element is better than storing it in plain ServletContext because these technologies are aware of multi threading and support ACID, which is not the primary use case for ServletContext but seems necessary for the monitor.
Separate the monitor that will receive the data from the C apps from the web app. In case the monitor fails or it takes too much time to perform some operations, the Web application will still be available to work without having the overhead to receive and manage the data from the C apps. In the other hand, if the web app starts to be slower (due to problems in the implementation of the app or something that should be discovered using a profiler) then you may restart it, and by doing this your monitor should continue gathering the data from the C apps and store them in your data source.
For the threads in the monitor app, since it seems it will be based on Java, use ExecutorService rather than creating and managing the threads manually.
For this part:
User monitoring C application tasks, will see progress in graph, updated every second. Additionally text lines of logs of info/error events that may happen occasionally in C applications
You may use Rx Java to not update your view (JSP, Facelet, plain HTML or whatever you will use) or another reactive programming model like Play Framework to read the data continuously from database (and cache if you use it) and update the view in a direct way for the users of the web app. If you don't want to use this programming model, then at least use push technology like comet or WebSockets. If this part is not that important, then use a simple refresh timer as explained here: How to reload page every 5 second?
For this part:
C applications will be per machine, which will execute any task user sends from web browser
You could reuse the protocol to communicate the C apps using JSON to the monitor and another thread in each C app to translate the action and execute it.

Using multithreading in Server application (written with and for deployment to Google App Engine)

I'm developing an enrollment application. The client side is an Android application enabling the client to enter their information which are stored using the data storage service of the Google cloud and the images are entered are stored using the blob storage service.
The server side is a J2EE application extracting the data and blobs entered previously and doing some tests such as face recognition, alphanumeric matching etc. These tests are done asynchronously and continuously.I thaught to use the multithreading for these processes done by the server side.
So is that recommended for such case ? Is there other solution ?
There are several limitations in GAE that somewhat limit it's multiprocessing abilities:
Each request can make up to 50 threads, but threads can not outlive the request, which itself has a 60 seconds limit. Also, threads must be created via GAE's own ThreadManager, which limits the use of most external processing libraries.
Background threads independent of current request are available and can be long-lived, but there is a limit of 10 background threads per instance.
For async processing you should look into Task Queues - it has all above limitations, but can run for 10 minutes. You can start periodic processing via Cron jobs.
Note that GAE instances are quite limited (default is single core 600MHz, 128Mb RAM). They are also quite expensive given how low-power they are. If you need more processing power you should look into Compute Engine (powerful, stand-alone, unmanaged, no GAE-services access, fairly priced for the power), or in your case preferably Managed VMs (powerful, managed, limited GAE-service access, same price as CE).
So if you have light processing, use Task Queues, if you need more power use Managed VMs (currently in preview).

Google App Engine Java Program concurrency

In order to improve the execution speed of a Java program running in Google App Engine, can I create additional Java threads during the runtime to make use of idle machines in the data center?
I've found conflicting data thus far.
If your primary concern is to improve the execution time, take a look at Memcache and Tasks. They can be used to reduce or avoid the latency of reading from or writing to the Datastore or other storage options, fetching URLs, sending emails, etc. If you do a lot of difficult computations that can run in parallel, look at MapReduce API.
Once you remove all the delays from your program, there will be no reason to use multiple threads within a single request.
Note that App Engine instances can use multithreading to execute multiple requests at the same time, so they tend to use allocated resources efficiently. To enable it, see:
https://developers.google.com/appengine/docs/java/config/appconfig#Java_appengine_web_xml_Using_concurrent_requests
If you have a problem that calls for a multithreaded solution, you can use threads (as described on the link that you included in your question).
However, based on your reasoning ("to make use of idle machines in the datacenter"), it seems like you're misguided. You should not use threads for that reason. You use the machines hours that you pay for and not more. The only time you will have an idle machine is if you tell App Engine to keep around an extra idle machine so that it doesn't have to start up an extra machine your app gets a big usage spike.
Most of the time, unless you are truly doing parallel computation, you won't need to use multiple threads in App Engine. For instance, the datastore has an asynchronous API so that you can do multiple datastore operations in parallel without having to deal with threads yourself.
Does that make sense?

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.
Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.
Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.
Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Categories

Resources