using ehcache to handle file processing - java

I am new to ehcache concept and its usage. in my application i am loading many files using java.io ( lets say 100 at a time. it may be more than that) and process these files using multiple threads.
from performance perspective i want to implement a caching mechanism for this. can anyone please let me know how should i do this and what will be the best practice ?
PS - processing file steps
1. read the file
2. create java file object.
3. process the file.
4. move the file to a different location.
( i am using spring in my application)
Thank you all in advance.

We're operating a high traffic portal about 95M PIs / monthly.
We're using proxy servers and varnish https://www.varnish-cache.org/ to cache static contents.
At the same time you outsource caching from your application servers, and they've more free memory to operate on. I think it would be a right solution in your case , too.

Related

direct logging on elasticsearch vs using logstash and filebeat

I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.
First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.
Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.
we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.

Adding new node to a scalable system with zero downtime

I am working as a developer on a batch processing solution, how it works is that we split a big file and process it across jvms. So we have 4 processor jvms which take a chunk of file and process it and 1 gateway jvm job of gateway jvm is to split the file into no. of processor jvms i.e. 4 and send a rest request which is consumed by processor jvms, rest request has all the details the file location it has to pick the file from and some other details
Now if i want to add another processor jvm without any downtime is there any way we can do it. Currently we are maintaining the urls for 4 jvms in a property file is there any better way to do it ? which provided me the ability to add more jvms without restarting any component
You can consider setting up a load balancer and putting your JVM(s) behind it. The load balancer would be responsible for distributing the incoming requests to the JVMs.
This way you can scale up or scale down your JVM depending on the work load. Also, if one of the JVMs are not working, other part of your system need not care about it anymore.
Not sure what is your use case and tech stack you are following. But it seems that you need to have distributed system with auto-scaling and dynamic provisioning capabilities. Have you considered Hadoop or Spark clusters or Akka?
If you can not use any of it, then solution is to maintain list of JVMs in some datastore (lets say in a table); its dynamic data meaning one can add/remove/update JVMs. Then you need a resource manager who can decide whether to spin up a new JVM based on load or any other conditional logic. This resource manager needs to monitor entire system. Also, whenever you create a task or chunk or slice data then distribute it using message queues such as ApacheMQ, ActiveMQ. You can also consider Kafka for complex use cases. Now a days, application servers such as websphere (Libery profile), weblogic also provide auto-scaling capability. So, if you are already using any of such application server then you can think of making use of that capability. I hope this helps.

What is the fastest way to connect two Java processes on the same physical machine?

I have a large in-memory cache inside my Java application, which is being filled after application starts. It's makes redeployments extremelly expensive and slowers development process.
To solve the problem I'd like to outsource the cache to a separate Java process. What is the fastest way to connect two Java processes on Linux?
As a fastest solution I'd recommend you to use Hazelcast. They support distributed maps. You can define simple, 2 nodes cluster, so when both your processes are up the date will be shared, when one of them is going down the data will be still in the memory of dedicated process, when the main process is up again the data will be shared again.
The only thing that you have to change in your code is the line where you create instance of your map. You have to use the Hazelcast API instead of new HashMap<>().

Sharing JVM session

I have a java application, using JVM as session storage. But recently when a certain number of users exceed. The application goes down. JVM is running out of memory.
I want to add new application server also want to use load balancer but as the session is JVM dependent, I can not share it with other application server.
It would be great if I can use one JVM instance dedicatedly for the JVM session and access it via multiple application server.How I can do that?
I am using Java Spring in the project. Is my plan ok to accommodate lot of users requests?
Thanks in advance.
There is a 3rd Party Application called Terracotta. i Tried it and work fine for Spring Application.
You can find the Configuration details from below link.
http://www.terracotta.org/documentation/4.1/terracotta-server-array/introduction
Put a Comment if need any help.
First make sure you know what the cause of out of memory is. If it is really related to having many sessions, you may want to change the way sessions are managed. Instead of keeping session in memory, you could save it into database. In that approach you would reduce memory and also after adding other machines session wouldn't be tied to any of them.
It sound like you are holding (large amounts of) session data in memory ... for performance reasons.
Is my plan ok to accommodate lot of users requests?
Ultimately you will run out of:
physical memory to hold all of the session data in one JVM, or
CPU and I/O bandwidth to satisfy the requests for session data from the other application servers, and/or
CPU resources for simply managing the data. (Hint: the time taken to do a full GC is proportional to the total amount on reachable data.)
If your architecture uses a single JVM for all session data, you will eventually hit a wall. That suggests that you should make it possible to replicate that part of your system. However, it is not possible to suggest the best way to do that ... without a much deeper analysis of your application ... and its real need for scalability.
Bottom line: there are no simple one-size-fits-all solutions for scalability.

Does a war file size affect in some way the application and/or application server performance?

we've bean struggling here at work by somebody suggestion that we should decrease the size of our war file, specifically the WEB-INF/lib directory size, in order to improve our production JBoss instance performance. Something I'm still suspicious about.
We have around 15 web apps deploy in our application server, each about 15 to 20 MB in size.
I know there are a lot of variables involved on this, but has anyone of you actually deal with this situation? Does the .war files size actually has a significant impact on web containers in general?
What advice can you offer?
Thank U.
There are many things to be suspicious of here:
What about the application is not performing to the level you would like?
Have you measured the application to find out which components are causing the lack of performance?
What are the bottlenecks in the application/system?
The size of the application alone has nothing to do with any sort of runtime performance. The number of classes loaded during the lifetime of the application has an impact on memory usage of the application, but an incredibly negligible one.
When dealing with "performance issues", the solution always follows the same general steps:
What does it mean when we say "bad performance"?
What specifically is not performing? Measure, measure, measure.
Can we improve the specific component not performing to the level we want?
If so, implement the ideas, measure again to find out if performance has truly improved.
Need you to tell us the operating system.
Do you have antivirus live protection?
A war/jar file is actually a zip file - i.e., if you renamed a .war to a .zip, you can use a zip utility to view/unzip it.
During deployment, the war file is unzipped once into a designated folder. If you have live-protection, the antivirus utility might take some time to scan the new branch of directories created and slow down any access to them.
Many web app frameworks, like JSPs, create temporary files and your live-protection would get into action to scan them.
If this is your situation, you have to decide whether you wish to exclude your web-app from antivirus live-scanning.
Are you running Linux but your web directory is accessed using ntfs-3g? If so, check if the ntfs directory is compressed. ntfs-3g has problems accessing compressed ntfs files especially when multiple files are manipulated/created/uncompressed simultaneously. In the first place, unless there are some extremely valid reasons (which I can't see any), a web app directory should be a local partition in a format native to Linux.
Use wireshark to monitor the network activity. Find out if web apps are causing accesses to remote file systems. See if there are too many retransmits whenever the web apps are active. Excessive retransmits or requests for retransmits means the network pipeline has integrity problems. I am still trying to understand this issue myself - some network cards have buffering problems (as though buffer overflow) operating in Linux but not in Windows.
Wireshark is not difficult to use as long as you have an understanding of ip addresses, and you might wish to write awk, perl or python scripts to analyze the traffic. Personally, I would use SAS.

Categories

Resources