I have 2 Java applications. First I may edit as much as I wish, but I will compile it to machine code later on. Second one I am not able to edit, but I may write an addon for it. I need to make that addon be able to talk with first application. Generally simply send strings to each other. Input and Output streams of a process is not an option for me. I am thinking of using a tcp socket client/server or a file which will act as a buffer. But both ways seem a liitle bit ugly to me, could anyone propose me a better idea?
It depends on what kind of data you wish to transfer.
If it is only Strings, then:
if number of process = 2 and if you are sure of it, then stdin &8 stdout is the best way forward. You can create a Process using ProcessBuilder and then get the streams to communicate. The other process can just to System.out to transfer message. This is preferred to Socket, because you dont have to handle graceful closing of socket etc. (In case it fails and the port is not un-binded successfully, it can be a big trouble)
if number of process > 2 and less than say 10, you can probably use Sockets and communicate through Socket. This should work well, though extra effort goes in gracefully managing sockets.
if number of process is Large, then JMS should be used. It does a lot of things which you dont need to handle. Too big a task if the number of processes are less.
So in your case, process is the best way forward.
If the data you wish to transfer, can even be Objects. RMI can be used given the number of processes are less. If more, use JMS again.
Edit: Now for all the above, there is a lot of dirty work involved. For a change, if you are looking at something new & exciting, I would advice akka. It is a actor based model which communicate with each other using Messages.
The beauty is, the actors can be on same JVM or another (very little config) and akka takes care the rest for you. I haven't seen a more cleaner way than doing this :)
What about to use JMS ?
You can use according to your needs, either the Publish/Sunbscribe or Point-to-Point Models.
Another approach is having DB table to store your data, one process can insert and other process can read it when ever required. When you are using JMS, there is likeliness of loosing data, But storing in db would be failsafe and future proof.
Related
I'm using a Spring Boot back-end to provide some restful API and need to log all of my request-response logs into ElasticSearch.
Which of the following two methods has better performance?
Using Spring Boot ResponseBodyAdvice to log every request and response that is sent to the client directly to ElasticSearch.
Log every request and response into a log file and using filebeat and/or logstash to send them to ElasticSearch.
First off, I assume, that you have a distributed application, otherwise just write your stuff in a log file and that's it
I also assume that you have quite a log of logs to manage, otherwise, if you're planning to log like a couple of messages in a hour, then it doesn't really matter which way you go - both will do the job.
Technically both ways can be implemented, although for the first path I would suggest a different approach, at least I did something similar ~ 5 years ago in one of my projects:
Create a custom log appender that throws everything into some queue (for async processing) and from that took an Apache Flume project that can write stuff to the DB of your choice in a transaction manner with batch support, "all-or-nothing" semantics, etc.
This approach solves issues that might appear in the "first" option that you've presented, while some other issues will be left unsolved.
If I compare the first and the second option that you've presented,
I think you better off with filebeat / logstash or even both to write to ES, here is why:
When you log in the advice - you will "eat" the resources of your JVM - memory, CPU to maintain ES connections pool, thread pool for doing an actual log (otherwise the business flow might slow down because of logging the requests to ES).
In addition you won't be able to write "in batch" into the elasticsearch without the custom code and instead will have to create an "insert" per log message that might be wasty.
One more "technicality" - what happens if the application gets restarted for some reason, will you be able to write all the logs prior to the restart if everything gets logged in the advice?
Yet another issue - what happens if you want to "rotate" the indexes in the ES, namely create an index with TTL and produce a new index every day.
filebeat/logstash potentially can solve all these issues, however they might require a more complicated setup.
Besides, obviously you'll have more services to deploy and maintain:
logstash is way heavier than filebeat from the resource consumption standpoint, and usually you should parse the log message (usually with grok filter) in logstash.
filebeat is much more "humble" when it comes to the resource consumption, and if you have like many instances to log (really distributed logging, that I've assumed you have anyway) consider putting a service of filebeat (deamon set if you have k8s) on each node from which you'll gather the logs, so that a single filebeat process could handle different instances, and then deploy a cluster of instances of logstash on a separate machine so that they'll do a heavy log-crunching all the time and stream the data to the ES.
How does logstash/filebeat help?
Out of my head:
It will run in its own pace, so even if process goes down, the messages produced by this process will be written to the ES after all
It even can survive short outages of the ES itself I think (should check that)
It can handle different processes written in different technologies, what if tomorrow you'll want to gather logs from the database server, for example, that doesn't have spring/not written java at all
It can handle indices rotation, batch writing internally so you'll end up with effective ES management that otherwise you had to write by yourself.
What are the drawbacks of the logstash/filebeat approach?
Again, out of my head, not a full list or something:
Well, much more data will go through the network all-in-all
If you use "LogEvent" you don't need to parse the string, so this conversion is redundant.
As for performance implications - it basically depends on what do you measure how exactly does your application look like, what hardware do you have, so I'm afraid I won't be able to give you a clear answer on that - you should measure in your concrete case and come up with a way that works for you better.
Not sure if you can expect a clear answer to that. It really depends on your infrastructure and used hardware.
And do you mean by performance the performance of your spring boot backend application or performance in terms of how long it takes for your logs to arrive at ElasticSearch?
I just assume the first one.
When sending the logs directly to ElasticSearch your bottleneck will be the used network and while logging request and responses into a log file first, your bottleneck will probably be the used harddisk and possible max I/O operations.
Normally I would say that sending the logs directly to ElasticSearch via network should be the faster option when you are operating inside your company/network because writing to a disk is always quite slow in comparison. But if you are using fast SSDs the effect should be neglectable. And if you need to send your network packages to a different location/country this can also change fast.
So in summary:
If you have a fast network connection to your ElasticSearch and HDDs/slower SSDs the performance might be better using the network.
If your ElasticSearch is not at your location and you can use fast SSD, writing the logs into a file first might be the faster option.
But in the end you maybe have to try out both approaches, implement some timers and check for yourself.
we are using both solution. first approach have less complexity.
we choose second approach when we dont want to touch the code and have too many instance of app.
about performance. with writing directly on elasticsearch you have better performance because you are not occupying disk I/O. but assume that when the connection between your app and elasticsearch server is dropped. you would have lost log after some retrying attempts.
using rsyslog and logstash is more reliable for big clusters.
I have to make a file synchronizer: an application that essentially synchronizes H24 a large amount of data files from many systems outside to my local system using essentially FTP, SFTP and NFS.
The streams are more than twenty, for each of them the logic is slightly different and it must be configurable.
One of the requirements is that if one of the streams for some reason falls down it must be possible to retrieve it on without restarting the entire system.
Another requirement is that the transfer rate is balanced. In other words, there must not be a stream or a part of them synchronized and another stream 10 hours late
I have some perplexity about architecture to be realized: if I realize a single multithread system I would have a very high thread count (more than 100 I would say) and make it complicated by fulfilling the two requirements outlined above.
I was thinking of realizing several processes or different instances of the same process even if It seems a little "ugly" .. so in this way some load balancing would be done by the operating system and it would be simpler to kill or to start a flow ..Perhaps even performance might be better as several processes could use much more ram Someone has any tips/advice? Thanks a lot and sorry for my poor english. Gian
As #kayaman said, 100 threads is not a lot. If that means 100 threads per unit of work and you will have many units of work which would imply many magnitudes increase in threads, I would suggest having a look at Fibers
As long as you don't block the fibers, you can have 100000+ fibers running over a couple (typically number of CPU cores) of threads. Each fiber would then just wait for a callback from the process before continuing.
To access your endpoints and handle them in similar ways, have a look at Apache Camel - it will allow you to stream the FTP, SFTP, etc and handle each as just another endpoint (in theory you should be able to plug email in as well and stream packets that are emailed to the endpoint)
Regarding balancing the streams, this is business logic you need to implement. If one stream is receiving packets faster than another stream, you should be able to limit the rate by not requesting more packets under certain conditions. Need some more information on how you retrieve the packages and which libraries you are using in order to be of better assistance here.
I'm trying to figure out the best way to go about writing data transfer code for a client/server system that handles multiple clients at once.
I'm already keeping a List of clients who connect (I'm using NIO non-blocking framework btw).
Isn't it costly on performance to iterate through every client with each read/write pass and write the buffer data to each channel? Is there a better/more efficient way of doing it?
I've been thinking about dividing up the buffer size based on the number of clients. Is that a viable solution?
Using selectors (as you seem to be doing) really pays when you're handling a really large number of clients (and why optimize for the case in which you don't have a large number of clients ;)
The bottleneck in such system is rarely the CPU which does the iteration, but the I/O anyway, so I wouldn't worry if I where you.
I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.
Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.
This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.
I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.
Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.
I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?
More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.
Thanks,
Franklin
Might want to look into MapReduce and Hadoop
You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.
Check out Hadoop
I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.
If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.
You may want to elaborate on what kind of work you want to do.
The problem you've described is definitely best solved using the master/worker pattern.
You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.
Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.
If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.
For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.
And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.
And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.
I would consider using Jgroups for that. You can cluster your jvms and one of your nodes can be selected as master and then can distribute the work to the other nodes by sending message over network. Or you can already partition your work items and then manage in master node the distribution of the partitions like partion-1 one goes to JVM-4 , partion-2 goes to JVM-3, partion-3 goes to JVM-2 and so on. And if JVM-4 goes down it will be realized by the master node and then master node will tell to one of the other nodes to start pick up partition-1 as well.
One other alternative which is easier to use is redis pub sub support. http://redis.io/topics/pubsub . But then you will have to maintain redis servers which i dont like.
I am developing a client-server based application for financial alerts, where the client can set a value as the alert for a chosen financial instrument , and when this value will be reached the monitoring server will somehow alert the client (email, sms ... not important) .The server will monitor updates that come from a data generator program. Now, the server has to be very efficient as it has to handle many clients (possible over 50-100.000 alerts ,with updates coming at 1,2 seconds) .I've written servers before , but never with such imposed performances and I'm simply afraid that a basic approach(like before) will just not do it . So how should I design the server ?, what kind of data structures are best suited ?..what about multithreading ?....in general what should I do (and what I should not do) to squeeze every drop of performance out of it ?
Thanks.
I've worked on servers like this before. They were all written in C (or fairly simple C++). But they were even higher performance -- handling 20K updates per second (all updates from most major stock exchanges).
We would focus on not copying memory around. We were very careful in what STL classes we used. As far as updates, each financial instrument would be an object, and any clients that wanted to hear about that instrument would subscribe to it (ie get added to a list).
The server was multi-threaded, but not heavily so -- maybe a thread handing incoming updates, one handling outgoing client updates, one handling client subscribe/release notifications (don't remember that part -- just remember it had fewer threads than I would have expected, but not just one).
EDIT: Oh, and before I forget, the number of financial transactions happening is growing at an exponential rate. That 20K/sec server was just barely keeping up and the architects were getting stressed about what to do next year. I hear all major financial firms are facing similar problems.
You might want to look into using a proven message queue system, as it sounds like this is basically what you are doing in your application.
Projects like Apache's ActiveMQ or RabbitMQ are already widely used and highly tuned, and should be able to support the type of load you are talking about outside of the box.
I would think that squeezing every drop of performance out of it is not what you want to do, as you really never want that server to be under load significant enough to take it out of a real-time response scenario.
Instead, I would use a separate machine to handle messaging clients, and let that main, critical server focus directly on processing input data in "real time" to watch for alert criteria.
Best advice is to design your server so that it scales horizontally.
This means distributing your input events to one or more servers (on the same or different machines), that individually decide whether they need to handle a particular message.
Will you be supporting 50,000 clients on day 1? Then that should be your focus: how easily can you define a single client's needs, and how many clients can you support on a single server?
Second-best advice is not to artificially constrain yourself. If you say "we can't afford to have more than one machine," then you've already set yourself up for failure.
Beware of any architecture that needs clustered application servers to get a reasonable degree of performance. London Stock Exchange had just such a problem recently when they pulled an existing Tandem-based system and replaced it with clustered .Net servers.
You will have a lot of trouble getting this type of performance from a single Java or .Net server - really you need to consider C or C++. A clustered architecture is much more error prone to build and deploy and harder to guarantee uptime from.
For really high volumes you need to think in terms of using asynchronous I/O for networking (i.e. poll(), select() and asynchronous writes or their Windows equivalents), possibly with a pool of worker threads. Read up about the C10K problem for some more insight into this.
There is a very mature C++ framework called ACE (Adaptive Communications Environment) which was designed for high volume server applications in telecommunications. It may be a good foundation for your product - it has support for quite a variety of concurrency models and deals with most of the nuts and bolts of synchronisation within the framework. You might find that the time spent learning how to drive this framework pays you back in less development and easier implementation and testing.
One Thread for the receiving of instrument updates which will process the update and put it in a BlockingQueue.
One Thread to take the update from the BlockingQueue and hand it off to the process that handles that instrument, or set of instruments. This process will need to serialize the events to an instrument so the customer will not receive notices out-of-order.
This process (Thread) will need to iterated through the list of customers registered to receive notification and create a list of customers who should be notified based on their criteria. The process should then hand off the list to another process that will notify the customer of the change.
The notification process should iterate through the list and send each notification event to another process that handles how the customer wants to be notified (email, etc.).
One of the problems will be that with 100,000 customers synchronizing access to the list of customers and their criteria to be monitored.
You should try to find a way to organize the alerts as a tree and be able to quickly decide what alerts can be triggered by an update.
For example let's assume that the alert is the level of a certain indicator. Said indicator can have a range of 0, n. I would groups the clients who want to be notified of the level of the said indicator in a sort of a binary tree. That way you can scale it properly (you can actually implement a subtree as a process on a different machine) and the number of matches required to find the proper subset of clients will always be logarithmic.
Probably the Apache Mina network application framework as well as Apache Camel for messages routing are the good start point. Also Kilim message-passing framework looks very promising.