How to improve the performance of a stock data transfer application?

How to improve the performance of a stock data transfer application? - java

This is a question which I have worked for several years, but now I still don't get a good solution.
My application has two part:
The first one is running in a server which is called "ROOT server". It will receive the realtime stock data from HKEx(Securities and futures exchange in Hong Kong), and broadcast them to 5 other children servers. It will append a timestamp to each data item when broadcasting.
The second ones are running in the "children" servers. They will receive the stock data from ROOT server, parse each of them, and get the important information. At last, they will send them in a new text format to the clients. The clients may be hundreds to thousands, they can register for some kind of stocks, and get the realtime information of them.
The performance is the most important thing. In the past several years, I tried all kinds of solutions I know to make it faster. The "faster" here means, the first one will receive and send the data to the children servers as fast as it can, and the children servers will receive and parse and send the data to the clients as fast as they can.
For now, when the data speed is 200K from HKEx and there are 5 children servers, the first one application will have 10ms latency for each data item in average. And the second one is not easy to test, it depends on the clients count.
What I'm using:
OpenSUSE 10
Sun Java 5.0
Mina 2.0
The server hardware:
4-core CPU (I don't know the type)
4G ram
I'm considering how to improve the performance.
Do I need to use a concurrent framework as akka
try another language, e.g. Scala? C++?
use the real-time java system?
your advices...
Need your help!
Update:
The applications have logged some important information for analysis, but I don't find any bottlenecks. The HKEx will provide more data in the next year, I don't think my application will be fast enough.
One of my customer have tested our application and another company's, but ours didn't have advantage in speed. I just want to find a way to make it faster.
How is the first application running
The first application will receive the stock data from HKEx and broadcast them to several other servers. The steps are:
It connects HKEx
logins
reads the data. The data is in binary format, each item has a head, which is 2 bytes of integer which means the length of body, then body, then next item.
put them into a hashmap in memory. Key is the sequence of the item, value is the byte array.
log the sequence of each received item into disk. Use log4j's buffer appender.
a daemon thread try to read the data from hashmap, and inserts them into postgresql in every 1 minute. (this is just used to backup the data)
when clients connect to this server, it accepts them and try to send all the data from hashmap from memory. I used thread pool in mina, the acceptor and senders are in different threads.
I think the logic is very simple. When there are 5 clients, I monitored the speed of transfer is only 1.5M/s at most. I used java to write a simplest socket program, and found it can be 10M/s.
Actually, I've spent more than 1 year trying all kinds of solutions on this application, just to make it faster. That why I feel desperate. Do I need to try another language than Java?
about 10ms latency
When the application received a data from HKEx, I will record the timestamp for it. When the root server broadcast the data to the children servers, it will append the timestamp to the data.
when children server get the data, it will send a message to root server to get the current timestamp, then compare them.
So, the 10ms latency contains:
root server got the data ---> the child server got the data
child server send a request for root server's timestamp ---> root server got it
But the 2nd one is very small that we can ignore it.

The first thing to do to find performance bottlenecks is to find out where most of the time is spent. A way to determine this is to use a profiler.
There are open source profilers available such as http://www.eclipse.org/tptp/, or commercial profilers such as Yourkit Java Profiler.
One easy thing to do could be to upgrade the JVM to Java SE6 or Java 7. General JVM performance improved a lot at version 6. See the Java SE 6 Performance White Paper for more details.

If you have checked everything, and found no obvious performance optimizations, you may need to change the architecture to get better performance. This would obviously be most fruitful if you could at least identify where your application is spending time - sounds like there are several major components:
The HK Ex server (out of your control)
The network between the Exchange and your system
The "root" server
The network between the "root" and the "child" servers
The "child" servers
The network between "child" servers and the client
The clients
To know where to spend your time, money and energy, I'd at least want to see an analysis of those components, how long each component takes (min, max, avg), and what the specification is of each resource.
Easiest thing to change is hardware - bigger servers, more memory etc., or better bandwidth. Can you see if any of those resources are constrained?
Next thing to look at is to change the communication protocol to be more efficient - how do clients receive the stocks? Can you reduce data size? 1.5M for only 5 clients sounds a lot...
Next, you might look at some kind of quality of service solution - provide dedicated hardware for "premium" customers, with reduced resource contention, more servers, more bandwidth - this will probably require changes to the architecture.
Next, you could consider changing the architecture - right now, your clients "pull" data from the client servers. You could, instead, "push" data out - that way, you shave off the polling interval on the client end.
At the very end of the list, I'd consider a different technology stack; Java is a fine programming language, but if absolute performance is a key priority, C/C++ is still faster. Clearly, that's a huge change, and a well-written Java app will be faster than a poorly written C/C++ app (and far more stable).

To trace the source of the delay I would add timing data to your end to end process. You can do this using an external log, or by adding meta data to your messages.
What you want to get is a timestamp at key stages in your application 3-5 is enough to start with. Normally I would use System.nanoTime() because I am looking for micro-second delays, but in your case System.currentTimeMillis() is likely to be enough, esp if you average over many samples (you will still get 0.1 ms accuracy on an average, with Ubuntu)
Compare time stamps for the same messages as it passes through your system and look for the highest average delay. Once you have found this try breaking this interval into more stages to zoom in on the problem.
I would analyse any stage which has a verage delay over over 1 ms for your situation.
If clients are updating every minute, there might not be a good technical reason to do this, but you don't want to be seen as being slow and your traders at a disavantage even if in reality it won't make a difference.

Related

Synchronise a variable between java instance across network

I have this assignment in college where they ask us to run a Java app as a socket server with multiple clients. Client sends a string, server returns the string in upper case with a request counter. Quite simple.
Each request made by any given client is counted on the server side and stored in a static variable for each client connection thread. So that each client request increments the counter globally on the server. That's working well.
Now, they ask us to run "backup" instances of that server on different machines on the network so that if the primary stops responding, the client connects to one of the backups. That, I got working. But the counter is obviously reset since it's a different server.
The challenge is that the request counter be the same on the primary and the secondaries so that if the primary responds to 10 requests, goes down, client switch to a backup and makes a request, the backup server responds 11.
Here is what I considered:
if on the same PC, I'd use threads but we're over the network so I
believe this will not work.
server sends that counter to the client
with the response, which in turn returns it to the server at the
next request and so forth. Not very "clean" imo but could work.
Each server talks to each other to sync this counter. However, sockets
don't seem to be very efficient to do this, if even possible. RMI
seems to be an alternative here but I'd like confirmation before I
start learning it.
Any leads or suggestions here? I'm not posting code because I don't need a solution here but if necessary, I can invite to the gihub repo.
EDIT: There is no latency, reliability or similar constraints for this project. There is X number of clients and Y number of servers (single master, multiple failovers). Additional third party infrastructure like a DB isn't an option really but third party Java librairies are welcome. Basically I just run in Eclipse on multiple PCs. This is an introduction assignment to distributed systems, expected done in 2 weeks so I believe "keep it simple" is the key here!
EDIT 2: The number and addresses of backup servers will be passed as arguments to the application so broadcast/discovery isn't necessary. We'll likely cover all those points in a later lab assignment in the semester :)
EDIT 3: From all your great suggestions, I'll try an implementation of some variation of #3 and let you know how it works. I think the issue I have here is to make sure all servers are aware of the others. But like I mentioned, they don't need to discover each other so I'll hard code it for now and revisit in the next assignment! Probably opt for some elected master... :)

If option #2 is allowed, then it is the easiest, however I am not sure how it could work in the face of multiple clients (so it depends on the requirements here).
Is it possible to back the servers by a shared db running on another computer? Ideally perhaps one clustered across multiple machines? Or can you use an event bus or 3rd party libraries / code (shared cache, JMS, or even EJBs)?
If not, then having the servers talk to each other is your best bet. Sockets can work, as could UDP multicast (careful there though, no way to know if a message was missed which is why TCP / sockets are safer). If the nodes are going to talk to each other there are generally a few accepted ways to handle the setup:
Master / slaves: Current node is the master and all writes are to it. Slaves connect to the master and receive updates. When the master goes down a new master needs to be elected (see leader election). MongoDB works like this.
Everyone to everyone: Every node connects to every other known node. Can get complicated and might not scale well to lots of nodes.
Daisy chain: one node connects to the next node, which connects to the next, and so on. I don't believe this is widely used.
Ring network: Each node connects to two others in order to form a ring. This is generally superior to daisy chain, but a little bit more complicated to implement.
See here for more examples: https://en.wikipedia.org/wiki/Network_topology
If this was in the real world (i.e. not school), you would use either a shared cache (e.g. ehcache), local caches backed by an event bus (JMS of some sort), or a shared clustered db.
EDIT:
After re-reading your question, it seems you only have a single backup server to worry about, and my guess of the course requirements is that they simply want your backup server to connect to your primary server and also receive the variable count updates. This is completely fine to implement with sockets (it isn't inefficient for a single backup server), and is perhaps the solution they are expecting you to use.
E.g. Backup server connects to primary server and either polls for updates across the held connection or simply listens for updates issued from the primary server.
Key notes:
- You might need keep alives to ensure the connection does not get killed.
- Make sure to implement re-connection logic if the connection from backup to primary dies.
If this is for a networking course they may be expecting UDP multicast, but that may depend a little bit on the server / network environment.

This is a classic distributed systems problem. The right solution is some variation of your option #3, where the different servers communicate with each other.
Where it gets complicated is when you start to introduce latency, downtime, and/or network partitioning between the various servers. Eventually you'll need to arrive at some kind of consensus algorithm. Paxos is a well-known approach to this problem, but there are others; Raft is popular these days as well.

In my opinion best solution is to have vector of the counters. One counter per one server. Each server increments its own counter and broadcast vector value to all other servers. This data structure is conflict-free replicated data type.
Number of requests is calculated as sum of all elements of the vector.
About consistency. If you need strictly growing number on all servers you need to synchronously replicate you new value before answer to client.
The penalty here is performance and availability.
About broadcasting. You can choose any broadcasting algorithm you want. If number of servers are not too large you can use full mesh topology. If number of server become large you can use ring or star topology to replicate data.

The most real life would be option 3. It happens all the time. Nodes talk to one another on another port. So they self discover by broadcast (UDP). So each server broad casts its max on a UDP port. Other nodes listen and up their value that + 1 if their current value is less than that value, else ignore it and instead broadcast their bigger value.
This will work best when there is a 2-300 gap between client calls. This also assumes that any server could be primary (as decided by a load balancer).
UDP is stable within a LAN. Used widely.

Solutions to this problem trade off speed against consistency.
If you value consistency over speed you could try a synchronous approach (assuming servers A, B and C):
A receives initial request
A opens connection to B and C to request current counts from each
A calculates max count (based on its own value and the values from B and C), adds one and sends new count to B and C
A closes connections to B and C
A replies to original request, including new max count
At this point, all servers are in sync with the new max count, ready for a new request to any server.
Edit: Of course, if you are able to use a shared database, the problem becomes much simpler.

Decision to go for distributed application?

I have a legacy product in financial domain.Using tomcat 6. We get millions of request 10k of request in hour. I am wondering at high level
should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
The reason i am planning to go in this direction so that load is distribute and get better performance With this it becomes scalable also.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
If some expert can help
what is the criteria i should consider to go for distributed model and pros/cons of it? I also tried googling where i could get some stats
like how much load a given webserver (tomcat in my case)handle efiiciently with given hardware(16 gb ram, windows 7, processor ).
Yes i am going
to do POC where i will be measuring performance with distributed model vs without bit high level input will be highly appreciated?

It is impossible to answer this questions without more details - how long does it take to reply to one request on the current server? How many resources are allocated for one request?
having 10k requests per hour means ~3 requests per second. If performing the necessary operations and replying to a request, using 1 CPU takes ~300ms - one simple machine is totally fine. This is simple math, and doesn't always work. I guess you still have peaks within those 10k requests per hour and they aren't gradually distributed.
If we assume, one reply can take up to 1 second, than you can handle as many replies per second as your system has CPUs (given that a CPU would be the bottle neck) If the CPU isn't the bottle neck for your application server, there's probably something wrong. You should set up the database(s) on a different machine and only perform computation tasks on the application server machine.
Especially in the financial sector with a legacy software, I wouldn't try splitting a running product. How old is the current server? I believe that a new Server should be cheaper than rewriting an application. Unless you expect 50-100k requests per hour very soon, I don't think, splitting up such small parts makes sense.
Instead - run it on an up to date server hardware, split application server and data storage and you should be fine.

I am wondering at high level if should i go for ditributed application where my mvc component is on one system and service/dao on another box(can use spring remote/EJB).
I'm not sure what you mean for "system" in this context, but if it means that you are planning to run your application in two servers,
one dedicated to presentation and other dedicated to business layer, take in mind that a simpler approach (and probably more suitable for your app)
is build a co-located architecture.
Basically, the idea is to replicate your app in several servers (at least two) and put in front of them a load balancer that routes the incoming requests among the available servers.
All servers share the same database instance. This will give you vertical scalability and also will improve the availability of your system.
I only see the positive side of it but somehow not able to figure out what can be the negative aspect of it?
Distributing your business logic will probably involve a refactor of your application code, if the system is working well you will add some bugs for sure.
The necessary remote calls will add latency and the fact that you execute your business logic in several servers doesn't resolve the performance problems on the presentation tier.
In Expert One-on-One J2EE Development Without EJB (pag. 65), you can find a good reading about why not distribute your business logic.

Netty options for real-time distribution of small messages to a large number of clients?

I am designing a (near) real-time Netty server to distribute a large number of very small messages to a large number of clients across the internet. In internal, go as fast as you can testing, I found that I could do 10k clients no sweat, but now that we are trying to go across the internet, where the latency, bandwidth etc varies pretty wildly we are running into the dreaded outOfMemory issues, even with 2 gigs of RAM.
I have tried various workarounds(setting the socket stack sizes smaller, setting high and low water marks, cancelling things that are too old), and they help a little, but they seem to only help a little bit. What would some good ways to optimize Netty for sending large #s of small messages without significant delays? Also, the bulk of the message only consists of one kind of message that I don't particularly care if it doesn't arrive. I would use UDP but because we don't control the client, thats not really a possibility. Is it possible to set a separate timeout solely for this kind of message without affecting the other messages?
Any insight you could offer would be greatly appreciated.

usually, if see outOfMemory you can use a thread dump tool to dump the threads. Or use something like jvirtualvm and jconsole to find out which class doesn't get GCed and keep eating your memory.
2Gigs is not big for 64 bit machines nowadays.Try to turn that a number bigger to about 3 or 4 G to see if you don't hit OOM.
If you find you can handle 10k connections easily in LAN, try to add a small delay in your netty handler. Check what happens.

You might want to look into load balancing approach. It is used to distribute the workload across the distributed system using both hardware and software. The details of what is suitable for your system depend on several factors which includes hardware upgrade, etc. Certainly, 2GB of RAM is fairly small to server 10k users and you will need to increase this limit.

You don't say whether the subscription stream is constant or bursty. You also don't say whether there is a minimum number of messages / second the client must support.
Given that I don't know anything about Redis, are any of the following practical?
For the messages you don't care about, if channel.isWritable() == false, discard immediately. Unfortunately I don't know of a way to cancel messages that are in Netty's send buffer. You wouldn't be able to cancel messages that have been passed to the TCP send buffer anyway so it's not really something to rely on.
Slow reception from the subscription to the rate of the slowest client.
Determine which clients can't keep up (maybe use the write timeout handler) and move them to a separate subscription which can be slowed down. Duplicate the published messages to both subscriptions.
Can you split the messages to send to the clients across different subscriptions. If a client can't keep up unsubscribe it from the unimportant messages.
If your average send rate is higher than the client can support over time then there isn't really a solution other than negotiating a change in requirements to reduce the maximum allowable throughput.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.

Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.

Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.

Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

Critically efficient server

I am developing a client-server based application for financial alerts, where the client can set a value as the alert for a chosen financial instrument , and when this value will be reached the monitoring server will somehow alert the client (email, sms ... not important) .The server will monitor updates that come from a data generator program. Now, the server has to be very efficient as it has to handle many clients (possible over 50-100.000 alerts ,with updates coming at 1,2 seconds) .I've written servers before , but never with such imposed performances and I'm simply afraid that a basic approach(like before) will just not do it . So how should I design the server ?, what kind of data structures are best suited ?..what about multithreading ?....in general what should I do (and what I should not do) to squeeze every drop of performance out of it ?
Thanks.

I've worked on servers like this before. They were all written in C (or fairly simple C++). But they were even higher performance -- handling 20K updates per second (all updates from most major stock exchanges).
We would focus on not copying memory around. We were very careful in what STL classes we used. As far as updates, each financial instrument would be an object, and any clients that wanted to hear about that instrument would subscribe to it (ie get added to a list).
The server was multi-threaded, but not heavily so -- maybe a thread handing incoming updates, one handling outgoing client updates, one handling client subscribe/release notifications (don't remember that part -- just remember it had fewer threads than I would have expected, but not just one).
EDIT: Oh, and before I forget, the number of financial transactions happening is growing at an exponential rate. That 20K/sec server was just barely keeping up and the architects were getting stressed about what to do next year. I hear all major financial firms are facing similar problems.

You might want to look into using a proven message queue system, as it sounds like this is basically what you are doing in your application.
Projects like Apache's ActiveMQ or RabbitMQ are already widely used and highly tuned, and should be able to support the type of load you are talking about outside of the box.

I would think that squeezing every drop of performance out of it is not what you want to do, as you really never want that server to be under load significant enough to take it out of a real-time response scenario.
Instead, I would use a separate machine to handle messaging clients, and let that main, critical server focus directly on processing input data in "real time" to watch for alert criteria.

Best advice is to design your server so that it scales horizontally.
This means distributing your input events to one or more servers (on the same or different machines), that individually decide whether they need to handle a particular message.
Will you be supporting 50,000 clients on day 1? Then that should be your focus: how easily can you define a single client's needs, and how many clients can you support on a single server?
Second-best advice is not to artificially constrain yourself. If you say "we can't afford to have more than one machine," then you've already set yourself up for failure.

Beware of any architecture that needs clustered application servers to get a reasonable degree of performance. London Stock Exchange had just such a problem recently when they pulled an existing Tandem-based system and replaced it with clustered .Net servers.
You will have a lot of trouble getting this type of performance from a single Java or .Net server - really you need to consider C or C++. A clustered architecture is much more error prone to build and deploy and harder to guarantee uptime from.
For really high volumes you need to think in terms of using asynchronous I/O for networking (i.e. poll(), select() and asynchronous writes or their Windows equivalents), possibly with a pool of worker threads. Read up about the C10K problem for some more insight into this.
There is a very mature C++ framework called ACE (Adaptive Communications Environment) which was designed for high volume server applications in telecommunications. It may be a good foundation for your product - it has support for quite a variety of concurrency models and deals with most of the nuts and bolts of synchronisation within the framework. You might find that the time spent learning how to drive this framework pays you back in less development and easier implementation and testing.

One Thread for the receiving of instrument updates which will process the update and put it in a BlockingQueue.
One Thread to take the update from the BlockingQueue and hand it off to the process that handles that instrument, or set of instruments. This process will need to serialize the events to an instrument so the customer will not receive notices out-of-order.
This process (Thread) will need to iterated through the list of customers registered to receive notification and create a list of customers who should be notified based on their criteria. The process should then hand off the list to another process that will notify the customer of the change.
The notification process should iterate through the list and send each notification event to another process that handles how the customer wants to be notified (email, etc.).
One of the problems will be that with 100,000 customers synchronizing access to the list of customers and their criteria to be monitored.

You should try to find a way to organize the alerts as a tree and be able to quickly decide what alerts can be triggered by an update.
For example let's assume that the alert is the level of a certain indicator. Said indicator can have a range of 0, n. I would groups the clients who want to be notified of the level of the said indicator in a sort of a binary tree. That way you can scale it properly (you can actually implement a subtree as a process on a different machine) and the number of matches required to find the proper subset of clients will always be logarithmic.

Probably the Apache Mina network application framework as well as Apache Camel for messages routing are the good start point. Also Kilim message-passing framework looks very promising.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.