Java/Scala streaming timeseries data - java

I've got underlying timeseries database based on HBase (OpenTSDB), on top of it works my applications that somehow loads data from HBase. I need to stream this fetched time series to application clients. What is the best solution for it? If client paused processing already received data, server should also pause, if client dead, server should stop sending data.
I would like a pure Java solution. I know about ZMq, but had not very nice experience with it. Maybe I should take a look on Netty?
P.S. Amount of data is big enough. Gigabytes and tens of gigabytes per one request to server.

Related

Caching and uploading to server in a periodic interval

I need to upload data on a periodic interval to the server which gets stored in the SQL(After some processing in BL).
Lets say every 15 mins i have 20000 JSON objects of 1kb each. Whats the best way to implement this.
I thought of writing it to a text , then zip and upload to the server. But now there might be better technologies like EHcache. How should i decide? is it better to use any of these Caching opensource tools ?
There might of 10-100's of clients each sending messages as mentioned earlier to the server.
Mongodb sounds the one suitable for your task, because you are talking about caching json documents.
You can cache the documents as they come in a certain collection, then you can later on sync with a persistent store.
As compared to other persistent cache like eh, it would offer greater scalability and reliability outside your typical application container, yet would help you easily dump records in a quick and easy way.
http://www.tutorialspoint.com/mongodb/mongodb_advantages.htm
http://www.mongodb.com/blog/post/why-mongodb-popular

How is Java Http Server scalable, or how can I make it scalable?

Hello I am a student just learning to use Netty and MySQL.
I am building a server for my android and iOS application. I built my server based on using Netty 4.0.6 example HttpUploadServer.
The server's primary task is to send/recieve and save images and audio files(about 1mb in total). About 10,000 requests will be sent daily.
One of my advisor said that two things should be the most thought about when developing a server.
Scaling up and out
High availability
However, (as I am just learning server programming) I have no idea how to do them. The only thing I can think to increase scalability and availability is something like Amazon's Elastic Load Balancer.
I know this is a very broad question, but please give me a headway.
How can I increase scalability and availablity using Java(Espcially Netty)?
Scaling up can be achieved trough many techniques
Having multiple instances: aka Elastic Load Balancers
Sharding: server 1 handles requests for users A-M server 2 handles requests for users N-Z
Add caching: Are you servicing the same request multiple times? Throw some memory at the problem at keep serving the same answer
Simplify your workload!
The really important question you need to answer is what is limiting your ability to server N+1 clients. Are you running out of sockets, memory, cpu time, db transactions?
Like any profiling problem work out what your dominant problem is and solve it.

Is there a Java local queue library I can use that keeps memory usage low by dumping to the hard drive?

This maybe not possible but I thought I might just give it a try. I have some work that process some data, it makes 3 decisions with each data it proceses: keep, discard or modify/reprocess(because its unsure to keep/discard). This generates a very large amount of data because the reprocess may break the data into many different parts.
My initial method was to send it to my executionservice that was processing the data but because the number of items to process was large I would run out of memory very quickly. Then I decided to maybe offload the queue off to a messaging server(rabbitmq) which works fine but now I'm bound by network IO. What I like about rabbitmq is it keeps messages in memory up to a certain level and then dumps old messages to the local drive so if I have 8 gigs of memory on my server I can still have a 100 gig message queue.
So my question is, is there any library that has a similar feature in Java? Something that I can use as a nonblocking queue that keeps only X items in queue(either by number of items or size) and writes the rest to the local drive.
note: Right now I'm only asking for this to be used on one server. In the future I might add more servers but because each server is self-generating data I would try to take messages from one queue and push them to another if one server's queue is empty. The library would not need to have network access but I would need to access the queue from another Java process. I know this is a long shot but thought if anyone knew it would be SO.
Not sure if it id the approach you are looking for, but why not using a lightweight database like hsqldb and a persistence layer like hibernate? You can have your messages in memory, then commit to db to save on disk, and later query them, with a convenient SQL query.
Actually, as Cuevas wrote, HSQLDB could be a solution. If you use the "cached table" provided, you can specify the maximum amount of memory used, exceeding data will be sent to the hard drive.
Use the filesystem. It's old-school, yet so many engineers get bitten with libraries because they are lazy. True that HSQLDB provides lots of value-add features, but in the context of being light weight....

How to improve the performance of a stock data transfer application?

This is a question which I have worked for several years, but now I still don't get a good solution.
My application has two part:
The first one is running in a server which is called "ROOT server". It will receive the realtime stock data from HKEx(Securities and futures exchange in Hong Kong), and broadcast them to 5 other children servers. It will append a timestamp to each data item when broadcasting.
The second ones are running in the "children" servers. They will receive the stock data from ROOT server, parse each of them, and get the important information. At last, they will send them in a new text format to the clients. The clients may be hundreds to thousands, they can register for some kind of stocks, and get the realtime information of them.
The performance is the most important thing. In the past several years, I tried all kinds of solutions I know to make it faster. The "faster" here means, the first one will receive and send the data to the children servers as fast as it can, and the children servers will receive and parse and send the data to the clients as fast as they can.
For now, when the data speed is 200K from HKEx and there are 5 children servers, the first one application will have 10ms latency for each data item in average. And the second one is not easy to test, it depends on the clients count.
What I'm using:
OpenSUSE 10
Sun Java 5.0
Mina 2.0
The server hardware:
4-core CPU (I don't know the type)
4G ram
I'm considering how to improve the performance.
Do I need to use a concurrent framework as akka
try another language, e.g. Scala? C++?
use the real-time java system?
your advices...
Need your help!
Update:
The applications have logged some important information for analysis, but I don't find any bottlenecks. The HKEx will provide more data in the next year, I don't think my application will be fast enough.
One of my customer have tested our application and another company's, but ours didn't have advantage in speed. I just want to find a way to make it faster.
How is the first application running
The first application will receive the stock data from HKEx and broadcast them to several other servers. The steps are:
It connects HKEx
logins
reads the data. The data is in binary format, each item has a head, which is 2 bytes of integer which means the length of body, then body, then next item.
put them into a hashmap in memory. Key is the sequence of the item, value is the byte array.
log the sequence of each received item into disk. Use log4j's buffer appender.
a daemon thread try to read the data from hashmap, and inserts them into postgresql in every 1 minute. (this is just used to backup the data)
when clients connect to this server, it accepts them and try to send all the data from hashmap from memory. I used thread pool in mina, the acceptor and senders are in different threads.
I think the logic is very simple. When there are 5 clients, I monitored the speed of transfer is only 1.5M/s at most. I used java to write a simplest socket program, and found it can be 10M/s.
Actually, I've spent more than 1 year trying all kinds of solutions on this application, just to make it faster. That why I feel desperate. Do I need to try another language than Java?
about 10ms latency
When the application received a data from HKEx, I will record the timestamp for it. When the root server broadcast the data to the children servers, it will append the timestamp to the data.
when children server get the data, it will send a message to root server to get the current timestamp, then compare them.
So, the 10ms latency contains:
root server got the data ---> the child server got the data
child server send a request for root server's timestamp ---> root server got it
But the 2nd one is very small that we can ignore it.
The first thing to do to find performance bottlenecks is to find out where most of the time is spent. A way to determine this is to use a profiler.
There are open source profilers available such as http://www.eclipse.org/tptp/, or commercial profilers such as Yourkit Java Profiler.
One easy thing to do could be to upgrade the JVM to Java SE6 or Java 7. General JVM performance improved a lot at version 6. See the Java SE 6 Performance White Paper for more details.
If you have checked everything, and found no obvious performance optimizations, you may need to change the architecture to get better performance. This would obviously be most fruitful if you could at least identify where your application is spending time - sounds like there are several major components:
The HK Ex server (out of your control)
The network between the Exchange and your system
The "root" server
The network between the "root" and the "child" servers
The "child" servers
The network between "child" servers and the client
The clients
To know where to spend your time, money and energy, I'd at least want to see an analysis of those components, how long each component takes (min, max, avg), and what the specification is of each resource.
Easiest thing to change is hardware - bigger servers, more memory etc., or better bandwidth. Can you see if any of those resources are constrained?
Next thing to look at is to change the communication protocol to be more efficient - how do clients receive the stocks? Can you reduce data size? 1.5M for only 5 clients sounds a lot...
Next, you might look at some kind of quality of service solution - provide dedicated hardware for "premium" customers, with reduced resource contention, more servers, more bandwidth - this will probably require changes to the architecture.
Next, you could consider changing the architecture - right now, your clients "pull" data from the client servers. You could, instead, "push" data out - that way, you shave off the polling interval on the client end.
At the very end of the list, I'd consider a different technology stack; Java is a fine programming language, but if absolute performance is a key priority, C/C++ is still faster. Clearly, that's a huge change, and a well-written Java app will be faster than a poorly written C/C++ app (and far more stable).
To trace the source of the delay I would add timing data to your end to end process. You can do this using an external log, or by adding meta data to your messages.
What you want to get is a timestamp at key stages in your application 3-5 is enough to start with. Normally I would use System.nanoTime() because I am looking for micro-second delays, but in your case System.currentTimeMillis() is likely to be enough, esp if you average over many samples (you will still get 0.1 ms accuracy on an average, with Ubuntu)
Compare time stamps for the same messages as it passes through your system and look for the highest average delay. Once you have found this try breaking this interval into more stages to zoom in on the problem.
I would analyse any stage which has a verage delay over over 1 ms for your situation.
If clients are updating every minute, there might not be a good technical reason to do this, but you don't want to be seen as being slow and your traders at a disavantage even if in reality it won't make a difference.

What is the most efficient way to store analytics beacons?

Similar to how google analytics sends beacons from javascript that track events, what are the most efficient ways to collect that beacon data and return back to the client in the fastest time?
For example, if I have a server to server beacon call I want to make that call as fast as possible on the clients server.
PHP to a flat files?
PHP to a local queue?
Java Server that logs to a queue and maintains a connection the remote queue the whole time?
custom c++ server?
This would be on the order of 1000 requests per second.
There are 2 aspects to this.
1) the client's beacon call should be done as quickly as possible. This means the incoming HTTP request should respond 200 OK and exit as soon as possible, so it probably shouldn't do the actual data writing itself. It should hand that off to another process in the background, either by a background shell execution or by utilizing a queue/job mechanism like Gearman.
2) The data writing itself, if done in a background thread away from the client's attention, has a little more time luxury. 1000 writes per second should be fine for a modern hardware well tuned database with row locking that's not being SELECTed from too heavily at the same instant. Perhaps, though, this could be a good usage scenario for a key-value store for the immediate data storage. Then a separate analysis/reporting process could query the key-value store off-line for all stored data, process it, and eventually copy it into a database.

Categories

Resources