Q1) Does anyone familiar with the Java Disruptor pattern know the size of messages they benchmarked their results against? I am writing a similar system (out of pure interest) and when I read the description of their testing there is no mention of the message size sent?
http://code.google.com/p/disruptor/wiki/PerformanceResults
Q2) Is the disruptor for computer to computer communications, or inter-process? I originally had the impression it was for computer to computer but their work is labelled "inter thread" messaging library?
Disruptor is not just within the same machine, it is withing a single process. When they say "inter-thread", they mean that it is for sending messages between threads of one process.
The message size is actually almost irrelevant because the messages don't get copied. The messages are all fixed at the beginning and reused, so it doesn't really matter how big they are.
Although Im not entirely familiar, just exploring it...
1) It looks like from the perf test folder in the src that they are using the ValueEvent class, which just holds a long, there is also some other xxxEvent classes that are used in other perf tests that are slightly bigger but from what i can gather so far, only a long is used within the ring buffer.
2) I would assume it is for completely same machine inter thread comms. the latency & uncertainty of comms across machines would make it extremely slow. (relatively) and then the project would also need to deal with socket comms, which I haven't seen in this lib.
1,Disruptor not care the size of message. but result should be linearly down by size of message(workload increased, the speed decreased)
In deed it's not care the message.
The KEY of the library is the ID of buffer. pointer, position, cursor, indicator, all both mean the same.
Disruptor self call it as "sequence"
Once the ID got, the whole world only owned by you!:) so ONLY one writer. the real key point.:)
2,not C2C, nor P2P:). just T2T. the T is thread. peter-lawrey have a great library Java-Chronicle, can be used in P2P case. a new article on java dzone: http://java.dzone.com/articles/ultra-fast-reliable-messaging
3, the core pattern should be capable to clone to cross boundary use cases. every thing is ID.
As to the message, customerized.
4, another important point, is the cache of volatile. a great example on github
5, JDK8 intro a new annotation #Contended, seems sexy. details about contended
Related
This a general programming question. Let's say I have a thread doing a specific simulation, where speed is quite important. At every iteration I want to extract data from it and write it to a file.
Is it a better practice to hand over the data to a different thread and let the simulation thread focus on his job, or since speed is very important, make the simulation thread do the data recording too without any copying of data. (in my case it is 3-5 deques of integers with a size of 1000-10000)
Firstly it surely depends on how much data we are copying, but what else can it depend on? Can the cost of synchronization and copying be worth? Is it a good practice to create small runnables at each iteration to handle the recording task in case of 50 or more iterations per second?
If you truly want low latency on this stat capturing, and you want it during the simulation itself then two techniques come to mind. They can be used together very effectively. Please note that these two approaches are fairly far from the standard Java trodden path, so measure first and confirm that you need these techniques before abusing them; they can be difficult to implement correctly.
The fastest way to write the data to file during a simulation, without slowing down the simulation is to hand the work off to another thread. However care has to be taken on how the hand off occurs, as a memory barrier in the simulation thread will slow the simulation. Given the writer only cares that the values will come eventually I would consider using the memory barrier that sits behind AtomicLong.lazySet, it requests a thread safe write out to a memory address without blocking for the write to actually become visible to the other thread. Unfortunately direct access to this memory barrier is currently only availble via lazySet or via class sun.misc.Unsafe, which obviously is not part of the public Java API. However that should not be too large of a hurdle as it is on all current JVM implementations and Doug Lea is talking about moving parts of it into the mainstream.
To avoid the slow, blocking file IO that Java uses; make use of a memory mapped file. This lets the OS perform async IO for you on your behalf, and is very efficient. It also supports use of the same memory barrier mentioned above.
For examples of both techniques, I strongly recommend reading the source code to HFT Chronicle by Peter Lawrey. In fact, HFT Chronicle may be just the library for you to use here. It offers a highly efficient and simple to use disk backed queue that can sustain a million or so messages per second.
In my work on a stress-testing HTTP client I stored the stats into an array and, when the array was ready to send to the GUI, I would create a new array for the tester client and hand off the full array to the network layer. This means that you don't need to pay for any copying, just for the allocation of a fresh array (an ultra-fast operation on the JVM, involving hand-coded assembler macros to utilize the best SIMD instructions available for the task).
I would also suggest not throwing yourself head-on into the realms of optimal memory barrier usage; the difference between a plain volatile write and an AtomicReference.lazySet() can only be measurable if your thread does almost nothing else but excercise the memory barrier (at least millions of writes per second). Depending on your target I/O throughput, you may not even need NIO to meet the goal. Better try first with simple, easily maintainable code than dig elbows-deep into highly specialized APIs without a confirmed need for that.
Lets say I am reading a single incoming stream with millions of transaction per ms, is so fast that I can't afford to have a GC or the entire system will hang.
The functionality is very simple, it is merely to record every single packets that went pass NIC card. (hypothetical)
Is it even possible?
Are there design pattern for such implementation? I only know flyweight and resource pool design pattern.
Do I really need to code in C so that I can manage it?
1) I can have reasonable amount of ram but not ridiculous like 100gb (maybe 16gb)?
2) CPU processing is not an issue.
FAQ:
Must it be Java? No, please recommend me another language that can support most platform. (linux, aix, windows)
If you really want to handle everything passing through your network card. Java is the wrong language. Look into C possibly C++ or Assembler.
As you have been told a million transactions per milliseconds seem unrealistic, only achievable when you are able to split the work between multiple (read many many many) computers
There are many Garbage Collectors out there, go do some searching if anything is good for you.
If you really don't want the garbage collector to kick in, I think your only option is: Don't create garbage. Initialize an array of bytes as your memory to work in. Only use primitives, no Objects. It will be cumbersome, but it might be fast and I have been told this is the kind of stuff people working on real time systems do.
Assuming you meant millions of transactions per second, not ms, you can use Chronicle which promises up to 5-20m transactions per second, persisted.
I hope that millions of transactions per milliseconds is a joke or a hyperbole. No single computer can handle that much, particularly if a mere 100Gb counts as ridiculous amounts of RAM.
But ignoring the actual number of expected transactions, what is needed for this type of task is real-time Java. Provided that you want to stick with Java.
Either way you'll probably need a real-time OS first, because your application isn't going to be the only thing running on the computer and if the OS decides not to give your application control when it needs it, there's nothing you can do.
Update: If you only want to capture network traffic, don't reinvent the wheel, use libpcap.. AFAIK there's even a Java wrapper to it.
Okay, for my application I'm trying to decide on an architecture that's as DDoS resistant as possible. Obviously it will never be perfect but I'd like protection against simple attacks.
There's a few that I've thought of so far:
1) Single thread per connection.
This method seems to have unbelievable scalability problems, and with a tonne of connections, having too many threads seems like it would be a scheduling nightmare for the OS.
2) 2 threads. first thread will accept connections and append them to a list, the second thread loops through the list (with the proper synchro here) and checks if there's anything in the InputStream. Upon finding something, read a line. Any of the actual work will be done, including the reply, in a new event thread. The new thread is just passed the line that is read.
This method seems to have even bigger problems. It appears as though a simple cat /dev/urandom | telnet server port would lock it down.
3) This is similar to #2, but only read a single byte from each connection at each iteration, and processing it as a string when I get to a newline byte.
This seems like my best option so far, but it means that if the attack initiates a lot of connections and sends input on all of them, it could slow the loop down considerably.
Are there any other potential architectures that might be better suited for the job?
Large companies have entire teams that spend all day every day working to combat various DOS attacks. It' too much to discuss here. After trivial mitigation techniques (like SYN-cookies, etc.), your best bet is simply to have sufficient capacity to "eat it".
I would recommend writing your code for efficiency and then running it on a hosted service like Google's or Amazon's and let them deal with fending of DOS attacks and scaling your service to handle such spikes.
I want to implement a CoreLocal map, which works just like ThreadLocal, only it returns a value that is specific to the core the current thread is running on.
The reason for this is that I want to write code that will take a job from a queue, but I want to give priority to jobs that will have their associated data already be in the same L1 cache as the thread picking the job from the queue. So, instead of one job queue for the entire program, I want to have a queue for each core and only when a queue is empty will a worker thread go looking at the queues of other cores.
I don't think there is any call to get the current CPU currently exposed in the JDK, although it certainly has been previously discussed1 and proposed as a JDK enhancement.
I think until something like that gets implemented your best bet is to use something like JNA (easiest) or JNI (fast) to wrap a native system call like getcpu on Linux or GetCurrentProcessorNumber on Windows.
At least on Linux, getcpu is implemented in the VDSO without a kernel transition, so it should only take a few nanoseconds, plus a few more nanoseconds for the JNI call. JNA is slower.
If you really need speed, you could always add the function as an intrinsic to a bespoke JVM (since OpenJDK is open source). That would shave off several more nanoseconds.
Keep in mind that this information can be out of date as soon as you get it, so you should never rely on it for correctness, only performance. Since you already need to handle getting the "wrong" value, another possible approach is to store the cached value of the CPU ID in a ThreadLocal, and only update it periodically. This makes slow approaches such as parsing the /proc filesystem viable since you do them only infrequently. For maximum speed, you can invalidate the thread-local periodically from a timer thread, rather than checking the invalidation condition on each call.
1 Both the discussion and the enhancement request are highly recommended reading.
There's a related linux question with no satisfactory answer (parsing top output doesn't count and the accepted answer doesn't work anymore). I thought that
/proc/<pid>/task/<tid>/sched
might give this information in a line like
current_node=0, numa_group_id=0
but on my i5-2400 running 4.4.0-92-generic kernel, this line is always the same for all threads. I guess, "node" means a whole CPU (socket) and I have only one.
I could find no documentation on this, or missed it in this document.
However, I'm afraid that this obtaining this information could improbably help you:
Reading from the proc filesystem may be too costly on the scale you're working on.
Unlike ThreadLocal, your CoreLocal is not thread-safe: Migrating a thread to another core could spoil even trivial non-atomic operations like someCoreLocalField++. Suspending it would do it, too. So you'd need some atomics or thread-locals to get it working, which again may make it far too slow for what you want.
probably you could check /proc/[pid]/status
These fields may be helpful:
Cpus_allowed: Mask of CPUs on which this process may run
Cpus_allowed_list: Same as previous, but in "list format"
Our company is running a Java application (on a single CPU Windows server) to read data from a TCP/IP socket and check for specific criteria (using regular expressions) and if a match is found, then store the data in a MySQL database. The data is huge and is read at a rate of 800 records/second and about 70% of the records will be matching records, so there is a lot of database writes involved. The program is using a LinkedBlockingQueue to handle the data. The producer class just reads the record and puts it into the queue, and a consumer class removes from the queue and does the processing.
So the question is: will it help if I use multiple consumer threads instead of a single thread? Is threading really helpful in the above scenario (since I am using single CPU)? I am looking for suggestions on how to speed up (without changing hardware).
Any suggestions would be really appreciated. Thanks
Simple: Try it and see.
This is one of those questions where you argue several points on either side of the argument. But it sounds like you already have most of the infastructure set up. Just create another consumer thread and see if the helps.
But the first question you need to ask yourself:
What is better?
How do you measure better?
Answer those two questions then try it.
Can the single thread keep up with the incoming data? Can the database keep up with the outgoing data?
In other words, where is the bottleneck? If you need to go multithreaded then look into the Executor concept in the concurrent utilities (There are plenty to choose from in the Executors helper class), as this will handle all the tedious details with threading that you are not particularly interested in doing yourself.
My personal gut feeling is that the bottleneck is the database. Here indexing, and RAM helps a lot, but that is a different question.
It is very likely multi-threading will help, but it is easy to test. Make it a configurable parameter. Find out how many you can do per second with 1 thread, 2 threads, 4 threads, 8 threads, etc.
First of all:
It is wise to create your application using the java 5 concurrent api
If your application is created around the ExecutorService it is fairly easy to change the number of threads used. For example: you could create a threadpool where the number of threads is specified by configuration. So if ever you want to change the number of threads, you only have to change some properties.
About your question:
- About the reading of your socket: as far as i know, it is not usefull (if possible at all) to have two threads read data from one socket. Just use one thread that reads the socket, but make the actions in that thread as few as possible (for example read socket - put data in queue -read socket - etc).
- About the consuming of the queue: It is wise to construct this part as pointed out above, that way it is easy to change number of consuming threads.
- Note: you cannot really predict what is better, there might be another part that is the bottleneck, etcetera. Only monitor / profiling gives you a real view of your situation. But if your application is constructed as above, it is really easy to test with different number of threads.
So in short:
- Producer part: one thread that only reads from socket and puts in queue
- Consumer part: created around the ExecutorService so it is easy to adapt the number of consuming threads
Then use profiling do define the bottlenecks, and use A-B testing to define the optimal numbers of consuming threads for your system
As an update on my earlier question:
We did run some comparison tests between single consumer thread and multiple threads (adding 5, 10, 15 and so on) and monitoring the que size of yet-to-be processed records. The difference was minimal and what more.. the que size was getting slightly bigger after the number of threads was crossing 25 (as compared to running 5 threads). Leads me to the conclusion that the overhead of maintaining the threads was more than the processing benefits got. Maybe this could be particular to our scenario but just mentioning my observations.
And of course (as pointed out by others) the bottleneck is the database. That was handled by using the multiple-insert statement in mySQL instead of single inserts. If we did not have that to start with, we could not have handled this load.
End result: I am still not convinced on how multi-threading will give benefit on processing time. Maybe it has other benefits... but I am looking only from a processing-time factor. If any of you have experience to the contrary, do let us hear about it.
And again thanks for all your input.
In your scenario where a) the processing is minimal b) there is only one CPU c) data goes straight into the database, it is not very likely that adding more threads will help. In other words, the front and the backend threads are I/O bound, with minimal processing int the middle. That's why you don't see much improvement.
What you can do is to try to have three stages: 1st is a single thread pulling data from the socket. 2nd is the thread pool that does processing. 3rd is a single threads that serves the DB output. This may produce better CPU utilization if the input rate varies, at the expense of temporarily growth of the output queue. If not, the throughput will be limited by how fast you can write to the database, no matter how many threads you have, and then you can get away with just a single read-process-write thread.