Netstat for a single connection? - java

On Linux, is there any way to programmatically get stats for single TCP connection? The stats I am looking for are the sort that are printed out by netstat -s, but for a single connection rather than in the aggregate across all connections. To give some examples: bytes in/out, retransmits, lost packets and so on.
I can run the code within the process that owns the socket, and it can be given the socket file descriptor. The code that sends/receives data is out of reach though, so for example there's no way to wrap recv()/send() to count bytes in/out.
I'll accept answers in any language, but C or Java are particularly relevant hence the tags.

The information nos refers to is available from C with:
#include <linux/tcp.h>
...
struct tcp_info info;
socklen_t optlen;
getsockopt(sd, IPPROTO_TCP, TCP_INFO, &info, &optlen)
Unfortunately, as this is Linux specific, it is not exposed through the Java Socket API. If there is a way to obtain the raw file descriptor from the socket, you might be able to implement this as a native method.
I do not see a way to get to the descriptor. However, it might be possible with your own SocketImplFactory and SocketImpl.
It's probably worth noting that the TCP(7) manual page says this re TCP_INFO:
This option should not be used in code intended to be portable.

Most of the statistics you see with netstat -s is not kept track of on a per connection basis, only overall counters exists.
What you can do, is pull out the information in /proc/net/tcp
First, readlink() on /proc/self/fd, you want to parse the inode number from that symlink, and match it against a line with the same inode number in /proc/net/tcp , which will contain some rudimentary info about that socket/connection. That fil is though not very well documented, so expect to spend some time on google and reading the linux kernel source code to interpret them.

Related

Ways to buffer REST response

There's a REST endpoint, which serves large (tens of gigabytes) chunks of data to my application.
Application processes the data in it's own pace, and as incoming data volumes grow, I'm starting to hit REST endpoint timeout.
Meaning, processing speed is less then network throughoutput.
Unfortunately, there's no way to raise processing speed enough, as there's no "enough" - incoming data volumes may grow indefinitely.
I'm thinking of a way to store incoming data locally before processing, in order to release REST endpoint connection before timeout occurs.
What I've came up so far, is downloading incoming data to a temporary file and reading (processing) said file simultaneously using OutputStream/InputStream.
Sort of buffering, using a file.
This brings it's own problems:
what if processing speed becomes faster then downloading speed for
some time and I get EOF?
file parser operates with
ObjectInputStream and it behaves weird in cases of empty file/EOF
and so on
Are there conventional ways to do such a thing?
Are there alternative solutions?
Please provide some guidance.
Upd:
I'd like to point out: http server is out of my control.
Consider it to be a vendor data provider. They have many consumers and refuse to alter anything for just one.
Looks like we're the only ones to use all of their data, as our client app processing speed is far greater than their sample client performance metrics. Still, we can not match our app performance with network throughoutput.
Server does not support http range requests or pagination.
There's no way to divide data in chunks to load, as there's no filtering attribute to guarantee that every chunk will be small enough.
Shortly: we can download all the data in a given time before timeout occurs, but can not process it.
Having an adapter between inputstream and outpustream, to pefrorm as a blocking queue, will help a ton.
You're using something like new ObjectInputStream(new FileInputStream(..._) and the solution for EOF could be wrapping the FileInputStream first in an WriterAwareStream which would block when hitting EOF as long a the writer is writing.
Anyway, in case latency don't matter much, I would not bother start processing before the download finished. Oftentimes, there isn't much you can do with an incomplete list of objects.
Maybe some memory-mapped-file-based queue like Chronicle-Queue may help you. It's faster than dealing with files directly and may be even simpler to use.
You could also implement a HugeBufferingInputStream internally using a queue, which reads from its input stream, and, in case it has a lot of data, it spits them out to disk. This may be a nice abstraction, completely hiding the buffering.
There's also FileBackedOutputStream in Guava, automatically switching from using memory to using a file when getting big, but I'm afraid, it's optimized for small sizes (with tens of gigabytes expected, there's no point of trying to use memory).
Are there alternative solutions?
If your consumer (the http client) is having trouble keeping up with the stream of data, you might want to look at a design where the client manages its own work in progress, pulling data from the server on demand.
RFC 7233 describes the Range Requests
devices with limited local storage might benefit from being able to request only a subset of a larger representation, such as a single page of a very large document, or the dimensions of an embedded image
HTTP Range requests on the MDN Web Docs site might be a more approachable introduction.
This is the sort of thing that queueing servers are made for. RabbitMQ, Kafka, Kinesis, any of those. Perhaps KStream would work. With everything you get from the HTTP server (given your constraint that it cannot be broken up into units of work), you could partition it into chunks of bytes of some reasonable size, maybe 1024kB. Your application would push/publish those records/messages to the topic/queue. They would all share some common series ID so you know which chunks match up, and each would need to carry an ordinal so they can be put back together in the right order; with a single Kafka partition you could probably rely upon offsets. You might publish a final record for that series with a "done" flag that would act as an EOF for whatever is consuming it. Of course, you'd send an HTTP response as soon as all the data is queued, though it may not necessarily be processed yet.
not sure if this would help in your case because you haven't mentioned what structure & format the data are coming to you in, however, i'll assume a beautifully normalised, deeply nested hierarchical xml (ie. pretty much the worst case for streaming, right? ... pega bix?)
i propose a partial solution that could allow you to sidestep the limitation of your not being able to control how your client interacts with the http data server -
deploy your own webserver, in whatever contemporary tech you please (which you do control) - your local server will sit in front of your locally cached copy of the data
periodically download the output of the webservice using a built-in http querying library, a commnd-line util such as aria2c curl wget et. al, an etl (or whatever you please) directly onto a local device-backed .xml file - this happens as often as it needs to
point your rest client to your own-hosted 127.0.0.1/modern_gigabyte_large/get... 'smart' server, instead of the old api.vendor.com/last_tested_on_megabytes/get... server
some thoughts:
you might need to refactor your data model to indicate that the xml webservice data that you and your clients are consuming was dated at the last successful run^ (ie. update this date when the next ingest process completes)
it would be theoretically possible for you to transform the underlying xml on the way through to better yield records in a streaming fashion to your webservice client (if you're not already doing this) but this would take effort - i could discuss this more if a sample of the data structure was provided
all of this work can run in parallel to your existing application, which continues on your last version of the successfully processed 'old data' until the next version 'new data' are available
^
in trade you will now need to manage a 'sliding window' of data files, where each 'result' is a specific instance of your app downloading the webservice data and storing it on disc, then successfully ingesting it into your model:
last (two?) good result(s) compressed (in my experience, gigabytes of xml packs down a helluva lot)
next pending/ provisional result while you're streaming to disc/ doing an integrity check/ ingesting data - (this becomes the current 'good' result, and the last 'good' result becomes the 'previous good' result)
if we assume that you're ingesting into a relational db, the current (and maybe previous) tables with the webservice data loaded into your app, and the next pending table
switching these around becomes a metadata operation, but now your database must store at least webservice data x2 (or x3 - whatever fits in your limitations)
... yes you don't need to do this, but you'll wish you did after something goes wrong :)
Looks like we're the only ones to use all of their data
this implies that there is some way for you to partition or limit the webservice feed - how are the other clients discriminating so as not to receive the full monty?
You can use in-memory caching techniques OR you can use Java 8 streams. Please see the following link for more info:
https://www.conductor.com/nightlight/using-java-8-streams-to-process-large-amounts-of-data/
Camel could maybe help you the regulate the network load between the REST producer and producer ?
You might for instance introduce a Camel endpoint acting as a proxy in front of the real REST endpoint, apply some throttling policy, before forwarding to the real endpoint:
from("http://localhost:8081/mywebserviceproxy")
.throttle(...)
.to("http://myserver.com:8080/myrealwebservice);
http://camel.apache.org/throttler.html
http://camel.apache.org/route-throttling-example.html
My 2 cents,
Bernard.
If you have enough memory, Maybe you can use in-memory data store like Redis.
When you get data from your Rest endpoint you can save your data into Redis list (or any other data structure which is appropriate for you).
Your consumer will consume data from the list.

Using Java to perform Zero Copy data transfers between two or more sockets

Does any one know of any good java libraries/API packages that performs zero copy data transfers between two or more sockets? I know that Java's NIO API can perform zero copy data transfers from disk to socket and vice versa using java.nio.channels.FileChannel.transferTo and java.nio.channels.FileChannel.transferFrom methods respectively. However, there doesn't appear to be support for java socket to socket zero copy transfers. In addition any java libraries/API that can perform the system call splice (which can transfer data from a file descriptor to a pipe and vice versa) would be a plus, preferably on a linux platform.
Thank you for response.
In addition, I have read most of the previous blogs about zero copy as well as other informative sites such as http://www.ibm.com/developerworks/library/j-zerocopy/; However, it appears the above issue has not been addressed.
I don't know much about SocketChannel, but I think ByteBuffer.allocateDirect() is a good choice for you.
Just read socket A's data into ByteBuffer.allocateDirect() and let socket B read from it, simple and easy.
Here's the differences:
1. Old Way
SocketA --> BufferA(Kernel Space) --> BufferB(User Space)
BufferB(User Space) --> BufferC(Kernel Space) --> SocketB
2. Zero Copy Way
SocketA --> DirectBuffer(Could be accessed from Kernel and User Space)
DirectBuffer --> SocketB
Note
IMHO, I don't think we can make it directly SocketA -> SocketB, the os need to load data into physical memory before send them out.
========= EDIT =========
According to the article you referred, FileChannel's transferTo do it this way :
Use ByteBuffer.allocateDirect() you don't need to switch context between Kernel and User space. The only buffer is mapped to physical memory while reading and sending use the same blocks.

How to avoid loading a large file repeatedly?

I'm trying to call a Java program (Stanford Chinese Word Segmenter) from within python. The Java program needs to load a large (100M) dictionary file (word list to assist segmentation) which takes 12+ seconds. I was wondering if it is possible to speed up the loading process, and more importantly, how to avoid loading it repeatedly when I need to call the python script multiple times?
Here's the relevant part of the code:
op = subprocess.Popen(['java',
'-mx2g',
'-cp',
'seg.jar',
'edu.stanford.nlp.ie.crf.CRFClassifier',
'-sighanCorporaDict',
'data',
'-testFile',
filename,
'-inputEncoding',
'utf-8',
'-sighanPostProcessing',
'true',
'ctb',
'-loadClassifier',
**'./data/ctb.gz',**
'-serDictionary',
'./data/dict-chris6.ser.gz',
'0'],
stdout = subprocess.PIPE,
stdin = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
In the above code, './data/ctb.gz' is the place where the large word list file is loaded. I think this might be related to process, but I don't know much about it.
You might be able to use an OS specific solution here. Most modern Operating Systems have the ability to have a partition in memory. For example, in Linux you could do
mkfs -q /dev/ram1 8192
mkdir -p /ramcache
mount /dev/ram1 /ramcache
Moving the file to that directory would greatly speed I/O
There might be many ways to speed up the loading of the word list, but it depends on the details. If IO (disk read speed) is the bottleneck, then a simple way might be to zip the file and use a ZipInputStream to read it - but you would need to benchmark this.
To avoid multiple loading, you probably need to keep the Java process running, and communicate with it from Python via files or sockets, to send it commands, rather than actually launching the Java process each time from Python.
However, both of these require modifying the Java code.
If the java program produces output as soon as it receives input from filename named pipe and you can't change the java program then you could keep your Python script running instead and communicate with it via files/sockets as #DNA suggested for the Java process (the same idea but the Python program keeps running).
# ...
os.mkfifo(filename)
p = Popen([..., filename, ...], stdout=PIPE)
with open(filename, 'w') as f:
while True:
indata = read_input() # read text to segment from files/sockets, etc
f.write(indata)
# read response from java process
outdata = p.stdout.readline()# you need to figure out when to stop reading
write_output(outdata) # write response via files/sockets, etc
You can run a single instance of the JVM and use named pipes to allow the python script to communicate with the JVM. This will work assuming that the program executed by the JVM is stateless and responds on its stdout (and stderr perhaps) to requests arriving via its stdin.
Why not track whether the file has already been read on the python side? I'm not a python whiz, but I'm sure you could have some list or map/dictionary of all the files that have been opened so far.

URL.openStream() is very slow when ran on school's unix server

I am using URL.openStream() to download many html pages for a crawler that I am writing. The method runs great locally on my mac however on my schools unix server the method is extremely slow. But only when downloading the first page.
Here is the method that downloads the page:
public static String download(URL url) throws IOException {
Long start = System.currentTimeMillis();
InputStream is = url.openStream();
System.out.println("\t\tCreated 'is' in "+((System.currentTimeMillis()-start)/(1000.0*60))+"minutes");
...
}
And the main method that invokes it:
LinkedList<URL> ll = new LinkedList<URL>();
ll.add(new URL("http://sheldonbrown.org/bicycle.html"));
ll.add(new URL("http://www.trentobike.org/nongeo/index.html"));
ll.add(new URL("http://www.trentobike.org/byauthor/index.html"));
ll.add(new URL("http://www.myra-simon.com/bike/travel/index.html"));
for (URL tmp : ll) {
System.out.println();
System.out.println(tmp);
CrawlerTools.download(tmp);
}
Output locally (Note: all are fast):
http://sheldonbrown.org/bicycle.html
Created 'is' in 0.00475minutes
http://www.trentobike.org/nongeo/index.html
Created 'is' in 0.005083333333333333minutes
http://www.trentobike.org/byauthor/index.html
Created 'is' in 0.0023833333333333332minutes
http://www.myra-simon.com/bike/travel/index.html
Created 'is' in 0.00405minutes
Output on School Machine Server (Note: All are fast except the first one. The first one is slow regardless of what the first site is):
http://sheldonbrown.org/bicycle.html
Created 'is' in 3.2330666666666668minutes
http://www.trentobike.org/nongeo/index.html
Created 'is' in 0.016416666666666666minutes
http://www.trentobike.org/byauthor/index.html
Created 'is' in 0.0022166666666666667minutes
http://www.myra-simon.com/bike/travel/index.html
Created 'is' in 0.009533333333333333minutes
I am not sure if this is a Java issue (*A problem in my Java code) or a server issue. What are my options?
When run on the server this is the output of the time command:
real 3m11.385s
user 0m0.277s
sys 0m0.113s
I am not sure if this is relevant... What should I do to try and isolate my problem..?
You've answered your own question. It's not a Java issue, it has to do with your school's network or server.
I'd recommend that you report your timings in milliseconds and see if they're repeatable. Run that test in a loop - 1,000 or 10,000 times - and keep track of all the values you get. Import them into a spreadsheet and calculate some statistics. Look at the distribution of values. You don't know if the one data point that you have is an outlier or the mean value. I'd recommend that you do this for both networks in exactly the same way.
I'd also recommend using Fiddler or some other tool to watch network traffic as you download. You can get better insight into what's going on and perhaps ferret out the root cause.
But it's not Java. It's your code, your network. If this was a bug in the JDK it would have been fixed a long time ago. Suspect yourself first, last, and always.
UPDATE:
My network admin assured me that this
was a bad java implementation Not a
network problem. What do you think?
"Assured" you? What evidence did s/he produce to support this conclusion? What data? What measurements were taken? Sounds like laziness and ignorance to me.
It certainly doesn't explain why all the other requests behave just fine. What changed in Java between the first and subsequent calls? Did the JVM suddenly rewrite itself?
You can accept it if you want, but I'd say shame on your network admin for not being more curious. It would have been more honorable to be honest and say they didn't know, didn't have time, and weren't interested.
By Default Java prefers to use IPv6. My school's firewall
drops all IPv6 traffic (with no warning). After 3 minutes, 15 seconds Java falls back to IPv4. Seems strange to me that it takes so long to fall back to IPv4.
duffymo's answer, essentially: "Go talk to your network admin", helped me to solve the problem however I think that this is a problem caused by a strange Java implementation and a strange network configuration.
My network admin assured me that this was a bad java implementation Not a network problem. What do you think?

Suggested Approaches to programmatically make and record a VOIP call

I want to write a program that will be able to call into my company's bi-weekly conference calls, and record the call, so it can then be made into a podcast.
I am thinking of using Gizmo's SIP interface (and the fact that it allows you to make toll-free calls for free), but I am having trouble finding any example code (preferably in Java) that will be able to make an audio call, and get hold of the audio stream.
I have seen plenty of SIP programming tutorials that deal with establishing a session, and then they seem to just do some hand waving, and say "here is where you can establish the audio connection" without actually doing it.
I am experienced in Java, so I would prefer to use it, but other language suggestions are welcome as well.
I have never written a VOIP application, so I'm not really sure where to start. Can anyone suggest a good library or other resource that would help me get started?
Thanks!
Look for a VOIP softphone writtin in Java, then modify it to save the final audio stream instead of sending it to be played.
Side note: In many states you would be violating the law unless you do one of several things, varying by state: Notify the participants they're being recorded, insert BEEPs every N seconds, both, etc. Probably you only have to comply with the laws of the state you're calling from. Even worse, you may need to allow the users to decline recording (requires you to be there before recording starts). If you control the conference server, you may be able to get it to play a canned announcement that the call is being recorded.
You could do this with Twilio with almost no programming whatsoever. It will cost you 3ยข per minute, so if your company's weekly call is 45 minutes long, you're looking at $1.35 per week, about as close to free as possible. Here are the steps:
Sign up for Twilio and make note of your Account ID and token
Create a publicly accessible file on your web server that does nothing but output the following XML (see the documentation for explanation of the record parameters):
<Response>
<Record timeout="30" finishOnKey="#" />
</ Response>
When it's time to start the recording, perform a POST to this URL (documented here) with your browser or set up an automated process or script to do it for you:
POST http://api.twilio.com/2008-08-01/Accounts/ACCOUNT SID HERE/Calls
HTTP/1.1
Called=CONFERENCE NUMBER HERE
&Url=WEB PAGE HERE
&Method=GET
&SendDigits=PIN CODE HERE
If you want to get really creative, you can actually write code to handle the result of the recording verb and email the link to the MP3 or WAV file that Twilio hosts for you. But, if this is a one off, you can skip it because you can access all your recordings in the control panel for your account anyway.
try peers with mediaDebug option true in peers.xml. This option records all outgoing and incoming media streams in a media/ folder with a date pattern for file name. Nevertheless this file will probably not be usable as is. It contains raw uncompressed lienar PCM samples. You can use Audacity, sox or ffmpeg to convert it to whatever you want.
https://voip.dev.java.net/
They have some sample code there.

Categories

Resources