Am I correct I's suppose that within the bounds of the same process having 2 threads reading/writing to a named pipe does not block reader/writer at all? So with wrong timings it's possible to miss some data?
And in case of several processes - reader will wait until some data is available, and writer will be blocked until reader will read all the data supplied by reader?
I am planning to use named pipe to pass several (tens, hundreds) of files from external process and consume ones in my Java application. Writing simple unit tests to use one thread for writing to the pipe, and another one - for reading from the pipe, resulted in sporadic test failures because of missing data chunks.
I think it's because of the threading and same process, so my test is not correct in general. Is this assumption correct?
Here is some sort of example which illustrates the case:
import java.io.{FileOutputStream, FileInputStream, File}
import java.util.concurrent.Executors
import org.apache.commons.io.IOUtils
import org.junit.runner.RunWith
import org.scalatest.FlatSpec
import org.scalatest.junit.JUnitRunner
#RunWith(classOf[JUnitRunner])
class PipeTest extends FlatSpec {
def md5sum(data: Array[Byte]) = {
import java.security.MessageDigest
MessageDigest.getInstance("MD5").digest(data).map("%02x".format(_)).mkString
}
"Pipe" should "block here" in {
val pipe = new File("/tmp/mypipe")
val srcData = new File("/tmp/random.10m")
val md5 = "8e0a24d1d47264919f9d47f5223c913e"
val executor = Executors.newSingleThreadExecutor()
executor.execute(new Runnable {
def run() {
(1 to 10).foreach {
id =>
val fis = new FileInputStream(pipe)
assert(md5 === md5sum(IOUtils.toByteArray(fis)))
fis.close()
}
}
})
(1 to 10).foreach {
id =>
val is = new FileInputStream(srcData)
val os = new FileOutputStream(pipe)
IOUtils.copyLarge(is, os)
os.flush()
os.close()
is.close()
Thread.sleep(200)
}
}
}
without Thread.sleep(200) the test is failing to pass for reasons
broken pipe exception
incorrect MD5 sum
with this delay set - it works just great. I am using file with 10 megabytes of random data.
This is a very simple race condition in your code: you're writing fixed-size messages to the pipe, and assuming that you can read the same messages back. However, you have no idea how much data is available in the pipe for any given read.
If you prefix your writes with the number of bytes written, and ensure that each read only reads that number of bytes, you'll see that pipes work exactly as advertised.
If you have a situation with multiple writers and/or multiple readers, I recommend using an actual message queue. Actually, I recommend using a message queue in any case, as it solves the issue of message boundary demarcation; there's little point in reinventing that particular wheel.
Am I correct I's suppose that within the bounds of the same process having 2 threads reading/writing to a named pipe does not block reader/writer at all?
Not unless you are using non-blocking I/O, which you aren't.
So with wrong timings it's possible to miss some data?
Not unless you are using non-blocking I/O, which you aren't.
Related
Disclaimer: I work on a non-traditional project, so don't be shocked if some assumptions seem absurd.
Context
I wish to create a stream reader for integers, strings, and the other common types in Scala, but to start with I focus only on integers. Also note that I'm not interesting in handling exception at the moment -- I'll deal with them in due time and this will be reflected in the API and in the meantime I can make the huge assumption that failures won't occur..
The API should be relatively simple, but due to the nature of the project I'm working on, I can't rely on some feature of Scala and the API needs to look something like this (slightly simplified for the purpose of this question):
object FileInputStream {
def open(filename: String): FileInputStream =
new FileInputStream(
try {
// Check whether the stream can be opened or not
val out = new java.io.FileReader(filename)
out.close()
Some[String](filename)
} catch {
case _: Throwable => None[String]
}
)
}
case class FileInputStream(var filename: Option[String]) {
def close: Boolean = {
filename = None[String]
true // This implementation never fails
}
def isOpen: Boolean = filename.isDefined
def readInt: Int = nativeReadInt
private def nativeReadInt: Int = {
??? // TODO
}
}
object StdIn {
def readInt: Int = nativeReadInt
private def nativeReadInt: Int = {
??? // TODO
}
}
Please also note that I cannot rely on additional fields in this class, with the exception of Int variables. This (probably) implies that the stream has to be opened and closed for every operations. Hence, it goes without saying that the implementation will not be efficient, but this is not an issue.
The Question
My goal is to implement the two nativeReadInt methods such that the input stream gets consumed by only one integer if one is available straight away. However, if the input doesn't start (w.r.t. the last read operation) with an integer then nothing should be read and a fixed value can be returned, say -1.
I've explored several high level Java and Scala standard APIs, but none seemed to offer a way to re-open a stream to a given position trivially. My hope is to avoid implementing low level parsing based solely on java.io.InputStream and its read() and skip(n) methods.
Additionally, to let the user read from the standard input stream, I need to avoid using scala.io.StdIn.readInt() method because it reads "an entire line of the default input", therefore trashing some potential data.
Are you aware of a Java or Scala API that could do the trick here?
Thank you
I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt
A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.
Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.
I am using java to create an application for network management. In this application I establish communication with network devices using SNMP4j library (for the snmp protocol). So, Im supposed to scan certain values of the network devices using this protocol and put the result into a file for caching. Up in some point I decided to make my application multi-threaded and assign a device to a thread. I created a class that implements the runnable interface and then scans for the values that I want for each device.
When i run this class alone it, works fine. but when I put multiple threads at the same time the output mess up, it prints additional or out of order output into the files. Now, i wonder if this problem is due to the I/O or due to the communication.
Here I'll put some of the code so that you can see what im doing and help me figure what's wrong.
public class DeviceScanner implements Runnable{
private final SNMPCommunicator comm;
private OutputStreamWriter out;
public DeviceScanner(String ip, OutputStream output) throws IOException {
this.device=ip;
this.comm = new SNMPV1Communicator(device);
oids=MIB2.ifTableHeaders;
out = new OutputStreamWriter(output);
}
#Override
public void run(){
//Here I use the communicator to request for desired data goes something like ...
String read=""
for (int j=0; j<num; j++){
read= comm.snmpGetNext(oids);
out.write(read);
this.updateHeaders(read);
}
out.flush();
//...
}
}
some of the expected ooutput would be something like:
1.3.6.1.2.1.1.1.0 = SmartSTACK ELS100-S24TX2M
1.3.6.1.2.1.1.2.0 = 1.3.6.1.4.1.52.3.9.1.10.7
1.3.6.1.2.1.1.3.0 = 26 days, 22:35:02.31
1.3.6.1.2.1.1.4.0 = admin
1.3.6.1.2.1.1.5.0 = els
1.3.6.1.2.1.1.6.0 = Computer Room
but instead i get something like (varies):
1.3.6.1.2.1.1.1.0 = SmartSTACK ELS100-S24TX2M
1.3.6.1.2.1.1.2.0 = 1.3.6.1.4.1.52.3.9.1.10.7
1.3.6.1.2.1.1.4.0 = admin
1.3.6.1.2.1.1.5.0 = els
1.3.6.1.2.1.1.3.0 = 26 days, 22:35:02.31
1.3.6.1.2.1.1.6.0 = Computer Room
1.3.6.1.2.1.1.1.0 = SmartSTACK ELS100-S24TX2M
1.3.6.1.2.1.1.2.0 = 1.3.6.1.4.1.52.3.9.1.10.7
*Currently I have one file per device scanner desired.
i get them from a list of ip , it looks like this. Im also using a little threadpool to keep a limited number of threads at the same time .
for (String s: ips){
output= new FileOutputStream(new File(path+s));
threadpool.add(new DeviceScanner(s, output));
}
I suspect SNMPV1Communicator(device) is not thread-safe. As I can see it's not a part of SNMP4j library.
Taking a wild guess at what's going on here, try putting everything inside a synchronized() block, like this:
synchronized (DeviceScanner.class)
{
for (int j=0; j<num; j++){
read= comm.snmpGetNext(oids);
out.write(read);
this.updateHeaders(read);
}
out.flush();
}
If this works, my guess is right and the reason for the problems you're seeing is that you have many OutputStreamWriters (one on each thread), all writing to a single OutputStream. Each OutputStreamWriter has its own buffer. When this buffer is full, it passes the data to the OutputStream. It's essentially random when each each OutputStreamWriter's buffer is full - it might well be in the middle of a line.
The synchronized block above means that only one thread at a time can be writing to that thread's OutputStreamWriter. The flush() at the end means that before leaving the synchronized block, the OutputStreamWriter's buffer should have been flushed to the underlying OutputStream.
Note that synchronizing in this way on the class object isn't what I'd consider best practice. You should probably be looking at using a single instance of some other kind of stream class - or something like a LinkedBlockingQueue, with all of the SNMP threads passing their data over to a single file-writing thread. I've added the synchronized as above because it was the only thing available to synchronize on within your pasted example code.
You've got multiple threads all using buffered output, and to the same file.
There's no guarantees as to when those threads will be scheduled to run ... the output will be fairly random ordered, dictated by the thread scheduling.
I have tried to do this with simple Threads and succeded but I believe that using Threadpool I could do the same thing more effeciently:)?
simple threads:
public static class getLogFile implements Runnable {
private String file;
public void setFilename(String namefile){
file=namefile;
}
public int run1(String Filenamet) {
connectToServer(XXX, Filenamet, XXX, XXX, XXX, XXX);//creates a file and downloads it
return 0;
}
public void run() {
run1(file);
}
}
in main:
for(x=0 ; x < 36 ; x++){
String Filename1=Filename+x;
getLogFile n=new getLogFile();
n.setFilename(Filename1);
(new Thread(n)).start();
}
Program connects to the server executes 36 commands(using threadpool/simplethreads?!) at the same time and either downloads 36 result files, than merges them, or maybe it could just write to one file on server and then download it?
how to transform this code into threadpools?
how to write data to one file from 36 threads?
I can only offer you directions.
In order to use thread pool, look how ServiceExecutor works. Any example from Google will give you enough information. As an example look at:
http://www.deitel.com/articles/java_tutorials/20051126/JavaMultithreading_Tutorial_Part4.html
Concerning writing 36 threads to its own file, or writing into the one file. I cannot say anything about writing by several threads into the same file, but you may use CyclicBarrier to wait event when all threads will finish writing. Example of its using you may find here:
http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/CyclicBarrier.html
It's not clear what you want to do. My thoughts are that creating 36 separate connections to the server will be a sizable load that it could well do without.
Can the server assemble these 36 files itself and look after the threading itself ? That seems a more logical partition of duties. The server would have knowledge of how parallelisable this work is, and there's a substantial impact on the server of servicing multiple connections (including potentially blocking out other clients).
A simple way to do it using java task executors, as follows:
ExecutorService executor = Executors.newFixedThreadPool(100);
for(int i=0;i<100;i++)
{
executor.execute(new Runnable(i));
}
You can also use spring task executors, it will be easier. However, I will also suggest using a single connection, as mentioned above.
I've posted the same question here a few days ago(Java reading standard output from an external program using inputstream), and I found some excellent advices in dealing with a block in while reading ( while(is.read()) != -1)), but I still cannot resolve the problem.
After reading the answers to this similar question,
Java InputStream blocking read
(esp, the answer posted by Guss),
I am beginning to believe that looping an input stream by using is.read() != -1 condition doesn't work if the program is interactive (that is it takes multiple inputs from user and present additional outputs upon subsequent inputs, and the program exits only when an explicit exit command is given). I admit that I don't know much about multi-threading, but I think what I need is a mechanism to promptly pause input stream threads(one each for stdout, stderr) when an user input is needed, and resume once the input is provided to prevent a block. The following is my current code which is experiencing a block on the line indicated:
EGMProcess egm = new EGMProcess(new String[]{directory + "/egm", "-o",
"CasinoA", "-v", "VendorA", "-s", "localhost:8080/gls/MessageRobot.action ",
"-E", "glss_env_cert.pem", "-S", "glss_sig_cert.pem", "-C", "glsc_sig_cert.pem",
"-d", "config", "-L", "config/log.txt", "-H", "GLSA-SampleHost"}, new String[]{"PATH=${PATH}"}, directory);
egm.execute();
BufferedReader stdout = new BufferedReader(new InputStreamReader(egm.getInputStream()));
BufferedReader stderr = new BufferedReader(new InputStreamReader(egm.getErrorStream()));
EGMStreamGobbler stdoutprocessor = new EGMStreamGobbler(stdout, egm);
EGMStreamGobbler stderrprocessor = new EGMStreamGobbler(stderr, egm);
BufferedWriter stdin = new BufferedWriter(new OutputStreamWriter(egm.getOutputStream()));
stderrprocessor.run(); //<-- the block occurs here!
stdoutprocessor.run();
//EGM/Agent test cases
//check bootstrap menu
if(!checkSimpleResult("******** EGM Bootstrap Menu **********", egm))
{
String stdoutdump = egm.getStdOut();
egm.cleanup();
throw new Exception("can't find '******** EGM Bootstrap Menu **********'" +
"in the stdout" + "\nStandard Output Dump:\n" + stdoutdump);
}
//select bootstrap
stdin.write("1".toCharArray());
stdin.flush();
if(!checkSimpleResult("Enter port to receive msgs pushed from server ('0' for no push support)", egm)){
String stdoutdump = egm.getStdOut();
egm.cleanup();
throw new Exception("can't find 'Enter port to receive msgs pushed from server ('0' for no push support)'" +
"in the stdout" + "\nStandard Output Dump:\n" + stdoutdump);
}
...
public class EGMStreamGobbler implements Runnable{
private BufferedReader instream;
private EGMProcess egm;
public EGMStreamGobbler(BufferedReader isr, EGMProcess aEGM)
{
instream = isr;
egm = aEGM;
}
public void run()
{
try{
int c;
while((c = instream.read()) != 1)
{
egm.processStdOutStream((char)c);
}
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
I apologize for the length of the code, but my questions are,
1) Is there any way to control the process of taking in inputstreams (stdout, stderr) without using read()? Or am I just implementing this badly?
2) Is multi-threading the right strategy for developing the process of taking in inputstreams and writing an output?
PS: if anyone can provide a similar problem with solution, it will help me a lot!
instead of
stderrprocessor.run(); //<-- the block occurs here!
stdoutprocessor.run();
You need to start threads:
Thread errThread = new Thread(stderrprocessor);
errThread.setDaemon( true );
errThread.start();
Thread outThread = new Thread(stdoutprocessor);
outThread.setDaemon( true );
outThread.start();
run() is just a method specified in Runnable. Thread.start() calls run() on the Runnable in a new Thread.
If you just call #run() on a runnable, it will not be executed in parallel. To run it in parallel, you have to spawn a java.lang.Thread, that executes the #run() of your Runnable.
Whether a stream blocks depends on both sides of the stream. If either the sender does not send any data or the receiver does not receive data, you have a block situation. If the processor has to do something, while the stream is blocked, you need to spawn a(nother) thread within the processor to wait for new data and to interrupt the alternate process, when new data is streamed.
First, you need to read up on Thread and Runnable. You do not call Runnable.run() directly, you set up Threads to do that, and start the threads.
But more important, the presence of three independent threads implies the need for some careful design. Why 3 thread? The two you just started, and the main one.
I assume that the generall idea of your app is to wait for some output to arrive, interpret it and as a result send a command to the application you are controlling?
So your main thread needs to wait around for one of the reader threads to say "Aha! that's interesting, better ask the user what he wants to do."
In other words you need some communication mechanism between your readers and your writer.
This might be implemented using Java's event mechanism. Yet more reading I'm afraid.
Isn't this why the nio was created?
I don't know much about the Channels in nio, but this answer may be helpful. It shows how to read a file using nio. May be useful.