now i am working on a job about data format transform.
there is a large file, like 10GB, the current solution i implemented is read this file line by line, transform the format for each line, then output to a output file. i found the transform process is a bottle neck. so i am trying to do this in a concurrent way.
Each line is a complete unit, has nothing to do with other lines. Some lines may be discarded as some specific value in the line do not meet the demand.
now i have two plans:
one thread read data line by line from input file, then put the line into a queue, several threads get lines from the queue, transform the format, then put the line into a output queue, finally an output thread reads lines from the output queue and writes to a output file.
several threads currently read data from different part of the input file, then process the line and output to a file through a output queue or file lock.
would you guys please give me some advise ? i really appreciate it.
thanks in advance!
I would go for the first option ... reading data from a file in small pieces normally is slower than reading the whole file at once (depending on file caches/buffering/read ahead etc).
You also might need to think about a way to create the output file (acquiring all lines from the different processes, possibly in the correct order if needed).
Solution 1 makes sense.
This would also map nicely and simply to Java's Executor framework. Your main thread reads lines and submits each line to an Executor or ExecutorService.
It gets more complicated if you must keep order intact, though.
Related
I have one file which contains 100 messages,one message in one line. I have 10 threads and each thread should pick one message from file and sent it to given address. Message should not be sent duplicate by any thread. Here i have 10 threads, so 1 thread should be responsible for sending 10 messages.
Normally people use CSV Data Set Config for this form of parameterization.
Add CSV Data Set Config to your Test Plan
Configure it as follows:
That's it, now you can refer the line from CSV as ${message} where required, each user will read its own line, no duplicates, when all lines are read - test will end.
Another option is to use __StringFromFile() function, however in this case the test will not stop, you will have to worry about setting the number of iterations yourself. Also __StringFromFile() function keeps the whole file in memory so it is not suitable for large data sets.
I have a large sized file. Each line in that file maps to a database record. So I need to read that file line by line and persist each record into the database. Suppose I use multiple threads to read that file.
Is there a way in java wherein one thread can read lines from line number 1...50 and other thread reads lines from line number 51..100. In the similar way I will have multiple threads which will be reading from that single file.
I am monitoring and Minecraft server and I am making a setup file in Python. I need to be able to run two threads, one running the minecraft_server.jar in the console window, while a second thread is constantly checking the output of the minecraft_server. Also, how would I input into the console from Python after starting the Java process?
Example:
thread1 = threading.Thread(target=listener)
thread2 = minecraft_server.jar
def listener():
if minecraft_server.jarOutput == "Server can't keep up!":
sendToTheJavaProccessAsUserInputSomeCommandsToRestartTheServer
It's pretty hard to tell here, but I think what you're asking is how to:
Launch a program in the background.
Send it input, as if it came from a user on the console.
Read its output that it tries to display to a user on the console.
At the same time, run another thread that does other stuff.
The last one is pretty easy; in fact, you've mostly written it, you just need to add a thread1.start() somewhere.
The subprocess module lets you launch a program and control its input and output. It's easiest if you want to just feed in all the input at once, wait until it's done, then process all the output, but obviously that's not your case here, so it's a bit more involved:
minecraft = subprocess.Popen(['java', 'path/to/minecraft_server.jar', '-other', 'args],
stdin=subprocess.PIPE,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
I'm merging stdout and stderr together into one pipe; if you want to read them separately, or send stderr to /dev/null, or whatever, see the docs; it's all pretty simple. While we're making assumptions here, I'm going to assume that minecraft_server uses a simple line-based protocol, where every command, every response, and every info message is exactly one line (that is, under 1K of text ending in a \n).
Now, to send it input, you just do this:
minecraft.stdin.write('Make me a sandwich\n')
Or, in Python 3.x:
minecraft.stdin.write(b'Make me a sandwich\n')
To read its output, you do this:
response = minecraft.stdout.readline()
That works just like a regular file. But note that it works like a binary file. In Python 2.x, the only difference is that newlines don't get automatically converted, but in Python 3.x, it means you can only write bytes (and compatible objects), not strs, and you will receive bytes back. There are good reasons for that, but if you want to get pipes that act like text files instead, see the universal_newlines (and possibly bufsize) arguments under Frequently Used Arguments and Popen Constructor.
Also, it works like a blocking file. With a regular file, this rarely matters, but with a pipe, it's quite possible that there will be data later, but there isn't data yet (because the server hasn't written it yet). So, if there is no output yet (or not a complete line's worth, since I used readline()), your thread just blocks, waiting until there is.
If you don't want that, you probably want to create another thread to service stdout. And its function can actually look pretty similar to what you've got:
def listener():
for line in minecraft.stdout:
if line.strip() == "Server can't keep up!":
minecraft.stdin.write("Restart Universe\n")
Now that thread can block all day and there's no problem, because your other threads are still going.
Well, not quite no problem.
First it's going to be hard to cleanly shut down your program.
More seriously, the pipes between processes have a fixed size; if you don't service stdout fast enough, or the child doesn't service stdin fast enough, the pipe can block. And, the way I've written things, if the stdin pipe blocks, we'll be blocked forever in that stdin.write and won't get to the next read off stdout, so that can block too, and suddenly we're both waiting on each other forever.
You can solve this by having another thread to service stdout. The subprocess module itself includes an example, in the Popen._communicate function used by all the higher-level functions. (Make sure to look at Python 3.3 or later, because earlier versions had bugs.)
If you're in Python 3.4+ (or 3.3 with a backport off PyPI), you can instead use asyncio to rewrite your program around an event loop and handle the input and output the same way you'd write a reactor-based network server. That's what all the cool kids are doing in 2017, but back in late 2014 many people still thought it looked new and scary.
If all of this is sounding like a lot more work than you signed on for, you may want to consider using pexpect, which wraps up a lot of the tedious details, and makes some simplifying assumptions that are probably true in your case.
I'd like my program to get a file, and then create 4 files based on its byte content.
Working with only the main thread, I just create one DataInputStream and do my thing sequentially.
Now, I'm interested in making my program concurrent. Maybe I can have four threads - one for each file to be created.
I don't want to read the file's bytes into memory all at once, so my threads will need to query the DataInputStream constantly to stream the bytes using read().
What is not clear to me is, should my 4 threads call read() on the same DataInputStream, or should each one have their own separate stream to read from?
I don't think this is a good idea. See http://download.java.net/jdk7/archive/b123/docs/api/java/io/DataInputStream.html
DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class.
Assuming you want all of the data in each of your four new files, each thread should create its own DataInputStream.
If the threads share a single DataInputStream, at best each thread will get some random quarter of the data. At worst, you'll get a crash or data corruption due to multithreaded access to code that is not thread safe.
If you want to read data from 1 file into 4 separate ones you will not share DataInputStream. You can however wrap that stream and add functionality that would make it thread safe.
For example you may want to read in a chunk of data from your DataInputStream and cache that small chunk. When all 4 threads have read the chunk you can dispose of it and continue reading. You would never have to load the complete file into memory. You would only have to load a small amount.
If you look at the doc of DataInputStream. It is a FilterInputStream, which means the read operation is delegated to other inputStream. Suppose you use here is a FileInputStream, In most platform, concurrent read will be supported.
So in your case, you should initialize four different FileInputStream, result in four DataInputStream, used in four thread separately. The read operation will not be interfered.
Short answer is no.
Longer answer: have a single thread read the DataInputStream, and put the data into one of four Queues, one per output file. Decide which Queue based upon the byte content.
Have four threads, each one reading from a Queue, that write to the output files.
Please Note: I am not "looking for teh codez" - just ideas for algorithms to solve this problem.
This IS a homework assignment. I thought I was in the home stretch, about to finish it out, but the last part has absolutely stumped me. Never have I been stuck like this. It has to do with threading in Java.
The Driver class reads a file, the first line indicates the number of threads, second line is a space delimited list of file names for each thread to read from. Each thread is numbered (0 - N), N being the total number of files. Each thread reads the file specified, and outputs to a file named t#_out.txt where # is the threads index.
After all of this is done the Driver thread must:
After all threads finish execution, the program Driver.java opens all
output files t#_out.txt, reads a line from each file, and writes the
line to an output file out.txt.
Example of the out.txt:
MyThread[0]: Line[1]: Something there is that doesn't love a wall,
MyThread[1]: Line[1]: HOG Butcher for the World,
MyThread[2]: Line[1]: I think that I shall never see
MyThread[0]: Line[2]: That sends the frozen-ground-swell under it,
MyThread[1]: Line[2]: Tool Maker, Stacker of Wheat,
MyThread[2]: Line[2]: A poem lovely as a tree.
MyThread[0]: Line[3]: And spills the upper boulders in the sun,
MyThread[1]: Line[3]: Player with Railroads and the Nation's Freight Handler;
MyThread[2]: Line[3]: A tree whose hungry mouth is prest
My problem is: What kind of loop structure could I setup to do this? Read a line from t1_out.txt, write to out.txt, read line from t2_out.txt, write to out.txt, read line from tN_out.txt, write to out.txt? How do I know when one file has reached the end?
Ideas:
Use a while(!done) loop to continue looping until each scanner is done. Keep track of an array of booleans indicating whether or not the Scanner is done reading its file. The Scanners would be in an array as well. The problem with this is how do I tell when ALL are done, to finish my infinite loop? In each iteration see if booleans[i] is done and if not then done = false? No good.
Just read every files lines into its own String[] array. Then figure out a loop to alternate the writing to the out.txt. Problem with this is what happens when I hit array index out of bounds? Also this is not in the specs, it says to read a line, and write a line.
EDIT: The solution was to create an allFilesReachedEOF() method which has an initial boolean of true. It then loops through each one, and if ANY have another line to read, sets the return condition to false. This was my while loops condition: while (!allFilesReachedEOF()).
My problem was that I was trying to control the loop from within the loop. So if a file had another line it would continue, but if ANY file EOF'd, the loop would stop.
Thanks for the help!
You could do a while with a condition that not all the files have reached EOF. Then you iterate through all the files, and for those that haven't reached EOF, you read the next line and write it to your output file. As you go, you update your condition variable for the "while" loop.
Is this what you're looking to do?
It sounds like you could use a Queue to achieve this. Add each t#_out.txt's input to the Queue then implement a loop in which you read a line from the polled input and write it to your output. As long as the read line isn't EOF, re-add the input to the Queue. When the Queue is empty, break out from the loop.
Also I recommend a BufferedWriter for the output, which you flush() at the end so that the actual writing only occurs once.
Here are the main points:
Create a class that implements Runnable whose run() method does what you need one thread to do. It'll likely need fields for threadNumber and filename. The run method should make sure to close() the output streams of the output files
For each thread you need to create, instantiate one of your class (giving it the filename (and other) data it needs) and pass it into the constructor of Thread. Keep a reference to the Thread objects
Call the start() method on all the threads
Call the join() method on all the threads (join waits for the thread to finish)
Do your final Driver task of opening the output files
This can be done by using a do - while "exit condition" loop and an inner for loop for iterating through the output files. Set the exit condition to true before the start of the for loop, and reset it within the for loop if you get at least a line from any of the files.
Files that have reached eof will continue to be read, but will not return any lines. You can choose to print blank lines for these or just skip them.