File from multithreaded download different from single threaded

File from multithreaded download different from single threaded - java

I have been experimenting with multithreading lately, and I coded an application that downloads a file with HTTPUrlConnection's Range request property. I first get the length of the file from the header, and then I split it up into X number of equal parts, and if there is a remainder, I assign one more thread to take up the slack. Then each of the parts go into an object inside of a queue. Then multiple threads access each task in the queue and execute it, downloading each part concurrently into separate files.
The way that I join the files is the problem. No matter whether I use Linux cat or Windows' copy /B or type, it always comes out that the resulting file comes out invalid in some way.
With AVI files, the index is broken, but when rebuilt, the AVI plays correctly. With .rar files winrar displays "unexpected end of archive", although the files extract normally. What could be causing this. I made sure that no bytes overlapped when I split it up amongst threads.

You're using the Range parameter to the request incorrectly. The end index is for the last character to be read, inclusive, while your algorithm passes the index for the first character that you don't want transferred. Subtract 1 from the argument you pass as the end argument to DownloadPart, and you should be fine:
list.add(new DownloadPart(pos, pos + pieceLen - 1, savePath, url, String.valueOf(ch)));
You've also got some unnecessary duplication of code that you should probably clean up; your first full block doesn't need to be treated any differently than any other full block, which would simplify your code.

Related

Replicating a Looping File Backup in Java

I'm trying to implement a file storage system that is like a video recording system that loops over existing data. Say, we have a maximum file size of 10MB and append an integer value every second. We can setup a FileChannel and keep appending the values. But what to do once we've reached our 10MB? Now we want to append a new value but pop one off the head of the file. We see the problem is a sort of queue and it's easy to put and pop values from a queue in memory, but not so easy when using files.
I implemented a circular buffer with a FileChannel as the behind the scenes storage. It works but the problem is the first and last indices move through the file as data is added and removed. Ideally, I would always like the oldest data value to be at file index 0 and the most recent data at file index n-1, so that when a file is read it is from the start to end.
I saw that FileChannel supports methods transferTo() and transferFrom() and also did an implementation using these methods and again it works. The problem with this method is continually having to transfer blocks of data from the current file to a temporary file and then replace the current with the new file. It works but not particularly efficient.
Thus, I've tried a few things but not found the ideal solution or replicating a file-queue as yet and was wondering if anyone else has implemented the golden bullet solution? Maybe, a file version of a queue in which the data is shuffled along is simply not possible, but hopefully someone knows an answer. Thanks.

Sorting text lines from hard drive files by partly loading to memory | Java

My task is to sort file which is too large to fit in memory. File contains text lines.
What I did:
read from original file by parts (of allowed size).
sorted each part
saved each sorted part to tempfiles
As I understand next thing i should do is:
read first lines of each file
sort them between each other (use local variable to temporarily store it, but I am not sure if it will be below restricted size)
write first line (as result of sorting) to final file
now I need to remove line I just wrote from temporary file
now i need to repeat steps 1-4 until all lines are sorted and "transferred" from temp files to final file
I am most unsure about step 4 - is there a class than can look for a value and then erase line with this value (at that point I won't even know from which file that line came)? I think that this is not a proper way to reach my goal at all. But I need to remove lines which are already sorted. And I can't operate with files' data in memory.

do you need to do this in Java (assuming by the tag)? As memory wise it isn't going to be efficient way. The simplest option in my opinion would be using sort and just sort the file directly on the OS level.
This article will give you a guide on how to use sort: https://www.geeksforgeeks.org/sort-command-linuxunix-examples/
Sort is available on Windows as well as unix/linux and can handle huge files.

Java: read groups of lines with same prefix from very large text file

I have a large (~100GB) text file structured like this:
A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar
Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.
The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.
I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways:
1) Scan the file using a BufferedReader, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line.
2) Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.
I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.

I think a PushbackReader would work:
if (lineBelongsToNewGroup){
reader.unread(lastLine.toCharArray());
// probably also unread a newline
}

I think option 1 is the simplest. I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB.
The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. You can memory map 100 GB on a 64-bit JVM. This avoids the need to parse every line which is pretty expensive. An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines.

Need help using java threads to download file parts

I am trying to download a file from a server in a user specified number of parts (n). So there is a file of x bytes divided into n parts with each part downloading a piece of the whole file at the same time. I am using threads to implement this, but I have not worked with http before and do not really understand how downloading a file really works. I have read up on it and it seems "Range" needs to be used, but I do not know how to download different parts and being able to append them without corrupting the data.

(Since it's a homework assignment I will only give you a hint)
Appending to a single file will not help you at all, since this will mess up the data. You have two alternatives:
Download from each thread to a separate temporary file and then merge the temporary files in the right order to create the final file. This is probably easier to conceive, but a rather ugly and inefficient approach.
Do not stick to the usual stream-style semantics - use random access (1, 2) to write data from each thread straight to the right location within the output file.

Need advice in Efficiency: Scanning 2 very large files worth of information

I have a relatively strange question.
I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.
I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read every line that begins with the userID that is contained in the file and store (some value X, related to the rows)
Any advice or tips on how I can make this process work more efficiently?

Import file into SQL database
Use SQL
Performance!
Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.

Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().
Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.

For each row R in the file {
Let N be the number that you need to
extract from R.
Check if there is a file called N. If
not, create it.
Append R to the file called N
}

How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.
Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.

You don't mention whether or not this is a regular, ongoing thing or an occasional check.
Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.

Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.
If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.
I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.
For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).
Make sense?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.