any quick sorting for a huge csv file

any quick sorting for a huge csv file - java

I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Thanks.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
Edit:
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt

How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.

Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
sortProcess.getErrorStream()));
String outputS = reader.readLine();
while (outputS != null) {
System.err.println(outputS);
outputS = reader.readLine();
}
sortProcess.waitFor();
}

Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
The result is:
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.

Java Lists can be sorted, you can try starting there.

Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
Decorate:
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
Undecorate:
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )

You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
do
sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted

As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.

You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)

Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
waitlist=
for f in ${SOURCE}/*
do
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.

Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.

Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.

Related

GC overhead limit exceeded when reading large pipe-separated values file

I am trying to parse a large file (6.5 million rows) but am getting the mentioned out-of-memory error. I am using this same method to read other files of around 50K rows, and it works fairly quickly. Here it runs extremely slowly, then fails with the error. I originally had 2 GB dedicated to intelliJ, which I changed to 4 GB (-Xmx4000m), then 6 GB (-Xmx6000m), and still finish with the same error. My computer only has 8 GB RAM so I can't go any higher. Any suggestions?
Thanks!
public static List<UmlsEntry> umlsEntries(Resource resource) throws
IOException {
return CharStreams.readLines(new InputStreamReader(resource.getInputStream())).stream().distinct()
.map(UmlsParser::toUmlsEntry).collect(Collectors.toList());
}
private static UmlsEntry toUmlsEntry(String line) {
String[] umlsEntry = line.split("|");
return new UmlsEntry(umlsEntry[UNIQUE_IDENTIFIER_FOR_CONCEPT_COLUMN_INDEX],
umlsEntry[LANGUAGE_OF_TERM_COLUMN_INDEX], umlsEntry[TERM_STATUS_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_TERM_COLUMN_INDEX], umlsEntry[STRING_TYPE_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_STRING_COLUMN_INDEX],
umlsEntry[IS_PREFERRED_STRING_WITHIN_THIS_CONCEPT_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_ATOM_COLUMN_INDEX], umlsEntry[SOURCE_ASSERTED_ATOM_INDENTIFIER_COLUMN_INDEX],
umlsEntry[SOURCE_ASSERTED_CONCEPT_IDENTIFIER_COLUMN_INDEX],
umlsEntry[SOURCE_ASSERTED_DESCRIPTOR_IDENTIFIER_COLUMN_INDEX],
umlsEntry[ABBREVIATED_SOURCE_NAME_COLUMN_IDENTIFIER_COLUMN_INDEX],
umlsEntry[ABBREVIATION_FOR_TERM_TYPE_IN_SOURCE_VOCABULARY_COLUMN_INDEX],
umlsEntry[MOST_USEFUL_SOURCE_ASSERTED_IDENTIFIER_COLUMN_INDEX], umlsEntry[STRING_COLUMN_INDEX],
umlsEntry[SOURCE_RESTRICTION_LEVEL_COLUMN_INDEX], umlsEntry[SUPPRESSIBLE_FLAG_COLUMN_INDEX],
umlsEntry[CONTENT_VIEW_FLAG_COLUMN_INDEX]);
}

You need to treat the lines a few at a time to avoid using up all available memory, since the file doesn't fit in memory. CharStreams.readLines confusingly isn't streaming. It reads all lines at once and returns you a list. This won't work. Try File.lines instead. I suspect that you will get into trouble with distinct as well. It will need to keep track of all hashes of all lines, and if this balloons too far you might have to change that tactic as well. Oh, and collect won't work either if you don't have enough memory to hold the result. Then you might want to write to a new file or a database or so.
Here is an example of how you can stream lines from a file, compute distinct entries and print the md5 of each line:
Files.lines(FileSystems.getDefault().getPath("/my/file"))
.distinct()
.map(DigestUtils::md5)
.forEach(System.out::println);
If you run into trouble detecting distinct rows, sort the file in-place first and then filter out identical adjacent rows only.

HBase - Java Client Limit Scan Results Explanation?

Is there a way to limit the total number of rows of sequential data when performing a scan of the data?
Notes:
I'm working with 500,000 total rows
I've tried both setMaxResultSize and setMaxResultsPerColumnFamily. This proves to be ineffectual (there does seem to be some behavior when both are set to low numbers or setMaxResultSize is larger. What is the relationship between these two functions?)
I've worked with setting a PageFilter (size 10), and the behavior displays 5 different sequence data sets of 10.
I actually got it sudo-working while typing this out by setting the PageFilter size and the setMaxResultSize equal. When I change either, it conforms to the PageFilter. It will also jump to another large subset of PageFilter size if I make setMaxResultSize significantly larger.
HBase version is 1.1.1
Can someone better explain what is happening here and how to get the results I want?

you can use either hbase shell or hbase java client.
1- hbase shell: use this command and pipe the results to a file and do "wc -l ..."
count 'table_name',1
2- java hbase client api
long count=0;
String row="";
for (Result res : scanner)
{
for (Cell cell : res.listCells())
{
row = new String(CellUtil.cloneRow(cell));
if(!row.equals(""))
count++;
}
}

Construct document-term matrix via Java and MapReduce

Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1

Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!

You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.

Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Util method to get Line by Line#

Is there any Util method to get the line contents by Line# from given file?

The simplest approach is to read all the lines into a list and look up the line by number in this list. You can use
List<String> lines = FileUtils.readLines(file);
My file is 3GB and I don't want to store all the lines in my java memory
I would make sure you have plenty of memory. You can buy 32 GB for less than $200.
However, assuming this is not an option you can index the file by reading it once storing the offset of each line in another file. It could be a 32-bit offset, but it would simpler/more scalable if you used a 64-bit offset.
You can then lookup the offset of each line and the next one to determine where to read each line. I would expect this to take about 10 micro-seconds if implemented efficiently.
BTW: If you had it loaded in Java memory it would be about 100x faster.

Huge String Table in Java

I've got a question about storing huge amount of Strings in application memory. I need to load from file and store about 5 millions lines, each of them max 255 chars (urls), but mostly ~50. From time to time i'll need to search one of them. Is it possible to do this app runnable on ~1GB of RAM?
Will
ArrayList <String> list = new ArrayList<String>();
work?
As far as I know String in java is coded in UTF-8, what gives me huge memory use. Is it possible to make such array with String coded in ANSI?
This is console application run with parameters:
java -Xmx1024M -Xms1024M -jar "PServer.jar" nogui

The latest JVMs support -XX:+UseCompressedStrings by default which stores strings which only use ASCII as a byte[] internally.
Having several GB of text in a List isn't a problem, but it can take a while to load from disk (many seconds)
If the average URL is 50 chars which are ASCII, with 32 bytes of overhead per String, 5 M entries could use about 400 MB which isn't much for a modern PC or server.

A Java String is a full blown object. This means that appart from the characters of the string theirselves, there is other information to store in it (a pointer to the class of the object, a counter with the number of pointers pointing to it, and some other infrastructure data). So an empty String already takes 45 bytes in memory (as you can see here).
Now you just have to add the maximum lenght of your string and make some easy calculations to get the maximum memory of that list.
Anyway, I would suggest you to load the string as byte[] if you have memory issues. That way you can control the encoding and you can still do searchs.

Is there some reason you need to restrict it to 1G? If you want to search through them, you definitely don't want to swap to disk, but if the machine has more memory it makes sense to go higher then 1G.
If you have to search, use a SortedSet, not an ArrayList

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.