Preprocessing CSV data efficiently before or while parallel streaming

Preprocessing CSV data efficiently before or while parallel streaming - java

I am looking for an efficient way to preprocess CSV data before (or while) dumping it to a java stream.
Under normal circumstances I would do something like this to process the file:
File input = new File("helloworld.csv");
InputStream is = new FileInputStream(input);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
br.lines().parallel().forEach(line -> {
System.out.println(line);
});
However in this current case I need to preprocess the records before or while streaming them and each item in my collection could depend on the previous. Here is a simple example CSV file to demonstrate the issue:
species, breed, name
dog, lab, molly
, greyhound, stella
, beagle, stanley
cat, siamese, toby
, persian, fluffy
In my example CSV the species column is only populated when it changes from record to record. I know the simple answer would be to fix my CSV output but in this case that is not possible.
I am looking for a reasonable efficient way to process the records from CSV, copying the species value from the prior record if blank, and then passing to a parallel stream after preprocessing.
Downstream processing can take a long time so I ultimately need to process in parallel once preprocessing is complete. My CSV files can also be large so I would like to avoid loading the entire file into an object in memory first.
I was hoping there was some way to do something like the following (warning bad pseudocode):
parallelStream.startProcessing
while read line {
if (line.doesntHaveSpecies) {
line.setSpecies
}
parallelStream.add(line)
}
My current solution is to process the entire file and "fix it" then stream it. Since the file can be large, it would be nice to start processing records immediately after they have been "fixed" and before the entire file has been processed.

You have to encapsulate the state into a Spliterator.
private static Stream<String> getStream(BufferedReader br) {
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<String>(
100, Spliterator.ORDERED|Spliterator.NONNULL) {
String prev;
public boolean tryAdvance(Consumer<? super String> action) {
try {
String next = br.readLine();
if(next==null) return false;
final int ix = next.indexOf(',');
if(ix==0) {
if(prev==null)
throw new IllegalStateException("first line without value");
next = prev+next;
}
else prev=ix<0? next: next.substring(0, ix);
action.accept(next);
return true;
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}, false);
}
which can be used as
try(Reader r = new FileReader(input);
BufferedReader br = new BufferedReader(r)) {
getStream(br).forEach(System.out::println);
}
The Spliterator will always be traversed sequentially. If parallel processing is turned on, the Stream implementation will try to get new Spliterator instances for other threads by calling trySplit. Since we can’t offer an efficient strategy for that operation, we inherit the default from AbstractSpliterator which will do some array based buffering. This will always work correctly, but only pay off, if you have heavy computations in the subsequent stream pipeline. Otherwise, you may simply stay with sequential execution.

you can't start it with parallel stream because it has to be execute sequentially to get the species from previous line. So we could introduce some side-effect mapper:
final String[] species = new String[1];
final Function<String, String> speciesAppending = l -> {
if (l.startsWith(",")) {
return species[0] + l;
} else {
species[0] = l.substring(0, l.indexOf(','));
return l;
}
};
try (Stream<String> stream = Files.lines(new File("helloworld.csv").toPath())) {
stream.map(speciesAppending).parallel()... // TODO
}

Related

Correct use of ForkJoinPool submit and join in Java

I've worked recently on an implementation of a external merge sort algorithm (External Sorting) and my implementation needed to use a multi-threaded approach.
I tried to use ForkJoinPool instead of using the older implementations in Java such as Thread and ExecutorService. The first step of the algorithm requires to read a file, and every x lines collect and send to be sorted and written to file. This action (sort and save) can be done in a separate thread while the main thread reads the next batch. I've written a method to do just that (see below).
My concern is that the actual parallel work is not started when I use ForkJoinPool.commonPool().submit(()->SortAndWriteToFile(lines, fileName)) but instead only when I call task.join() after the loop had finished. That would mean that on a large enough loop I'd be collating the tasks to be run, but not gaining any time running them. When I used invoke instead of submit it seems like I cannot control where the join will be and cannot guarantee all the work was done before moving on.
Is there a more correct way to implement this?
My code is below. The method and two utility methods are listed. I hope this is not too long.
protected int generateSortedFiles (String originalFileName, String destinationFilePrefix) {
//Number of accumulated sorted blocks of size blockSize
int blockCount = 0;
//hold bufferSize number of lines from the file
List<String> bufferLines = new ArrayList<String>();
List<ForkJoinTask<?>> taskList = new ArrayList<ForkJoinTask<?>>();
//Open file to read
try (Stream<String> fileStream = Files.lines(Paths.get(originalFileName))) {
//Iterate over BufferSize lines to add them to list.
Iterator<String> lineItr = fileStream.iterator();
while(lineItr.hasNext()) {
//Add bufferSize lines to List
for (int i=0;i<bufferSize;i++) {
if (lineItr.hasNext()) {
bufferLines.add(lineItr.next());
}
}
//submit the task to sort and write to file in a separate thread
String fileName= destinationFilePrefix+blockCount+".csv";
List<String> lines = Collections.unmodifiableList(bufferLines);
taskList.add(ForkJoinPool.commonPool().submit(
()->SortAndWriteToFile(lines, fileName)));
blockCount++;
bufferLines = new ArrayList<String>();
}
} catch (IOException e) {
System.out.println("read from file " +originalFileName + "has failed due to "+e);
} catch (ArrayIndexOutOfBoundsException e) {
System.out.println("the index prodived was not available in the file "
+originalFileName+" and the error is "+e);
}
flushParallelTaskList(taskList);
return blockCount;
}
/**
* This method takes lines, sorts them and writes them to file
* #param lines the lines to be sorted
* #param fileName the filename to write them to
*/
private void SortAndWriteToFile(List<String> lines, String fileName) {
//Sort lines
lines = lines.stream()
.parallel()
.sorted((e1,e2) -> e1.split(",")[indexOfKey].compareTo(e2.split(",")[indexOfKey]))
.collect(Collectors.toList());
//write the sorted block of lines to the destination file.
writeBuffer(lines, fileName);
}
/**
* Wait until all the threads finish, clear the list
* #param writeList
*/
private void flushParallelTaskList (List<ForkJoinTask<?>> writeList) {
for (ForkJoinTask<?> task:writeList) {
task.join();
}
writeList.clear();
}

Java 8's Files.lines(): Performance concern for very long line

Java 8's stream API has been convenient and gained popularity. For file I/O, I found that two API's are provided to generate stream output: Files.lines(path), and bufferedReader.lines();
I did not find a stream API which provide Stream of fixed-sized buffers for reading files, though.
My concern is: in case of files with very long line, e.g. a 4GB file with only a single line, aren't these line-based API very inefficient?
The line-based reader will need at least 4GB memory to keep that line.
Compared to a fix-sized buffer reader (fileInputStream.read(byte[] b, int off, int len)), which takes at most the buffer size of memory.
If the above concern is true, are there any Stream API for file i/o API which are more efficient?

If you have a 4GB text file with a single line, and you're processing it "line by line", then you've made a serious error in your programming by not understanding the data you're working with.
They're convenience methods for when you need to do simple work with data like CSV or other such format, and the line sizes are manageable.
A real life example of a 4GB text file with a single line would be an XML file without line breaks. You would use a streaming XML parser to read that, not roll your own solution that reads line by line.

It depends on how you want to process the data, which method of delivery is appropriate. So if your processing requires processing the data line by line, there is no way around doing it that way.
If you really want fixed size chunks of character data, you can using the following method(s):
public static Stream<String> chunks(Path path, int chunkSize) throws IOException {
return chunks(path, chunkSize, StandardCharsets.UTF_8);
}
public static Stream<String> chunks(Path path, int chunkSize, Charset cs)
throws IOException {
Objects.requireNonNull(path);
Objects.requireNonNull(cs);
if(chunkSize<=0) throw new IllegalArgumentException();
CharBuffer cb = CharBuffer.allocate(chunkSize);
BufferedReader r = Files.newBufferedReader(path, cs);
return StreamSupport.stream(
new Spliterators.AbstractSpliterator<String>(
Files.size(path)/chunkSize, Spliterator.ORDERED|Spliterator.NONNULL) {
#Override public boolean tryAdvance(Consumer<? super String> action) {
try { do {} while(cb.hasRemaining() && r.read(cb)>0); }
catch (IOException ex) { throw new UncheckedIOException(ex); }
if(cb.position()==0) return false;
action.accept(cb.flip().toString());
return true;
}
}, false).onClose(() -> {
try { r.close(); } catch(IOException ex) { throw new UncheckedIOException(ex); }
});
}
but I wouldn’t be surprised if your next question is “how can I merge adjacent stream elements”, as these fixed sized chunks are rarely the natural data unit to your actual task.
More than often, the subsequent step is to perform pattern matching within the contents and in this case, it’s better to use Scanner in the first place, which is capable of performing pattern matching while streaming the data, which can be done efficiently as the regex engine tells whether buffering more data could change the outcome of a match operation (see hitEnd() and requireEnd()). Unfortunately, generating a stream of matches from a Scanner has only been added in Java 9, but see this answer for a back-port of that feature to Java 8.

Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}

I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.

This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.

When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()

Getting an InputStream to read more than once, regardless of markSupported()

I need to be able to re-use a java.io.InputStream multiple times, and I figured the following code would work, but it only works the first time.
Code
public class Clazz
{
private java.io.InputStream dbInputStream, firstDBInputStream;
private ArrayTable db;
public Clazz(java.io.InputStream defDB)
{
this.firstDBInputStream = defDB;
this.dbInputStream = defDB;
if (db == null)
throw new java.io.FileNotFoundException("Could not find the database at " + db);
if (dbInputStream.markSupported())
dbInputStream.mark(Integer.MAX_VALUE);
loadDatabaseToArrayTable();
}
public final void loadDatabaseToArrayTable() throws java.io.IOException
{
this.dbInputStream = firstDBInputStream;
if (dbInputStream.markSupported())
dbInputStream.reset();
java.util.Scanner fileScanner = new java.util.Scanner(dbInputStream);
String CSV = "";
for (int i = 0; fileScanner.hasNextLine(); i++)
CSV += fileScanner.nextLine() + "\n";
db = ArrayTable.createArrayTableFromCSV(CSV);
}
public void reloadDatabase()//A method called by the UI
{
try
{
loadDatabaseToArrayTable();
}
catch (Throwable t)
{
//Alert the user that an error has occurred
}
}
}
Note that ArrayTable is a class of mine, which uses arrays to give an interface for working with tables.
Question
In this program, the database is shown directly to the user immediately after the reloadDatabase() method is called, and so any solution involving saving the initial read to an object in memory is useless, as that will NOT refresh the data (think of it like a browser; when you press "Refresh", you want it to fetch the information again, not just display the information it fetched the first time). How can I read a java.io.InputStream more than once?

You can't necessarily read an InputStream more than once. Some implementations support it, some don't. What you are doing is checking the markSupported method, which is indeed an indicator if you can read the same stream twice, but then you are ignoring the result. You have to call that method to see if you can read the stream twice, and if you can't, make other arrangements.
Edit (in response to comment): When I wrote my answer, my "other arrangements" was to get a fresh InputStream. However, when I read in your comments to your question about what you want to do, I'm not sure it is possible. For the basics of the operation, you probably want RandomAccessFile (at least that would be my first guess, and if it worked, that would be the easiest) - however you will have file access issues. You have an application actively writing to a file, and another reading that file, you will have problems - exactly which problems will depend on the OS, so whatever solution would require more testing. I suggest a separate question on SO that hits on that point, and someone who has tried that out can perhaps give you more insight.

you never mark the stream to be reset
public Clazz(java.io.InputStream defDB)
{
firstDBInputStream = defDB.markSupported()?defDB:new BufferedInputStream(defDB);
//BufferedInputStream supports marking
firstDBInputStream.mark(500000);//avoid IOException on first reset
}
public final void loadDatabaseToArrayTable() throws java.io.IOException
{
this.dbInputStream = firstDBInputStream;
dbInputStream.reset();
dbInputStream.mark(500000);//or however long the data is
java.util.Scanner fileScanner = new java.util.Scanner(dbInputStream);
StringBuilder CSV = "";//StringBuilder is more efficient in a loop
while(fileScanner.hasNextLine())
CSV.append(fileScanner.nextLine()).append("\n");
db = ArrayTable.createArrayTableFromCSV(CSV.toString());
}
however you could instead keep a copy of the original ArrayTable and copy that when you need to (or even the created string to rebuild it)
this code creates the string and caches it so you can safely discard the inputstreams and just use readCSV to build the ArrayTable
private String readCSV=null;
public final void loadDatabaseToArrayTable() throws java.io.IOException
{
if(readCSV==null){
this.dbInputStream = firstDBInputStream;
java.util.Scanner fileScanner = new java.util.Scanner(dbInputStream);
StringBuilder CSV = "";//StringBuilder is more efficient in a loop
while(fileScanner.hasNextLine())
CSV.append(fileScanner.nextLine()).append("\n");
readCSV=CSV.toString();
fileScanner.close();
}
db = ArrayTable.createArrayTableFromCSV(readCSV);
}
however if you want new information you'll need to create a new stream to read from again

How to parse logs written by multiple threads?

I have an interesting problem and would appreciate your thoughts for the best solution.
I need to parse a set of logs. The logs are produced by a multi-threaded program and a single process cycle produces several lines of logs.
When parsing these logs I need to pull out specific pieces of information from each process - naturally this information is across the multiple lines (I want to compress these pieces of data into a single line). Due to the application being multi-threaded, the block of lines belonging to a process can be fragmented as other processes at written to the same log file at the same time.
Fortunately, each line gives a process ID so I'm able to distinguish what logs belong to what process.
Now, there are already several parsers which all extend the same class but were designed to read logs from a single threaded application (no fragmentation - from original system) and use a readLine() method in the super class. These parsers will keep reading lines until all regular expressions have been matched for a block of lines (i.e. lines written in a single process cycle).
So, what can I do with the super class so that it can manage the fragmented logs, and ensure change to the existing implemented parsers is minimal?

It sounds like there are some existing parser classes already in use that you wish to leverage. In this scenario, I would write a decorator for the parser which strips out lines not associated with the process you are monitoring.
It sounds like your classes might look like this:
abstract class Parser {
public abstract void parse( ... );
protected String readLine() { ... }
}
class SpecialPurposeParser extends Parser {
public void parse( ... ) {
// ... special stuff
readLine();
// ... more stuff
}
}
And I would write something like:
class SingleProcessReadingDecorator extends Parser {
private Parser parser;
private String processId;
public SingleProcessReadingDecorator( Parser parser, String processId ) {
this.parser = parser;
this.processId = processId;
}
public void parse( ... ) { parser.parse( ... ); }
public String readLine() {
String text = super.readLine();
if( /*text is for processId */ ) {
return text;
}
else {
//keep readLine'ing until you find the next line and then return it
return this.readLine();
}
}
Then any occurrence you want to modify would be used like this:
//old way
Parser parser = new SpecialPurposeParser();
//changes to
Parser parser = new SingleProcessReadingDecorator( new SpecialPurposeParser(), "process1234" );
This code snippet is simple and incomplete, but gives you the idea of how the decorator pattern could work here.

I would write a simple distributor that reads the log file line by line and stores them in different VirtualLog objects in memory -- a VirtualLog being a kind of virtual file, actually just a String or something that the existing parsers can be applied to. The VirtualLogs are stored in a Map with the process ID (PID) as the key. When you read a line from the log, check if the PID is already there. If so, add the line to the PID's respective VirtualLog. If not, create a new VirtualLog object and add it to the Map. Parsers run as separate Threads, one on every VirtualLog. Every VirtualLog object is destroyed as soon as it has been completely parsed.

You need to store lines temporarily in a queue where a single thread consumes them and passes them on once each set has been completed. If you have no way of knowing the if a set is complete or not by either the number of lines or the content of the lines, you could consider using a sliding window technique where you don't collect the individual sets until after a certain time has passed.

One simple solution could be to read the file line by line and write several files, one for each process id. The list of process id's can be kept in a hash-map in memory to determine if a new file is needed or in which already created file the lines for a certain process id will go. Once all the (temporary) files are written, the existing parsers can do the job on each one.

Would something like this do it? It runs a new Thread for each Process ID in the log file.
class Parser {
String currentLine;
Parser() {
//Construct parser
}
synchronized String readLine(String processID) {
if (currentLine == null)
currentLine = readLinefromLog();
while (currentline != null && ! getProcessIdFromLine(currentLine).equals(processId)
wait();
String line = currentLine;
currentLine = readLinefromLog();
notify();
return line;
}
}
class ProcessParser extends Parser implements Runnable{
String processId;
ProcessParser(String processId) {
super();
this.processId = processId;
}
void startParser() {
new Thread(this).start();
}
public void run() {
String line = null;
while ((line = readLine()) != null) {
// process log line here
}
}
String readLine() {
String line = super.readLine(processId);
return line;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Preprocessing CSV data efficiently before or while parallel streaming - java

Related

Correct use of ForkJoinPool submit and join in Java

Java 8's Files.lines(): Performance concern for very long line

Spark java.lang.StackOverflowError

Getting an InputStream to read more than once, regardless of markSupported()

How to parse logs written by multiple threads?

Categories

Resources