Fast CSV parsing

Fast CSV parsing - java

I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it's not premature optimization. The code so far:
client.executeMethod(method);
InputStream in = method.getResponseBodyAsStream(); // this is http stream
String line;
String[] record;
reader = new BufferedReader(new InputStreamReader(in), 65536);
try {
// read the header line
line = reader.readLine();
// some code
while ((line = reader.readLine()) != null) {
// more code
line = line.replaceAll("\"\"", "\"NULL\"");
// Now remove all of the quotes
line = line.replaceAll("\"", "");
if (!line.startsWith("ERROR"){
//bla bla
continue;
}
record = line.split(",");
//more error handling
// build the object and put it in HashMap
}
//exceptions handling, closing connection and reader
Is there any existing library that would help me to speed up things? Can I improve existing code?

Apache Commons CSV
Have you seen Apache Commons CSV?
Caveat On Using split
Bear in mind is that split only returns a view of the data, meaning that the original line object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)
It also is not reliable in grouping escaped CSV columns containing commas

opencsv
Take a look at opencsv.
This blog post, opencsv is an easy CSV parser, has example usage.

The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.
There is a benchmark on github
https://github.com/uniVocity/csv-parsers-comparison
that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress
see https://github.com/arnaudroger/csv-parsers-comparison

Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.
Following is the brief analysis and suggested solution
From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
What you ca do is the tasks that you are doing in for loop can be executed in a thread.
Using the threadpool and concurrency will greatly improve your performance.
Though the solution involves some effort, but at the end this will surly help you.

opencsv
You should have a look at OpenCSV. I would expect that they have performance optimizations.

A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc).
SimpleFlatMapper
uniVocity
sesseltjonna-csv (disclaimer: I wrote this parser)

Quirk-CSV
The new kid on the block. It uses java annotations and is built on apache-csv which one of the faster libraries out there for csv parsing.
This library is also thread safe as well if you wanted to re-use the CSVProcessor you can and should.
Example:
Pojo
#CSVReadComponent(type = CSVType.NAMED)
#CSVWriteComponent(type = CSVType.ORDER)
public class Pojo {
#CSVWriteBinding(order = 0)
private String name;
#CSVWriteBinding(order = 1)
#CSVReadBinding(header = "age")
private Integer age;
#CSVWriteBinding(order = 2)
#CSVReadBinding(header = "money")
private Double money;
#CSVReadBinding(header = "name")
public void setA(String name) {
this.name = name;
}
#Override
public String toString() {
return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
+ money;
}}
Main
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.*;
public class SimpleMain {
public static void main(String[] args) {
String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";
CSVProcessor processor = new CSVProcessor(Pojo.class);
List<Pojo> list = new ArrayList<>();
try {
list.addAll(processor.parse(new StringReader(csv)));
list.forEach(System.out::println);
System.out.println();
StringWriter sw = new StringWriter();
processor.write(list, sw);
System.out.println(sw.toString());
} catch (IOException e) {
}
}}
Since this is built on top of apache-csv you can use the powerful tool CSVFormat. Lets say the delimiter for the csv are pipes (|) instead of commas(,) you could for Example:
CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);
Another benefit are inheritance is also consider.
For other examples on handling reading/writing non-primitive data

For speed you do not want to use replaceAll, and you don't want to use regex either. What you basically always want to do in critical cases like that is making a state-machine character by character parser. I've done that having rolled the whole thing into an Iterable function. It also takes in the stream and parses it without saving it out or caching it. So if you can abort early that's likely going to go fine as well. It should also be short enough and well coded enough to make it obvious how it works.
public static Iterable<String[]> parseCSV(final InputStream stream) throws IOException {
return new Iterable<String[]>() {
#Override
public Iterator<String[]> iterator() {
return new Iterator<String[]>() {
static final int UNCALCULATED = 0;
static final int READY = 1;
static final int FINISHED = 2;
int state = UNCALCULATED;
ArrayList<String> value_list = new ArrayList<>();
StringBuilder sb = new StringBuilder();
String[] return_value;
public void end() {
end_part();
return_value = new String[value_list.size()];
value_list.toArray(return_value);
value_list.clear();
}
public void end_part() {
value_list.add(sb.toString());
sb.setLength(0);
}
public void append(int ch) {
sb.append((char) ch);
}
public void calculate() throws IOException {
boolean inquote = false;
while (true) {
int ch = stream.read();
switch (ch) {
default: //regular character.
append(ch);
break;
case -1: //read has reached the end.
if ((sb.length() == 0) && (value_list.isEmpty())) {
state = FINISHED;
} else {
end();
state = READY;
}
return;
case '\r':
case '\n': //end of line.
if (inquote) {
append(ch);
} else {
end();
state = READY;
return;
}
break;
case ',': //comma
if (inquote) {
append(ch);
} else {
end_part();
break;
}
break;
case '"': //quote.
inquote = !inquote;
break;
}
}
}
#Override
public boolean hasNext() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
return state == READY;
}
#Override
public String[] next() {
if (state == UNCALCULATED) {
try {
calculate();
} catch (IOException ex) {
}
}
state = UNCALCULATED;
return return_value;
}
};
}
};
}
You would typically process this quite helpfully like:
for (String[] csv : parseCSV(stream)) {
//<deal with parsed csv data>
}
The beauty of that API there is worth the rather cryptic looking function.

Apache Commons CSV ➙ 12 seconds for million rows
Is there any existing library that would help me to speed up things?
Yes, the Apache Commons CSV project works very well in my experience.
Here is an example app that uses Apache Commons CSV library to write and read rows of 24 columns: An integer sequential number, an Instant, and the rest are random UUID objects.
For 10,000 rows, the writing and the read each take about half a second. The reading includes reconstituting the Integer, Instant, and UUID objects.
My example code lets you toggle on or off the reconstituting of objects. I ran both with a million rows. This creates a file of 850 megs. I am using Java 12 on a MacBook Pro (Retina, 15-inch, Late 2013), 2.3 GHz Intel Core i7, 16 GB 1600 MHz DDR3, Apple built-in SSD.
For a million rows, ten seconds for reading plus two seconds for parsing:
Writing: PT25.994816S
Reading only: PT10.353912S
Reading & parsing: PT12.219364S
Source code is a single .java file. Has a write method, and a read method. Both methods called from a main method.
I opened a BufferedReader by calling Files.newBufferedReader.
package work.basil.example;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
public class CsvReadingWritingDemo
{
public static void main ( String[] args )
{
CsvReadingWritingDemo app = new CsvReadingWritingDemo();
app.write();
app.read();
}
private void write ()
{
Instant start = Instant.now();
int limit = 1_000_000; // 10_000 100_000 1_000_000
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Writer writer = Files.newBufferedWriter( path, StandardCharsets.UTF_8 );
CSVPrinter printer = new CSVPrinter( writer , CSVFormat.RFC4180 );
)
{
printer.printRecord( "id" , "instant" , "uuid_01" , "uuid_02" , "uuid_03" , "uuid_04" , "uuid_05" , "uuid_06" , "uuid_07" , "uuid_08" , "uuid_09" , "uuid_10" , "uuid_11" , "uuid_12" , "uuid_13" , "uuid_14" , "uuid_15" , "uuid_16" , "uuid_17" , "uuid_18" , "uuid_19" , "uuid_20" , "uuid_21" , "uuid_22" );
for ( int i = 1 ; i <= limit ; i++ )
{
printer.printRecord( i , Instant.now() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() );
}
} catch ( IOException ex )
{
ex.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Wrote CSV for limit: " + limit );
System.out.println( "Elapsed: " + d );
}
private void read ()
{
Instant start = Instant.now();
int count = 0;
Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
try (
Reader reader = Files.newBufferedReader( path , StandardCharsets.UTF_8) ;
)
{
CSVFormat format = CSVFormat.RFC4180.withFirstRecordAsHeader();
CSVParser parser = CSVParser.parse( reader , format );
for ( CSVRecord csvRecord : parser )
{
if ( true ) // Toggle parsing of the string data into objects. Turn off (`false`) to see strictly the time taken by Apache Commons CSV to read & parse the lines. Turn on (`true`) to get a feel for real-world load.
{
Integer id = Integer.valueOf( csvRecord.get( 0 ) ); // Annoying zero-based index counting.
Instant instant = Instant.parse( csvRecord.get( 1 ) );
for ( int i = 3 - 1 ; i <= 22 - 1 ; i++ ) // Subtract one for annoying zero-based index counting.
{
UUID uuid = UUID.fromString( csvRecord.get( i ) );
}
}
count++;
if ( count % 1_000 == 0 ) // Every so often, report progress.
{
//System.out.println( "# " + count );
}
}
} catch ( IOException e )
{
e.printStackTrace();
}
Instant stop = Instant.now();
Duration d = Duration.between( start , stop );
System.out.println( "Read CSV for count: " + count );
System.out.println( "Elapsed: " + d );
}
}

Related

Java CSV Writing

Im currently trying to write data into excel for a report. I can write data to the csv file however its not coming out in excel in the order I want. I need the data to print under best and worst fitness in each column instead of it all print under Average. Here is the relevant code, any help would be appreciated:
String [] Fitness = "Average fitness#Worst fitness #Best Fitness".split("#");
writer.writeNext(Fitness);
//takes data from average fitness and stores as an int
int aFit = myPop.individuals[25].getFitness();
//converts int to string
String aFit1 = Integer.toString(aFit);
//converts string to string array
String aFit2 [] = aFit1.split(" ");
//writes to csv
writer.writeNext(aFit2);
//String [] nextCol = "#".split("#");
int wFit = myPop.individuals[49].getFitness();
String wFit1 = Integer.toString(wFit);
String wFit2 [] = wFit1.split(" ");
writer.writeNext(wFit2);
int bFit = myPop.individuals[1].getFitness();
String bFit1 = Integer.toString(bFit);
String bFit2 [] = bFit1.split(" ");
writer.writeNext(bFit2);
enter image description here

I think you should call your "writeNext" method once per line of datas:
String [] Fitness = "Average fitness#Worst fitness #Best Fitness".split("#");
writer.writeNext(Fitness);
int aFit = myPop.individuals[25].getFitness();
String aFit1 = Integer.toString(aFit);
int wFit = myPop.individuals[49].getFitness();
String wFit1 = Integer.toString(wFit);
int bFit = myPop.individuals[1].getFitness();
String bFit1 = Integer.toString(bFit);
writer.writeNext(new String[]{aFit1, wFit1, bFit1});

From the docs at
CSVWriter.html#writeNext(java.lang.String[])
public void writeNext(String[] nextLine)
- Writes the next line to the file.
The String array to provide is
A string array with each comma-separated element as a separate entry.
You are writing 3 separate lines instead of 1 and each line you write contains an Array with a single entry.
writer.writeNext(aFit2);
writer.writeNext(wFit2);
writer.writeNext(bFit2);
Solution:
Create a single Array with all 3 entries (column values) and write that once on a single line.

I am assuming you are using CSVWriter to write to a CSV file. Please make sure to mention as much details as possible in a question, it makes it much more readable to others.
As you can see from the documentation of CSVWriter:
void writeNext(String[] nextLine)
Writes the next line to the file.
The writeNext method actually writes the array to the an individual line of the file. From your code:
writer.writeNext(aFit2);
writer.writeNext(wFit2);
writer.writeNext(bFit2);
So, instead of doing this `String aFit2 [] = aFit1.split(" ");
Create an array of the values and then pass that array to writeNext
As an example, you can consider you own example of passing the array of column names, which gets written in a single line:
writer.writeNext(Fitness);

Apache Commons CSV
Here is the same kind of solution, but using the Apache Commons CSV library. This library specifically supports the Microsoft Excel variant of CSV format, so you may find it particularly useful.
CSVFormat.Predefined.EXCEL
Your data, both read and written in this example.
The Commons CSV library can read the first row as header names.
Here is a complete example app in a single .java file. First the app reads from an existing WorstBest.csv data file:
Average,Worst,Best
10,5,15
11,5,16
10,6,16
11,6,15
10,5,16
10,5,16
10,4,16
Each row is represented as a List of three String objects, a List< String >. We add each row to a collection, a list of lists, a List< List< String > >.
Then we write out that imported data to another file. Each written file is name WorstBest_xxx.csv where xxx is the current moment in UTC.
package com.basilbourque.example;
import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Instant;
import java.time.temporal.ChronoUnit;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;
public class WorstBest {
public static void main ( String[] args ) {
WorstBest app = new WorstBest();
List < List < String > > data = app.read();
app.write( data );
}
private List < List < String > > read ( ) {
List < List < String > > listOfListsOfStrings = List.of();
try {
// Locate file to import and parse.
Path path = Paths.get( "/Users/basilbourque/WorstBest.csv" );
if ( Files.notExists( path ) ) {
System.out.println( "ERROR - no file found for path: " + path + ". Message # 3cf416de-c33b-4c39-8507-5fbb72e113f2." );
}
// Hold data read from file.
int initialCapacity = ( int ) Files.lines( path ).count();
listOfListsOfStrings = new ArrayList <>( initialCapacity );
// Read CSV file.
BufferedReader reader = Files.newBufferedReader( path );
Iterable < CSVRecord > records = CSVFormat.RFC4180.withFirstRecordAsHeader().parse( reader );
for ( CSVRecord record : records ) {
// Average,Worst,Best
// 10,5,15
// 11,5,16
String average = record.get( "Average" ); // Must use annoying zero-based index counting.
String worst = record.get( "Worst" );
String best = record.get( "Best" );
// Collect
listOfListsOfStrings.add( List.of( average , worst , best ) ); // For real work, you would define a class to hold these values.
}
} catch ( IOException e ) {
e.printStackTrace();
}
return listOfListsOfStrings;
}
private void write ( List < List < String > > listOfListsOfStrings ) {
Objects.requireNonNull( listOfListsOfStrings );
// Determine file in which to write data.
String when = Instant.now().truncatedTo( ChronoUnit.SECONDS ).toString().replace( ":" , "•" ); // Colons are forbidden in names by some file systems such as HFS+.
Path path = Paths.get( "/Users/basilbourque/WorstBest_" + when + ".csv" );
// Loop collection of data (a list of lists of strings).
try ( final CSVPrinter printer = CSVFormat.EXCEL.withHeader( "Average" , "Worst" , "Best" ).print( path , StandardCharsets.UTF_8 ) ; ) {
for ( List < String > list : listOfListsOfStrings ) {
printer.printRecord( list.get( 1 - 1 ) , list.get( 2 - 1 ) , list.get( 3 - 1 ) ); // Annoying zero-based index counting.
}
} catch ( IOException e ) {
e.printStackTrace();
}
}
}

Java stored procedure returns nothing in Oracle Database

I have a fairly simple stored java procedure in an oracle database. The intended purpose is to read the contents of a folder which resides on the Oracle server. If it encounters a folder it will step into the folder and write the name of the contents into a global temp table, and move on to the next folder. The Java procedure compiles fine and submits into the database with no issues. When it's called by a stored Oracle procedure it runs successfully as well. But produces no results into the global temp table. I am using TOAD and i'm not sure how to put a break or view the variables during run time so i'm kind of flying blind. And i'm admittedly not great a java.
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED BALT_CHECK."WebDirList" AS
import java.io.*;
import java.sql.*;
import java.util.Date;
import java.text.SimpleDateFormat;
public class WebDirList
{
public static void getList(String rootdirectory) throws SQLException
{
File path = new File( rootdirectory );
String[] rootDirList = path.list();
String element;
for( int x = 0; x < rootDirList.length; x++)
{
element = rootDirList[x];
String newPath = rootdirectory + "/" + rootDirList[x] ;
File f = new File(newPath);
if (f.isFile()){
/* Do Nothing */
} else {
/*if it is a folder than load the subDirPath variable with the newPath variable */
File subDirPath = new File( newPath+"/");
String[] subDirList = subDirPath.list();
String efileName;
for(int i = 0; i < subDirList.length; i++)
{
efileName = subDirList[i];
String fpath = subDirPath + "/" + subDirList[i];
File nf = new File(fpath);
long len;
Date date;
String ftype;
String sqlDate;
SimpleDateFormat df = new SimpleDateFormat( "yyyy-MM-dd hh:mm:ss");
if (f.isFile()) {
len = f.length();
date = new Date(f.lastModified());
sqlDate = df.format(date);
#sql { INSERT INTO WEB_DIRLIST (FILENAME, LENGTH, CREATEDATE)
VALUES (:efileName, :len, to_date(:sqlDate, 'YYYY-MM-DD HH24:MI:SS')) };
}else{
/* Do nothing */
}
}
}
}
}
}
/
Procedure is created as
CREATE OR REPLACE procedure BALT_CHECK.get_webdir_list( p_directory in varchar2)
as language java
name 'WebDirList.getList( java.lang.String )';
/
Procedure is called as
exec get_webdir_list( '/transfer_edi/hs122/');
in the folder /transfer/edi/hs122/ are 10 sub directories each have between 1 and 100 items in them at any given time.

I'm not sure how you check the results (same session or not). Do you perform commit somewhere? There are some specifics with global temp tables (there is option whether data is purged after commit or not). You may wish to initially try with permanent one until you sort out the problem.
It may be useful if you add some logging (e.g. to another table). E.g. rootDirList.length may be a good indicator to check.
Some other remarks:
The /* Do nothing */ branches in your if statements are adding additional noise. Good to remove them.
Perhaps would be better to use .isDirectory() if you want to check if the paths is a directory (instead of isFile).

There were a few errors in this code that prevented it from writing to the database. Based on Yavor's suggestion of writing String variables to a temp table I was able to find that I had duplicated "/" on the file path e.g. (/transfer_edi/hs122//Acctg). I also found I had an incorrect data type on one of my columns in my data table that I was writing too. I also switched to a regular table instead of a global temp table which was deleting after commit. Again thanks Yavor. Regardless of all that I ended up re-writing the entire thing. I realized that I needed to traverse down the directory structure to get all the files so here is the final code that worked for me. Again i'm not a java guy so i'm sure this could be done better.
This link helped me quite a bit
http://rosettacode.org/wiki/Walk_a_directory/Recursively#Java
CREATE OR REPLACE AND RESOLVE JAVA SOURCE NAMED BALT_CHECK."WebDocs" AS
import java.io.*;
import java.sql.*;
import java.util.Date;
import java.text.SimpleDateFormat;
import java.lang.String;
public class WebDocs
{
public static long fileID;
public static void GetDocs(String rootdirectory) throws SQLException
{
stepinto(rootdirectory);
}
public static void stepinto(String rootdirectory) throws SQLException
{
File path = new File( rootdirectory );
String[] DirList = path.list();
for( int x = 0; x < DirList.length; x++)
{
String newPath = rootdirectory + DirList[x];
if (newPath != null) {
File f = new File(newPath);
if (f.isDirectory()) {
GetDocs(newPath +"/");
}
if (f.isFile()){
WriteFile(f);
}else{
}
}
}
}
public static void WriteFile(File file) throws SQLException
{
String fileName;
String filePath;
String elementID;
long len;
Date date;
String sqlDate;
SimpleDateFormat df = new SimpleDateFormat( "yyyy-MM-dd hh:mm:ss");
fileID = fileID + 1;
elementID = String.valueOf(fileID);
fileName = file.getName();
filePath = file.getPath();
len = file.length();
date = new Date(file.lastModified());
sqlDate = df.format(date);
#sql { INSERT INTO WEB_STATICDOCS (ID, FILE_NAME, FILE_SIZE, CREATE_DATE, FILE_PATH)
VALUES (:elementID, :fileName, :len, to_date(:sqlDate, 'YYYY-MM-DD HH24:MI:SS'), :filePath) };
}
}
/
Oracle Stored Procedure
CREATE OR REPLACE procedure BALT_CHECK.getWebDocs( p_directory in varchar2)
as language java
name 'WebDocs.GetDocs( java.lang.String )';
/
Calling the stored Procedure
exec getWebDocs( '/transfer_edi/hs122/');

Parsing XML with StAX with non-unique tag paths, design suggestions

I need to parse a large XML file (probably going to use StAX in Java) and output it into a delimited text file and I have a couple of design questions. First here is an example of the XML
<demographic>
<value>001</value>
<question>Name?</question>
<value>Bob</value>
<question>Last Name?</question>
<value>Smith</value>
<followUpQuestions>
<question>Middle Init.</question>
<value>J</value>
</followUpQuestions>
</demographic>
this would need to be outputted (in the delimited output file) as
001~Bob~Smith~J
so here are my questions:
How can I distinguish between all the different "value" tags, since the tag names are not unique. Currently I tried to resolve this by having 'state' variables that turn on once they pass question-text such as "Name?", however this approach doesnt really work for the first value since I have to check to make sure the 'name' and 'lastName' states are off to ensure I'm getting the first value.
Everytime the client changes the text of the questions (which happens) I have to change the code and recompile it. Is there anyway to avoid this? Maybe save the questions-text in a text file that the program reads in?
Can this be scalable? I need to extract over 100 values and the XML files are usually about 2 gigs large.
Thank you, in advance, for your help (from a Java and XML newbie)!!
UPDATE: here is my attempt to code the solution, can someone please help to streamline? There has to be a less messy way to do this:
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import java.io.*;
class TestJavaForStackOverflow{
boolean nameState = false,
lastNameState = false,
middleInitState = false;
String name = "",
lastName = "",
middleInit = "",
value = "";
public void parse() throws IOException, XMLStreamException{
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader streamReader = factory.createXMLStreamReader(
new FileReader("/n04/data/revmgmt/anthony/scripts/Java_Programs/TestJavaForStackOverflow.xml"));
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
if("demographic".equals(streamReader.getLocalName())){
parseDemographicInformation(streamReader);
}
}
}
System.out.println(value + "~" + name + "~" + lastName + "~" + middleInit);
}
public void parseDemographicInformation(XMLStreamReader streamReader) throws XMLStreamException {
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.END_ELEMENT){
if("demographic".equals(streamReader.getLocalName())){
return;
}
}
else if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
if("question".equals(streamReader.getLocalName())){
streamReader.next();
if("Name?".equals(streamReader.getText())){
nameState = true;
}
else if("Last Name?".equals(streamReader.getText())){
lastNameState = true;
}
else if("Middle Init.".equals(streamReader.getText())){
middleInitState = true;
}
}
else if("value".equals(streamReader.getLocalName())){
streamReader.next();
if(nameState){
name = streamReader.getText();
nameState = false;
}
else if (lastNameState){
lastName = streamReader.getText();
lastNameState = false;
}
else if (middleInitState){
middleInit = streamReader.getText();
middleInitState = false;
}
else {
value = streamReader.getText();
}
}
}
}
}
public static void main(String[] args){
TestJavaForStackOverflow t = new TestJavaForStackOverflow();
try{t.parse();}
catch(IOException e1){}
catch(XMLStreamException e2){}
}
}

I think the flags are not very scalable if you have a lot of different questions to parse, and neither are the global variables to hold the results... if you have 100 questions then you'll need 100 variables, and when they change over time it will be a bear to keep them up to date. I would use a map structure to hold the result, and another one to hold the correspondence between each question text and the corresponding field you are trying to capture (this is not actual Java, just an approximation):
public Map parseDemographicInformation(XmlStream xml, Map questionMap) {
Map record = new Map();
String field = "id";
while((elem = xml.getNextElement())) {
if(elem.tagName == "question") {
field = questionMap[elem.value];
} else if(elem.tagName == "value") {
record[field] = elem.value;
}
}
return record;
}
Then you have something like this to output the result:
String[] fieldsToOutput = { "id", "firstName", "lastName" }; // ideally read this from a file too so it can be changed dynamically
// ...
for(int i=0; i < fieldsToOutput.length; i++){
if(i > 0)
System.out.print("~");
System.out.print(record[fieldsToOutput[i]]);
}
System.out.println();

Validation of java object using javascript

I have following task for my project:
I need to validate a java object, based on rules in a script (for example javascript).
Why in javascript?
Because in javascript I can be flexible and create rules as I want (e.g. validating a combination of fields like validate(tax,recipient))
Now here is what I have:
1) I have validation rules defined in a javascript file, Rules.js
function checkPrice(price){
if(price < 0){
return false;
}
}
2) I have a plain Java Object (Invoice). And it must stay plain!
public class Invoice implements Serializable {
private String details;
private String tax;
private String recipient;
private double price;
//getter and setter
}
3) And I have a ValidatorObject. This can be a java or javascript object. Depending on your suggestion.
This ValidatorObject has a method validate, which has the Javascript Rules File (see Point 1) and the Java Object, Invoice, (see Point 2) as parameters.
validate(Rules.js, Invoice i){
//here it must take the Rules.js and use the rules inside to validate the Invoice i
}
So my question would be:
Are there any frameworks that I can use to validate a Java Object based on rules defined in a javascript file? Or any tutorials, videos or suggestions?
Or how can I read a javascript file into a java object? Are there any getters or setters for javascript?
Anything would be nice!
Regards,
Dave

Here is how to embed a ScripEngine into Java Application:
import javax.script.Bindings;
import javax.script.ScriptContext;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
public static void main( String[] args )
throws
ScriptException, IOException
{
final String run;
if( args.length > 0 )
{
run = args[0];
if( run.contains( "=" ))
{
usage();
}
}
else
{
run = "run.js";
}
File script = new File( run );
if( script.canRead())
{
ScriptEngine engine = new ScriptEngineManager().getEngineByMimeType( "text/javascript" );
Bindings bindings = engine.getBindings( ScriptContext.GLOBAL_SCOPE );
bindings.put( "controller", new Controller());
for( int i = 1; i < args.length; ++i )
{
String[] varVal = args[i].split( "=" );
if( varVal.length == 2 )
{
bindings.put( varVal[0], varVal[1] );
}
else
{
usage();
}
}
info( "Loading and executing: " + script );
engine.put( ScriptEngine.FILENAME, script.toString());
engine.eval( new FileReader( script ));
}
else
{
System.err.println( "Can't read automation script file: " + script );
usage();
}
}
In this case the Controller Java class exposes some public methods to JavaScript to use the flexibility of the interpreted code coupled to the robustness of compiled Java code.

I solved it using Rhino! where I can read and manipulate my javascript file...
https://developer.mozilla.org/en-US/docs/Rhino

Scala actors inefficiency issue

Let me start out by saying that I'm new to Scala; however, I find the Actor based concurrency model interesting, and I tried to give it a shot for a relatively simple application. The issue that I'm running into is that, although I'm able to get the application to work, the result is far less efficient (in terms of real time, CPU time, and memory usage) than an equivalent Java based solution that uses threads that pull messages off an ArrayBlockingQueue. I'd like to understand why. I suspect that it's likely my lack of Scala knowledge, and that I'm causing all the inefficiency, but after several attempts to rework the application without success, I decided to reach out to the community for help.
My problem is this:
I have a gzipped file with many lines in the format of:
SomeID comma_separated_list_of_values
For example:
1234 12,45,82
I'd like to parse each line and get an overall count of the number of occurrences of each value in the comma separated list.
This file may be pretty large (several GB compressed), but the number of unique values per file is pretty small (at most 500). I figured this would be a pretty good opportunity to try to write an Actor-based concurrent Scala application. My solution involves a main driver that creates a pool of parser Actors. The main driver then reads lines from stdin, passes the line off to an Actor that parses the line and keeps a local count of the values. When the main driver has read the last line, it passes a message to each actor indicating that all lines have been read. When the actor receive the 'done' message, they pass their counts to an aggregator that sums the counts from all actors. Once the counts from all parsers have been aggregated, the main driver prints out the statistics.
The problem:
The main issue that I'm encountering is the incredible amount of inefficiency of this application. It uses far more CPU and far more memory than an "equivalent" Java application that uses threads and an ArrayBlockingQueue. To put this in perspective, here are some stats that I gathered for a 10 million line test input file:
Scala 1 Actor (parser):
real 9m22.297s
user 235m31.070s
sys 21m51.420s
Java 1 Thread (parser):
real 1m48.275s
user 1m58.630s
sys 0m33.540s
Scala 5 Actors:
real 2m25.267s
user 63m0.730s
sys 3m17.950s
Java 5 Threads:
real 0m24.961s
user 1m52.650s
sys 0m20.920s
In addition, top reports that the Scala application has about 10x the resident memory size. So we're talking about orders of magnitude more CPU and memory here for orders of magnitude worse performance, and I just can't figure out what is causing this. Is it a GC issue, or am I somehow creating far more copies of objects than I realize?
Additional details that may or may not be of importance:
The scala application is wrapped by a Java class so that I could
deliver a self-contained executable JAR file (I don't have the Scala
jars on every machine that I might want to run this app).
The application is being invoked as follows: gunzip -c gzFilename |
java -jar StatParser.jar
Here is the code:
Main Driver:
import scala.actors.Actor._
import scala.collection.{ immutable, mutable }
import scala.io.Source
class StatCollector (numParsers : Int ) {
private val parsers = new mutable.ArrayBuffer[StatParser]()
private val aggregator = new StatAggregator()
def generateParsers {
for ( i <- 1 to numParsers ) {
val parser = new StatParser( i, aggregator )
parser.start
parsers += parser
}
}
def readStdin {
var nextParserIdx = 0
var lineNo = 1
for ( line <- Source.stdin.getLines() ) {
parsers( nextParserIdx ) ! line
nextParserIdx += 1
if ( nextParserIdx >= numParsers ) {
nextParserIdx = 0
}
lineNo += 1
}
}
def informParsers {
for ( parser <- parsers ) {
parser ! true
}
}
def printCounts {
val countMap = aggregator.getCounts()
println( "ID,Count" )
/*
for ( key <- countMap.keySet ) {
println( key + "," + countMap.getOrElse( key, 0 ) )
//println( "Campaign '" + key + "': " + countMap.getOrElse( key, 0 ) )
}
*/
countMap.toList.sorted foreach {
case (key, value) =>
println( key + "," + value )
}
}
def processFromStdIn {
aggregator.start
generateParsers
readStdin
process
}
def process {
informParsers
var completedParserCount = aggregator.getNumParsersAggregated
while ( completedParserCount < numParsers ) {
Thread.sleep( 250 )
completedParserCount = aggregator.getNumParsersAggregated
}
printCounts
}
}
The Parser Actor:
import scala.actors.Actor
import collection.mutable.HashMap
import scala.util.matching
class StatParser( val id: Int, val aggregator: StatAggregator ) extends Actor {
private var countMap = new HashMap[String, Int]()
private val sep1 = "\t"
private val sep2 = ","
def getCounts(): HashMap[String, Int] = {
return countMap
}
def act() {
loop {
react {
case line: String =>
{
val idx = line.indexOf( sep1 )
var currentCount = 0
if ( idx > 0 ) {
val tokens = line.substring( idx + 1 ).split( sep2 )
for ( token <- tokens ) {
if ( !token.equals( "" ) ) {
currentCount = countMap.getOrElse( token, 0 )
countMap( token ) = ( 1 + currentCount )
}
}
}
}
case doneProcessing: Boolean =>
{
if ( doneProcessing ) {
// Send my stats to Aggregator
aggregator ! this
}
}
}
}
}
}
The Aggregator Actor:
import scala.actors.Actor
import collection.mutable.HashMap
class StatAggregator extends Actor {
private var countMap = new HashMap[String, Int]()
private var parsersAggregated = 0
def act() {
loop {
react {
case parser: StatParser =>
{
val cm = parser.getCounts()
for ( key <- cm.keySet ) {
val currentCount = countMap.getOrElse( key, 0 )
val incAmt = cm.getOrElse( key, 0 )
countMap( key ) = ( currentCount + incAmt )
}
parsersAggregated += 1
}
}
}
}
def getNumParsersAggregated: Int = {
return parsersAggregated
}
def getCounts(): HashMap[String, Int] = {
return countMap
}
}
Any help that could be offered in understanding what is going on here would be greatly appreciated.
Thanks in advance!
---- Edit ---
Since many people responded and asked for the Java code, here is the simple Java app that I created for comparison purposes. I realize that this is not great Java code, but when I saw the performance of the Scala application, I just whipped up something quick to see how a Java Thread-based implementation would perform as a base-line:
Parsing Thread:
import java.util.Hashtable;
import java.util.Map;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.TimeUnit;
public class JStatParser extends Thread
{
private ArrayBlockingQueue<String> queue;
private Map<String, Integer> countMap;
private boolean done;
public JStatParser( ArrayBlockingQueue<String> q )
{
super( );
queue = q;
countMap = new Hashtable<String, Integer>( );
done = false;
}
public Map<String, Integer> getCountMap( )
{
return countMap;
}
public void alldone( )
{
done = true;
}
#Override
public void run( )
{
String line = null;
while( !done || queue.size( ) > 0 )
{
try
{
// line = queue.take( );
line = queue.poll( 100, TimeUnit.MILLISECONDS );
if( line != null )
{
int idx = line.indexOf( "\t" ) + 1;
for( String token : line.substring( idx ).split( "," ) )
{
if( !token.equals( "" ) )
{
if( countMap.containsKey( token ) )
{
Integer currentCount = countMap.get( token );
currentCount++;
countMap.put( token, currentCount );
}
else
{
countMap.put( token, new Integer( 1 ) );
}
}
}
}
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
System.err.println( "Failed to get something off the queue: "
+ e.getMessage( ) );
e.printStackTrace( );
}
}
}
}
Driver:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Hashtable;
import java.util.List;
import java.util.Map;
import java.util.TreeSet;
import java.util.concurrent.ArrayBlockingQueue;
public class JPS
{
public static void main( String[] args )
{
if( args.length <= 0 || args.length > 2 || args[0].equals( "-?" ) )
{
System.err.println( "Usage: JPS [filename]" );
System.exit( -1 );
}
int numParsers = Integer.parseInt( args[0] );
ArrayBlockingQueue<String> q = new ArrayBlockingQueue<String>( 1000 );
List<JStatParser> parsers = new ArrayList<JStatParser>( );
BufferedReader reader = null;
try
{
if( args.length == 2 )
{
reader = new BufferedReader( new FileReader( args[1] ) );
}
else
{
reader = new BufferedReader( new InputStreamReader( System.in ) );
}
for( int i = 0; i < numParsers; i++ )
{
JStatParser parser = new JStatParser( q );
parser.start( );
parsers.add( parser );
}
String line = null;
while( (line = reader.readLine( )) != null )
{
try
{
q.put( line );
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
System.err.println( "Failed to add line to q: "
+ e.getMessage( ) );
e.printStackTrace( );
}
}
// At this point, we've put everything on the queue, now we just
// need to wait for it to be processed.
while( q.size( ) > 0 )
{
try
{
Thread.sleep( 250 );
}
catch( InterruptedException e )
{
}
}
Map<String,Integer> countMap = new Hashtable<String,Integer>( );
for( JStatParser jsp : parsers )
{
jsp.alldone( );
Map<String,Integer> cm = jsp.getCountMap( );
for( String key : cm.keySet( ) )
{
if( countMap.containsKey( key ))
{
Integer currentCount = countMap.get( key );
currentCount += cm.get( key );
countMap.put( key, currentCount );
}
else
{
countMap.put( key, cm.get( key ) );
}
}
}
System.out.println( "ID,Count" );
for( String key : new TreeSet<String>(countMap.keySet( )) )
{
System.out.println( key + "," + countMap.get( key ) );
}
for( JStatParser parser : parsers )
{
try
{
parser.join( 100 );
}
catch( InterruptedException e )
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
System.exit( 0 );
}
catch( IOException e )
{
System.err.println( "Caught exception: " + e.getMessage( ) );
e.printStackTrace( );
}
}
}

I'm not sure this is a good test case for actors. For one thing, there's almost no interaction between actors. This is a simple map/reduce, which calls for parallelism, not concurrency.
The overhead on the actors is also pretty heavy, and I don't know how many actual threads are being allocated. Depending on how many processors you have, you might have less threads than on the Java program -- which seems to be the case, given that the speed-up is 4x instead of 5x.
And the way you wrote the actors is optimized for idle actors, the kind of situation where you have hundreds or thousands or actors, but only few of them doing actual work at any time. If you wrote the actors with while/receive instead of loop/react, they'd perform better.
Now, actors would make it easy to distribute the application over many computers, except that you violated one of the tenets of actors: you are calling methods on the actor object. You should never do that with actors and, in fact, Akka prevents you from doing so. A more actor-ish way of doing this would be for the aggregator to ask each actor for their key sets, compute their union, and then, for each key, ask all actors to send their count for that key.
I'm not sure, however, that the actor overhead is what you are seeing. You provided no information about the Java implementation, but I daresay you use mutable maps, and maybe even a single concurrent mutable map -- a very different implementation than what you are doing in Scala.
There's also no information on how the file is read (such a big file might have buffering issues), or how it is parsed in Java. Since most of the work is reading and parsing the file, not counting the tokens, differences in implementation there can easily overcome any other issue.
Finally, about resident memory size, Scala has a 9 MB library (in addition to what JVM brings), which might be what you are seeing. Of course, if you are using a single concurrent map in Java vs 6 immutable maps in Scala, that will certainly make a big difference in memory usage patterns.

Scala actors give way Akka actors last days... and more is coming - Viktor is hAkking further to make last the best: https://twitter.com/viktorklang/status/229694698397257728
BTW: Open Source is great power! This day should be holiday of all JVM-based community:
http://www.marketwire.com/press-release/azul-systems-announces-new-initiative-support-open-source-community-with-free-zing-jvm-1684899.htm

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Fast CSV parsing - java

opencsv Take a look at opencsv. This blog post, opencsv is an easy CSV parser, has example usage.

opencsv You should have a look at OpenCSV. I would expect that they have performance optimizations.

A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc). SimpleFlatMapper uniVocity sesseltjonna-csv (disclaimer: I wrote this parser)