I am writing an application which needs to load a large csv file that is pure data and doesn't contain any headers.
I am using a fastCSV library to parse the file, however the data needs to be stored and specific fields need to be retrieved. Since the entire data is not necessary I am skipping every third line.
Is there a way to set the headers after the file has been parsed and save it in a data structure such as an ArrayList?
Here is the function which loads the file:
public void fastCsv(String filePath) {
File file = new File(filePath);
CsvReader csvReader = new CsvReader();
int linecounter = 1;
try (CsvParser csvParser = csvReader.parse(file, StandardCharsets.UTF_8)) {
CsvRow row;
while ((row = csvParser.nextRow()) != null) {
if ((linecounter % 3) > 0 ) {
// System.out.println("Read line: " + row);
//System.out.println("First column of line: " + row.getField(0));
System.out.println(row);
}
linecounter ++;
}
System.out.println("Execution Time in ms: " + elapsedTime);
csvParser.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Any insight would be greatly appreciated.
univocity-parsers supports field selection and can do this very easily. It's also faster than the library you are using.
Here's how you can use it to select columns of interest:
Input
String input = "X, X2, Symbol, Date, Open, High, Low, Close, Volume\n" +
" 5, 9, AAPL, 01-Jan-2015, 110.38, 110.38, 110.38, 110.38, 0\n" +
" 2710, 289, AAPL, 01-Jan-2015, 110.38, 110.38, 110.38, 110.38, 0\n" +
" 5415, 6500, AAPL, 02-Jan-2015, 111.39, 111.44, 107.35, 109.33, 53204600";
Configure
CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial
settings.setHeaderExtractionEnabled(true); //tells the parser to use the first row as the header row
settings.selectFields("X", "X2"); //selects the fields
Parse and print results
CsvParser parser = new CsvParser(settings);
for(String[] row : parser.iterate(new StringReader(input))){
System.out.println(Arrays.toString(row));
}
}
Output
[5, 9]
[2710, 289]
[5415, 6500]
On the field selection, you can use any sequence of fields, and have rows with different column sizes, and the parser will handle this just fine. No need to write complex logic to handle that.
The process the File in your code, change the example above to do this:
for(String[] row : parser.iterate(new File(filePath))){
... //your logic goes here.
}
If you want a more usable record (with typed values), use this instead:
for(Record record : parser.iterateRecords(new File(filePath))){
... //your logic goes here.
}
Speeding up
The fastest way of processing the file is through a RowProcessor. That's a callback that received the rows parsed from the input:
settings.setProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
System.out.println(Arrays.toString(row));
context.skipLines(3); //use the context object to control the parser
}
});
CsvParser parser = new CsvParser(settings);
//`parse` doesn't return anything. Rows go to the `rowProcessed` method.
parser.parse(new StringReader(input));
You should be able to parse very large files pretty quickly. If things are slowing down look in your code (avoid adding values to lists or collections in memory, or at least pre-allocate the collections to a good size, and give the JVM a large amount of memory to work with using Xms and Xmx flags).
Right now this parser is the fastest you can find. I made this performance comparison a while ago you can use for reference.
Hope this helps
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license)
Do you know which fields/columns you want to keep, and what you'd like the "header" value to be ? , ie you want columns the first and third columns and you want them called "first" and "third" ? If so, you could build a HashMap of string/objects (or other appropriate type, depends on your actual data and needs), and add the HashMap to an ArrayList - this should get you going, just be sure to change the HashMap types as needed
ArrayList<HashMap<String,String>> arr=new ArrayList<>();
HashMap<String,String> hm=new HashMap<>();
while ((row = csvParser.nextRow()) != null) {
if ((linecounter % 3) > 0 ) {
// System.out.println("Read line: " + row);
//System.out.println("First column of line: " + row.getField(0));
// keep col1 and col3
hm.clear();
hm.put("first",row.getField(0));
hm.put("third",row.getField(2));
arr.add(hm);
}
linecounter ++;
}
If you want to capture all columns, you can use a similar technique but I'd build a mapping data structure so that you can match field indexes to column header names in a loop to add each column to the HashMap that is then stored in the ArrayList
I am trying to export repeat grid data to excel. To do this, I have provided a button which runs "MyCustomActivity" activity via clicking. The button is placed above the grid in the same layout. It also worth pointing out that I am utulizing an article as a guide to configure. According to the guide my "MyCustomActivity" activity contains two steps:
Method: Property-Set, Method Parameters: Param.exportmode = "excel"
Method: Call pzRDExportWrapper. And I pass current parameters (There is only one from the 1st step).
But after I had got an issue I have changed the 2nd step by Call Rule-Obj-Report-Definition.pzRDExportWrapper
But as you have already understood the solution doesn't work. I have checked the log files and found interesting error:
2017-04-11 21:08:27,992 [ WebContainer : 4] [OpenPortal] [ ] [ MyFW:01.01.02] (ctionWrapper._baseclass.Action) ERROR as1|172.22.254.110 bar - Activity 'MyCustomActivity' failed to execute; Failed to find a 'RULE-OBJ-ACTIVITY' with the name 'PZRESOLVECOPYFILTERS' that applies to 'COM-FW-MyFW-Work'. There were 3 rules with this name in the rulebase, but none matched this request. The 3 rules named 'PZRESOLVECOPYFILTERS' defined in the rulebase are:
2017-04-11 21:08:42,807 [ WebContainer : 4] [TABTHREAD1] [ ] [ MyFW:01.01.02] (fileSetup.Code_Security.Action) ERROR as1|172.22.254.110 bar - External authentication failed:
If someone have any suggestions and share some, I will appreciate it.
Thank you.
I wanted to provide a functionality of exporting retrieved works to a CSV file. The functionality should has a feature to choose fields to retrieve, all results should be in Ukrainian and be able to use any SearchFilter Pages and Report Definition rules.
At a User Portal I have two sections: the first section contains text fields and a Search button, and a section with a Repeat Grid to display results. The textfields are used to filter results and they use a page Org-Div-Work-SearchFilter.
I made a custom parser to csv. I created two activities and wrote some Java code. I should mention that I took some code from the pzPDExportWrapper.
The activities are:
ExportToCSV - takes parameters from users, gets data, invokes the ConvertResultsToCSV;
ConvertResultsToCSV - converts retrieved data to a .CSV file.
Configurations of the ExportToCSV activity:
The Pages And Classes tab:
ReportDefinition is an object of a certain Report Definition.
SearchFilter is a Page with values inputted by user.
ReportDefinitionResults is a list of retrieved works to export.
ReportDefinitionResults.pxResults denotes a type of a certain work.
The Parameters tab:
FileName is a name of a generated file
ColumnsNames names of columns separated by comma. If the parameter is empty then CSVProperties is exported.
CSVProperties is a props to display in a spreadsheet separated by comma.
SearchPageName is a name of a page to filter results.
ReportDefinitionName is a RD's name used to retrieve results.
ReportDefinitionClass is a class of utilized report definition.
The Step tab:
Lets look through the steps:
1. Get an SearchFilte Page with a name from a Parameter with populated fields:
2. If SearchFilter is not Empty, call a Data Transform to convert SearchFilter's properties to Paramemer properties:
A fragment of the data Transform:
3. Gets an object of a Report Definition
4. Set parameters for the Report Definition
5. Invoke the Report Definition and save results to ReportDefinitionResults:
6. Invoke the ConvertResultsToCSV activity:
7. Delete the result page:
The overview of the ConvertResultsToCSV activity.
The Parameters tab if the ConvertResultsToCSV activity:
CSVProperties are the properties to retrieve and export.
ColumnsNames are names of columns to display.
PageListProperty a name of the property to be read in the primay page
FileName the name of generated file. Can be empty.
AppendTimeStampToFileName - if true, a time of the file generation.
CSVString a string of generated CSV to be saved to a file.
FileName a name of a file.
listSeperator is always a semicolon to separate fields.
Lets skim all the steps in the activity:
Get a localization from user settings (commented):
In theory it is able to support a localization in many languages.
Set always "uk" (Ukrainian) localization.
Get a separator according to localization. It is always a semicolon in Ukrainian, English and Russian. It is required to check in other languages.
The step contains Java code, which form a CSV string:
StringBuffer csvContent = new StringBuffer(); // a content of buffer
String pageListProp = tools.getParamValue("PageListProperty");
ClipboardProperty resultsProp = myStepPage.getProperty(pageListProp);
// fill the properties names list
java.util.List<String> propertiesNames = new java.util.LinkedList<String>(); // names of properties which values display in csv
String csvProps = tools.getParamValue("CSVProperties");
propertiesNames = java.util.Arrays.asList(csvProps.split(","));
// get user's colums names
java.util.List<String> columnsNames = new java.util.LinkedList<String>();
String CSVDisplayProps = tools.getParamValue("ColumnsNames");
if (!CSVDisplayProps.isEmpty()) {
columnsNames = java.util.Arrays.asList(CSVDisplayProps.split(","));
} else {
columnsNames.addAll(propertiesNames);
}
// add columns to csv file
Iterator columnsIter = columnsNames.iterator();
while (columnsIter.hasNext()) {
csvContent.append(columnsIter.next().toString());
if (columnsIter.hasNext()){
csvContent.append(listSeperator); // listSeperator - local variable
}
}
csvContent.append("\r");
for (int i = 1; i <= resultsProp.size(); i++) {
ClipboardPage propPage = resultsProp.getPageValue(i);
Iterator iterator = propertiesNames.iterator();
int propTypeIndex = 0;
while (iterator.hasNext()) {
ClipboardProperty clipProp = propPage.getIfPresent((iterator.next()).toString());
String propValue = "";
if(clipProp != null && !clipProp.isEmpty()) {
char propType = clipProp.getType();
propValue = clipProp.getStringValue();
if (propType == ImmutablePropertyInfo.TYPE_DATE) {
DateTimeUtils dtu = ThreadContainer.get().getDateTimeUtils();
long mills = dtu.parseDateString(propValue);
java.util.Date date = new Date(mills);
String sdate = dtu.formatDateTimeStamp(date);
propValue = dtu.formatDateTime(sdate, "dd.MM.yyyy", "", "");
}
else if (propType == ImmutablePropertyInfo.TYPE_DATETIME) {
DateTimeUtils dtu = ThreadContainer.get().getDateTimeUtils();
propValue = dtu.formatDateTime(propValue, "dd.MM.yyyy HH:mm", "", "");
}
else if ((propType == ImmutablePropertyInfo.TYPE_DECIMAL)) {
propValue = PRNumberFormat.format(localeCode,PRNumberFormat.DEFAULT_DECIMAL, false, null, new BigDecimal(propValue));
}
else if (propType == ImmutablePropertyInfo.TYPE_DOUBLE) {
propValue = PRNumberFormat.format(localeCode,PRNumberFormat.DEFAULT_DECIMAL, false, null, Double.parseDouble(propValue));
}
else if (propType == ImmutablePropertyInfo.TYPE_TEXT) {
propValue = clipProp.getLocalizedText();
}
else if (propType == ImmutablePropertyInfo.TYPE_INTEGER) {
Integer intPropValue = Integer.parseInt(propValue);
if (intPropValue < 0) {
propValue = new String();
}
}
}
if(propValue.contains(listSeperator)){
csvContent.append("\""+propValue+"\"");
} else {
csvContent.append(propValue);
}
if(iterator.hasNext()){
csvContent.append(listSeperator);
}
propTypeIndex++;
}
csvContent.append("\r");
}
CSVString = csvContent.toString();
5. This step forms and save a file in server's catalog tree
char sep = PRFile.separatorChar;
String exportPath= tools.getProperty("pxProcess.pxServiceExportPath").getStringValue();
DateTimeUtils dtu = ThreadContainer.get().getDateTimeUtils();
String fileNameParam = tools.getParamValue("FileName");
if(fileNameParam.equals("")){
fileNameParam = "RecordsToCSV";
}
//append a time stamp
Boolean appendTimeStamp = tools.getParamAsBoolean(ImmutablePropertyInfo.TYPE_TRUEFALSE,"AppendTimeStampToFileName");
FileName += fileNameParam;
if(appendTimeStamp) {
FileName += "_";
String currentDateTime = dtu.getCurrentTimeStamp();
currentDateTime = dtu.formatDateTime(currentDateTime, "HH-mm-ss_dd.MM.yyyy", "", "");
FileName += currentDateTime;
}
//append a file format
FileName += ".csv";
String strSQLfullPath = exportPath + sep + FileName;
PRFile f = new PRFile(strSQLfullPath);
PROutputStream stream = null;
PRWriter out = null;
try {
// Create file
stream = new PROutputStream(f);
out = new PRWriter(stream, "UTF-8");
// Bug with Excel reading a file starting with 'ID' as SYLK file. If CSV starts with ID, prepend an empty space.
if(CSVString.startsWith("ID")){
CSVString=" "+CSVString;
}
out.write(CSVString);
} catch (Exception e) {
oLog.error("Error writing csv file: " + e.getMessage());
} finally {
try {
// Close the output stream
out.close();
} catch (Exception e) {
oLog.error("Error of closing a file stream: " + e.getMessage());
}
}
The last step calls #baseclass.DownloadFile to download the file:
Finally, we can post a button on some section or somewhere else and set up an Actions tab like this:
It also works fine inside "Refresh Section" action.
A possible result could be
Thanks for reading.
I am evaluating RIAK kV V2.1.1 on a local desktop using java client and a little customised version of the sample code
And my concern is I found it to be taking almost 920bytes per KV.
That's too steep. The data dir was 93 mb for 100k kvs and kept increasing linearly there after for every 100k Store ops.
Is that expected.
RiakCluster cluster = setUpCluster();
RiakClient client = new RiakClient(cluster);
System.out.println("Client object successfully created");
Namespace quotesBucket = new Namespace("quotes2");
long start = System.currentTimeMillis();
for(int i=0; i< 100000; i++){
RiakObject quoteObject = new RiakObject().setContentType("text/plain").setValue(BinaryValue.create("You're dangerous, Maverick"));
Location quoteObjectLocation = new Location(quotesBucket, ("Ice"+i));
StoreValue storeOp = new StoreValue.Builder(quoteObject).withLocation(quoteObjectLocation).build();
StoreValue.Response storeOpResp = client.execute(storeOp);
}
There was a thread on the riak users mailing list a while back that discussed the overhead of the riak object, estimating it at ~400 bytes per object. However, that was before the new object format was introduced, so it is outdated. Here is a fresh look.
First we need a local client
(node1#127.0.0.1)1> {ok,C}=riak:local_client().
{ok,{riak_client,['node1#127.0.0.1',undefined]}}
Create a new riak object with a 0-byte value
(node1#127.0.0.1)2> Obj = riak_object:new(<<"size">>,<<"key">>,<<>>).
#r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],...}}},
value = <<>>}],
vclock = [],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}
The object is actually stored in a reduced binary format:
(node1#127.0.0.1)3> byte_size(riak_object:to_binary(v1,Obj)).
36
That is 36 bytes overhead for just the object, but that doesn't include the metadata like last updated time or the version vector, so store it in Riak and check again.
(node1#127.0.0.1)4> C:put(Obj).
ok
(node1#127.0.0.1)5> {ok,Obj1} = C:get(<<"size">>,<<"key">>).
{ok, #r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[[...]],[...],...}}},
value = <<>>}],
vclock = [{<<204,153,66,25,119,94,124,200,0,0,156,65>>,
{3,63654324108}}],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}}
(node1#127.0.0.1)6> byte_size(riak_object:to_binary(v1,Obj)).
110
Now it is 110 bytes overhead for an empty object with a single entry in the version vector. If a subsequent put of the object is coordinated by a different vnode, it will add another entry. I've selected the bucket and key names so that the local node is not a member of the preflist, so the second put has a fair probability of being coordinated by a different node.
(node1#127.0.0.1)7> C:put(Obj1).
ok
(node1#127.0.0.1)8> {ok,Obj2} = C:get(<<"size">>,<<"key">>).
{ok, #r_object{bucket = <<"size">>,key = <<"key">>,
contents = [#r_content{metadata = {dict,3,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[[...]],[...],...}}},
value = <<>>}],
vclock = [{<<204,153,66,25,119,94,124,200,0,0,156,65>>,
{3,63654324108}},
{<<85,123,36,24,254,22,162,159,0,0,78,33>>,{1,63654324651}}],
updatemetadata = {dict,1,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],...},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],...}}},
updatevalue = undefined}}
(node1#127.0.0.1)9> byte_size(riak_object:to_binary(v1,Obj2)).
141
Which is another 31 bytes added for an additional entry in the version vector.
These numbers don't include storing the actual bucket and key names with the value, or Bitcask storing them again in a hint file, so the actual space on disk would then be 2x(bucketname size + keyname size) + value overhead + file structure overhead + checksum/hash size
If you're using bitcask, there is a calculator in the documentation that will help you estimate disk and memory requirements: http://docs.basho.com/riak/kv/2.2.0/setup/planning/bitcask-capacity-calc/
If you use eLevelDB, you have the option of snappy compression which could reduce the size on disk.
I'm writing a program in Java that parse bibtex library file. each entry should be parsed to
field and value. this is an example of one single bibtex from a library.
#INPROCEEDINGS{conf/icsm/Ceccato07,
author = {Mariano Ceccato},
title = {Migrating Object Oriented code to Aspect Oriented Programming},
booktitle = {ICSM},
year = {2007},
pages = {497--498},
publisher = {IEEE},
bibdate = {2008-11-18},
bibsource = {DBLP, http://dblp.uni-trier.de/db/conf/icsm/icsm2007.html#Ceccato07},
crossref = {conf/icsm/2007},
owner = {Administrator},
timestamp = {2009.04.30},
url = {http://dx.doi.org/10.1109/ICSM.2007.4362668}
}
in this case, I just read the line and split it using the method split. for example, the first entry (author) is parsed like this:
Scanner in = new Scanner(new File(library.bib));
in.nextLine(); //skip the header
String input = in.nextLine(); //read (author = {Mariano Ceccato},)
String field = input.split("=")[0].trim(); //field = "author"
String value = input.split("=")[1]; //value = "{Mariano Ceccato},"
value = value.split("\\}")[0]; //value = "{Mariano Ceccato"
value = value.split("\\{")[1]; //value = "Mariano Ceccato"
value = value.trim; //remove any white spaces (if any)
up to know every thing is good. However there are a bibtex in the library that has multiple lines' value:
#ARTICLE{Aksit94AbstractingCF,
author = {Mehmet Aksit and Ken Wakita and Jan Bosch and Lodewijk Bergmans and
Akinori Yonezawa },
title = {{Abstracting Object Interactions Using Composition Filters}},
journal = {Lecture Notes in Computer Science},
year = {1994},
volume = {791},
pages = {152--??},
acknowledgement = {Nelson H. F. Beebe, Center for Scientific Computing, University of
Utah, Department of Mathematics, 110 LCB, 155 S 1400 E RM 233, Salt
Lake City, UT 84112-0090, USA, Tel: +1 801 581 5254, FAX: +1 801
581 4148, e-mail: \path|beebe#math.utah.edu|, \path|beebe#acm.org|,
\path|beebe#computer.org|, \path|beebe#ieee.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Mon May 13 11:52:14 MDT 1996},
coden = {LNCSD9},
issn = {0302-9743},
owner = {aljasser},
timestamp = {2009.01.08}
}
as you see, the acknowledgement field it more than a line, so I can't read it using nextLine(). My parsing function works fine with it if I passed it as a String to it. So what is the best way to read this entry and other multiple lines entry and stile be able to read single line entries ?
The form of these entries is
#<type>{<Id>
<name>={<value>},
....
<name>={<value>}
}
Note that the last name-value pair is not followed by a comma.
If a value is split over several lines, then that simply means that a particular line does not yet contain the closing brace. In that case, scan the next line and append it to the string you are about to split. Keep doing this until the last characters in the string are "}," or "}" (this latter would happen if the 'acknowledgement' was the last name-value pair in the record).
For extra safety, count that the number of closing braces matches the number of opening braces, and keep appending lines to your string until it does. This would be to cover situations where you have a long title in an article that happened to unfortunately break at the wrong place, such as
title = {{Abstracting Object Interactions Using Composition Filters, and other stuff}
},
For these king of issues, it is always better to use a specific parser.
I googled for bibtex parser and find this.
If you like to have your own as what you are doing, one sulotion to this problem is to check whether
the line ends with }, if not append the current line with the next one.
Having said that, there might be other issues, that's why I suggested using a parser