I'm trying to create a simple class to read a csv file and store the contents in an
ArrayList<ArrayList<T>>.
I'm creating a generic class CsvReader so that I can handle data of different types: int, double, String. If I had, say, a csv file of doubles, I was imagining I would use my class like this:
//possible method 1
CsvReader<Double> reader = new CsvReader<Double>();
ArrayList<ArrayList<Double>> contents = reader.getContents();
//possible method 2
CsvReader reader = new CsvReader(Double.class);
ArrayList<ArrayList<Double>> contents = reader.getContents();
But method 1 won't work since type erasure prevents you from writing code like
rowArrayList.add(new T(columnStringValue));
But I can't even make the passing in a Double.class solution work. The problem is that what's really going on is that I need my class "parameterized" (in the general sense of that word, not the technical java generics sense) on a type with the following property: it has a ctor accepting a single String argument. That is, to create the row ArrayLists on, say, a Double csv file, I'd need to write:
StringTokenizer st = new StringTokenizer(line,",");
ArrayList<Double> curRow = new ArrayList<Double>();
while (st.hasMoreTokens()) {
curRow.add(new Double(st.nextToken());
}
Having passed in Double.class, I could get its String ctor using
Constructor ctor = c.getConstructor(new Class[] {String.class});
but this has two problems. Most importantly, this is a general constructor that will return a type Object, which I cannot then downcast into a Double. Second, I would be missing "type" checking on the fact that I am requiring my passed in class to have a String arg constructor.
My question is: How can I properly implement this general purpose CsvReader?
Thanks,
Jonah
I'm not sure a generic CSV reader would be this simple to use (and to create, by the way).
The first question that comes to my mind is: What if the CSV contains three columns: first an integer, then a string and finally a date? How would you use your generic CSV reader?
Anyway, lets suppose you want to create a CSV reader where all columns are of the same type. As you said, you can't parametrize a class on a type "that accepts a String as constructor". Java just doesn't allow that. The solution using reflection is a good start. But what if your class doesn't take a String as parameter in one of its constructors?
Here you can come with an alternative: a parser that would take your String and return an object of the correct type. Create a generic interface, and make some implementations for the type you want to crawl:
public interface Parser<T> {
T parse(String value);
}
And then, implement:
public class StringParser implements Parser<String> {
public String parse(String value) {
return value;
}
}
Then, you CSV reader can take a Parser as one of its parameters. Then, it can use this parser to convert each String into a Java object.
With this solution, you get rid of the not-so-pretty reflection your where using. And you can convert to any type, you just have to implement a Parser.
Your reader will look like this:
public CSVReader<T> {
Parser<T> parser;
List<T> getValues() {
// ...
}
}
Now, back at the problem where a CSV file can have multiple types, just improve your reader a little. All you need is a list of parsers (one per column) instead of one that parse all columns.
Hope that helps :-)
Creating a correct CVS reader might be more difficult than you thought. For example, in your code example, it will not work correctly under the following situation.
"Microsoft, Inc",1,2,3
Instead of 4 fields, what you will be getting is 5 fields based on
StringTokenizer st = new StringTokenizer(line,",");
What my suggestion is, use third party libraries implementation. For example
http://opencsv.sourceforge.net/
I use it in one of my application, and my application has been running for 3 years. So far so good.
If you are trying to do real work, I suggest you forget that and use Scanner.
If you are experimenting: I would make CsvReader an abstract class:
public abstract class CsvReader<T> {
...
// This is what you use in the rest of CsvReader
// to create your objects from the strings in the CSV
protected abstract T parse(String s);
...
}
And it would be used as:
CsvReader<Double> = new CsvReader<Double>() {
#Override protected Double parse(String s) {
return Double.valueOf(s);
}
};
...
Not perfect, but reasonable.
EDIT: It turns out that you can have it your way, though it looks a bit hackish. See Super Type Tokens. It would basically involve including the logic shown in the Super Type Tokens link in CsvReader to have avilable the class object corresponding to your element class.
I had a need to read a simple list of strings stored in the cells of a CSV file, and started searching for a Java solution. I found most open source CSV readers to be unnecessarily complicated for my purpose. (See https://agiletribe.purplehillsbooks.com/2012/11/23/the-only-class-you-need-for-csv-files/ for a comprehensive review).
Finally I found MKYong's code very effective. I had to adapt it for my purpose to read the whole CSV or TSV file and return it as a list of lists. Each element in the inner list represents one cell of the CSV. The code along with credites to MKYong can be found at:
https://github.com/ramanraja/CsvReader
Related
I thought that it's such a common question, but I can't find anything which helps me. I'm relatively new to java and exercising for a job application. For that I've started writing a tool to transform data. (eg. read CSV, translate some columns and write it as SQL inserts to a file)
If you're interested you find it here, I'll copy some code for my question: https://github.com/fvosberg/datatransformer
I've started with a Class which should read the CSV (and which will get more complex by enclosing some fields which should contain the seperator and so on). My IDE (IntellJ IDEA) suggested to use as strict access modifiers for my methods as possible. Why should I hide these methods (with private) from subclasses?
package de.frederikvosberg.datatransformer;
import java.io.BufferedReader;
import java.io.Reader;
import java.util.*;
class CSVInput {
private final BufferedReader _reader;
private String separator = ",";
public CSVInput(Reader reader) {
_reader = new BufferedReader(reader);
}
public List<SortedMap<String, String>> readAll() throws java.io.IOException {
List<SortedMap<String, String>> result = new LinkedList<>();
List<String> headers = readHeaders();
String line;
while ((line = _reader.readLine()) != null) {
result.add(
colsFromLine(headers, line)
);
}
_reader.close();
return result;
}
private List<String> readHeaders() throws java.io.IOException {
List<String> headers = new ArrayList<>();
String line = _reader.readLine();
if (line == null) {
throw new RuntimeException("There is no first line for the headers in the CSV");
}
return valuesFromLine(line);
}
public void setSeparator(String separator) {
this.separator = separator;
}
/**
* creates a list of values from a CSV line
* it uses the separator field
*
* #param line a line with values separated by this.separator
* #return a list of values
*/
private List<String> valuesFromLine(String line) {
return Arrays.asList(
line.split(this.separator)
);
}
private SortedMap<String, String> colsFromLine(List<String> headers, String line) {
SortedMap<String, String> cols = new TreeMap<>();
List<String> values = valuesFromLine(line);
Iterator<String> headersIterator = headers.iterator();
Iterator<String> valuesIterator = values.iterator();
while (headersIterator.hasNext() && valuesIterator.hasNext()) {
cols.put(headersIterator.next(), valuesIterator.next());
}
if (headersIterator.hasNext() || valuesIterator.hasNext()) {
throw new RuntimeException("The size of a row doesn't fit with the size of the headers");
}
return cols;
}
}
Another downside are the unit test. I would like to write separate tests for my methods. Especially the CSVInput::valuesFromLine method which will get more complex. My Unit test for this class is testing so much and I really don't want to have to many things in my head when developing.
Any suggestions from experienced Java programmers?
Thanks in advance
Replies to comments
Thank you for your comments. Let me answer to the comments here, for the sake of clarity.
"Why should I hide these methods (with private) from subclasses?" Why
do you keep your car keys away from your front door?
For security purposes, but why does it affect security when I change the access modifier of the colsFromLine method? This method accepts the headers as a parameter, so it doesn't rely on any internal state nor change it.
The next advantage of strict access modifiers I can think of is to help other developers to show them which method he should use and where the logic belongs to.
Don't write your test to depend on the internal implementation of the
functionality, just write a test to verify the functionality.
I don't. It depends on what do you mean with internal implementation. I don't check any internal states or variables. I just want to test the algorithm which is going to parse the CSV in steps.
"My Unit test for this class is testing so much" - If too many tests
on a class, you should rethink your design. It's very likely that your
class literally is doing too much, and should be broken up.
I don't have many tests on my class, but when I'm going the way I started, I'm going to write many tests for the same method (parsing the CSV) because it has many edge cases. The tests will grow in size because of different boilerplates. That is my concern why I'm asking here
To answer your direct question: you always strive to hide as much as possible from either client code, but also from subclasses.
The point is: you want (theoretically) to be able to change parts/all of your implementation without affecting other elements in your system. When client/subclass code knows about such implementation details ... sooner or later, such code starts relying on them. To avoid that, you keep them out of sight. The golden rule is: good OO design is about behavior (or "contracts") of objects and methods. You absolutely do not care how some method does its job; you only care about the what it does. The how part should be invisible!
Having said that, sometimes it does make sense to give "package protected" visibility to some of your methods; in order to make to make them available within your unit tests.
And beyond that: I don't see much point in extending your CsvInput (prefer camel case, even for class names!) class anyway. As usual: prefer composition over inheritance!
In any case, such kind of "assignments" are excellent material for practicing TDD. You write a test (that checks one aspect); and then you write the code to pass that test. Then you write another test checking "another" condition; and so on.
I was using a library called Mallet. It is by far the most complicated Java Library I have ever used. They provide tutorials and code template and I was trying to understand it. However, I came across this line of code:
TransducerEvaluator evaluator = new MultiSegmentationEvaluator(
new InstanceList[]{trainingData, testingData},
new String[]{"train", "test"}, labels, labels) {
#Override
public boolean precondition(TransducerTrainer tt) {
// evaluate model every 5 training iterations
return tt.getIteration() % 5 == 0;
}
};
Please don't pay too much attention on the term "transducer". What is passed into this function? Two classes? What is this new String[]{}? I am just very very confused with this syntax as I have never seen it before.
This is the code for this method:
public MultiSegmentationEvaluator (InstanceList[] instanceLists, String[] instanceListDescriptions,
Object[] segmentStartTags, Object[] segmentContinueTags)
Can someone tell me what this weird construct is?
This construct does several things:
Creates a subclass of MultiSegmentationEvaluator without giving it a name
Provides an override of the precondition(TransducerTrainer tt) method
Instantiates the newly defined anonymous class by passing two string arrays and then labels to the constructor that takes four parameters.
Assigns the newly created instance to the evaluator variable.
The code uses the anonymous class feature of Java - a very handy tool for situations when you have to subclass or implement an interface, but the class that you define is used in only one spot in your program.
Consider this code:
String[] stringArr = new String[]{"train", "test"};
Does it make any sense now? It is a String array! =) Here's even more stupid code to prove my point:
new String[]{"train", "test"}.getClass() == String[].class
InstanceList[]
means that you need to have a list of objects that are of they type InstanceList, same goes for String[]
for these:
Object[]
means that anything that is a sublass of Object (any object) can be passed as arguments for the last two paramaters.
In the top code this is exactly what they're doing but they create new objects for InstanceList and String,and then labels is the 2 objects they're passing.
Pretty new to Java
I would like to be able to use a method in following sort of way;
class PairedData {
String label;
Object val:
}
public void myMethod(String tablename, PairedData ... pD) {
/*
insert a record into a table -tablename with the various fields being
populated according to the information provided by the list of
PairedData objects
*/
}
myMethod("firststring",{"field1",Date1},{"field2",12},{"field3","aString"});
I realise the syntax is not valid but I hope it gives the gist of what I would like to do.
What I am trying to do is to directly pass the data rather than populate the instances of the class and then pass those. Is that possible or am I just trying to break a whole lot of OOPs rules?
No, what you're trying to do really isn't possible. It looks to me like it would be much better to pass instances of your class to the method as opposed to doing something convoluted with arrays like that. Another answer suggested using an Object[] varargs parameter - that's probably the closest you'll get to achieving something like what you show in your example. Another alternative (and I think a better one) would be
public void myMethod(String tablename, String[] labels, Object[] vals) {
You could instantiate your class for each labels[i] and vals[i] (pairing them up) in those arrays. In other words, in your method you could have something like
pD = new PairedData[labels.length];
for (i = 0; i < labels.length; i++)
pD[i] = new PairedData(labels[i], vals[i]); // assuming you
// added this
// constructor
The method call example that you included would then be converted to
myMethod("firststring", new String[]{"field1", "field2", "field3"},
new Object[]{date1, 12, "aString"});
You can do this by using arrays of Object:
public void myMethod(String tableName, Object[] ...pairs)
and invoke this method in a such style:
myMethod("someTable", new Object[] {"field1", date1}, new Object[] {"field2", date2});
usually...
you would make a class that has variable in it for all the parameters.
then you would build an instance of that class and populate the values.
then you could use that class instance to pass those around.
if you want a whole bunch... then make a Collection (Map, HashMap, List etc.) and pass that.
Seems to be a good case for a future language extension if you ask me. But by slightly changing the way you call your method we should be able to get close ...
myMethod("someTable",
new PairedData("field1", date1),
new PairedData("field2", date2)
);
It’s more type-work, but it is probably the safest as it is typesafe and not error prone to matching pairs.
You would also be required to write your constructor for ‘PairedData(String label, Object val)‘ for which I advise to write multiple overloaded versions one for each type of val you plan to store.
This question isn't specifically about performing tokenization with regular expressions, but more so about how an appropriate type of object (or appropriate constructor of an object) can be matched to handle the tokens output from a tokenizer.
To explain a bit more, my objective is to parse a text file containing lines of tokens into appropriate objects that describe the data. My parser is in fact already complete, but at present is a mess of switch...case statements and the focus of my question is how I can refactor this using a nice OO approach.
First, here's an example to illustrate what I'm doing overall. Imagine a text file that contains many entries like the following two:
cat 50 100 "abc"
dog 40 "foo" "bar" 90
When parsing those two particular lines of the file, I need to create instances of classes Cat and Dog respectively. In reality there are quite a large number of different object types being described, and sometimes different variations of numbers of arguments, with defaults often being assumed if the values aren't there to explicity state them (which means it's usually appropriate to use the builder pattern when creating the objects, or some classes have several constructors).
The initial tokenization of each line is being done using a Tokenizer class I created that uses groups of regular expressions that match each type of possible token (integer, string, and a few other special token types relevant to this application) along with Pattern and Matcher. The end result from this tokenizer class is that, for each line it parses, it provides back a list of Token objects, where each Token has a .type property (specifying integer, string, etc.) along with primitive value properties.
For each line parsed, I have to:
switch...case on the object type (first token);
switch on the number of arguments and choose an appropriate constructor
for that number of arguments;
Check that each token type is appropriate for the types of arguments needed to construct the object;
Log an error if the quantity or combination of argument types aren't appropriate for the type of object being called for.
The parser I have at the moment has a lot of switch/case or if/else all over the place to handle this and although it works, with a fairly large number of object types it's getting a bit unwieldy.
Can someone suggest an alternative, cleaner and more 'OO' way of pattern matching a list of tokens to an appropriate method call?
The answer was in the question; you want a Strategy, basically a Map where the key would be, e.g., "cat" and the value an instance of:
final class CatCreator implements Creator {
final Argument<Integer> length = intArgument("length");
final Argument<Integer> width = intArgument("width");
final Argument<String> name = stringArgument("length");
public List<Argument<?>> arguments() {
return asList(length, width, name);
}
public Cat create(Map<Argument<?>, String> arguments) {
return new Cat(length.get(arguments), width.get(arguments), name.get(arguments));
}
}
Supporting code that you would reuse between your various object types:
abstract class Argument<T> {
abstract T get(Map<Argument<?>, String> arguments);
private Argument() {
}
static Argument<Integer> intArgument(String name) {
return new Argument<Integer>() {
Integer get(Map<Argument<?>, String> arguments) {
return Integer.parseInt(arguments.get(this));
}
});
}
static Argument<String> stringArgument(String name) {
return new Argument<String>() {
String get(Map<Argument<?>, String> arguments) {
return arguments.get(this);
}
});
}
}
I'm sure someone will post a version that needs less code but uses reflection. Choose either but do bear in mind the extra possibilities for programming mistakes making it past compilation with reflection.
I have done something similar, where I have decoupled my parser from code emitter, which I consider anything else but the parsing itself. What I did, is introduce an interface which the parser uses to invoke methods on whenever it believes it has found a statement or a similar program element. In your case these may well be individual lines you have shown in the example in your question. So whenever you have a line parsed you invoke a method on the interface, an implementation of which will take care of the rest. That way you isolate the program generation from parsing, and both can do well on their own (well, at least the parser, as the program generation will implement an interface the parser will use). Some code to illustrate my line of thinking:
interface CodeGenerator
{
void onParseCat(int a, int b, String c); ///As per your line starting with "cat..."
void onParseDog(int a, String b, String c, int d); /// In same manner
}
class Parser
{
final CodeGenerator cg;
Parser(CodeGenerator cg)
{
this.cg = cg;
}
void parseCat() /// When you already know that the sequence of tokens matches a "cat" line
{
/// ...
cg.onParseCat(/* variable values you have obtained during parsing/tokenizing */);
}
}
This gives you several advantages, one of which being that you do not need a complicated switch logic as you have determined type of statement/expression/element already and invoke the correct method. You can even use something like onParse in CodeGenerator interface, relying on Java method overriding if you want to always use same method. Remember also that you can query methods at runtime with Java, which can aid you further in removing switch logic.
getClass().getMethod("onParse", Integer.class, Integer.class, String.class).invoke(this, catStmt, a, b, c);
Just make note that the above uses Integer class instead of the primitive type int, and that your methods must override based on parameter type and count - if you have two distinct statements using same parameter sequence, the above may fail because there will be at least two methods with the same signature. This is of course a limitation of method overriding in Java (and many other languages).
In any case, you have several methods to achieve what you want. The key to avoid switch is to implement some form of virtual method call, rely on built-in virtual method call facility, or invoke particular methods for particular program element types using static binding.
Of course, you will need at least one switch statement where you determine which method to actually call based on what string your line starts with. It's either that or introducing a Map<String,Method> which gives you a runtime switch facility, where the map will map a string to a proper method you can call invoke (part of Java) on. I prefer to keep switch where there is not substantial amount of cases, and reserve Java Maps for more complicated run-time scenarios.
But since you talk about "fairly large amount of object types", may I suggest you introduce a runtime map and use the Map class indeed. It depends on how complicated your language is, and whether the string that starts your line is a keyword, or a string in a far larger set.
Assume a class (for instance URI) that is convertable to and from a String using the constructor and toString() method.
I have an ArrayList<URI> and I want to copy it to an ArrayList<String>, or the other way around.
Is there a utility function in the Java standard library that will do it? Something like:
java.util.collections.copy(urlArray,stringArray);
I know there are utility libraries that provide that function, but I don't want to add an unnecessary library.
I also know how to write such a function, but it's annoying to read code and find that someone has written functions that already exist in the standard library.
I know you don't want to add additional libraries, but for anyone who finds this from a search engine, in google-collections you might use:
List<String> strings = Lists.transform(uris, Functions.toStringFunction());
one way, and
List<String> uris = Lists.transform(strings, new Function<String, URI>() {
public URI apply(String from) {
try {
return new URI(from);
} catch (URISyntaxException e) {
// whatever you need to do here
}
}
});
the other.
No, there is no standard JDK shortcut to do this.
Have a look at commons-collections:
http://commons.apache.org/collections/
I believe that CollectionUtils has a transform method.
Since two types of collections may not be compatible, there is no built-in method for converting one typed collection to another typed collection.
Try:
public ArrayList<String> convert(ArrayList<URI> source) {
ArrayList<String> dest=new ArrayList<String>();
for(URI uri : source)
dest.add(source.toString());
return dest;
}
Seriously, would a built-in API offer a lot to that?
Also, not very OO. the URI array should probably be wrapped in a class. The class might have a .asStrings() method.
Furthermore you'll probably find that you don't even need (or even want) the String collection version if you write your URI class correctly. You may just want a getAsString(int index) method, or a getStringIterator() method on your URI class, then you can pass your URI class in to whatever method you were going to pass your string collection to.