Reading large sets of data in Java - java

I am using Java to read and process some datasets from the UCI Machine Learning Repository.
I started out with making a class for each dataset and working with the particular class file. Every attribute in the dataset was represented by a corresponding data member in the class of the required type. This approach worked fine till no. of attributed <10-15. I just increased or decreased the data members of the class and changed their types to model new datasets. I also made the required changes to the functions.
The problem:
I have to work with much large datasets now. Ones with >20-30 attributes are vey tedious to work with in this manner. I dont need to query. My data discretization algorithm just needs 4 scans of the data to discretize it. My work ends right after the discretization. What would be an effective strategy here?
I hope I have been able to state my problem clearly.

Some options:
Write a code generator to read the meta-data of the file and generate the equivalent class file.
Don't bother with classes; keep the data in arrays of Object or String and cast them as needed.
Create a class that contains a collection of DataElements and subclass DataElements for all the types you need and use the meta-data to create the right class at runtime.

Create a simple DataSet class that contains a member like the following:
public class DataSet {
private List<Column> columns = new ArrayList<Column>();
private List<Row> rows = new ArrayList<Row>();
public void parse( File file ) {
// routines to read CSV data into this class
public class Row {
private Object[] data;
public void parse( String row, List<Column> columns ) {
String[] row = data.split(",");
data = new Object[row.length];
int i = 0;
for( Column column : columns ) {
data[i] = column.convert(row[i]);
public class Column {
private String name;
private int index;
private DataType type;
public Object convert( String data ) {
if( type == DataType.NUMERIC ) {
return Double.parseDouble( data );
} else {
return data;
public enum DataType {
That'll handle any data set you wish to use. The only issue is the user must define the dataset by defining the columns and their respective data types to the DataSet. You can do it in code or reading it in from a file whatever you think is easier. You might be able to default a lot of the configuration data (say as CATEGORICAL), or attempt to parse the field if that fails it must be CATEGORICAL otherwise its numeric. Normally, the file contains a header you could parse to find the names of the columns, then you just need to figure out the data type by looking at the data in that column. A simple algorithm to guess the data type goes a long way in aiding you. Essentially this is the exact same data structure every other package uses for data like this (eg R, Weka, etc).

I did something like that in one of my projects; lots of variable data, and in my case I obtained the data from the Internet. Since I needed to query, sort, etc., I spent some time designing a database to accommodate all the variations of the data (not all entries had the same number of properties). It did take a while but in the end I used the same code to get the data for any entry (using JPA in my case). My IDE (NetBeans) created most of the code straight using the database schema.
From your question, it is not clear on how you plan to use the data so I'm answering based on personal experience.


Finding min/max values in a KafkaStream (KStream) object

I have a Kafka Stream application and Avro schemas for each of the topics and also for the key. Key topic schema is same for all.
Now, there is a KafkaStream (KStream) object with the known key object as the key and a value object (derived from the AvroSchema) which extends org.apache.avro.specific.SpecificRecordBase but it could be any of my avro schemas for the topic content.
KStream<CustomKey, ? extends SpecificRecordBase> myStream = ...
What I want to achieve is to run min and max functions on this stream. The problem is that I don't know what is the ? object, and as there are 30+ (and will increase in the future) topics, I don't wanna do a switch-case. So I have the followings:
public KStream<CustomKey, ? extends SpecificRecordBase> max(
final KStream<CustomKey, ? extends SpecificRecordBase> myStream,
final String attributeName) {
SpecificRecordBase maxValue = ...;
myStream.foreach((key, value) -> {
value.get(attributeName) // I want to find the max value for this attribute,
// but at this point we don't know it's type and
// and we can't assign maxValue = value, because this is a lambda
// function.
// find and return the max value
My question is, how can I calculate the max value for the myStream on the attributeName attribute?
it could be any of my avro schemas for the topic content
Then you need to extends ClassWithMinMaxFields. Otherwise, you will be unable to extract it from generic SpecificRecordBase object.
Also, your method returns a stream. You cannot return the min/max. If that is your objective, you need a plain consumer to scan the whole topic, beginning to (eventual) end.
To do this (correctly) with Streams API, you would either
need to build a KTable for every value, grouped by key, then do a table scan for the min/max, as you need them.
Create a new topic using aggregate DSL function, initialized with {"min": +Inf, "max": -Inf}, then on new records you check old vs new records, if you have a new min and/or max, set them and return the new record. Then, you still need an external consumer to fetch the most recent min/max events.
If you had a consistent Avro type, you could use ksqlDB functions

Parse a single POJO from multiple YAML documents representing different classes

I want to use a single YAML file which contains several different objects - for different applications. I need to fetch one object to get an instance of MyClass1, ignoring the rest of docs for MyClass2, MyClass3, etc. Some sort of selective de-serializing: now this class, then that one... The structure of MyClass2, MyClass3 is totally unknown to the application working with MyClass1. The file is always a valid YAML, of course.
The YAML may be of any structure we need to implement such a multi-class container. The preferred parsing tool is snakeyaml.
Is it sensible? How can I ignore all but one object?
UPD: replaced all "document" with "object". I think we have to speak about the single YAML document containing several objects of different structure. More of it, the parser knows exactly only 1 structure and wants to ignore the rest.
UDP2: I think it is impossible with snakeyaml. We have to read all objects anyway - and select the needed one later. But maybe I'm wrong.
UPD2: sample config file
attachmentFieldName: "name"
baseSftpInboxPath: /home/user/somedir/
somebool: false
days: 9999
- ABC w/o quotes
- "Cat ABC"
- "Some string"
dateFormat: yyyy-MMdd-HHmm
user: someuser
k1: v1
- v21
- v22
This is definitely possible with SnakeYAML, albeit not trivial. Here's a general rundown what you need to do:
First, let's have a look what loading with SnakeYAML does. Here's the important part of the YAML class:
private Object loadFromReader(StreamReader sreader, Class<?> type) {
Composer composer = new Composer(new ParserImpl(sreader), resolver, loadingConfig);
return constructor.getSingleData(type);
The composer parses YAML input into Nodes. To do that, it doesn't need any knowledge about the structure of your classes, since every node is either a ScalarNode, a SequenceNode or a MappingNode and they just represent the YAML structure.
The constructor takes a root node generated by the composer and generates native POJOs from it. So what you want to do is to throw away parts of the node graph before they reach the constructor.
The easiest way to do that is probably to derive from Composer and override two methods like this:
public class MyComposer extends Composer {
private final int objIndex;
public MyComposer(Parser parser, Resolver resolver, int objIndex) {
super(parser, resolver);
this.objIndex = objIndex;
public MyComposer(Parser parser, Resolver resolver, LoaderOptions loadingConfig, int objIndex) {
super(parser, resolver, loadingConfig);
this.objIndex = objIndex;
public Node getNode() {
return strip(super.getNode());
private Node strip(Node input) {
return ((SequenceNode)input).getValue().get(objIndex);
The strip implementation is just an example. In this case, I assumed your YAML looks like this (object content is arbitrary):
- {first: obj}
- {second: obj}
- {third: obj}
And you simply select the object you actually want to deserialize by its index in the sequence. But you can also have something more complex like a searching algorithm.
Now that you have your own composer, you can do
Constructor constructor = new Constructor();
// assuming we want to get the object at index 1 (i.e. second object)
Composer composer = new MyComposer(new ParserImpl(sreader), new Resolver(), 1);
MyObject result = (MyObject)constructor.getSingleData(MyObject.class);
The answer of #flyx was very helpful for me, opening the way to workaround the library (in our case - snakeyaml) limitations by overriding some methods. Thanks a lot! It's quite possible there is a final solution in it - but not now. Besides, the simple solution below is robust and should be considered even if we'd found the complete library-intruding solution.
I've decided to solve the task by double distilling, sorry, processing the configuration file. Imagine the latter consisting of several parts and every part is marked by the unique token-delimiter. For the sake of keeping the YAML-likenes, it may be
#this is a unique key for the configuration A
<some YAML document>
#this is another key for the configuration B
<some YAML document
The first pass is pre-processing. For the given String fileString and String key (and DELIMITER = "\n---\n". for example) we select a substring with the key-defined configuration:
int begIndex;
do {
begIndex= fileString.indexOf(DELIMITER);
if (begIndex == -1) {
if (fileString.startsWith(DELIMITER + key, begIndex)) {
fileString = fileString.substring(begIndex + DELIMITER.length() + key.length());
// spoil alien delimiter and repeat search
fileString = fileString.replaceFirst(DELIMITER, " ");
} while (true);
int endIndex = fileString.indexOf(DELIMITER);
if (endIndex != -1) {
fileString = fileString.substring(0, endIndex);
Now we feed the fileString to the simple YAML parsing
ExportConfiguration configuration = new Yaml(new Constructor(ExportConfiguration.class))
.loadAs(fileString, ExportConfiguration.class);
This time we have a single document that must co-respond to the ExportConfiguration class.
Note 1: The structure and even the very content of the rest of configuration file plays absolutely no role. This was the main idea, to get independent configurations in a single file
Note 2: the rest of configurations may be JSON or XML or whatever. We have a method-preprocessor that returns a String configuration - and the next processor parses it properly.

How to get private vendor attribute tag in C_FIND from pixelmed?

i'm trying to read a private vendor tag from a dicom server.
The only tags I'm able to read successfully are the standard DICOM tagFromNames
the tag is 2001,100b, and in my example set of files they definitely have that entry in their header
here is the code for calling the CFIND request
SpecificCharacterSet specificCharacterSet = new SpecificCharacterSet((String[])null);
AttributeList identifier = new AttributeList();
//specify attributes to retrieve and pass in any search criteria
//query root of "study" to retrieve studies
AttributeTag at = new com.pixelmed.dicom.AttributeTag("0x2001,0x100b");
IdentifierHandler ih = new IdentifierHandler(){
public void doSomethingWithIdentifier(AttributeList id) throws DicomException {
studies.add(new Study(id, reportfolder));
//Attempt to read private dicom tag from received identifier
new FindSOPClassSCU(serv.getAddress(),serv.getPort(), serv.getAetitle(), "ISPReporter",SOPClass.StudyRootQueryRetrieveInformationModelFind,identifier,ih);
However, my output from the query, receives 7 identifiers that match for the date however when I try to read the 2001,100b tag, the error I get reads:
DicomException: No such data element as (0x2001,0x100b) in dictionary
if I use this line instead
identifier.put(new com.pixelmed.dicom.TextAttribute(at) {
public int getMaximumLengthOfEntireValue() { return 20; }
Then I get:
(null for each identifier returned)
Two things (second one moot because this won't work anyway because of the first):
C-FIND SCPs query against a database of a subset of data elements previously extracted from the DICOM image header and indexed - only a (small) subset of data elements present in images are actually indexed, as described; the standard requires very few in the Query Information Models, and the IHE Scheduled Workflow (SWF) profile a few more (Query Images Transaction Table 4.14-1; implementers could index every data element (or at least every standard data elements), but this is rarely done (PixelMed doesn't, though I have consider doing it adaptively as data elements are encountered now that hsqldb supports adding columns; NoSQL based implementations might find this easier)
When you encode a private data element, whether it be in a query identifier/response, or in an image header, you need to include its creator; i.e., for (2001,100b), you need to include (2001,0010); otherwise the creator of the private data element is not specified.

How to create PCollection<Row> from PCollection<String> for performing beam SQL Trasforms

I am trying to implement a Data Pipeline which joins multiple unbounded sources from Kafka topics. I am able to connect to topic and get the data as PCollection<String> and i need to convert it into PCollection<Row>. I am splitting the comma delimited string to an array and use schema to convert it as Row. But, How to implement/build schema & bind values to it dynamically?
Even if I create a separate class for schema building, is there a way to bind the string array directly to schema?
Below is my current working code which is static and needs to be rewritten every time i build a pipeline and it elongates based on the number of fields as well.
final Schema sch1 =
PCollection<KafkaRecord<Long, String>> kafkaDataIn1 = pipeline
KafkaIO.<Long, String>read()
ImmutableMap.of("", (Object)"test1")));
PCollection<Row> Input1 = kafkaDataIn1.apply(
ParDo.of(new DoFn<KafkaRecord<Long, String>, Row>() {
public void processElement(
ProcessContext processContext,
final OutputReceiver<Row> emitter) {
KafkaRecord<Long, String> record = processContext.element();
final String input = record.getKV().getValue();
final String[] parts = input.split(",");
My Expectation is to achieve the same thing as above code dynamically/reusable way.
The schema is set on a pcollection so it is not dynamic, if you want to build it lazily, then you need to use a format/coder supporting it. Java serialization or json are examples.
That said to benefit from sql feature you can also use a static schema with querying fields and other fields, this way the static part enables to do you sql and you dont loose additionnal data.

How to use an array value as field in Java? a1.section[2] = 1;

New to Java, and can't figure out what I hope to be a simple thing.
I keep "sections" in an array:
public static final String[] TOP = {
"Top News",
I'd like to do something like this:
Article a1 = new Article();
a1.["s_" + section[2]] = 1; //should resolve to a1.s_top = 1;
But it won't let me, as it doesn't know what "section" is. (I'm sure seasoned Java people will cringe at this attempt... but my searches have come up empty on how to do this)
My article mysqlite table has fields for the "section" of the article:
When doing my import from an XML file, I'd like to set that field to a 1 if it's in that category. I could have switch statement:
//whatever the Java version of this is
switch(section[2]) {
case "top": a1.s_top = 1; break;
case "sports": a1.s_sports = 1; break;
But I thought it'd be a lot easier to just write it as a single line:
a1["s_"+section[2]] = 1;
In Java, it's a pain to do what you want to do in the way that you're trying to do it.
If you don't want to use the switch/case statement, you could use reflection to pull up the member attribute you're trying to set:
Class articleClass = a1.getClass();
Field field = articleClass.getField("s_top");
field.set(a1, 1);
It'll work, but it may be slow and it's an atypical approach to this problem.
Alternately, you could store either a Map<String> or a Map<String,Boolean> inside of your Article class, and have a public function within Article called putSection(String section), and as you iterate, you would put the various section strings (or string/value mappings) into the map for each Article. So, instead of statically defining which sections may exist and giving each Article a yes or no, you'd allow the list of possible sections to be dynamic and based on your xml import.
Java variables are not "dynamic", unlink actionscript for exemple. You cannot call or assign a variable without knowing it at compile time (well, with reflection you could but it's far to complex)
So yes, the solution is to have a switch case (only possible on strings with java 1.7), or using an hashmap or equivalent
Or, if it's about importing XML, maybe you should take a look on JAXB
If you are trying to get an attribute from an object, you need to make sure that you have "getters" and "setters" in your object. You also have to make sure you define Section in your article class.
Something like:
class Article{
String section;
public Article(){
//set section
public void setSection(Section section){
this.section = section;
//get section
public String getSection(){
return this.section;

