Related
I used to use R for data processing and now my new project needs Java, so I apologise if I am asking naive questions. My problem is I want to achieve something like the filter in R dplyr. Basically, I now have two CSV files:
One is about the person attribute, where the first element in each row represents a unique personID:
1,4CC0D97F9ECC6B1A,MUTSAARD,7,m,7-8,0,0,ACT
2,F6B73020FC552E32,PORTE TERVUEREN,3,m,3-4,3,0,EMP
4,4072878C4683C96F,ALTITUDE 100,4,f,1-2,5,1,EMP
Another CSV is about person activities, where the first element in each row also represents unique personID:
1,0,0,0,9,34,home,34200,150101.5,176176
1,1,10,34,13,34,leisure,48600,249319.227415549,64034.2890971927
1,2,14,34,14,35,home,600,249319.227415549,64034.2890971927
1,3,15,49,16,19,shopping,58800,281683.200856897,118126.130836235
1,4,18,4,25,0,home,90000,281683.200856897,118126.130836235
2,0,0,0,15,38,home,56400,152056.679999997,170502.339842428
2,1,15,48,24,1,work,86400,153720.999999996,167515.000842442
2,2,24,18,25,0,home,90000,156685.535763012,169194.702448164
4,0,0,0,9,58,home,36000,147758.618000003,167097.459842441
4,1,10,29,14,58,work,54000,147251.000000004,174872.000842412
4,2,15,28,16,28,shopping,59400,144431.419000006,166735.039842444
4,3,16,38,18,38,leisure,67200,146053.041238428,169647.999589575
4,4,18,58,25,0,home,90000,149907.09447342,170229.096090939
What I want to do now is to first loop the person attribute and do some coding there, after, I would like to filter the rows with the same personID in the activities CSV and loop those rows with the same personID and do some coding there.
So the code I have for now is:
BufferedReader attributeReader = new BufferedReader(new FileReader(attributesFile));
String agent = null;
while ((agent = attributeReader.readLine()) != null) {
String[] attributeSpilted = agent.split(",");
int attributeAgentID = Integer.parseInt(attributeSpilted[0]);
// Set attributes for agents
Person person = populationFactory.createPerson(Id.createPersonId(attributeAgentID));
// Question: What I should do here to find the activities with the same personID?
population.addPerson(person);
}
The question I have is in the code, I get stuck there and unsure about what I should do here to find the activities with the same personID?
From your code, i assume you can read the csv file and convert to DataObject by each line without problem.
So you will have 2 class: Person, PersonActivity for example.
If memory is not a problem for now, i suggest the simple way to read person activity and convert it Map.
// key is attributeAgentID, value List<PersonActivity>
// mean each person id have list of activities
Map<Integer, List<PersonActivity>> getPersonActivityFromCsv(String file);
So you just need to update few code when processing list of person:
Map<Integer,List<PersonActivity>> map = getPersonActivityFromCsv("activities.csv");
...
while(...){
int attributeAgentID = Integer.parseInt(attributeSpilted[0]);
// Set attributes for agents
Person person = populationFactory.createPerson(Id.createPersonId(attributeAgentID));
List<PersonActivity> activities = map.get(attributeAgentID);
// then do something with list activities here.
}
I have this piece of data (this is just one part of one line of the whole file):
000000055555444444******4444 YY
I implemented this CSV config file to be able to read each part of the data and parse it:
128-12,140-22,YY
The first pair (128-12) represent at what position in the line to start reading and then the amount of characters to read, that first pair is for the account number.
The second pair if for the card number.
And the thir parameter is for the registry name.
Anyways, what I do is String.split(","), and then assign the array[0] as the account number and so on.
But I want to change that CSV config file to a Property file, but I'm not sure of how to implement that solution, if I use a Properties file I'd have to add a bunch of if/then in order to properly map my values, here's what I'm thinking of doing:
Property cfg = new Property();
cfg.put("fieldName", "accountNumber");
cfg.put("startPosition", "128");
cfg.put("length", "12");
But I'd have to say if("fieldName".equals("accountNumber")) then assign accountNumber; is there a way to implement this in such a way that I could avoid implementing all this decisions? right now with my solution I don't have to use ifs, I only say accountNumber = array[0]; and that's it, but I don't think that's a good solution and I think that using Property would be more elegant or efficient
EDIT:
This probably needs some more clarification, this data I'm showing is part of a parsing program that I'm currently doing for a client; the data holds information for many many of their customers and I have to parse a huge mess of data that I receive from them, into something more readable in order to convert it to a PDF file, so far the program is under production but I'm trying to refactor it a little bit. All the customer's information is saved into different Registry classes, each class having it's own set of fields with unique information, lets say that this is what RegistryYY would look like:
class RegistryYY extends Registry{
String name;
String lastName;
PhysicalAddress address;
public RegistryYY(String dataFromFile) {
}
}
I want to implement the Property solution, because in that way, I could make the Property for parsing the file, or interpreting the data correctly to be owned by each Registry class, I mean, a Registry should know what data it needs from the data received from the file right?, I think that if I do it that way, I could make each Registry an Observer and they would decide if the current line read from the file belongs to them by checking the registry name stored in the current line and then they'd return an initialized Registry to the calling object which only cares about receiving and storing a Registry class.
EDIT 2:
I created this function to return the value stored in each line's position:
public static String getField(String fieldParams, String rawData){
// splits the field
String[] fields = fieldParams.split("-");
int fieldStart = Integer.parseInt(fields[0]); // get initial position of the field
int fieldLen = Integer.parseInt(fields[1]); // get length of field
// gets field value
String fieldValue = FieldParser.getStringValue(rawData, fieldStart, fieldLen);
return fieldValue;
}
Which works with the CSV file, I'd like to change the implementation to work with the Property file instead.
Is there any reason why you need to have the record layout exposed to the outside world ? does it need to be configurable ?
I think your proposed approached of using the Property file is better than your current approach of using the CSV file since it is more descriptive and meaningful. I would just add a "type" attribute to your Property definition as well to enforce your conversion i.e. for Numeric/String/Date/Boolean.
I wouldnt use an "if" statement to process your property file. You can load all the properties into an Array at the beginning and then iterate around the array for each line of your data file and process that section accordingly something like pseudo code below,
for each loop of data-file{
SomeClass myClass = myClassBuilder(data-file-line)
}
myClassBuilder SomeClass (String data-file-line){
Map<column, value> result = new HashMap<>
for each attribute of property-file-list{
switch attribute_type {
Integer:
result.put(fieldname, makeInteger(data-file-line, property_attribute)
Date:
result.put(fieldname, makeDate(data-file-line, property_attribute)
Boolean :
result.put(fieldname, makeBoolean(data-file-line, property_attribute)
String :
result.put(fieldname, makeBoolean(data-file-line, property_attribute)
------- etc
}
}
return new SomeClass(result)
}
}
If your record layout doesnt need to be configurable then you could do all the conversion inside your Java application only and not even use a Property file.
If you could get your data in XML format then you could use the JAXB framework and simply have your data definition in an XML file.
First of all, thanks to the guys who helped me, #robbie70, #RC. and #VinceEmigh.
I used YAML to parse a file called "test.yml" with the following information in it:
statement:
- fieldName: accountNumber
startPosition: 128
length: 12
- fieldName: cardNumber
startPosition: 140
length: 22
- fieldName: registryName
startPosition: 162
length: 2
This is what I made:
// Start of main
String fileValue = "0222000000002222F 00000000000111110001000000099999444444******4444 YY";
YamlReader reader = new YamlReader(new FileReader("test.yml"));
Object object = reader.read();
System.out.println(object);
Map map = (Map) object;
List list = (List) map.get("statement");
for(int i = 0; i < list.size(); i++) {
Map map2 = (Map) list.get(i);
System.out.println("Value: " + foo(map2, fileValue));
}
}
// End of main
public static String foo(Map map, String source) {
int startPos = Integer.parseInt((String) map.get("startPosition"));
int length = Integer.parseInt((String) map.get("length"));
return getField(startPos, length, source);
}
public static String getField(int start, int length, String source) {
return source.substring(start, start+length);
}
It correctly displays the output:
Value: 000000099999
Value: 444444******4444
Value: YY
I know that maybe the config file has some lists and other unnecessary values and what nots, and that maybe the program needs a little improvement, but I think that I can take it from here and implement what I had in mind.
EDIT:
I made this other one, using Apache Commons, this is what I have in the configuration property file:
#properties defining the statement file
#properties for account number
statement.accountNumber.startPosition = 128
statement.accountNumber.length = 12
statement.account.rules = ${statement.accountNumber.startPosition} ${statement.accountNumber.length}
#properties for card number
statement.cardNumber.startPosition = 140
statement.cardNumber.length = 22
statement.card.rules = ${statement.cardNumber.startPosition} ${statement.cardNumber.length}
#properties for registry name
statement.registryName.startPosition = 162
statement.registryName.length = 2
statement.registry.rules = ${statement.registryName.startPosition} ${statement.registryName.length}
And this is how I read it:
// Inside Main
String valorLeido = "0713000000007451D 00000000000111110001000000099999444444******4444 YY";
Parameters params = new Parameters();
FileBasedConfigurationBuilder<FileBasedConfiguration> builder =
new FileBasedConfigurationBuilder<FileBasedConfiguration>(PropertiesConfiguration.class)
.configure(params.properties()
.setFileName("config.properties"));
try {
Configuration config = builder.getConfiguration();
Iterator<String> keys = config.getKeys();
String account = getValue(getRules(config, "statement.account.rules"), valorLeido);
String cardNumber = getValue(getRules(config, "statement.card.rules"), valorLeido);
String registryName = getValue(getRules(config, "statement.registry.rules"), valorLeido);
} catch (org.apache.commons.configuration2.ex.ConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// End of Main
public static String getRules(Configuration config, String rules) {
return config.getString(rules);
}
public static String getValue(String rules, String source) {
String[] tokens = rules.split(" ");
int startPos = Integer.parseInt(tokens[0]);
int length = Integer.parseInt(tokens[1]);
return getField(startPos, length, source);
}
I'm not entirely sure, I think that with the YAML file it looks simpler, but I really like the control I get with the Apache Commons Config, since I can pass around the Configuration object to each registry, and the registry knows what "rules" it wants to get, let's say that the Registry class only cares about "statement.registry.rules" and that's it, with the YAML option I'm not entirely sure of how to do that yet, maybe I'll need to experiment with both options a little bit more, but I like where this is going.
PS:
That weird value I used in fileValue is what I'm dealing with, now add nearly 1,000 characters to the length of the line and you'll understand why I want to have a config file for parsing it (don't ask why....clients be crazy)
I have an ArrayList containing fields like FirstName,LastName,Employee ID,Employee Job etc., When used to display the list, if at all an employee has the same first and Last names, his name(first+Last Name) should be appended with the employee ID and Job seperated by hyphen.
Iam able to detect the duplicates using Hash Set using Add Operation of Set. But on appending ID , it is being done for only one of the duplicates.
I should be able to differentiate Employees with same names based on appended ID and job.
Thanks :-)
Here is my Code:
List<AgentInfoVO>agentQueuesInfo = new ArrayList<AgentInfoVO>();
List<AgentInfo>agentQueues = null;
Set<AgentInfo>uniqueSet = null;
StringBuffer lastName = null;
if(agentQueuesInfo != null){
agentQueues = new ArrayList<AgentInfo>();
uniqueSet = new HashSet<AgentInfo>();
forEach(AgentInfoVO agentInfoVO : agentQueuesInfo){
AgentInfo agentInfo = new AgentInfo();
agentInfo.setFirstName(agentInfoVO.getFirstName());
agentInfo.setLastName(agentInfoVO.getLastName());
// to check if duplicate names exist and append ID and Job to duplicates
if(!uniqueSet.add(agentInfo)){
lastName = agentInfoVO.getLastName();
if(agentInfoVO.getAgentEmpID() != null){
lastName = lastName.append("-" +agentInfoVO.getAgentEmpID());
}
if(agentInfoVO.etEmpJob() != null){
lastName = lastName.append("-" +agentInfoVO.etEmpJob());
}
agentInfo.setLastName(lastName.toString());
}
agentInfo.setAgentEmpID(agentInfoVO.getAgentEmpID());
agentInfo.setEmpJob(agentInfoVO.etEmpJob());
agentQueues.add(agentInfo);
}
}
Sorry i dont have a computer close to me, so i cant give you code, but can share my ideas.
Detecting duplicates using set add method will only see the second entry as duplicate, thats why you will miss the first one.
If you can be able to modify AgentInfoVO, and firstName and LastName should uniquely identify AgentInfoVO.
A better way can be to overwrite the hash and equals method using using both fields.
And use this method to detect more than one occurrence and do your appends.
https://docs.oracle.com/javase/6/docs/api/java/util/Collections.html#frequency(java.util.Collection, java.lang.Object).
something like:
Set<AgentInfoVO> duplicateAgentInfoVOs = new HashSet<AgentInfoVO>();
for(AgentInfoVO agentInfoVO : agentQueuesInfo) {
if(Collections.frequency(agentQueuesInfo, agentInfoVO) > 1) {
duplicateAgentInfoVOs.add(agentInfoVO);
}
}
for(AgentInfoVO agentInfoVO : agentQueuesInfo) {
if(duplicateAgentInfoVOs.contains(agentInfoVO)) {
//you can do your appends to agentInfoVO
}
}
I am using JHDF5 to log a collection of values to a hdf5 file. I am currently using two ArrayLists to do this, one with the values and one with the names of the values.
ArrayList<String> valueList = new ArrayList<String>();
ArrayList<String> nameList = new ArrayList<String>();
valueList.add("Value1");
valueList.add("Value2");
nameList.add("Name1");
nameList.add("Name2");
IHDF5Writer writer = HDF5Factory.configure("My_Log").keepDataSetsIfTheyExist().writer();
HDF5CompoundType<List<?>> type = writer.compound().getInferredType("", nameList, valueList);
writer.compound().write("log1", type, valueList);
writer.close();
This will log the values in the correct way to the file My_Log and in the dataset "log1". However, this example always overwrites the previous log of the values in the dataset "log1". I want to be able to log to the same dataset everytime, adding the latest log to the next line/index of the dataset. For example, if I were to change the value of "Name2" to "Value3" and log the values, and then change "Name1" to "Value4" and "Name2" to "Value5" and log the values, the dataset should look like this:
I thought the keepDataSetsIfTheyExist() option to would prevent the dataset to be overwritten, but apparently it doesn't work that way.
Something similar to what I want can be achieved in some cases with writer.compound().writeArrayBlock(), and specify by what index the array block shall be written. However, this solution doesn't seem to be compatible with my current code, where I have to use lists for handling my data.
Is there some option to achieve this that I have overlooked, or can't this be done with JHDF5?
I don't think that will work. It is not quite clear to me, but I believe the getInferredType() you are using is creating a data set with 2 name -> value entries. So it is effectively creating an object inside the hdf5. The best solution I could come up with was to read the previous values add them to the valueList before outputting:
ArrayList<String> valueList = new ArrayList<>();
valueList.add("Value1");
valueList.add("Value2");
try (IHDF5Reader reader = HDF5Factory.configure("My_Log.h5").reader()) {
String[] previous = reader.string().readArray("log1");
for (int i = 0; i < previous.length; i++) {
valueList.add(i, previous[i]);
}
} catch (HDF5FileNotFoundException ex) {
// Nothing to do here.
}
MDArray<String> values = new MDArray<>(String.class, new long[]{valueList.size()});
for (int i = 0; i < valueList.size(); i++) {
values.set(valueList.get(i), i);
}
try (IHDF5Writer writer = HDF5Factory.configure("My_Log.h5").writer()) {
writer.string().writeMDArray("log1", values);
}
If you call this code a second time with "Value3" and "Value4" instead, you will get 4 values. This sort of solution might become unpleasant if you start to have hierarchies of datasets however.
To solve your issue, you need to define the dataset log1 as extendible so that it can store an unknown number of log entries (that are generated over time) and write these using a point or hyperslab selection (otherwise, the dataset will be overwritten).
If you are not bound to a specific technology to handle HDF5 files, you may wish to give a look at HDFql which is an high-level language to manage HDF5 files easily. A possible solution for your use-case using HDFql (in Java) is:
public class Example
{
public Class Log
{
String name1;
String name2;
}
public boolean doSomething(Log log)
{
log.name1 = "Value1";
log.name2 = "Value2";
return true;
}
public static void main(String args[])
{
// declare variables
Log log = new Log();
int variableNumber;
// create an HDF5 file named 'My_Log.h5' and use (i.e. open) it
HDFql.execute("CREATE AND USE FILE My_Log.h5");
// create an extendible HDF5 dataset named 'log1' of data type compound
HDFql.execute("CREATE DATASET log1 AS COMPOUND(name1 AS VARCHAR, name2 AS VARCHAR)(0 TO UNLIMITED)");
// register variable 'log' for subsequent usage (by HDFql)
variableNumber = HDFql.variableRegister(log);
// call function 'doSomething' that does something and populates variable 'log' with an entry
while(doSomething(log))
{
// alter (i.e. extend) dataset 'log1' to +1 (i.e. add a new row)
HDFql.execute("ALTER DIMENSION log1 TO +1");
// insert (i.e. write) data stored in variable 'log' into dataset 'log1' using a point selection
HDFql.execute("INSERT INTO log1(-1) VALUES FROM MEMORY " + variableNumber);
}
}
}
I'm trying to build a text classifier using Weka, but the probabilities with distributionForInstance of the classes are 1.0 in one and 0.0 in all other cases, so classifyInstance always returns the same class as prediction. Something in the training doesn't work correctly.
ARFF training
#relation test1
#attribute tweetmsg String
#attribute classValues {politica,sport,musicatvcinema,infogeneriche,fattidelgiorno,statopersonale,checkin,conversazione}
#DATA
"Renzi Berlusconi Salvini Bersani",politica
"Allegri insulta la terna arbitrale",sport
"Bravo Garcia",sport
Training methods
public void trainClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
for(Instance currentInstance : inputDataset)
{
Instance currentFeatureVector = extractFeature(currentInstance);
currentFeatureVector.setDataset(trainingInstances);
trainingInstances.add(currentFeatureVector);
}
classifier = new NaiveBayes();
try {
//classifier training code
classifier.buildClassifier(trainingInstances);
//storing the trained classifier to a file for future use
weka.core.SerializationHelper.write("NaiveBayes.model",classifier);
} catch (Exception ex) {
System.out.println("Exception in training the classifier."+ex);
}
}
private Instance extractFeature(Instance inputInstance) throws Exception
{
String tweet = inputInstance.stringValue(0);
StringTokenizer defaultTokenizer = new StringTokenizer(tweet);
List<String> tokens=new ArrayList<String>();
while (defaultTokenizer.hasMoreTokens())
{
String t= defaultTokenizer.nextToken();
tokens.add(t);
}
Iterator<String> a = tokens.iterator();
while(a.hasNext())
{
String token=(String) a.next();
String word = token.replaceAll("#","");
if(featureWords.contains(word))
{
double cont=featureMap.get(featureWords.indexOf(word))+1;
featureMap.put(featureWords.indexOf(word),cont);
}
else{
featureWords.add(word);
featureMap.put(featureWords.indexOf(word), 1.0);
}
}
attributeList.clear();
for(String featureWord : featureWords)
{
attributeList.add(new Attribute(featureWord));
}
attributeList.add(new Attribute("Class", classValues));
int indices[] = new int[featureMap.size()+1];
double values[] = new double[featureMap.size()+1];
int i=0;
for(Map.Entry<Integer,Double> entry : featureMap.entrySet())
{
indices[i] = entry.getKey();
values[i] = entry.getValue();
i++;
}
indices[i] = featureWords.size();
values[i] = (double)classValues.indexOf(inputInstance.stringValue(1));
trainingInstances = createInstances("TRAINING_INSTANCES");
return new SparseInstance(1.0,values,indices,1000000);
}
private void getTrainingDataset(final String INPUT_FILENAME)
{
try{
ArffLoader trainingLoader = new ArffLoader();
trainingLoader.setSource(new File(INPUT_FILENAME));
inputDataset = trainingLoader.getDataSet();
}catch(IOException ex)
{
System.out.println("Exception in getTrainingDataset Method");
}
System.out.println("dataset "+inputDataset.numAttributes());
}
private Instances createInstances(final String INSTANCES_NAME)
{
//create an Instances object with initial capacity as zero
Instances instances = new Instances(INSTANCES_NAME,attributeList,0);
//sets the class index as the last attribute
instances.setClassIndex(instances.numAttributes()-1);
return instances;
}
public static void main(String[] args) throws Exception
{
Classificatore wekaTutorial = new Classificatore();
wekaTutorial.trainClassifier("training_set_prova_tent.arff");
wekaTutorial.testClassifier("testing.arff");
}
public Classificatore()
{
attributeList = new ArrayList<Attribute>();
initialize();
}
private void initialize()
{
featureWords= new ArrayList<String>();
featureMap = new TreeMap<>();
classValues= new ArrayList<String>();
classValues.add("politica");
classValues.add("sport");
classValues.add("musicatvcinema");
classValues.add("infogeneriche");
classValues.add("fattidelgiorno");
classValues.add("statopersonale");
classValues.add("checkin");
classValues.add("conversazione");
}
TESTING METHODS
public void testClassifier(final String INPUT_FILENAME) throws Exception
{
getTrainingDataset(INPUT_FILENAME);
//trainingInstances consists of feature vector of every input
Instances testingInstances = createInstances("TESTING_INSTANCES");
for(Instance currentInstance : inputDataset)
{
//extractFeature method returns the feature vector for the current input
Instance currentFeatureVector = extractFeature(currentInstance);
//Make the currentFeatureVector to be added to the trainingInstances
currentFeatureVector.setDataset(testingInstances);
testingInstances.add(currentFeatureVector);
}
try {
//Classifier deserialization
classifier = (Classifier) weka.core.SerializationHelper.read("NaiveBayes.model");
//classifier testing code
for(Instance testInstance : testingInstances)
{
double score = classifier.classifyInstance(testInstance);
double[] vv= classifier.distributionForInstance(testInstance);
for(int k=0;k<vv.length;k++){
System.out.println("distribution "+vv[k]); //this are the probabilities of the classes and as result i get 1.0 in one and 0.0 in all the others
}
System.out.println(testingInstances.attribute("Class").value((int)score));
}
} catch (Exception ex) {
System.out.println("Exception in testing the classifier."+ex);
}
}
I want to create a text classifier for short messages, this code is based on this tutorial http://preciselyconcise.com/apis_and_installations/training_a_weka_classifier_in_java.php . The problem is that the classifier predict the wrong class for almost every message in the testing.arff because the probabilities of the classes are not correct. The training_set_prova_tent.arff has the same number of messages per class.
The example i'm following use a featureWords.dat and associate 1.0 to the word if it is present in a message instead I want to create my own dictionary with the words present in the training_set_prova_tent plus the words present in testing and associate to every word the number of occurrences .
P.S
I know that this is exactly what can i do with the filter StringToWordVector but I haven't found any example that exaplain how to use this filter with two file: one for the training set and one for the test set. So it seems easier to adapt the code I found.
Thank you very much
It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found.
What you (probably) wanted to do in extractFeature is
Split each tweet into words (tokenize)
Count the number of occurrences of these words
Create a feature vector representing these word counts plus the class
What you've overlooked in that method is
You never reset your featureMap. The line
Map<Integer,Double> featureMap = new TreeMap<>();
originally was at the beginning extractFeatures, but you moved it to initialize. That means that you always add up the word counts, but never reset them. For each new tweet, your word count also includes the word count of all previous tweets. I'm sure that is not what you wanted.
You don't initialize featureWords with the words you want as features. Yes, you create an empty list, but you fill it iteratively with each tweet. The original code initialized it once in the initialize method and it never changed after that. There are two problems with that:
With each new tweet, new features (words) get added, so your feature vector grows with each tweet. That wouldn't be such a big problem (SparseInstance), but that means that
Your class attribute is always in another place. These two lines work for the original code, because featureWords.size() is basically a constant, but in your code the class label will be at index 5, then 8, then 12, and so on, but it must be the same for every instance.
indices[i] = featureWords.size();
values[i] = (double) classValues.indexOf(inputInstance.stringValue(1));
This also manifests itself in the fact that you build a new attributeList with each new tweet, instead of only once in initialize, which is bad for already explained reasons.
There may be more stuff, but - as it is - your code is rather unfixable. What you want is much closer to the tutorial source code which you modified than your version.
Also, you should look into StringToWordVector because it seems like this is exactly what you want to do:
Converts String attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).