We are using spark for file processing. We are processing pretty big files with each file around 30 GB with about 40-50 million lines. These files are formatted. We load them into data frame. Initial requirement was to identify records matching criteria and load them to MySQL. We were able to do that.
Requirement changed recently. Records not meeting criteria are now to be stored in an alternate DB. This is causing issue as the size of collection is too big. We are trying to collect each partition independently and merge into a list as suggested here
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/dont_collect_large_rdds.html
We are not familiar with scala, so we are having trouble converting this to Java. How can we iterate over partitions one by one and collect?
Thanks
Please use df.foreachPartition to execute for each partition independently and won't returns to driver. You can save the matching results into DB in each executor level. If you want to collect the results in driver, use mappartitions which is not recommended for your case.
Please refer the below link
Spark - Java - foreachPartition
dataset.foreachPartition(new ForeachPartitionFunction<Row>() {
public void call(Iterator<Row> r) throws Exception {
while (t.hasNext()){
Row row = r.next();
System.out.println(row.getString(1));
}
// do your business logic and load into MySQL.
}
});
For mappartitions:
// You can use the same as Row but for clarity I am defining this.
public class ResultEntry implements Serializable {
//define your df properties ..
}
Dataset<ResultEntry> mappedData = data.mapPartitions(new MapPartitionsFunction<Row, ResultEntry>() {
#Override
public Iterator<ResultEntry> call(Iterator<Row> it) {
List<ResultEntry> filteredResult = new ArrayList<ResultEntry>();
while (it.hasNext()) {
Row row = it.next()
if(somecondition)
filteredResult.add(convertToResultEntry(row));
}
return filteredResult.iterator();
}
}, Encoders.javaSerialization(ResultEntry.class));
Hope this helps.
Ravi
I want to use Spark Java to read RCFile. I find the functions via searching on Google, they told me to use the hadoopFile function. I write this:
JavaSparkContext ctx = new JavaSparkContext("local", "Accumulate",System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(spark.wendy.RCFileAccumulate.class));
JavaPairRDD<String,Array> idListData = ctx.hadoopFile("/rcfinancetest",RCFileInputFormat.class,String.class, String.class);
idListData.saveAsTextFile("/financeout");
Then I open the result file, the result is as followed:
(0,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(1,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(2,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(3,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
(4,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable#4442459f)
The second phase of all record is the same. How can I deal with the variable? How to deal with the column data in the RCFile?By the way, I create the RCFile via Hive. I think it has nothing to do with the spark reading program. Could you help me ?
I'm new on topic modeling and I'm trying to use Mallet library but I have a question.
I'm using Simple parallel threaded implementation of LDA to find topics for some instances. My question is what is estimate function in ParallelTopicModel?
I have search in API but they have not description. Also I have read this tutorial.
Can someone explain what is this function?
EDIT
This is an example of my code:
public void runModel(Sting [] str){
ParallelTopicModel model = new ParallelTopicModel(numTopics);
ArrayList<Pipe> pipeList = new ArrayList<Pipe>();
// Pipes: lowercase, tokenize, remove stopwords, map to features
pipeList.add(new CharSequenceLowercase());
pipeList.add(new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")));
pipeList.add(new TokenSequence2FeatureSequence());
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new StringArrayIterator(str));
model.addInstances(instances);
model.setNumThreads(THREADS);
model.setOptimizeInterval(optimizeation);
model.setBurninPeriod(burninInterval);
model.setNumIterations(numIterations);
// model.estimate();
}
estimate() runs LDA, attempting to estimate the topic model given the data and settings you've already set up.
Have a look at the main() function of the ParrallelTopicModel source for inspiration about what's needed to estimate a model.
We are currently evaluating Elasticsearch as our solution for Analytics. The main driver is the fact that once the data is populated into Elasticsearch, the reporting comes for free with Kibana.
Before adopting it, I am tasked to do a performance analysis of the tool.
The main requirement is supporting a PUT rate of 500 evt/sec.
I am currently starting with a small setup as follows just to get a sense of the API before I upload that to a more serious lab.
My Strategy is basically, going over CSVs of analytics that correspond to the format I need and putting them into elasticsearch. I am not using the bulk API because in reality the events will not arrive in a bulk fashion.
Following is the main code that does this:
// Created once, used for creating a JSON from a bean
ObjectMapper mapper = new ObjectMapper();
// Creating a measurement for checking the count of sent events vs
// ES stored events
AnalyticsMetrics metrics = new AnalyticsMetrics();
metrics.startRecording();
File dir = new File(mFolder);
for (File file : dir.listFiles()) {
CSVReader reader = new CSVReader(new FileReader(file.getAbsolutePath()), '|');
String [] nextLine;
while ((nextLine = reader.readNext()) != null) {
AnalyticRecord record = new AnalyticRecord();
record.serializeLine(nextLine);
// Generate json
String json = mapper.writeValueAsString(record);
IndexResponse response = mClient.getClient().prepareIndex("sdk_sync_log", "sdk_sync")
.setSource(json)
.execute()
.actionGet();
// Recording Metrics
metrics.sent();
}
}
metrics.stopRecording();
return metrics;
I have the following questions:
How do I know through the API when all the requests are completed and the data is saved into Elasticsearch? I could query Elasticsearch for the objects counts in my particular index but doing that would be a new performance factor by itself, hence I am eliminating this option.
Is the above the fastest way to insert object to Elasticsearch or are there other optimizations I could do. Keep in mind the bulk API is not an option for now.
Thx in advance.
P.S: the Elasticsearch version I am using on both client and server is 1.0.0.
Elasticsearch index response has isCreated() method that returns true if the document is a new one or false if it has been updated and can be used to see if the document was successfully inserted/updated.
If bulk indexing is not an option there are other areas that could be tweaked to improve performance like
increasing index refresh interval using index.refresh_interval
disabling replicas by setting index.number_of_replicas to 0
Disabling _source and _all fields if they are not needed.
Overview
I know that one can get the percentages of each prediction in a trained WEKA model through the GUI and command line options as conveniently explained and demonstrated in the documentation article "Making predictions".
Predictions
I know that there are three ways documented to get these predictions:
command line
GUI
Java code/using the WEKA API, which I was able to do in the answer to "Get risk predictions in WEKA using own Java code"
this fourth one requires a generated WEKA .MODEL file
I have a trained .MODEL file and now I want to classify new instances using this together with the prediction percentages similar to the one below (an output of the GUI's Explorer, in CSV format):
inst#,actual,predicted,error,distribution,
1,1:0,2:1,+,0.399409,*0.7811
2,1:0,2:1,+,0.3932409,*0.8191
3,1:0,2:1,+,0.399409,*0.600591
4,1:0,2:1,+,0.139409,*0.64
5,1:0,2:1,+,0.399409,*0.600593
6,1:0,2:1,+,0.3993209,*0.600594
7,1:0,2:1,+,0.500129,*0.600594
8,1:0,2:1,+,0.399409,*0.90011
9,1:0,2:1,+,0.211409,*0.60182
10,1:0,2:1,+,0.21909,*0.11101
The predicted column is what I want to get from a .MODEL file.
What I know
Based from my experience with the WEKA API approach, one can get these predictions using the following code (the PlainText inserted into an Evaluation object) BUT I do not want to do k-fold cross-validation that is provided by the Evaluation object.
StringBuffer predictionSB = new StringBuffer();
Range attributesToShow = null;
Boolean outputDistributions = new Boolean(true);
PlainText predictionOutput = new PlainText();
predictionOutput.setBuffer(predictionSB);
predictionOutput.setOutputDistribution(true);
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(j48Model, data, numberOfFolds,
randomNumber, predictionOutput, attributesToShow,
outputDistributions);
System.out.println(predictionOutput.getBuffer());
From the WEKA documentation
Note that a .MODEL file classifies data from an .ARFF or related input is discussed in "Use Weka in your Java code" and "Serialization" a.k.a. "How to use a .MODEL file in your own Java code to classify new instances" (why the vague title smfh).
Using own Java code to classify
Loading a .MODEL file is through "Deserialization" and the following is for versions > 3.5.5:
// deserialize model
Classifier cls = (Classifier) weka.core.SerializationHelper.read("/some/where/j48.model");
An Instance object is the data and it is fed to the classifyInstance. An output is provided here (depending on the data type of the outcome attribute):
// classify an Instance object (testData)
cls.classifyInstance(testData.instance(0));
The question "How to reuse saved classifier created from explorer(in weka) in eclipse java" has a great answer too!
Javadocs
I have already checked the Javadocs for Classifier (the trained model) and Evaluation (just in case) but none directly and explicitly addresses this issue.
The only thing closest to what I want is the classifyInstances method of the Classifier:
Classifies the given test instance. The instance has to belong to a dataset when it's being classified. Note that a classifier MUST implement either this or distributionForInstance().
How can I simultaneously use a WEKA .MODEL file to classify and get predictions of a new instance using my own Java code (aka using the WEKA API)?
This answer simply updates my answer from How to reuse saved classifier created from explorer(in weka) in eclipse java.
I will show how to obtain the predicted instance value and the prediction percentage (or distribution). The example model is a J48 decision tree created and saved in the Weka Explorer. It was built from the nominal weather data provided with Weka. It is called "tree.model".
import weka.classifiers.Classifier;
import weka.core.Instances;
public class Main {
public static void main(String[] args) throws Exception
{
String rootPath="/some/where/";
Instances originalTrain= //instances here
//load model
Classifier cls = (Classifier) weka.core.SerializationHelper.read(rootPath+"tree.model");
//predict instance class values
Instances originalTrain= //load or create Instances to predict
//which instance to predict class value
int s1=0;
//perform your prediction
double value=cls.classifyInstance(originalTrain.instance(s1));
//get the prediction percentage or distribution
double[] percentage=cls.distributionForInstance(originalTrain.instance(s1));
//get the name of the class value
String prediction=originalTrain.classAttribute().value((int)value);
System.out.println("The predicted value of instance "+
Integer.toString(s1)+
": "+prediction);
//Format the distribution
String distribution="";
for(int i=0; i <percentage.length; i=i+1)
{
if(i==value)
{
distribution=distribution+"*"+Double.toString(percentage[i])+",";
}
else
{
distribution=distribution+Double.toString(percentage[i])+",";
}
}
distribution=distribution.substring(0, distribution.length()-1);
System.out.println("Distribution:"+ distribution);
}
}
The output from this is:
The predicted value of instance 0: no
Distribution: *1, 0