How to map a log file in Java using Spark?

How to map a log file in Java using Spark? - java

I have to monitor a log file in which is written the history of utilization of an app. This log file is formatted in this way:
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
... about 800000 rows
AppId is always the same, because is referenced at only one app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are expressed in % so for example:
<3ghffh3t482age20304,230720142245,0.2,3,5>
To be specific, I have to check the percentage of CPU usage and memory usage by this application to be monitored using spark and the map reduce algorithm.
My output is to print alert when the cpu or the memory are 100% of usage.
How can I start?

The idea is to declare a class and map the line into a scala object,
Lets declare the case class as follows,
case class App(name: String, date: String, cpuUsage: Double, memoryusage: Double)
Then initialize the SparkContext and create a RDD from the text file where the data is present,
val sc = new SparkContext(sparkConf)
val inFile = sc.textFile("log.txt")
then parse each line and map it to App object so that the range checking would be faster,
val mappedLines = inFile.map(x => (x.split(",")(0), parse(x)))
where the parse(x) method is defined as follows,
def parse(x: String):App = {
val splitArr = x.split(",");
val app = new App(splitArr(0),
splitArr(1),
splitArr(2).toDouble,
splitArr(3).toDouble)
return app
}
Note that i have assumed the input as follows, (this is just to give you the idea and not the entire program),
ffh3t482age20304,230720142245,0.2,100.5
Then do the filter transformation where you can perform the check and report the anamoly conditions,
val anamolyLines = mappedLines.filter(doCheckCPUAndMemoryUtilization)
anamolyLines.count()
where doCheckCPUAndMemoryUtilization function is defined as follows,
def doCheckCPUAndMemoryUtilization(x:(String, App)):Boolean = {
if(x._2.cpuUsage >= 100.0 ||
x._2.memoryusage >= 100.0) {
System.out.println("App name -> "+x._2.name +" exceed the limit")
return true
}
return false
}
Note: This is only a batch processing and not real-time processing.

Related

Using a Java Stream instead of a for-loop skips code execution

Switching a bunch of for-loop code to use a parallel stream is apparently causing a certain part of the code to be ignored.
I'm using MOA and Weka with Java 11 to run a simple recommendation engine example, taking cues from the source code of moa.tasks.EvaluateOnlineRecomender, which uses MOA's internal task setup to test the accuracy of the Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) implementation provided by MOA. Instead of using MOA's prepared MovielensDataset class, I switched over to Weka's Instances for prospects of applying Weka's ML tools.
The time it took to process about a million instances (I'm using the Movielens 1M dataset) was about 13-14 minutes. In a bid to see improvements, I wanted to run it on a parallel stream, and became suspicious when the task finished in about 40 seconds. I found that BRISMFPredictor.predictRating was always producing 0 within the parallel stream's body. Here's the code for either case:
Code for initialisation:
import com.github.javacliparser.FileOption;
import com.github.javacliparser.IntOption;
import moa.options.ClassOption;
import moa.recommender.predictor.BRISMFPredictor;
import moa.recommender.predictor.RatingPredictor;
import moa.recommender.rc.data.RecommenderData;
import weka.core.converters.CSVLoader;
...
private static ClassOption datasetOption;
private static ClassOption ratingPredictorOption;
private static IntOption sampleFrequencyOption;
private static FileOption defaultFileOption;
static {
ratingPredictorOption = new ClassOption("ratingPredictor",
's', "Rating Predictor to evaluate on.", RatingPredictor.class,
"moa.recommender.predictor.BRISMFPredictor");
sampleFrequencyOption = new IntOption("sampleFrequency",
'f', "How many instances between samples of the learning performance.", 100, 0, 2147483647);
defaultFileOption = new FileOption("file",
'f', "File to load.",
"C:\\Users\\shiva\\Documents\\Java-ML\\mlapp\\data\\ml-1m\\ratings.dat", "dat", false);
}
... and inside main() (a quirk with Weka's CSVLoader required that I replace the default :: delimiter with +)
var csvLoader = new CSVLoader();
csvLoader.setSource(defaultFileOption.getFile());
csvLoader.setFieldSeparator("+");
var dataset = csvLoader.getDataSet();
System.out.println(dataset.toSummaryString());
var predictor = new BRISMFPredictor();
predictor.prepareForUse();
RecommenderData data = predictor.getData();
data.clear();
data.disableUpdates(false);
Now, alternating between the following snippets:
for (var instance : dataset) {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
}
(Now being a noob in everything concurrent):
dataset.parallelStream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
Now I decide that heck, maybe this operation can't be done in parallel, and I switch it to use stream(). Even then, the segment seems to be completely ignored since the output is again 0.0 each time
dataset.stream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
I have tried removing the print statement from the run, but without avail.
Obviously, I get the expected output lines consisting of actual and predicted rating within about 13 minutes in the first case, but find that the predicted rating is 0.0 in the second case with suspiciously low execution time. Is there something I'm missing out on?
EDIT: using dataset.forEach() does the same thing. Perhaps a quirk of lambdas?

Apache Flink: The execution environment and multiple sink

My question might cause some confusion so please see Description first. It might be helpful to identify my problem. I will add my Code later at the end of the question (Any suggestions regarding my code structure/implementation is also welcomed).
Thank you for any help in advance!
My question:
How to define multiple sinks in Flink Batch processing without having it get data from one source repeatedly?
What is the difference between createCollectionEnvironment() and getExecutionEnvironment() ? Which one should I use in local environment?
What is the use of env.execute()? My code will output the result without this sentence. if I add this sentence it will pop an Exception:
-
Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution. The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:940)
at org.apache.flink.api.java.ExecutionEnvironment.createProgramPlan(ExecutionEnvironment.java:922)
at org.apache.flink.api.java.CollectionEnvironment.execute(CollectionEnvironment.java:34)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:816)
at MainClass.main(MainClass.java:114)
Description:
New to programming. Recently I need to process some data (grouping data, calculating standard deviation, etc.) using Flink Batch processing.
However I came to a point where I need to output two DataSet.
The structure was something like this
From Source(Database) -> DataSet 1 (add index using zipWithIndex())-> DataSet 2 (do some calculation while keeping index) -> DataSet 3
First I output DataSet 2, the index is e.g. from 1 to 10000;
And then I output DataSet 3 the index becomes from 10001 to 20000 although I did not change the value in any function.
My guessing is when outputting DataSet 3 instead of using the result of
previously calculated DataSet 2 it started from getting data from database again and then perform the calculation.
With the use of ZipWithIndex() function it does not only give the wrong index number but also increase the connection to db.
I guess that this is relevant to the execution environment, as when I use
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
will give the "wrong" index number (10001-20000)
and
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
will give the correct index number (1-10000)
The time taken and number of database connections is different and the order of print will be reversed.
OS, DB, other environment details and versions:
IntelliJ IDEA 2017.3.5 (Community Edition)
Build #IC-173.4674.33, built on March 6, 2018
JRE: 1.8.0_152-release-1024-b15 amd64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Windows 10 10.0
My Test code(Java)：
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.createCollectionsEnvironment();
//Table is used to calculate the standard deviation as I figured that there is no such calculation in DataSet.
BatchTableEnvironment tableEnvironment = TableEnvironment.getTableEnvironment(env);
//Get Data from a mySql database
DataSet<Row> dbData =
env.createInput(
JDBCInputFormat.buildJDBCInputFormat()
.setDrivername("com.mysql.cj.jdbc.Driver")
.setDBUrl($database_url)
.setQuery("select value from $table_name where id =33")
.setUsername("username")
.setPassword("password")
.setRowTypeInfo(new RowTypeInfo(BasicTypeInfo.DOUBLE_TYPE_INFO))
.finish()
);
// Add index for assigning group (group capacity is 5)
DataSet<Tuple2<Long, Row>> indexedData = DataSetUtils.zipWithIndex(dbData);
// Replace index(long) with group number(int), and convert Row to double at the same time
DataSet<Tuple2<Integer, Double>> rawData = indexedData.flatMap(new GroupAssigner());
//Using groupBy() to combine individual data of each group into a list, while calculating the mean and range in each group
//put them into a POJO named GroupDataClass
DataSet<GroupDataClass> groupDS = rawData.groupBy("f0").combineGroup(new GroupCombineFunction<Tuple2<Integer, Double>, GroupDataClass>() {
#Override
public void combine(Iterable<Tuple2<Integer, Double>> iterable, Collector<GroupDataClass> collector) {
Iterator<Tuple2<Integer, Double>> it = iterable.iterator();
Tuple2<Integer, Double> var1 = it.next();
int groupNum = var1.f0;
// Using max and min to calculate range, using i and sum to calculate mean
double max = var1.f1;
double min = max;
double sum = 0;
int i = 1;
// The list is to store individual value
List<Double> list = new ArrayList<>();
list.add(max);
while (it.hasNext())
{
double next = it.next().f1;
sum += next;
i++;
max = next > max ? next : max;
min = next < min ? next : min;
list.add(next);
}
//Store group number, mean, range, and 5 individual values within the group
collector.collect(new GroupDataClass(groupNum, sum / i, max - min, list));
}
});
//print because if no sink is created, Flink will not even perform the calculation.
groupDS.print();
// Get the max group number and range in each group to calculate average range
// if group number start with 1 then the maximum of group number equals to the number of group
// However, because this is the second sink, data will flow from source again, which will double the group number
DataSet<Tuple2<Integer, Double>> rangeDS = groupDS.map(new MapFunction<GroupDataClass, Tuple2<Integer, Double>>() {
#Override
public Tuple2<Integer, Double> map(GroupDataClass in) {
return new Tuple2<>(in.groupNum, in.range);
}
}).max(0).andSum(1);
// collect and print as if no sink is created, Flink will not even perform the calculation.
Tuple2<Integer, Double> rangeTuple = rangeDS.collect().get(0);
double range = rangeTuple.f1/ rangeTuple.f0;
System.out.println("range = " + range);
}
public static class GroupAssigner implements FlatMapFunction<Tuple2<Long, Row>, Tuple2<Integer, Double>> {
#Override
public void flatMap(Tuple2<Long, Row> input, Collector<Tuple2<Integer, Double>> out) {
// index 1-5 will be assigned to group 1, index 6-10 will be assigned to group 2, etc.
int n = new Long(input.f0 / 5).intValue() + 1;
out.collect(new Tuple2<>(n, (Double) input.f1.getField(0)));
}
}

It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks. See this question Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?
getExecutionEnvironment is the right way to get the environment when you want to run your job. createCollectionEnvironment is a good way to play around and test. See the documentation
The exception error message is very clear: if you call print or collect your data flow gets executed. So you have two choices:
Either you call print/collect at the end of your data flow and it gets executed and printed. That's good for testing stuff. Bear in mind you can only call collect/print once per data flow, otherwise it gets executed many time while it's not completely defined
Either you add a sink at the end of your data flow and call env.execute(). That's what you want to do once your flow is in a more mature shape.

How to bucket outputs in Scalding

I'm trying to output a pipe into different directories such that the output of each directory will be bucketed based on some ids.
So in a plain map reduce code I would use the MultipleOutputs class and I would do something like this in the reducer.
protected void reduce(final SomeKey key,
final Iterable<SomeValue> values,
final Context context) {
...
for (SomeValue value: values) {
String bucketId = computeBucketIdFrom(...);
multipleOutputs.write(key, value, folderName + "/" + bucketId);
...
So i guess one could do it like this in scalding
...
val somePipe = Csv(in, separator = "\t",
fields = someSchema,
skipHeader = true)
.read
for (i <- 1 until numberOfBuckets) {
somePipe
.filter('someId) {id: String => (id.hashCode % numberOfBuckets) == i}
.write(Csv(out + "/bucket" + i ,
writeHeader = true,
separator = "\t"))
}
But I feel that you would end up reding the same pipe many times and it will affect the overall performance.
Is there any other alternatives?
Thanks

Yes, of course there is a better way using TemplatedTsv.
So your code above can be written as follows,
val somePipe = Tsv(in, fields = someSchema, skipHeader = true)
.read
.write(TemplatedTsv(out, "%s", 'some_id, writeHeader = true))
This will put all records coming from 'some_id into separate folders under out/some_ids folder.
However, you can also create integer buckets. Just change the last lines,
.map('some_id -> 'bucket) { id: String => id.hashCode % numberOfBuckets }
.write(TemplatedTsv(out, "%02d", 'bucket, writeHeader = true, fields = ('all except 'bucket)))
This will create two digit folders as out/dd/. You can also check templatedTsv api here.
There might be small problem using templatedTsv, that is reducers can generate lots of small files which can be bad for the next job using your results. Therefore, it is better to sort on template fields before writing to disk. I wrote a blog about about it here.

Creating a large number of multiple outputs

I am creating a large number of output files, for example 500. I am getting already being created exception,as shoen below. The program recovers by itself when the number of output files is small. For ex. if its 50 files, though this exception occurs, the program starts running successfully after printing this exception several times.
But, for many files, it eventually fails with an IOException.
I have pasted the error and then the code below:
12/10/29 15:47:27 INFO mapred.JobClient: Task Id : attempt_201210231820_0235_r_000004_3, Status : FAILED
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /home/users/mlakshm/preopa406/data-r-00004 for DFSClient_attempt_201210231820_0235_r_000004_3 on client 10.0.1.100, because this file is already being created by DFSClient_attempt_201210231820_0235_r_000004_2 on 10.0.1.130
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1406)
I have pasted the code :
In the Reduce method, I have the below logic to generate ouputs:
int data_hash = (int)data_str.hashCode();
int data_int1 = 0;
int k = 500;
int check1 = 0;
for (int l = 10; l>0; l++)
{
if((data_hash%l==0)&&(check1 == 0))
{
check1 = 1;
int range = (int) k/10;
String check = "true";
while(range > 0 && check.equals("true"))
{
if(data_hash % range-1 == 0)
{
check = "false";
data_int1 = range*10;
}
}
}
}
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
PLs help!

The problem is that all the reducer are trying to write files with the same naming scheme.
The reason it's doing this because
mos.getCollector("/home/users/mlakshm/preopa407/cdata"+data_int1, reporter).collect(new Text(t+" "+alsort.get(0)+" "+alsort.get(1)), new Text(intersection));
Set's the file name based on a characteristic of the data not the identity of the reducer.
You have a couple of choices :
Rework your map job so so that the key that's emitted matches up with the hash that your calculating in this job. That would make sure that each reducer got a span of values.
Include in the file name a identifier that is unqiue to each mapper. This would leave you with a set of part files for each reducer.
Could you perhaps explain why your using multiple outputs here? I don't think you need to.

diverging results from weka training and java training

I'm trying to create an "automated trainning" using weka's java api but I guess I'm doing something wrong, whenever I test my ARFF file via weka's interface using MultiLayerPerceptron with 10 Cross Validation or 66% Percentage Split I get some satisfactory results (around 90%), but when I try to test the same file via weka's API every test returns basically a 0% match (every row returns false)
here's the output from weka's gui:
=== Evaluation on test split ===
=== Summary ===
Correctly Classified Instances 78 91.7647 %
Incorrectly Classified Instances 7 8.2353 %
Kappa statistic 0.8081
Mean absolute error 0.0817
Root mean squared error 0.24
Relative absolute error 17.742 %
Root relative squared error 51.0603 %
Total Number of Instances 85
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.885 0.068 0.852 0.885 0.868 0.958 1
0.932 0.115 0.948 0.932 0.94 0.958 0
Weighted Avg. 0.918 0.101 0.919 0.918 0.918 0.958
=== Confusion Matrix ===
a b <-- classified as
23 3 | a = 1
4 55 | b = 0
and here's the code I've using on java (actually it's on .NET using IKVM):
var classifier = new weka.classifiers.functions.MultilayerPerceptron();
classifier.setOptions(weka.core.Utils.splitOptions("-L 0.7 -M 0.3 -N 75 -V 0 -S 0 -E 20 -H a")); //these are the same options (the default options) when the test is run under weka gui
string trainingFile = Properties.Settings.Default.WekaTrainingFile; //the path to the same file I use to test on weka explorer
weka.core.Instances data = null;
data = new weka.core.Instances(new java.io.BufferedReader(new java.io.FileReader(trainingFile))); //loads the file
data.setClassIndex(data.numAttributes() - 1); //set the last column as the class attribute
cl.buildClassifier(data);
var tmp = System.IO.Path.GetTempFileName(); //creates a temp file to create an arff file with a single row with the instance I want to test taken from the arff file loaded previously
using (var f = System.IO.File.CreateText(tmp))
{
//long code to read data from db and regenerate the line, simulating data coming from the source I really want to test
}
var dataToTest = new weka.core.Instances(new java.io.BufferedReader(new java.io.FileReader(tmp)));
dataToTest.setClassIndex(dataToTest.numAttributes() - 1);
double prediction = 0;
for (int i = 0; i < dataToTest.numInstances(); i++)
{
weka.core.Instance curr = dataToTest.instance(i);
weka.core.Instance inst = new weka.core.Instance(data.numAttributes());
inst.setDataset(data);
for (int n = 0; n < data.numAttributes(); n++)
{
weka.core.Attribute att = dataToTest.attribute(data.attribute(n).name());
if (att != null)
{
if (att.isNominal())
{
if ((data.attribute(n).numValues() > 0) && (att.numValues() > 0))
{
String label = curr.stringValue(att);
int index = data.attribute(n).indexOfValue(label);
if (index != -1)
inst.setValue(n, index);
}
}
else if (att.isNumeric())
{
inst.setValue(n, curr.value(att));
}
else
{
throw new InvalidOperationException("Unhandled attribute type!");
}
}
}
prediction += cl.classifyInstance(inst);
}
//prediction is always 0 here, my ARFF file has two classes: 0 and 1, 92 zeroes and 159 ones
it's funny because if I change the classifier to let's say NaiveBayes the results match the test made via weka's gui

You are using a deprecated way of reading in ARFF files. See this documentation. Try this instead:
import weka.core.converters.ConverterUtils.DataSource;
...
DataSource source = new DataSource("/some/where/data.arff");
Instances data = source.getDataSet();
Note that that documentation also shows how to connect to a database directly, and bypass the creation of temporary ARFF files. You could, additionally, read from the database and manually create instances to populate the Instances object with.
Finally, if simply changing the classifier type at the top of the code to NaiveBayes solved the problem, then check the options in your weka gui for MultilayerPerceptron, to see if they are different from the defaults (different settings can cause the same classifier type to produce different results).
Update: it looks like you're using different test data in your code than in your weka GUI (from a database vs a fold of the original training file); it might also be the case that the particular data in your database actually does look like class 0 to the MLP classifier. To verify whether this is the case, you can use the weka interface to split your training arff into train/test sets, and then repeat the original experiment in your code. If the results are the same as the gui, there's a problem with your data. If the results are different, then we need to look more closely at the code. The function you would call is this (from the Doc):
public Instances trainCV(int numFolds, int numFold)

I had the same Problem.
Weka gave me different results in the Explorer compared to a cross-validation in Java.
Something that helped:
Instances dataSet = ...;
dataSet.stratify(numOfFolds); // use this
//before splitting the dataset into train and test set!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to map a log file in Java using Spark? - java

Related

Using a Java Stream instead of a for-loop skips code execution

Apache Flink: The execution environment and multiple sink

How to bucket outputs in Scalding

Creating a large number of multiple outputs

diverging results from weka training and java training

Categories

Resources