I have a little problem joining two datasets in spark, I have this:
SparkConf conf = new SparkConf()
.setAppName("MyFunnyApp")
.setMaster("local[*]");
SparkSession spark = SparkSession
.builder()
.config(conf)
.config("spark.debug.maxToStringFields", 150)
.getOrCreate();
//...
//Do stuff
//...
Encoder<MyOwnObject1> encoderObject1 = Encoders.bean(MyOwnObject1.class);
Encoder<MyOwnObject2> encoderObject2 = Encoders.bean(MyOwnObject2.class);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile1)
.as(encoderObject1);
Dataset<MyOwnObject2> object2DS = spark.read()
.option("header","true")
.option("delimiter",";")
.option("inferSchema","true")
.csv(pathToFile2)
.as(encoderObject2);
I can print the schema and show it correctly.
//Here start the problem
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS =
object1DS.join(object2DS, object1DS.col("column01")
.equalTo(object2DS.col("column01")))
.as(Encoders.tuple(MyOwnObject1,MyOwnObject2));
Last line can't make join and get me this error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to map struct<"LIST WITH ALL VARS FROM TWO OBJECT"> to Tuple2, but failed as the number of fields does not line up.;
That's true, because Tuple2 (object2) doesn't have all vars...
Then I had tried this:
Dataset<Tuple2<MyOwnObject1, MyOwnObject2>> joinObjectDS = object1DS
.joinWith(object2DS, object1DS
.col("column01")
.equalTo(object2DS.col("column01")));
And works fine! But, I need a new Dataset without tuple, I have an object3, that have some vars from object1 and object2, then I have this problem:
Encoder<MyOwnObject3> encoderObject3 = Encoders.bean(MyOwnObject3.class);
Dataset<MyOwnObject3> object3DS = joinObjectDS.map(tupleObject1Object2 -> {
MyOwnObject1 myOwnObject1 = tupleObject1Object2._1();
MyOwnObject2 myOwnObject2 = tupleObject1Object2._2();
MyOwnObject3 myOwnObject3 = new MyOwnObject3(); //Sets all vars with start values
//...
//Sets data from object 1 and 2 to 3.
//...
return myOwnObject3;
}, encoderObject3);
Fails!... here is the error:
17/05/10 12:17:43 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 593, Column 72: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
and over thousands error lines...
What can I do? I had tried:
Make my object only with String, int (or Integer) and double (or Double) (no more)
use differents encoders like kryo or javaSerialization
use JavaRDD (works! but very slowly) and use Dataframes with Rows (works, but i need to change many objects)
All my java objects are serializable
use sparks 2.1.0 and 2.1.1, now I have 2.1.1 on my pom.xml
I want to use Datasets, to use the speed from Dataframes and object sintax from JavaRDD...
Help?
Thanks
Finally I found a solution,
I had a problem with the option inferSchema when my code was creating a Dataset. I have a String column that the option inferSchema return me an Integer column because all values are "numeric", but i need use them as String (like "0001", "0002"...) I need to do a schema, but I have many vars, then I write this with all my classes:
List<StructField> fieldsObject1 = new ArrayList<>();
for (Field field : MyOwnObject1.class.getDeclaredFields()) {
fieldsObject1.add(DataTypes.createStructField(
field.getName(),
CatalystSqlParser.parseDataType(field.getType().getSimpleName()),
true)
);
}
StructType schemaObject1 = DataTypes.createStructType(fieldsObject1);
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(schemaObject1)
.csv(pathToFile1)
.as(encoderObject1);
Works fine.
The "best" solution would be this:
Dataset<MyOwnObject1> object1DS = spark.read()
.option("header","true")
.option("delimiter",";")
.schema(encoderObject1.schema())
.csv(pathToFile1)
.as(encoderObject1);
but encoderObject1.schema() returns me a Schema with vars in alphabetical order, not in original order, then this option fails when I read a csv. Maybe Encoders should return a schema with vars in original order and not in alphabetical order
Related
Using apache Spark, we need to process a bunch of files and track which files have certain keywords in them.
I'm attempting to create a dataframe with two columns:
line from a file
file that contained the line
Here's what I have so far:
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();
// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema);
df.show();
Which prints out:
+--------------------+
| line|
+--------------------+
|1331901000.000000...|
|1331901000.000000...|
|1331901000.000000...|
...
Can anyone help me understand how to get the original file's name added as a second column?
Searching for advice has led to advice like this, but I'm uncertain of how to translate that in this scenario.
Thanks in advance, I'm new to Spark and any advice is appreciated.
I am not Java guy , but in python using spark you can provide the whole folder or patter of files and use something like this below.. if local file system use file: at the front. filename will add filename to the data.
df = spark.read.text('/datafolder/foldername/*')
df = df.withColumn("filename", input_file_name())
Thanks to #Rafa I think this is the answer:
String[] sourceLogPaths = Files.walk(Paths.get(getLogSourceDirectory())).filter(Files::isRegularFile).map(path -> path.toString()).collect(Collectors.toList()).toArray((new String[0]));
SparkSession spark = SparkSession.builder().appName("LogSearcher").master("local").getOrCreate();
// sourceLogPaths is an array of different file names
JavaRDD<String> textFile = spark.read().textFile(sourceLogPaths).javaRDD();
JavaRDD<Row> rowRDD = textFile.map(RowFactory::create);
// How to add a field that shows the associated filename for each row?
List<StructField> fields = Arrays.asList(DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = spark.sqlContext();
// Below line has the additional column added
Dataset<Row> df = sqlContext.createDataFrame(rowRDD, schema).withColumn("file_name", input_file_name());
df.show();
Prints
+--------------------+--------------------+
| line| file_name|
+--------------------+--------------------+
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.000000...|file:///Users/acu...|
|1331901000.010000...|file:///Users/acu...|
During deployment of my java spark program to a cluster, It runs into an ArrayIndexOutOfBoundsException: 11 exception
From my understanding there is nothing syntactically wrong with how my code is written, and an indexing error doesn't clarify what's going wrong. My program is simply supposed to be able to take 12 columns, separated by spaces, From then it needs to take ONE column (the command column) and do an aggregation to see how many times each command exists i.e.
column1 column2 command column3 ect
dggd gdegdg cmd#1 533 ect
dggd gdegdg cmd#1 533 ect
dggd gdegdg cmd#2 534 ect
dggd gdegdg cmd#5 5353 ect
dggd gdegdg cmd#2 533 ect
will look something like
commmand count
command#1 5
command#2 15
command#5 514
I'm running
spark 2.1
HDP 2.6
here's the code I have so far
public class Main {
public static void main(String[] args) {
//functions fu = new functions();
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("appName").setMaster("local[*]"));
SparkSession spark = SparkSession
.builder()
.appName("Log File Reader")
.getOrCreate();
JavaRDD<String> logsRDD = spark.sparkContext()
.textFile(args[0],1)
.toJavaRDD();
String schemaString = "date time command service responseCode bytes ip dash1 dash2 dash3 num dash4";
List<StructField> fields = new ArrayList<>();
String[] fieldName = schemaString.split(" ");
for (String field : fieldName){
fields.add(DataTypes.createStructField(field, DataTypes.StringType, true));
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = logsRDD.map((Function<String, Row>) record -> {
String[] attributes = record.split(" ");
return RowFactory.create(attributes[0],attributes[1],attributes[2],attributes[3],attributes[4],attributes[5],
attributes[6],attributes[7],attributes[8],attributes[9],
attributes[10],attributes[11]);
});
Dataset<Row> dataDF = spark.createDataFrame(rowRDD, schema);
dataDF.createOrReplaceTempView("data");
//shows the top 20 rows from the dataframe including all columns
Dataset<Row> showDF = spark.sql("select * from data");
//shows the top 20 columns from the same dataframe, but only displays
//the command column
Dataset<Row> commandDF = spark.sql("select command from data");
showDF.show();
commandDF.show();
This code works fine, however when I try to find the final result using code like the following, it runs into the indexing error.
logsDF.groupBy(col("command")).count().show();
Dataset<Row> ans = spark.sql("select command, count(*) from logs group by command").show();
And finally the spark-submit code
spark-submit --class com.ect.java.Main /path/application.jar hdfs:///path/textfile.txt
In my mind I'm stuck that it's an environment issue but can't find any documentation relating to an issue related to this
Problem is not with aggregate function. Problem is with your log file. You are getting error
Caused by: java.lang.ArrayIndexOutOfBoundsException: 11
Which means one of the line in your log file has 11 entries and not 12, as required by your program. You can verify this by creating a sample log.txt file and keeping two rows in that file.
Your group by code should be like below (looks like typo). In your sample application you have dataDF not the logsDF. Temp table name is data not logs.
dataDF.groupBy(col("command")).count().show();
Dataset<Row> ans = spark.sql("select command, count(*) from data group by command");
ans.show();
So I'm new to Apache Spark and I have a file that looks like this:
Name Size Records
File1 1,000 104,370
File2 950 91,780
File3 1,500 109,123
File4 2,170 113,888
File5 2,000 111,974
File6 1,820 110,666
File7 1,200 106,771
File8 1,500 108,991
File9 1,000 104,007
File10 1,300 107,037
File11 1,900 111,109
File12 1,430 108,051
File13 1,780 110,006
File14 2,010 114,449
File15 2,017 114,889
This is my sample/test data. I'm working on an anomaly detection program and I have to test other files with the same format but different values and detect which one have anomalies on the size and records values (if size/records on another file differ a lot from the standard one, or if size and records are not proportional within each other). I decided to start trying different ML algorithms and I wanted to start with the k-Means approach. I tried putting this file on the following line:
KMeansModel model = kmeans.fit(file)
file is already parsed to a Dataset variable. However I get an error and I'm pretty sure it has to do with the structure/schema of the file. Is there a way to work with structured/labeled/organized data when trying to fit in on a model?
I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
And this is the code:
public class practice {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Anomaly Detection").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1 = "C:\\Users\\ZK0GJXO\\Documents\\day1.txt";
Dataset<Row> df = spark.read().
option("header", "true").
option("delimiter", "\t").
csv(day1);
df.show();
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(df);
}
}
Thanks
By default all Spark ML models train on a column called "features". One can specify a different input column name via the setFeaturesCol method http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/KMeans.html#setFeaturesCol(java.lang.String)
update:
One can combine multiple columns into a single feature vector using VectorAssembler:
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"size", "records"})
.setOutputCol("features");
Dataset<Row> vectorized_df = assembler.transform(df)
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(vectorized_df);
One can further streamline and chain these feature transformations with the pipeline API https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline
I have above code as Spark driver, when I execute my program it works properly saving required data as Parquet file.
String indexFile = "index.txt";
JavaRDD<String> indexData = sc.textFile(indexFile).cache();
JavaRDD<String> jsonStringRDD = indexData.map(new Function<String, String>() {
#Override
public String call(String patientId) throws Exception {
return "json array as string"
}
});
//1. Read json string array into a Dataframe (execution 1)
DataFrame dataSchemaDF = sqlContext.read().json(jsonStringRDD );
//2. Save dataframe as parquet file (execution 2)
dataSchemaDF.write().parquet("md.parquet");
But I observed my mapper function on RDD indexData is getting executed twice.
first, when I read jsonStringRdd as DataFrame using SQLContext
Second, when I write the dataSchemaDF to the parquet file
Can you guide me on this, how to avoid this repeated execution? Is there any other better way of converting JSON string into a Dataframe?
I believe that the reason is a lack of schema for JSON reader. When you execute:
sqlContext.read().json(jsonStringRDD);
Spark has to infer schema for a newly created DataFrame. To do that it has scan input RDD and this step is performed eagerly
If you want to avoid it you have to create a StructType which describes the shape of the JSON documents:
StructType schema;
...
and use it when you create DataFrame:
DataFrame dataSchemaDF = sqlContext.read().schema(schema).json(jsonStringRDD);
I am using spark 1.5.0.
I have a set of files on s3 containing json data in sequence file format, worth around 60GB. I have to fire around 40 queries on this dataset and store results back to s3.
All queries are select statements with a condition on same field. Eg. select a,b,c from t where event_type='alpha', select x,y,z from t where event_type='beta' etc.
I am using an AWS EMR 5 node cluster with 2 core nodes and 2 task nodes.
There could be some fields missing in the input. Eg. a could be missing. So, the first query, which selects a would fail. To avoid this I have defined schemas for each event_type. So, for event_type alpha, the schema would be like {"a": "", "b": "", c:"", event_type=""}
Based on the schemas defined for each event, I'm creating a dataframe from input RDD for each event with the corresponding schema.
I'm using the following code:
JavaPairRDD<LongWritable,BytesWritable> inputRDD = jsc.sequenceFile(bucket, LongWritable.class, BytesWritable.class);
JavaRDD<String> events = inputRDD.map(
new Function<Tuple2<LongWritable,BytesWritable>, String>() {
public String call(Tuple2<LongWritable,BytesWritable> tuple) throws JSONException, UnsupportedEncodingException {
String valueAsString = new String(tuple._2.getBytes(), "UTF-8");
JSONObject data = new JSONObject(valueAsString);
JSONObject payload = new JSONObject(data.getString("payload"));
return payload.toString();
}
}
);
events.cache();
for (String event_type: events_list) {
String query = //read query from another s3 file event_type.query
String jsonSchemaString = //read schema from another s3 file event_type.json
List<String> jsonSchema = Arrays.asList(jsonSchemaString);
JavaRDD<String> jsonSchemaRDD = jsc.parallelize(jsonSchema);
DataFrame df_schema = sqlContext.read().option("header", "true").json(jsonSchemaRDD);
StructType schema = df_schema.schema();
DataFrame df_query = sqlContext.read().schema(schema).option("header", "true").json(events);
df_query.registerTempTable(tableName);
DataFrame df_results = sqlContext.sql(query);
df_results.write().format("com.databricks.spark.csv").save("s3n://some_location);
}
This code is very inefficient, it takes around 6-8 hours to run. How can I optimize my code?
Should I try using HiveContext.
I think the current code is taking multipe passes at the data, not sure though as I have cached the RDD? How can I do it in a single pass if that is so.