Load CSV files with Apache Spark - Java API

Load CSV files with Apache Spark - Java API - java

I am new to Spark and working my way through the Java and Scala API and I was interested to put together two examples with a view of comparing both languages in terms of conciseness and readability.
Here is my Scala version:
import java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
import org.apache.spark.{SparkConf, SparkContext}
object LoadCSVScalaExample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MyLoadCSVScalaExampleApplication").setMaster("local[*]")
val sc = new SparkContext(conf)
val input = sc.textFile("D:\\MOCK_DATA_spark.csv")
val result = input.map { line => val reader = new CSVReader(new StringReader(line));
reader.readNext()
}
print("This is the total count " + result.count())
}
}
Whereas this is the Java counterpart:
import au.com.bytecode.opencsv.CSVReader;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import java.io.StringReader;
public class LoadCSVJavaExample implements Function<String, String[]> {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MyLoadCSVJavaExampleApp").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csvFile = sc.textFile("D:\\MOCK_DATA_spark.csv");
JavaRDD<String[]> csvData = csvFile.map(new LoadCSVJavaExample());
System.out.println("This prints the total count " + csvData.count());
}
public String[] call(String line) throws Exception {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
}
}
I am not sure, however, whether the Java example is actually correct? Can I pick your brain on it? I know I could use the Databrickd Spark csv library, instead, but I was wondering whether the current example is correct and how it can be further improved upon.
Thank you for your help,
I.

Related

How to build a simple spark-test-app.jar to test AWS SageMaker SparkJarProcessing

Does anyone know a repo that shows what a simple HelloWorld java or scala code would look like to build the jar that could be executed using the AWS SageMaker SparkJarProcessing class?
Readthedocs (https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html) mentions:
"In the next example, you’ll take a Spark application jar (located in ./code/spark-test-app.jar)..."
My question is how does the source code look like for this jar (spark-test-app.jar)?
I tried building a simple Java project jar
src>com.test>HW.java:
public class HW {
public static void main(String[] args) {
System.out.printf("hello world!");
}
}
and running it inside SageMaker Notebook conda_python3 kernel using
from sagemaker.spark.processing import SparkJarProcessor
from sagemaker import get_execution_role
role = get_execution_role()
print(role)
spark_processor = SparkJarProcessor(
base_job_name="sm-spark-java",
framework_version="3.1",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
spark_processor.run(
submit_app="./SparkJarProcessing-1.0-SNAPSHOT.jar",
submit_class="com.test.HW",
arguments=["--input", "abc"],
logs=True,
)
But end up getting an error:
Could not execute HW class.
Any sample source code for spark-test-app.jar would be highly appreciated!

To answer your question, the source code of that class looks like:
package com.amazonaws.sagemaker.spark.test;
import java.lang.invoke.SerializedLambda;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.ParseException;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.Options;
import org.apache.spark.sql.Dataset;
import org.apache.commons.cli.CommandLine;
import org.apache.spark.sql.types.DataTypes;
import org.apache.commons.lang3.StringUtils;
import java.util.List;
import org.apache.spark.sql.SparkSession;
public class HelloJavaSparkApp
{
public static void main(final String[] args) {
System.out.println("Hello World, this is Java-Spark!");
final CommandLine parsedArgs = parseArgs(args);
final String inputPath = parsedArgs.getOptionValue("input");
final String outputPath = parsedArgs.getOptionValue("output");
final SparkSession spark = SparkSession.builder().appName("Hello Spark App").getOrCreate();
System.out.println("Got a Spark session with version: " + spark.version());
System.out.println("Reading input from: " + inputPath);
final Dataset salesDF = spark.read().json(inputPath);
salesDF.printSchema();
salesDF.createOrReplaceTempView("sales");
final Dataset topDF = spark.sql("SELECT date, sale FROM sales WHERE sale > 750 SORT BY sale DESC");
topDF.show();
final Dataset avgDF = salesDF.groupBy("date", new String[0]).avg(new String[0]).orderBy("date", new String[0]);
System.out.println("Collected average sales: " + StringUtils.join((Object[])new List[] { avgDF.collectAsList() }));
spark.sqlContext().udf().register("double", n -> n + n, DataTypes.LongType);
final Dataset saleDoubleDF = salesDF.selectExpr(new String[] { "date", "sale", "double(sale) as sale_double" }).orderBy("date", new String[] { "sale" });
saleDoubleDF.show();
System.out.println("Writing output to: " + outputPath);
saleDoubleDF.coalesce(1).write().json(outputPath);
spark.stop();
}
private static CommandLine parseArgs(final String[] args) {
final Options options = new Options();
final CommandLineParser parser = (CommandLineParser)new BasicParser();
final Option input = new Option("i", "input", true, "input path");
input.setRequired(true);
options.addOption(input);
final Option output = new Option("o", "output", true, "output path");
output.setRequired(true);
options.addOption(output);
try {
return parser.parse(options, args);
}
catch (ParseException e) {
new HelpFormatter().printHelp("HelloScalaSparkApp --input /opt/ml/input/foo --output /opt/ml/output/bar", options);
throw new RuntimeException((Throwable)e);
}
}
}
At the same time, I have created a simple example that shows how to run an hello world app here. Please note that I have run that example on Amazon SageMaker Studio Notebooks, using the Data Science 1.0 kernel.
Hope this helps.

unable to read file in java spark

I am trying to run the spark program on java using eclipse. Its is running if i simply print something on console but I am not able to read any file using textFile function.
I have read somewhere that reading a file can only be done using HDFS but I am not able to do in my local system.
Do let me know how to access/read file , if using HDFS then how to install HDFS in my local system so that i can rad the text file.
Here's a code on which I am testing , though this program is working fine but it is unable to read file saying Input path does not exist.
package spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
String[] jars = {"D:\\customJars\\spark.jar"};
System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark")
.setMaster("spark://10.1.50.165:7077")
.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "./forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}
My File Struture :

Update: you don't have .txt extension in file name and you are using it in your application. You should use it as String inputFileName = "forecaster" ;
If file is in same folder as java class TestSpark ($APP_HOME):
String inputFileName = "forecaster.txt" ;
If file is in Data dir under your project of spark:
String inputFileName = "Data\\forecaster.txt" ;
Or use fully qualified Path log says from below testing:
16/08/03 08:25:25 INFO HadoopRDD: Input split: file:/C:/Users/user123/worksapce/spark-java/forecaster.txt
~~~~~~~
String inputFileName = "file:/C:/Users/user123/worksapce/spark-java/forecaster.txt" ;
For example: I copied your code and ran on my local environment:
this is how my project step up is, and I run it as:
String inputFileName = "forecaster.txt" ;
Test File:
this is test file
aaa
bbb
ddddaaee
ewwww
aaaa
a
a
aaaa
bb
Code that I used:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
// String[] jars = {"D:\\customJars\\spark.jar"};
// System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark").setMaster("local");
//.setMaster("spark://10.1.50.165:7077")
//.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}

Spark needs schema and proper path in order to understand how to read the file. So if you are reading from HDFS, you should use:
jsc.textFile("hdfs://host/path/to/hdfs/file/input.txt");
If you are reading local file (local to the worker node, not the machine the driver is running), you should use:
jsc.textFile("file://path/to/hdfs/file/input.txt");
For reading Hadoop Archive File (HAR), you should use:
jsc.textFile("har://archive/path/to/hdfs/file/input.txt");
And so on.

Spark - Extract line after String match and save it in ArrayList

I am new to spark and trying to extract a line which contains "Subject:" and save it in an arraylist. I am not facing any error but the array list is empty. Can you please guide me where am i going wrong? or the best way to do this?
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
final ArrayList<String> list = new ArrayList<>();
sample.foreach(new VoidFunction<String>(){
public void call(String line) {
if (line.contains("Subject:")) {
System.out.println(line);
list.add(line);
}
}}
);
System.out.println(list);
sc.stop();
}
}

Please keep in mind that Spark applications run distributed and in parallel. Therefore you cannot modify variables outside of functions that are executed by Spark.
Instead you need to return a result from these functions. In your case you need flatMap (instead of foreach that has no result), which concatenates collections that are returned as result of your function.
If a line matches a list that contains the matching line is returned, otherwise you return an empty list.
To print the data in the main function, you first have to gather the possibly distributed data in your master node, by calling collect().
Here an example:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
JavaRDD<String> sample = sc.parallelize(Arrays.asList("Subject: first",
"nothing here",
"Subject: second",
"dummy"));
JavaRDD<String> subjectLinesRdd = sample.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.contains("Subject:")) {
return Collections.singletonList(line); // line matches → return list with the line as its only element
} else {
return Collections.emptyList(); // ignore line → return empty list
}
}
});
List<String> subjectLines = subjectLinesRdd.collect(); // collect values from Spark workers
System.out.println(subjectLines); // → "[Subject: first, Subject: second]"
sc.stop();
}
}

Getting error in apache spark simple program

i am doing one simple example of word count in apache spark in java with reference of Internet and i m getting error of
Caused by: java.net.UnknownHostException: my.txt
you can see my below code for the reference!
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class MyCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
String file = "hdfs://my.txt";
JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
JavaRDD<String> lines = sc.textFile(file);
long nums = lines.count();
System.out.println(nums);
}
}

Can you try
String file = "hdfs://localhost/my.txt"
PS: make sure you have this file my.txt in hdfs.
In case if you don't have that file hdfs, follow below command to put the file in hdfs from local dir.
Hadoop fs -copyFromLocal /home/training/my.txt hadoop/

Old question but an answer was never accepted, the mistake at the time I read it is mixing the "local" concept of Spark with "localhost."
Using this constructor: JavaSparkContext(java.lang.String master, java.lang.String appName), you would want to use:
JavaSparkContext sc = new JavaSparkContext("localhost", "Simple App");
but the question was using "local". Further, the HDFS filename didn't specify a hostname: "hdfs://SomeNameNode:9000/foo/bar/"or
"hdfs://host:port/absolute-path"
As of 1.6.2, the Javadoc for JavaSparkContext is not showing any constructor that let's you specify the cluster type directly:
http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html
The best constructor for JavaSparkContext wants a SparkConf object. To do something more readable by humans, build a SparkConf object and then pass it to JavaSparkContext, here's an example that sets the appname, specifies Kryo serializer and sets the master:
SparkConf sparkConf = new SparkConf().setAppName("Threshold")
//.setMaster("local[4]");
.setMaster(getMasterString(masterName))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(kryoClassArray);
// create the JavaSparkContext now:
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
NOTE: the alternate .setMaster("local[4]"); would use local mode, which the OP may have been trying.
I have a more extended answer here that addresses using hostnames vs. IP addresses and a lot more for setting up your SparkConf

You can try this simple word count program
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class First {
public static void main(String[] args) {
SparkConf sf = new SparkConf().setMaster("local[3]").setAppName("parth");
JavaSparkContext sc = new JavaSparkContext(sf);
JavaRDD<String> textFile = sc.textFile("input file path");
JavaRDD<String> words = textFile.flatMap((new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }}));
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("outputfile-path");
}
}

Convert a JavaRDD String to JavaRDD Vector

I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;
public class Trial {
public void start() throws InstantiationException, IllegalAccessException,
ClassNotFoundException {
run();
}
private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
}
private List<Vector> Seq(Vector dv) {
// TODO Auto-generated method stub
return null;
}
public static void main(String[] args) throws Exception {
Trial trial = new Trial();
trial.start();
}
}
The program is running without any error but i'm not able to get anything when trying to run it on spark-machine. Can anyone tell me whether the conversion of string RDD to Vector RDD is correct.
My csv file consist of only one column which are floating numbers

The null in this flatMap invocation might be a problem:
JavaRDD<Vector> datamain = data.flatMap(null);

I solved my answer by changing the code to this
JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
public Vector call(String s){
String[] sarray = s.trim().split("\\r?\\n");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
System.out.println(values[i]);
}
return Vectors.dense(values);
}
}
);

Assuming your trial.csv file looks like this
1.0
2.0
3.0
Taking your original code from your question a one line change is required with Java 8
SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
Prints 2.0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Load CSV files with Apache Spark - Java API - java

Related

How to build a simple spark-test-app.jar to test AWS SageMaker SparkJarProcessing

unable to read file in java spark

Spark - Extract line after String match and save it in ArrayList

Getting error in apache spark simple program

Convert a JavaRDD String to JavaRDD Vector

Categories

Resources