i am doing one simple example of word count in apache spark in java with reference of Internet and i m getting error of
Caused by: java.net.UnknownHostException: my.txt
you can see my below code for the reference!
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class MyCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
String file = "hdfs://my.txt";
JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
JavaRDD<String> lines = sc.textFile(file);
long nums = lines.count();
System.out.println(nums);
}
}
Can you try
String file = "hdfs://localhost/my.txt"
PS: make sure you have this file my.txt in hdfs.
In case if you don't have that file hdfs, follow below command to put the file in hdfs from local dir.
Hadoop fs -copyFromLocal /home/training/my.txt hadoop/
Old question but an answer was never accepted, the mistake at the time I read it is mixing the "local" concept of Spark with "localhost."
Using this constructor: JavaSparkContext(java.lang.String master, java.lang.String appName), you would want to use:
JavaSparkContext sc = new JavaSparkContext("localhost", "Simple App");
but the question was using "local". Further, the HDFS filename didn't specify a hostname: "hdfs://SomeNameNode:9000/foo/bar/"or
"hdfs://host:port/absolute-path"
As of 1.6.2, the Javadoc for JavaSparkContext is not showing any constructor that let's you specify the cluster type directly:
http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html
The best constructor for JavaSparkContext wants a SparkConf object. To do something more readable by humans, build a SparkConf object and then pass it to JavaSparkContext, here's an example that sets the appname, specifies Kryo serializer and sets the master:
SparkConf sparkConf = new SparkConf().setAppName("Threshold")
//.setMaster("local[4]");
.setMaster(getMasterString(masterName))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(kryoClassArray);
// create the JavaSparkContext now:
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
NOTE: the alternate .setMaster("local[4]"); would use local mode, which the OP may have been trying.
I have a more extended answer here that addresses using hostnames vs. IP addresses and a lot more for setting up your SparkConf
You can try this simple word count program
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class First {
public static void main(String[] args) {
SparkConf sf = new SparkConf().setMaster("local[3]").setAppName("parth");
JavaSparkContext sc = new JavaSparkContext(sf);
JavaRDD<String> textFile = sc.textFile("input file path");
JavaRDD<String> words = textFile.flatMap((new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }}));
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("outputfile-path");
}
}
Related
When i am trying to run a simple word count in Spark Java using Eclipse, i am getting Java error in a new popup Java Virtual Machine Launcher window which says-
A Java Exception has occurred.
java -version
Java Virtual Machine Launcher
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
Below is the code:
package com.fd.spark;
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class SparkWordCount {
public static void main(String[] args) throws Exception {
String inputFile = "/Spark/inp1";
String outputFile = "/Spark/out1";
// Create a Java Spark Context.
SparkConf conf = new SparkConf().setAppName("wordCount").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load our input data.
JavaRDD<String> input = sc.textFile(inputFile);
// Split up into words.
JavaRDD<String> words = input.flatMap(
new FlatMapFunction<String, String>() {
public Iterator<String> call(String x) {
return (Iterator<String>) Arrays.asList(x.split(" "));
}});
// Transform into word and count.
JavaPairRDD<String, Integer> counts = words.mapToPair(
new PairFunction<String, String, Integer>(){
public Tuple2<String, Integer> call(String x){
return new Tuple2(x, 1);
}}).reduceByKey(new Function2<Integer, Integer, Integer>(){
public Integer call(Integer x, Integer y){ return x + y;}});
// Save the word count back out to a text file, causing evaluation.
counts.saveAsTextFile(outputFile);
}
}
Use the below in Java 8 and it will work. Below is my code snippet on Java 8 using lamda functions.
public class WordCount {
public static void main(
String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> inputFile = sparkContext.textFile(*Path_To_TXT_file*);
JavaRDD<String> wordsFromFile = inputFile.flatMap(content -> Arrays.asList(content.split(" ")).iterator());
JavaPairRDD<String,Integer> countData = wordsFromFile.mapToPair(t -> new Tuple2<String,Integer>(t,1)).reduceByKey((x, y) -> x + y);
countData.collect().forEach(t -> System.out.println(t._1+" : "+t._2));
}
}
I am new to Spark and working my way through the Java and Scala API and I was interested to put together two examples with a view of comparing both languages in terms of conciseness and readability.
Here is my Scala version:
import java.io.StringReader
import au.com.bytecode.opencsv.CSVReader
import org.apache.spark.{SparkConf, SparkContext}
object LoadCSVScalaExample {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MyLoadCSVScalaExampleApplication").setMaster("local[*]")
val sc = new SparkContext(conf)
val input = sc.textFile("D:\\MOCK_DATA_spark.csv")
val result = input.map { line => val reader = new CSVReader(new StringReader(line));
reader.readNext()
}
print("This is the total count " + result.count())
}
}
Whereas this is the Java counterpart:
import au.com.bytecode.opencsv.CSVReader;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import java.io.StringReader;
public class LoadCSVJavaExample implements Function<String, String[]> {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("MyLoadCSVJavaExampleApp").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> csvFile = sc.textFile("D:\\MOCK_DATA_spark.csv");
JavaRDD<String[]> csvData = csvFile.map(new LoadCSVJavaExample());
System.out.println("This prints the total count " + csvData.count());
}
public String[] call(String line) throws Exception {
CSVReader reader = new CSVReader(new StringReader(line));
return reader.readNext();
}
}
I am not sure, however, whether the Java example is actually correct? Can I pick your brain on it? I know I could use the Databrickd Spark csv library, instead, but I was wondering whether the current example is correct and how it can be further improved upon.
Thank you for your help,
I.
I am trying to run the spark program on java using eclipse. Its is running if i simply print something on console but I am not able to read any file using textFile function.
I have read somewhere that reading a file can only be done using HDFS but I am not able to do in my local system.
Do let me know how to access/read file , if using HDFS then how to install HDFS in my local system so that i can rad the text file.
Here's a code on which I am testing , though this program is working fine but it is unable to read file saying Input path does not exist.
package spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
String[] jars = {"D:\\customJars\\spark.jar"};
System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark")
.setMaster("spark://10.1.50.165:7077")
.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "./forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}
My File Struture :
Update: you don't have .txt extension in file name and you are using it in your application. You should use it as String inputFileName = "forecaster" ;
If file is in same folder as java class TestSpark ($APP_HOME):
String inputFileName = "forecaster.txt" ;
If file is in Data dir under your project of spark:
String inputFileName = "Data\\forecaster.txt" ;
Or use fully qualified Path log says from below testing:
16/08/03 08:25:25 INFO HadoopRDD: Input split: file:/C:/Users/user123/worksapce/spark-java/forecaster.txt
~~~~~~~
String inputFileName = "file:/C:/Users/user123/worksapce/spark-java/forecaster.txt" ;
For example: I copied your code and ran on my local environment:
this is how my project step up is, and I run it as:
String inputFileName = "forecaster.txt" ;
Test File:
this is test file
aaa
bbb
ddddaaee
ewwww
aaaa
a
a
aaaa
bb
Code that I used:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
// String[] jars = {"D:\\customJars\\spark.jar"};
// System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark").setMaster("local");
//.setMaster("spark://10.1.50.165:7077")
//.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}
Spark needs schema and proper path in order to understand how to read the file. So if you are reading from HDFS, you should use:
jsc.textFile("hdfs://host/path/to/hdfs/file/input.txt");
If you are reading local file (local to the worker node, not the machine the driver is running), you should use:
jsc.textFile("file://path/to/hdfs/file/input.txt");
For reading Hadoop Archive File (HAR), you should use:
jsc.textFile("har://archive/path/to/hdfs/file/input.txt");
And so on.
I am new to spark and trying to extract a line which contains "Subject:" and save it in an arraylist. I am not facing any error but the array list is empty. Can you please guide me where am i going wrong? or the best way to do this?
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
final ArrayList<String> list = new ArrayList<>();
sample.foreach(new VoidFunction<String>(){
public void call(String line) {
if (line.contains("Subject:")) {
System.out.println(line);
list.add(line);
}
}}
);
System.out.println(list);
sc.stop();
}
}
Please keep in mind that Spark applications run distributed and in parallel. Therefore you cannot modify variables outside of functions that are executed by Spark.
Instead you need to return a result from these functions. In your case you need flatMap (instead of foreach that has no result), which concatenates collections that are returned as result of your function.
If a line matches a list that contains the matching line is returned, otherwise you return an empty list.
To print the data in the main function, you first have to gather the possibly distributed data in your master node, by calling collect().
Here an example:
import java.util.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
public final class extractSubject {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("JavaBookExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
//JavaRDD<String> sample = sc.textFile("/Users/Desktop/sample.txt");
JavaRDD<String> sample = sc.parallelize(Arrays.asList("Subject: first",
"nothing here",
"Subject: second",
"dummy"));
JavaRDD<String> subjectLinesRdd = sample.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.contains("Subject:")) {
return Collections.singletonList(line); // line matches → return list with the line as its only element
} else {
return Collections.emptyList(); // ignore line → return empty list
}
}
});
List<String> subjectLines = subjectLinesRdd.collect(); // collect values from Spark workers
System.out.println(subjectLines); // → "[Subject: first, Subject: second]"
sc.stop();
}
}
I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;
public class Trial {
public void start() throws InstantiationException, IllegalAccessException,
ClassNotFoundException {
run();
}
private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
}
private List<Vector> Seq(Vector dv) {
// TODO Auto-generated method stub
return null;
}
public static void main(String[] args) throws Exception {
Trial trial = new Trial();
trial.start();
}
}
The program is running without any error but i'm not able to get anything when trying to run it on spark-machine. Can anyone tell me whether the conversion of string RDD to Vector RDD is correct.
My csv file consist of only one column which are floating numbers
The null in this flatMap invocation might be a problem:
JavaRDD<Vector> datamain = data.flatMap(null);
I solved my answer by changing the code to this
JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
public Vector call(String s){
String[] sarray = s.trim().split("\\r?\\n");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
System.out.println(values[i]);
}
return Vectors.dense(values);
}
}
);
Assuming your trial.csv file looks like this
1.0
2.0
3.0
Taking your original code from your question a one line change is required with Java 8
SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
Prints 2.0