unable to read file in java spark

unable to read file in java spark - java

I am trying to run the spark program on java using eclipse. Its is running if i simply print something on console but I am not able to read any file using textFile function.
I have read somewhere that reading a file can only be done using HDFS but I am not able to do in my local system.
Do let me know how to access/read file , if using HDFS then how to install HDFS in my local system so that i can rad the text file.
Here's a code on which I am testing , though this program is working fine but it is unable to read file saying Input path does not exist.
package spark;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
String[] jars = {"D:\\customJars\\spark.jar"};
System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark")
.setMaster("spark://10.1.50.165:7077")
.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "./forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}
My File Struture :

Update: you don't have .txt extension in file name and you are using it in your application. You should use it as String inputFileName = "forecaster" ;
If file is in same folder as java class TestSpark ($APP_HOME):
String inputFileName = "forecaster.txt" ;
If file is in Data dir under your project of spark:
String inputFileName = "Data\\forecaster.txt" ;
Or use fully qualified Path log says from below testing:
16/08/03 08:25:25 INFO HadoopRDD: Input split: file:/C:/Users/user123/worksapce/spark-java/forecaster.txt
~~~~~~~
String inputFileName = "file:/C:/Users/user123/worksapce/spark-java/forecaster.txt" ;
For example: I copied your code and ran on my local environment:
this is how my project step up is, and I run it as:
String inputFileName = "forecaster.txt" ;
Test File:
this is test file
aaa
bbb
ddddaaee
ewwww
aaaa
a
a
aaaa
bb
Code that I used:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class TestSpark {
public static void main(String args[])
{
// String[] jars = {"D:\\customJars\\spark.jar"};
// System.setProperty("hadoop.home.dir", "D:\\hadoop-common-2.2.0-bin-master");
SparkConf sparkConf = new SparkConf().setAppName("spark.TestSpark").setMaster("local");
//.setMaster("spark://10.1.50.165:7077")
//.setJars(jars);
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
//SQLContext sqlcon = new SQLContext(jsc);
String inputFileName = "forecaster.txt" ;
JavaRDD<String> logData = jsc.textFile(inputFileName);
long numAs = logData.filter(new Function<String, Boolean>() {
#Override
public Boolean call(String s) throws Exception {
return s.contains("a");
}
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
System.out.println("sadasdasdf");
jsc.stop();
jsc.close();
}
}

Spark needs schema and proper path in order to understand how to read the file. So if you are reading from HDFS, you should use:
jsc.textFile("hdfs://host/path/to/hdfs/file/input.txt");
If you are reading local file (local to the worker node, not the machine the driver is running), you should use:
jsc.textFile("file://path/to/hdfs/file/input.txt");
For reading Hadoop Archive File (HAR), you should use:
jsc.textFile("har://archive/path/to/hdfs/file/input.txt");
And so on.

Related

Is it possible to transfer a folder from GCS to Bigquery using Java API

When I tried to give the source URI of a folder inside my bucket (which has around 400 CSV files) in the Java program, it has not moved any files to BQ table. If I try with a single csv file , it moves.
package com.example.bigquerydatatransfer;
import com.google.api.gax.rpc.ApiException;
import com.google.cloud.bigquery.datatransfer.v1.CreateTransferConfigRequest;
import com.google.cloud.bigquery.datatransfer.v1.DataTransferServiceClient;
import com.google.cloud.bigquery.datatransfer.v1.ProjectName;
import com.google.cloud.bigquery.datatransfer.v1.TransferConfig;
import com.google.protobuf.Struct;
import com.google.protobuf.Value;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
// Sample to create google cloud storage transfer config
public class Cloud_to_BQ {
public static void main(String[] args) throws IOException {
final String projectId = "dfp-bq";
String datasetId = "mytest1";
String tableId = "PROG_DATA";
String sourceUri = "gs://dfp-bq/C:\\PROG_Reports";
String fileFormat = "CSV";
String fieldDelimiter = ",";
String skipLeadingRows = "1";
Map<String, Value> params = new HashMap<>();
params.put(
"destination_table_name_template", Value.newBuilder().setStringValue(tableId).build());
params.put("data_path_template", Value.newBuilder().setStringValue(sourceUri).build());
params.put("write_disposition", Value.newBuilder().setStringValue("APPEND").build());
params.put("file_format", Value.newBuilder().setStringValue(fileFormat).build());
params.put("field_delimiter", Value.newBuilder().setStringValue(fieldDelimiter).build());
params.put("skip_leading_rows", Value.newBuilder().setStringValue(skipLeadingRows).build());
TransferConfig transferConfig =
TransferConfig.newBuilder()
.setDestinationDatasetId(datasetId)
.setDisplayName("Trial_Run_PROG_DataTransfer")
.setDataSourceId("google_cloud_storage")
.setParams(Struct.newBuilder().putAllFields(params).build())
.setSchedule("every 24 hours")
.build();
createCloudStorageTransfer(projectId, transferConfig);
}
public static void createCloudStorageTransfer(String projectId, TransferConfig transferConfig)
throws IOException {
try (DataTransferServiceClient client = DataTransferServiceClient.create()) {
ProjectName parent = ProjectName.of(projectId);
CreateTransferConfigRequest request =
CreateTransferConfigRequest.newBuilder()
.setParent(parent.toString())
.setTransferConfig(transferConfig)
.build();
TransferConfig config = client.createTransferConfig(request);
System.out.println("Cloud storage transfer created successfully :" + config.getName());
} catch (ApiException ex) {
System.out.print("Cloud storage transfer was not created." + ex.toString());
}
}
}
Is there any way I can move all the files to the BQ table at a stretch?
2022-08-04T07:27:50.185847509ZNo files found matching: "gs://dfp-bq/C:\PROG_Reports" - This is the BQ logs for the Run.

How to build a simple spark-test-app.jar to test AWS SageMaker SparkJarProcessing

Does anyone know a repo that shows what a simple HelloWorld java or scala code would look like to build the jar that could be executed using the AWS SageMaker SparkJarProcessing class?
Readthedocs (https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.html) mentions:
"In the next example, you’ll take a Spark application jar (located in ./code/spark-test-app.jar)..."
My question is how does the source code look like for this jar (spark-test-app.jar)?
I tried building a simple Java project jar
src>com.test>HW.java:
public class HW {
public static void main(String[] args) {
System.out.printf("hello world!");
}
}
and running it inside SageMaker Notebook conda_python3 kernel using
from sagemaker.spark.processing import SparkJarProcessor
from sagemaker import get_execution_role
role = get_execution_role()
print(role)
spark_processor = SparkJarProcessor(
base_job_name="sm-spark-java",
framework_version="3.1",
role=role,
instance_count=2,
instance_type="ml.m5.xlarge",
max_runtime_in_seconds=1200,
)
spark_processor.run(
submit_app="./SparkJarProcessing-1.0-SNAPSHOT.jar",
submit_class="com.test.HW",
arguments=["--input", "abc"],
logs=True,
)
But end up getting an error:
Could not execute HW class.
Any sample source code for spark-test-app.jar would be highly appreciated!

To answer your question, the source code of that class looks like:
package com.amazonaws.sagemaker.spark.test;
import java.lang.invoke.SerializedLambda;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.ParseException;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.BasicParser;
import org.apache.commons.cli.Options;
import org.apache.spark.sql.Dataset;
import org.apache.commons.cli.CommandLine;
import org.apache.spark.sql.types.DataTypes;
import org.apache.commons.lang3.StringUtils;
import java.util.List;
import org.apache.spark.sql.SparkSession;
public class HelloJavaSparkApp
{
public static void main(final String[] args) {
System.out.println("Hello World, this is Java-Spark!");
final CommandLine parsedArgs = parseArgs(args);
final String inputPath = parsedArgs.getOptionValue("input");
final String outputPath = parsedArgs.getOptionValue("output");
final SparkSession spark = SparkSession.builder().appName("Hello Spark App").getOrCreate();
System.out.println("Got a Spark session with version: " + spark.version());
System.out.println("Reading input from: " + inputPath);
final Dataset salesDF = spark.read().json(inputPath);
salesDF.printSchema();
salesDF.createOrReplaceTempView("sales");
final Dataset topDF = spark.sql("SELECT date, sale FROM sales WHERE sale > 750 SORT BY sale DESC");
topDF.show();
final Dataset avgDF = salesDF.groupBy("date", new String[0]).avg(new String[0]).orderBy("date", new String[0]);
System.out.println("Collected average sales: " + StringUtils.join((Object[])new List[] { avgDF.collectAsList() }));
spark.sqlContext().udf().register("double", n -> n + n, DataTypes.LongType);
final Dataset saleDoubleDF = salesDF.selectExpr(new String[] { "date", "sale", "double(sale) as sale_double" }).orderBy("date", new String[] { "sale" });
saleDoubleDF.show();
System.out.println("Writing output to: " + outputPath);
saleDoubleDF.coalesce(1).write().json(outputPath);
spark.stop();
}
private static CommandLine parseArgs(final String[] args) {
final Options options = new Options();
final CommandLineParser parser = (CommandLineParser)new BasicParser();
final Option input = new Option("i", "input", true, "input path");
input.setRequired(true);
options.addOption(input);
final Option output = new Option("o", "output", true, "output path");
output.setRequired(true);
options.addOption(output);
try {
return parser.parse(options, args);
}
catch (ParseException e) {
new HelpFormatter().printHelp("HelloScalaSparkApp --input /opt/ml/input/foo --output /opt/ml/output/bar", options);
throw new RuntimeException((Throwable)e);
}
}
}
At the same time, I have created a simple example that shows how to run an hello world app here. Please note that I have run that example on Amazon SageMaker Studio Notebooks, using the Data Science 1.0 kernel.
Hope this helps.

Getting error in apache spark simple program

i am doing one simple example of word count in apache spark in java with reference of Internet and i m getting error of
Caused by: java.net.UnknownHostException: my.txt
you can see my below code for the reference!
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class MyCount {
public static void main(String[] args) {
// TODO Auto-generated method stub
String file = "hdfs://my.txt";
JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
JavaRDD<String> lines = sc.textFile(file);
long nums = lines.count();
System.out.println(nums);
}
}

Can you try
String file = "hdfs://localhost/my.txt"
PS: make sure you have this file my.txt in hdfs.
In case if you don't have that file hdfs, follow below command to put the file in hdfs from local dir.
Hadoop fs -copyFromLocal /home/training/my.txt hadoop/

Old question but an answer was never accepted, the mistake at the time I read it is mixing the "local" concept of Spark with "localhost."
Using this constructor: JavaSparkContext(java.lang.String master, java.lang.String appName), you would want to use:
JavaSparkContext sc = new JavaSparkContext("localhost", "Simple App");
but the question was using "local". Further, the HDFS filename didn't specify a hostname: "hdfs://SomeNameNode:9000/foo/bar/"or
"hdfs://host:port/absolute-path"
As of 1.6.2, the Javadoc for JavaSparkContext is not showing any constructor that let's you specify the cluster type directly:
http://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html
The best constructor for JavaSparkContext wants a SparkConf object. To do something more readable by humans, build a SparkConf object and then pass it to JavaSparkContext, here's an example that sets the appname, specifies Kryo serializer and sets the master:
SparkConf sparkConf = new SparkConf().setAppName("Threshold")
//.setMaster("local[4]");
.setMaster(getMasterString(masterName))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(kryoClassArray);
// create the JavaSparkContext now:
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
NOTE: the alternate .setMaster("local[4]"); would use local mode, which the OP may have been trying.
I have a more extended answer here that addresses using hostnames vs. IP addresses and a lot more for setting up your SparkConf

You can try this simple word count program
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class First {
public static void main(String[] args) {
SparkConf sf = new SparkConf().setMaster("local[3]").setAppName("parth");
JavaSparkContext sc = new JavaSparkContext(sf);
JavaRDD<String> textFile = sc.textFile("input file path");
JavaRDD<String> words = textFile.flatMap((new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }}));
JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
});
JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
});
counts.saveAsTextFile("outputfile-path");
}
}

Convert a JavaRDD String to JavaRDD Vector

I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;
public class Trial {
public void start() throws InstantiationException, IllegalAccessException,
ClassNotFoundException {
run();
}
private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
}
private List<Vector> Seq(Vector dv) {
// TODO Auto-generated method stub
return null;
}
public static void main(String[] args) throws Exception {
Trial trial = new Trial();
trial.start();
}
}
The program is running without any error but i'm not able to get anything when trying to run it on spark-machine. Can anyone tell me whether the conversion of string RDD to Vector RDD is correct.
My csv file consist of only one column which are floating numbers

The null in this flatMap invocation might be a problem:
JavaRDD<Vector> datamain = data.flatMap(null);

I solved my answer by changing the code to this
JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
public Vector call(String s){
String[] sarray = s.trim().split("\\r?\\n");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
System.out.println(values[i]);
}
return Vectors.dense(values);
}
}
);

Assuming your trial.csv file looks like this
1.0
2.0
3.0
Taking your original code from your question a one line change is required with Java 8
SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
Prints 2.0

getting error in importing spark dependencies in intellij idea

I am using intelli j idea with maven integration but I am getting error on following lines
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
I am trying to run following example
package com.spark.hello;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class Hello {
public static void main(String[] args) {
String logFile = "F:\\Spark\\a.java";
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(logFile).cache();
long numAs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("a"); }
}).count();
long numBs = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) { return s.contains("b"); }
}).count();
System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
}
}
plz help me to solve this issue or is there any other way to run this kind of project???

Without seeing the error, I'm guessing the IDE is telling you they are unused imports be sure to double check the dependencies and the versions.
Alt + Enter is the shortcut I've used to resolve many of the issues.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

unable to read file in java spark - java

Related

Is it possible to transfer a folder from GCS to Bigquery using Java API

How to build a simple spark-test-app.jar to test AWS SageMaker SparkJarProcessing

Getting error in apache spark simple program

Convert a JavaRDD String to JavaRDD Vector

getting error in importing spark dependencies in intellij idea

Categories

Resources