How to run analytic in spark? - java

I'm new to Spark. I'm still learning it. I have questions that would like opinion.
I have to prepare jar file for the analytic method that should be suitable to run as spark job.
is it necessary for jar to be executable / runnable?
Can I prepare jar as library with few methods
For my case,I have input and output of the analytic
Here, can I pass input json and get output json in the spark?
What are the steps?
Any help or links to read will be helpful?

Your first question basically asked how to run Spark with java API. Here is some code I think you'll find useful
SparkLauncher launcher = new SparkLauncher()
setAppName(config.getString("appName"))
.setSparkHome(sparkHomePath)
.setAppResource(pathToYourJar)
.setMaster(masterUrl)
.setMainClass(fullNameOfMainClass);
You might need to add launcher.addJar(...)
Create an instance of SparkAppHandle.Listener
SparkAppHandle handle = launcher.startApplication(sparkJobListener);
"can I pass input json and get output json in the spark?"
If you wish to read a JSON as the input you can follow the instructions in this link

Related

S3 Implementation for org.apache.parquet.io.InputFile?

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.
I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.
Any solution to this dilemma?
Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.
Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.
I'd really like to get the ParquetFileReader class to work, as it seem clean.
Appreciate any ideas.
Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).
The setup should look someting like the following (java, but same should work with scala):
Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider
URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);
ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

How to use Weka JSONLoader in Java IDE?

I want to use Weka in order to parse an existing json file in java eclipse. I believe this can be done using the JSONLoader class. After I read the classes' specifications (http://weka.sourceforge.net/doc.dev/weka/core/converters/JSONLoader.html#JSONLoader--) I thought that this could be easily done by doing this:
JSONLoader jsonLoader = new JSONLoader(jsonFile);
Then I thought by just doing jsonLoader.getFileDescription() or jsonLoader.getSource() would give me results. This is not how it's done though and I can't find anywhere how to use the JSONLoader class in my java code. So in order not to make this question too broad, how can I create a JSONLoader object that reads a source that is in JSON format?
First of all it has nothing to do with eclipse so you should edit your question.
A brief look at the documentation of JSONLoader(in the link you provided) can tell that you need to set the data source you want to parse using setSource (the constructor is empty):
JSONLoader jsonLoader = new JSONLoader();
File f = new File("PATH_TO_YOUR_JSON_FILE");
jsonLoader.setSource(f); //you can also use InputStream instead of a File
After doing that you can use other methods that parse your JSON:
Instances dataset = jsonLoader.getDataSet();
jsonLoader.getFileDescription();
...

Reading JSON file with BigQuery to make table

I'm new to Google Dataflow, and can't get this thing to work with JSON. I've been reading throughout the documentation, but can't solve my problem.
So, following the WordCount example i figured how data is loaded from .csv file with next line
PCollection<String> input = p.apply(TextIO.Read.from(options.getInputFile()));
where inputFile in .csv file from my gcloud bucket. I can transform read lines from .csv with:
PCollection<TableRow> table = input.apply(ParDo.of(new ExtractParametersFn()));
(Extract ParametersFn defined by me). So far so good!
But then I realize my .csv file is too big and had to convert it to JSON (https://cloud.google.com/bigquery/preparing-data-for-bigquery).
Since BigQueryIO is supposedly better for reading JSON, I tried with the following code:
PCollection<TableRow> table = p.apply(BigQueryIO.Read.from(options.getInputFile()));
(inputFile is then JSON file and the output when reading with BigQuery is PCollection with TableRows) I tried with TextIO too (which returns PCollection with Strings) and neither of the two IO options work.
What am I missing? The documentation is really not that detailed to find an answer there, but perhaps some of you guys already dealt with this problem before?
Any suggestions would be very appreciated. :)
I believe there are two options to consider:
Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.

can't save clustering result to an arff file

I'm using a java code to save clustering result to an arff file..
I've followed the instructions in this site:
http://weka.wikispaces.com/Visualizing+cluster+assignments
but I get an error in the line:
PlotData2D predData = ClustererPanel.setUpVisualizableInstances(train, eval);
saying that:
The method setUpVisualizableInstances(Instances, ClusterEvaluation) is undefined for the type ClustererPanel
I've tried to google it but I couldn't find a solution
Judging from the current code:
http://grepcode.com/file/repo1.maven.org/maven2/nz.ac.waikato.cms.weka/weka-dev/3.7.12/weka/gui/explorer/ClustererPanel.java#ClustererPanel
I assume you have to call setInstances instead of setUpVisualizableInstances now.
But: Why do you use a visualization tutorial?
Visualization won't produce an .arff file.

DFC reading a file

I am using the DFC to access documentum. I am trying to read a file. I have the r_object_id and I now wish to return the document assoicated with this. How would I do this in java?
assuming you also have a valid session with at least read access to the file:
String docId= getDocId();
IDfSysObject doc = (IDfSysObject)session.getObject(new DfId(docId));
ByteArrayInputStream stream = doc.getContent();
see the javadocs for the return type here for info on how to process the return. Also I've noticed you've been asking quite a few Documentum Foundation Classes questions. depending on the version of the DFCs you are using, you can find the javadocs online at either powerlink or subscribenet and probably answer many of your own questions.

Categories

Resources