Run a simple Cascading application in local mode - java

I'm new to Cascading/Hadoop and am trying to run a simple example in local mode (i.e. in memory). The example just copies a file:
import java.util.Properties;
import cascading.flow.Flow;
import cascading.flow.FlowConnector;
import cascading.flow.FlowDef;
import cascading.flow.local.LocalFlowConnector;
import cascading.pipe.Pipe;
import cascading.property.AppProps;
import cascading.scheme.hadoop.TextLine;
import cascading.tap.Tap;
import cascading.tap.hadoop.Hfs;
public class CascadingTest {
public static void main(String[] args) {
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, CascadingTest.class );
FlowConnector flowConnector = new LocalFlowConnector();
// create the source tap
Tap inTap = new Hfs( new TextLine(), "D:\\git_workspace\\Impatient\\part1\\data\\rain.txt" );
// create the sink tap
Tap outTap = new Hfs( new TextLine(), "D:\\git_workspace\\Impatient\\part1\\data\\out.txt" );
// specify a pipe to connect the taps
Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.addSource( copyPipe, inTap )
.addTailSink( copyPipe, outTap );
// run the flow
Flow flow = flowConnector.connect( flowDef );
flow.complete();
}
}
Here is the error I'm getting:
09-25-12 11:30:38,114 INFO - AppProps - using app.id: 9C82C76AC667FDAA2F6969A0DF3949C6
Exception in thread "main" cascading.flow.planner.PlannerException: could not build flow from assembly: [java.util.Properties cannot be cast to org.apache.hadoop.mapred.JobConf]
at cascading.flow.planner.FlowPlanner.handleExceptionDuringPlanning(FlowPlanner.java:515)
at cascading.flow.local.planner.LocalPlanner.buildFlow(LocalPlanner.java:84)
at cascading.flow.FlowConnector.connect(FlowConnector.java:454)
at com.x.y.CascadingTest.main(CascadingTest.java:37)
Caused by: java.lang.ClassCastException: java.util.Properties cannot be cast to org.apache.hadoop.mapred.JobConf
at cascading.tap.hadoop.Hfs.sourceConfInit(Hfs.java:78)
at cascading.flow.local.LocalFlowStep.initTaps(LocalFlowStep.java:77)
at cascading.flow.local.LocalFlowStep.getInitializedConfig(LocalFlowStep.java:56)
at cascading.flow.local.LocalFlowStep.createFlowStepJob(LocalFlowStep.java:135)
at cascading.flow.local.LocalFlowStep.createFlowStepJob(LocalFlowStep.java:38)
at cascading.flow.planner.BaseFlowStep.getFlowStepJob(BaseFlowStep.java:588)
at cascading.flow.BaseFlow.initializeNewJobsMap(BaseFlow.java:1162)
at cascading.flow.BaseFlow.initialize(BaseFlow.java:184)
at cascading.flow.local.planner.LocalPlanner.buildFlow(LocalPlanner.java:78)
... 2 more

Just to provide a bit more detail: You can't mix local and hadoop classes in Cascading, as they assume different and incompatible environments. What's happening in your case is that you're trying to create a local flow with hadoop taps, the latter expecting a hadoop JobConf instead of the Properties object used to configure local taps.
Your code will work if you use cascading.tap.local.FileTap instead of cascading.tap.hadoop.Hfs.

Welcome to Cascading -
I just answered on the Cascading user list, but in brief the problem is a mix of local and Hadoop mode classes.. This code has LocalFlowConnector, but then uses Hfs taps.
When I revert back to the classes used in the "Impatient" tutorial, it run correctly:
https://gist.github.com/3784194

Yes, you need to use LFS(Local File System) tap instead of HFS (Hadoop File System).
Also you can test your code using Junit test cases (with cascading-unittest jar) in local mode itself/ from eclipse.
http://www.cascading.org/2012/08/07/cascading-for-the-impatient-part-6/

Related

create `KafkaServer` from Java

I am trying to start a Kafka server form Java
Specifically, how can I translate this line of Scala into a line of Java?
private val server = new KafkaServer(serverConfig, kafkaMetricsReporters = reporters)
I can create the serverConfig easily, but I can't seem to be able to create the kafkaMetricsReporters parameter.
Note: I can create a KafkaServerStartable but I would like to create a normal KafkaServer to avoid the JVM exiting in case of error.
Apache Kafka version 0.11.0.1
The kafkaMetricsReporters parameter is a scala Seq.
You can either:
Create a Java collection and convert it into a Seq:
You need to import scala.collection.JavaConverters:
List<KafkaMetricsReporter> reportersList = new ArrayList<>();
...
Seq<KafkaMetricsReporter> reportersSeq = JavaConverters.asScalaBufferConverter(reportersList).asScala();
Use KafkaMetricsReporter.startReporters() method to create them for you from your configuration:
As KafkaMetricsReporter is a singleton, you need to use the MODULE notation to use it:
Seq<KafkaMetricsReporter> reporters = KafkaMetricsReporter$.MODULE$.startReporters(new VerifiableProperties(props));
Also the KafkaServer constructor has 2 other arguments that are required when calling it from Java:
time can easily be created using new org.apache.kafka.common.utils.SystemTime()
threadNamePrefix is an Option. If you import scala.Option, you'll be able to call Option.apply("prefix")
Putting it all together:
Properties props = new Properties();
props.put(...);
KafkaConfig config = KafkaConfig.fromProps(props);
Seq<KafkaMetricsReporter> reporters = KafkaMetricsReporter$.MODULE$.startReporters(new VerifiableProperties(props));
KafkaServer server = new KafkaServer(config, new SystemTime(), Option.apply("prefix"), reporters);
server.startup();

Trying to access s3 bucket using scala application

I want to access amazon s3 bucket using scala application. I have set up the scala IDE in my eclipse. But when i try to run the >application on my local (Run As --> Scala Application) , it gives the following >error on the console. Error: Could not find or load main class org.test.spark1.test I an trying to run a simple wordcount application in which i am accessing a >file that is stored in my S3 bucket and storing the results in another file. Please make me understand what might the problem be.
Note: I am using eclipse maven project. My scala application code is :
package org.test.spark1
import com.amazonaws._
import com.amazonaws.auth._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.amazonaws.services.s3._
import com.amazonaws.services.s3.model.GetObjectRequest
import java.io.File;
object test extends App {
def main(args: Array[String]) {
val myAccessKey = "here is my key"
val mySecretKey = "here is my secret key"
val bucket = "nlp.spark.apps"
val conf = new SparkConf().setAppName("sample")
val sc = new SparkContext(conf)
val yourAWSCredentials = new BasicAWSCredentials(myAccessKey, mySecretKey)
val amazonS3Client = new AmazonS3Client(yourAWSCredentials)
// This will create a bucket for storage
amazonS3Client.createBucket("nlp-spark-apps2")
val s3data = sc.textFile("here is my url of text file")
s3data.flatMap(line =>
line.split(" "))
.map(word =>
(word, 1))
.reduceByKey(_ * _)
.saveAsTextFile("/home/hadoop/cluster-code2.txt")
}}
A possible solution I came across is that the Scala IDE does not automatically detect your main class:
Go to Menu --> "Run" --> "Run configurations"
Click on "Scala application" and on the icon for "New launch configuration"
For project select your project and for the main class (that for some reason is not auto-detected) manually enter (in your case) org.test.spark1.test
Apply and Run
OR
You could try to run your Spark job locally without eclipse using spark-submit.
spark-submit --class org.test.spark1.test --master local[8] {path to assembly jar}
Another thing, you should never hardcode your AWS credentials. I suggest you use InstanceProfileCredentialsProvider. This credentials exist in the instance metadata associated with the IAM role for the EC2 instance.

How to fix NoClassDefFoundError for Kafka Producer Example?

I am getting a NoClassDefFoundError when I try to compile and run the Producer example that comes with Kafka. I want to know how to resolve the error discussed below?
Caveat: I am a C++/C# programmer who is Linux literate and starting to
learn Java. I can follow instructions, but may well ask for some
clarification along the way.
I have a VM sandbox from Hortonworks that is running a Red Hat appliance. On it I have a working kafka server and by following this tutorial I am able to get the desired Producer posting messages to the server.
Now I want to get down to writing my own code, but first I decided to make sure I can compile the example files that Kafka came with After a day of trial and error I just cannot seem to get this going.
here is what I am doing:
I am going to the directory where the example files are located and typing:
javac -cp $KCORE:$KCLIENT:$SCORE:. ./*.java
$KCORE:$KCLIENT:$SCORE resolve to the jars for the kafka core, kafka-client, and scala libraries respectively. everything returns just fine with no errors and places all the class files in the current directory; however, when I follow up with
javac -cp $KCORE:$KCLIENT:$SCORE:. Producer
I get a NoClassDefFoundError telling me the following
The code for the class is
package kafka.examples;
import java.util.Properties;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
public class Producer extends Thread
{
private final kafka.javaapi.producer.Producer<Integer, String> producer;
private final String topic;
private final Properties props = new Properties();
public Producer(String topic)
{
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("metadata.broker.list", "localhost:9092");
// Use random partitioner. Don't need the key type. Just set it to Integer.
// The message is of type String.
producer = new kafka.javaapi.producer.Producer<Integer, String>(new ProducerConfig(props));
this.topic = topic;
}
public void run() {
int messageNo = 1;
while(true)
{
String messageStr = new String("Message_" + messageNo);
producer.send(new KeyedMessage<Integer, String>(topic, messageStr));
messageNo++;
}
}
}
Can anybody point me in the right direction to resolve this error? Do the classes need to go in different directories for some reason?
The package name is a part of the class name you need to supply on the command line:
javac -cp $KCORE:$KCLIENT:$SCORE:. kafka.examples.Producer
Also, you should be standing in the root directory of your class hierarchy, which seems to be two directories up (you're currently standing in kafka/examples. Alternatively, you can use ../.. instead of . in the -cp argument to denote that the root is two directories up.
You might want to get familiar with using an IDE such as IntelliJ IDEA or Eclipse (and how to use libraries in those IDEs), as this will make development much easier. Props for doing things the hard way first, though, as you'll get a better understanding of how things are stitched together (an IDE will essentially figure all those console commands for you without you noticing).

odd error when populating accumulo 1.6 mutation object via spark-notebook

using spark-notebook to update an accumulo table. employing the method specified in both the accumulo documentation and the accumulo example code. Below is verbatim what I put into notebook, and the responses:
val clientRqrdTble = new ClientOnRequiredTable
val bwConfig = new BatchWriterConfig
val batchWriter = connector.createBatchWriter("batchtestY", bwConfig);
clientRqrdTble: org.apache.accumulo.core.cli.ClientOnRequiredTable =
org.apache.accumulo.core.cli.ClientOnRequiredTable#6c6a18ed bwConfig:
org.apache.accumulo.core.client.BatchWriterConfig =
[maxMemory=52428800, maxLatency=120000, maxWriteThreads=3,
timeout=9223372036854775807] batchWriter:
org.apache.accumulo.core.client.BatchWriter =
org.apache.accumulo.core.client.impl.BatchWriterImpl#298aa736
val rowIdS = rddX2_first._1.split(" ")(0)
rowIdS: String = row_0736460000
val mutation = new Mutation(new Text(rowIdS))
mutation: org.apache.accumulo.core.data.Mutation =
org.apache.accumulo.core.data.Mutation#0
mutation.put(
new Text("foo"),
new Text("1"),
new ColumnVisibility("exampleVis"),
new Value(new String("CHEWBACCA!").getBytes) )
java.lang.IllegalStateException: Can not add to mutation after
serializing it at
org.apache.accumulo.core.data.Mutation.put(Mutation.java:168) at
org.apache.accumulo.core.data.Mutation.put(Mutation.java:163) at
org.apache.accumulo.core.data.Mutation.put(Mutation.java:211)
I dug into the code and see that the culprit is an if-catch that's checking to see if UnsynchronizedBuffer.Writer buffer is null. the line numbers won't line up because this is a slightly different version than what's in the 1.6 accumulo-core jar - I've looked at both and the difference isn't one that makes a difference in this case. as far as I can tell, the object is getting created prior to execution of that method and isn't getting dumped.
so either I'm missing something in the code, or something else is up. do any of you know what might be causing this behavior?
UPDATE ONE
I've executed the following code using the scala console and via straight java 1.8. It fails in scala, but not in Java. I'm thinking this is an Accumulo issue at this point. As such I'm going to open a bug ticket and dig deeper into the source. If I come up with a resolution I'll post here.
below is the code in Java form. There's some extra stuff in there because I wanted to make sure I could connect to the table I created using the accumulo batch writer example:
import java.util.Map.Entry;
import org.apache.accumulo.core.client.security.tokens.PasswordToken;
import org.apache.accumulo.core.security.Authorizations;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.client.*;
import org.apache.accumulo.core.client.mapred.*;
import org.apache.accumulo.core.cli.ClientOnRequiredTable;
import org.apache.accumulo.core.cli.ClientOnRequiredTable.*;
import org.apache.accumulo.core.data.Mutation;
import org.apache.accumulo.core.security.ColumnVisibility;
import org.apache.hadoop.conf.Configured.*;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.io.Text;
public class App {
public static void main( String[] args ) throws
AccumuloException,
AccumuloSecurityException,
TableNotFoundException {
// connect to accumulo using a scanner
// print first ten rows of a given table
String instanceNameS = "accumulo";
String zooServersS = "localhost:2181";
Instance instance = new ZooKeeperInstance(instanceNameS, zooServersS);
Connector connector =
instance.getConnector( "root", new PasswordToken("password"));
Authorizations auths = new Authorizations("exampleVis");
Scanner scanner = connector.createScanner("batchtestY", auths);
scanner.setRange(new Range("row_0000000001", "row_0000000010"));
for(Entry<Key, Value> entry : scanner) {
System.out.println(entry.getKey() + " is " + entry.getValue());
}
// stage up connection info objects for serialization
ClientOnRequiredTable clientRqrdTble = new ClientOnRequiredTable();
BatchWriterConfig bwConfig = new BatchWriterConfig();
BatchWriter batchWriter =
connector.createBatchWriter("batchtestY", bwConfig);
// create mutation object
Mutation mutation = new Mutation(new Text("row_0000000001"));
// populate mutation object
// -->THIS IS WHAT'S FAILING IN SCALA<--
mutation.put(
new Text("foo"),
new Text("1"),
new ColumnVisibility("exampleVis"),
new Value(new String("CHEWBACCA!").getBytes()) );
}
}
UPDATE TWO
an Accumulo bug ticket has been created for this issue. their target is to have this fixed in v1.7.0. until then, the solution i provided below is a functional work-around.
It looks like whatever is happening in spark-notebook when the new Mutation cell is executed is serializing the Mutation. You can't call put on a Mutation after it has been serialized. I would try adding the mutation.put calls to the same notebook cell as the new Mutation command. It looks like the clientRqrdTble/bwConfig/batchWriter commands are in a single multi-line cell, so hopefully this will be possible for the Mutation as well.
so it seems as though the code that works perfectly well with java doesn't play nice with Scala. the solution (not necessarily a GOOD solution, but a working one) is to create a java method in a self-contained jar that creates the mutation object and returns it. this way you can add the jar to spark's classpath and call the method ass needed. tested using spark notebook and was successful in updating an existing accumulo table. i'm still going to submit a ticket to the accumulo peeps as this kind of work-around shouldn't be considered 'best practice'.

SolrJetty logging - how to get custom log formatter to work?

I have a Solr server on Linux running under Jetty 6 and am trying to set up a custom formatter for java logging however I can't seem to get it to recognize my custom class. I am new to Java so it is quote possible it is an issue with how I am exporting my class or something like that. Note this is almost the same question as can be found here, however the answer there does not help since I do have a public no-parameter constructor.
My formatter looks like the following (as described here):
package myapp.solr;
import java.text.MessageFormat;
import java.util.Date;
import java.util.logging.Formatter;
import java.util.logging.LogRecord;
public class LogFormatter extends Formatter {
private static final MessageFormat fmt = new MessageFormat("{0,date,yyyy-MM-dd HH:mm:ss} {1} [{2}] {3}\n");
public LogFormatter() {
super();
}
#Override public String format(LogRecord record) {
Object[] args = new Object[5];
args[0] = new Date(record.getMillis());
args[1] = record.getLevel();
args[2] = record.getLoggerName() == null ? "root" : record.getLoggerName();
args[3] = record.getMessage();
return fmt.format(args);
}
}
In my logging.properties file I then have the below (as well as properties to configure the file path/pattern and rotation limit and count) :
handlers = java.util.logging.FileHandler
java.util.logging.FileHandler.formatter = myapp.solr.LogFormatter
I then export my class into myapp.jar and put it in the lib/ext folder in jetty.home (I also tried placing it directly under lib and I tried specifying the path to it with the -Djetty.class.path parameter). However when I run my solr app it still uses the XmlFormatter instead. I am able to successfully change it to use the SimpleFormatter, just not my own custom formatter.
I also created a test class that imports my LogFormatter, creates an instance variable and calls the format method and prints the result to the console and that worked without any issues from within Eclipse.
If it helps, the command I am using to start up Solr/Jetty is:
nohup java -DSTOP.PORT=8079 -DSTOP.KEY=secret -Dsolr.solr.home=../solr_home/local -Djava.util.config.file=../solr_home/local/logging.properties -jar start.jar > /var/log/solr/stdout.log 2&>1 &
So what am I doing wrong, why won't it use my custom formatter?
Got it working thanks to some useful suggestions here. The problem was that the java logging is set up before the custom class is loaded from the lib or lib/ext folders so I needed to add it into the start.jar.
How I did that was I created a new package of my own that I called org.mortbay.start and added my custom LogFormatter class to that. Eclipse automatically built the LogFormatter.class from that in the bin/org/mortbay/start folder. I then opened the start.jar archive up and added my custom class into it so I had start.jar/org/mortbay/start/LogFormatter.class. Once that was there I was then able to set my formatter using:
handlers = java.util.logging.FileHandler
java.util.logging.FileHandler.formatter = org.mortbay.start.LogFormatter

Categories

Resources