Kafka Storm Integration using Kafka Spout

Kafka Storm Integration using Kafka Spout - java

I am using KafkaSpout. Please find the test program below.
I am using Storm 0.8.1. Multischeme class is there in Storm 0.8.2. I will be using that. I just want to know how were the earlier versions working just by instantiating the StringScheme() class? Where can I download earlier versions of Kafka Spout? But I doubt that would be a correct alternative than to work on Storm 0.8.2. ??? (Confused)
When I run the code (given below) on storm cluster (i.e. when I push my topology) I get the following error (This happens when the Scheme part is commented else of course I will get compiler error as the class is not there in 0.8.1):
java.lang.NoClassDefFoundError: backtype/storm/spout/MultiScheme
at storm.kafka.TestTopology.main(TestTopology.java:37)
Caused by: java.lang.ClassNotFoundException: backtype.storm.spout.MultiScheme
In the code given below you may find the spoutConfig.scheme=new StringScheme(); part commented. I was getting compiler error if I don't comment that line which is but natural as there are no constructors in there. Also when I instantiate MultiScheme I get error as I dont have that class in 0.8.1.
public class TestTopology {
public static class PrinterBolt extends BaseBasicBolt {
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple.toString());
}
}
public static void main(String [] args) throws Exception {
List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("127.0.0.1",9092));
LocalCluster cluster = new LocalCluster();
TopologyBuilder builder = new TopologyBuilder();
SpoutConfig spoutConfig = new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 1), "test", "/zkRootStorm", "STORM-ID");
spoutConfig.zkServers=ImmutableList.of("localhost");
spoutConfig.zkPort=2181;
//spoutConfig.scheme=new StringScheme();
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
builder.setSpout("spout",new KafkaSpout(spoutConfig));
builder.setBolt("printer", new PrinterBolt())
.shuffleGrouping("spout");
Config config = new Config();
cluster.submitTopology("kafka-test", config, builder.createTopology());
Thread.sleep(600000);
}

I had the same problem. Finally resolved it, and I put the complete running example up on github.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).

We had a similar issue.
Our solution:
Open pom.xml
Change scope from provided to <scope>compile</scope>
If you want to know more about dependency scopes check the maven docu:
Maven docu - dependency scopes

Related

Fail to read from bigtable in dataflow

I am using dataflow for my work to write some data into the bigtable. Currently, I got a task to read rows from the bigtable.
However, whenever I try to read rows from the bigtable using bigtable-hbase-dataflow, it fails and complains as follow.
Error: (3218070e4dd208d3): java.lang.IllegalArgumentException: b <= a
at org.apache.hadoop.hbase.util.Bytes.iterateOnSplits(Bytes.java:1720)
at org.apache.hadoop.hbase.util.Bytes.split(Bytes.java:1683)
at org.apache.hadoop.hbase.util.Bytes.split(Bytes.java:1664)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$AbstractSource.split(CloudBigtableIO.java:512)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$AbstractSource.getSplits(CloudBigtableIO.java:358)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$Source.splitIntoBundles(CloudBigtableIO.java:593)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:413)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:171)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:149)
at com.google.cloud.dataflow.sdk.runners.worker.SourceOperationExecutor.execute(SourceOperationExecutor.java:58)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:288)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:221)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am using 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.6.0' and 'com.google.cloud.bigtable:bigtable-hbase-dataflow:0.9.0' now. Here's my code.
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId("project-id")
.withInstanceId("instance-id")
.withTableId("table")
.build();
pipeline.apply(Read.<Result>from(CloudBigtableIO.read(config)))
.apply(ParDo.of(new Test()));
FYI, I just read from bigtable and just count rows using aggregator in Test DoFn.
static class Test extends DoFn<Result, Result> {
private static final long serialVersionUID = 0L;
private final Aggregator<Long, Long> rowCount = createAggregator("row_count", new Sum.SumLongFn());
#Override
public void processElement(ProcessContext c) {
rowCount.addValue(1L);
c.output(c.element());
}
}
I just followed tutorial on the dataflow document, but it fails. Can anyone help me out?

The root cause was a dependency issue:
Previously, our build file omitted this dependency:
compile 'io.netty:netty-tcnative-boringssl-static:1.1.33.Fork22'
Today, I added the dependency and it resolved all the issues. I double-checked that the problem arises when I don't have it in the build file.
From https://github.com/GoogleCloudPlatform/cloud-bigtable-client/issues/912#issuecomment-249999380.

How do I include a resource file in Java project to be used with just new File()?

I'm writing a UDF for Pig using Java. It works fine but Pig doesn't give me options to separate environment. What my Pig script is doing is to get Geo location from IP address.
Here's my code on the Geo location part.
private static final String GEO_DB = "GeoLite2-City.mmdb";
private static final String GEO_FILE = "/geo/" + GEO_DB;
public Map<String, Object> geoData(String ipStr) {
Map<String, Object> geoMap = new HashMap<String, Object>();
DatabaseReader reader = new DatabaseReader.Builder(new File(GEO_DB)).build();
// other stuff
}
GeoLite2-City.mmdb exists in HDFS that's why I can refer from absolute path using /geo/GeoLite2-City.mmdb.
However, I can't do that from my JUnit test or I have to create /geo/GeoLite2-City.mmdb on my local machine and Jenkins which is not ideal. I'm trying to figure out a way to make my test passed while using new File(GEO_DB) and not
getClass().getResourceAsStream('./geo/GeoLite2-City.mmdb') because
getClass().getResourceAsStream('./geo/GeoLite2-City.mmdb')
Doesn't work in Hadoop.
And if I run Junit test it would fail because I don't have /geo/GeoLite2-City.mmdb on my local machine.
Is there anyway I can overcome this? I just want my tests to pass without changing the code to be using getClass().getResourceAsStream and I can't if/else around that because Pig doesn't give me a way to pass in parameter or maybe I'm missing something.
And this is my JUnit test
#Test
#Ignore
public void shouldGetGeoData() throws Exception {
String ipTest = "128.101.101.101";
Map<String, Object> geoJson = new LogLine2Json().geoData(ipTest);
assertThat(geoJson.get("lLa").toString(), is(equalTo("44.9759")));
assertThat(geoJson.get("lLo").toString(), is(equalTo("-93.2166")));
}
which it works if I read the database file from resource folder. That's why I have #Ignore

Besides, your whole code looks pretty un-testable.
Every time when you directly call new in your production code, you prevent dependency injection; and thereby you make it much harder to test your code.
The point is to not call new File() within your production code.
Instead, you could use a factory that gives you a "ready to use" DatabaseReader object. Then you can test your factory to do the right thing; and you can mock that factory when testing this code (to return a mocked database reader).
So, that one file instance is just the top of your "testing problems" here.
Honestly: don't write production code first. Do TDD: write test cases first; and you will quickly learn that such production code that you are presenting here is really hard to test. And when you apply TDD, you start from "test perspective", and you will create production code that is really testable.

You have to make the file location configurable. E.g. inject it via constructor. E.g. you could create a non-default constructor for testing only.
public class LogLine2Json {
private static final String DEFAULT_GEO_DB = "GeoLite2-City.mmdb";
private static final String DEFAULT_GEO_FILE = "/geo/" + GEO_DB;
private final String geoFile;
public LogLine2Json() {
this(DEFAULT_GEO_FILE);
}
LogLine2Json(String geoFile) {
this.geoFile = geoFile;
}
public Map<String, Object> geoData(String ipStr) {
Map<String, Object> geoMap = new HashMap<String, Object>();
File file = new File(geoFile);
DatabaseReader reader = new DatabaseReader.Builder(file).build();
// other stuff
}
}
Now you can create a file from the resource and use this file in your test.
public class LogLine2JsonTest {
#Rule
public final TemporaryFolder folder = new TemporaryFolder();
#Test
public void shouldGetGeoData() throws Exception {
File dbFile = copyResourceToFile("/geo/GeoLite2-City.mmdb");
String ipTest = "128.101.101.101";
LogLine2Json logLine2Json = new LogLine2Json(dbFile.getAbsolutePath())
Map<String, Object> geoJson = logLine2Json.geoData(ipTest);
assertThat(geoJson.get("lLa").toString(), is(equalTo("44.9759")));
assertThat(geoJson.get("lLo").toString(), is(equalTo("-93.2166")));
}
private File copyResourceToFile(String name) throws IOException {
InputStream resource = getClass().getResourceAsStream(name);
File file = folder.newFile();
Files.copy(resource, file.toPath(), StandardCopyOption.REPLACE_EXISTING);
return file;
}
}
TemporaryFolder is a JUnit rule that deletes every file that is created during test afterwards.
You may modify the asserts by using the hasToString matcher. This will give you more detailed information in case of a failing test. (And you have to read/write less code.)
assertThat(geoJson.get("lLa"), hasToString("44.9759"));
assertThat(geoJson.get("lLo"), hasToString("-93.2166"));

You don't. Your question embodies a contradiction in terms. Resources are not files and do not live in the file system. You can either distribute the file separately from the JAR and use it as a File or include it in the JAR and use it as a resource. Not both. You have to make up your mind.

How to access user-defined value in a bolt's getComponentConfiguration() when using storm?

I want to initialize my redis address dynamicly by command line, And use it before a bolt's open method:
public class RunMyTopology {
#Parameter(names = { "-topologyName"}, description = "Topology name.")
private static String TOP_NAME = "demo";
#Parameter(names = { "-redisAddr"}, description = "Redis host address.", validateWith = IPValidator.class)
public static String REDIS_ADDR = "172.16.3.142";
public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException {
new JCommander(new RunMyTopology(), args);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new Spout(REDIS_ADDR), 1);
builder.setBolt("fixerBolt",new FixerBolt(REDIS_Addr),1).fieldsGrouping("spout", new Fields("busId"));
// And many other bolts need REDIS_ADDR
Config conf = new Config();
conf.put(Config.TOPOLOGY_WORKERS, 22);
StormSubmitter.submitTopology(TOP_NAME, new Conf, builder.createTopology());
}
}
Now I can achive it by passing constructor parameters, but if I have many
config values like redis address, this way looks ugly. How to notify the changed value in other way?

Unfortunately, there is no externalization of properties in Apache Storm.
But you can use many librairies that are available for this purpose, such as Spring (placeholder API) or Apache Commons Configuration (I personally use it with storm as it is quite lightweight and does that job well enough).
If you plan on using Commons Configuration:
define your property files for different environment DEV, PROD...
you must parse a commons configuration first with support for property overwritting (with an environment or system variable for instance)
then get all properties inside to put them into Storm Config (filter some if you take all system properties, it can be full of crap)
finally you can start your cluster
Hope that helps.
Here is a link to the documentation.
http://commons.apache.org/proper/commons-configuration/userguide_v1.10/overview.html#Using_Configuration

Default http/admin port in dropwizard project

I have a dropwizard project and I have maintained a config.yml file at the ROOT of the project (basically at the same level as pom.xml). Here I have specified the HTTP port to be used as follows:
http:
port:9090
adminPort:9091
I have the following code in my TestService.java file
public class TestService extends Service<TestConfiguration> {
#Override
public void initialize(Bootstrap<TestConfiguration> bootstrap) {
bootstrap.setName("test");
}
#Override
public void run(TestConfiguration config, Environment env) throws Exception {
// initialize some resources here..
}
public static void main(String[] args) throws Exception {
new TestService().run(new String[] { "server" });
}
}
I expect the config.yml file to be used to determine the HTTP port. However the app always seems to start with the default ports 8080 and 8081. Also note that I am running this from eclipse.
Any insights as to what am I doing wrong here ?

Try running your service as follows:
Rewrite your main method into:
public static void main(String[] args) throws Exception {
new TestService().run(args);
}
Then in eclipse go to Run --> Run configurations...., create a new run configuration for your class, go to arguments and add "server path/to/config.yml" in "program arguments". If you put it in the root directory, it would be "server config.yml"
I believe you are not passing the location/name of the .yml file and thus your configurations are not being loaded. Another solution is to just add the location of your config file to the array you're passing into run ;)

running an oozie workflow using java code

I'm new to java and having some trouble running an oozie job using java code. I am unable to figure out the problem in the code. Some help will be really appreciated. Here's my code
import java.util.Properties;
import org.apache.oozie.client.OozieClient;
import org.apache.oozie.client.WorkflowJob;
public class oozie {
public static void main(String[] args) {
OozieClient wc = new OozieClient("http://host:11000/oozie");
Properties conf = wc.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "hdfs://cluster/user/apps/merge-psp-logs/merge-wf/workflow.xml");
conf.setProperty("jobTracker", "jobtracker.bigdata.com:8021");
conf.setProperty("nameNode", "hdfs://namenode.bigdata.com:8020");
conf.setProperty("queueName", "jobtracker.bigdata.com:8021");
conf.setProperty("appsRoot", "hdfs://namenode.bigdata.com:8020/user/workspace/apps");
conf.setProperty("appLibLoc", "hdfs://namenode.bigdata.com:8020/user/workspace/lib");
conf.setProperty("rawlogsLoc", "hdfs://namenode.bigdata.com:8020/user/workspace/");
conf.setProperty("mergedlogsLoc", "jobtracker.bigdata.com:8021");
try {
String jobId = wc.run(conf);
System.out.println("Workflow job submitted");
while (wc.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
System.out.println("Workflow job running ...");
Thread.sleep(10 * 1000);
}
System.out.println("Workflow job completed ...");
System.out.println(wc.getJobInfo(jobId));
} catch (Exception r) {
System.out.println("Errors");
}
}
}
Though i am able to launch the job using command line

Without any further information, i would say this is the probably cause of your runtime errors:
conf.setProperty(OozieClient.APP_PATH,
"hdfs://cluster/user/apps/merge-psp-logs/merge-wf/workflow.xml");
conf.setProperty("jobTracker", "jobtracker.bigdata.com:8021");
conf.setProperty("nameNode", "hdfs://namenode.bigdata.com:8020");
conf.setProperty("queueName", "jobtracker.bigdata.com:8021");
Unless you have two clusters, my guess is you meant the APP_PATH to point to the same HDFS instance as the one named in your nameNode property, in which case try:
conf.setProperty(OozieClient.APP_PATH,
"hdfs://namenode.bigdata.com:8020/user/apps/merge-psp-logs/merge-wf/workflow.xml");
You might also want to change the queueName to a real queue name (probably "default", unless jobtracker.bigdata.com:8021 is the actual name of your queue):
conf.setProperty("queueName", "default");
Aside from those observations, try and post the actual runtime error you're seeing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Storm Integration using Kafka Spout - java

I had the same problem. Finally resolved it, and I put the complete running example up on github. You are welcome to check it out here > https://github.com/buildlackey/cep (click on the storm+kafka directory for a sample program that should get you up and running).

We had a similar issue. Our solution: Open pom.xml Change scope from provided to <scope>compile</scope> If you want to know more about dependency scopes check the maven docu: Maven docu - dependency scopes

Related

Fail to read from bigtable in dataflow

How do I include a resource file in Java project to be used with just new File()?

How to access user-defined value in a bolt's getComponentConfiguration() when using storm?

Default http/admin port in dropwizard project

running an oozie workflow using java code

Categories

Resources