I am trying to integrate the kafka message broker and spark and facing an issue saying
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka010/LocationStrategies
Below is the java spark code
package com.test.spell;
import java.util.Arrays;
/**
* Hello world!
*
*/
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import org.apache.spark.api.java.function.*;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import scala.Tuple2;
public class App
{
private static final Pattern SPACE = Pattern.compile(" ");
public static void main( String[] args )
{
String brokers = "localhost:9092";
String topics = "spark-topic";
// Create context with a 2 seconds batch interval
SparkConf sparkConf = new SparkConf().setAppName("JavaDirectKafkaWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", brokers);
// Create direct kafka stream with brokers and topics
JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
System.out.println("In programn");
// Get the lines, split them into words, count the words and print
JavaDStream<String> lines = messages.map(new Function<ConsumerRecord<String,String>, String>() {
#Override
public String call(ConsumerRecord<String, String> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
#Override
public Iterator<String> call(String line) throws Exception {
System.out.println(line);
return Arrays.asList(line.split(" ")).iterator();
}
});
/* JavaPairDStream<String,Integer> wordCounts = words.mapToPair(new PairFunction<String, String, Integer>() {
#Override
public Tuple2<String, Integer> call(String word) throws Exception {
return new Tuple2<>(word,1);
}
});*/
// wordCounts.print();
// Start the computation
jssc.start();
jssc.awaitTermination();
}
}
Below is my pom.xml
I have tried many jar file versions couldn't find the right one.
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.spark-project.spark</groupId>
<artifactId>unused</artifactId>
<version>1.0.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.3</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.8.2-beta</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>0.9.0-incubating</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.10</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
</dependency>
</dependencies>
</project>
I am running my spark job as follows:
./bin/spark-submit --class com.test.spell.spark.App \
--master local \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue default \
/home/cwadmin/spark_programs/spell/target/spell-0.0.1-SNAPSHOT.jar
I feel that the above problem is rising due to wrong jar files usage. Can someone help me out in fixing this. I would like to know what are the right jar files that should be used here. Also it would be grateful if someone shares some valuable resources regarding these programs like integration of Spark and Kafka.
I am trying to fix this issue since 2 days and unable to solve this
Thanks in advance.
First, you need to use the same version of Spark dependencies - I see that you're using 2.1.0 for spark-core, 2.3.1 for spark-streaming, 2.0.0 for spark-streaming-kafka, etc.
Second - you need to use the same version of Scala for these dependencies, and it should to version of Scala that was used to compile your build of Spark.
Third - you don't need to explicitly specify dependencies for Kafka libraries.
You need to build a fat-jar of your application, that will include necessary libraries (except spark-core that should be marked as provided). The easiest way to do this is to use Maven Assembly plugin.
Related
i'm having an issue when setting up the MockServerClient for multiple responses with the exact same request.
I read that with expectations with "Times" this might be done, but i coulnd't make it work with my scenario.
If you call the service with this JSON (twice):
{
"id": 1
}
The first response should be "passed true", the second "passed false"
Response 1:
{
"passed":true
}
Response 2:
{
"passed":false
}
I set up the first request, but how do i set the second one?
import com.nice.project.MyService;
import com.nice.project.MyPojo;
import org.mockito.Mock;
import org.mockserver.integration.ClientAndServer;
import org.mockserver.matchers.TimeToLive;
import org.mockserver.matchers.Times;
import org.mockserver.model.Header;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.http.HttpHeaders;
import org.springframework.http.HttpMethod;
import org.springframework.http.HttpStatus;
import org.springframework.http.MediaType;
import org.springframework.test.context.TestPropertySource;
import java.io.File;
import java.nio.charset.StandardCharsets;
import java.time.Instant;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import static org.assertj.core.api.Assertions.assertThat;
import static org.mockito.ArgumentMatchers.contains;
import static org.mockito.Mockito.when;
import static org.mockserver.integration.ClientAndServer.startClientAndServer;
import static org.mockserver.model.HttpRequest.request;
import static org.mockserver.model.HttpResponse.response;
#SpringBootTest
public class Tests{
private static final int PORT = 9998;
private static ClientAndServer mockServer;
#Autowired
private MyService myService;
#BeforeAll
public void init(){
mockServer = startClientAndServer(PORT);
mockServer
.when(
request()
.withPath(testUrlValidateTransactionOk).withMethod(HttpMethod.POST.name())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody(contains("\"id\":\"1\""))
).respond(
response().withStatusCode(HttpStatus.OK.value())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody("{\"passed\":true}"));
// What do i set here? Or in the snippet before by chaining?
// mockServer.when()...
}
#Test
void t1{
//myService will internally call the MockServer
//FIRST CALL -> Pass
MyPojo p = myService.call(1);
assertThat(p.isPassed()).isEqualTo(Boolean.TRUE);
//SECOND CALL -> No Pass
MyPojo p2 = myService.call(1);
assertThat(p2.isPassed()).isEqualTo(Boolean.FALSE);
}
}
Dependencies (relevant):
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.4.4</version>
</parent>
<groupId>com.nice.project</groupId>
<artifactId>testing</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>test</name>
<description>Testing</description>
<properties>
<java.version>1.8</java.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<httpclient.version>4.5.13</httpclient.version>
<mock-server.version>5.11.2</mock-server.version>
</properties>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-json</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
</dependency>
<!--HTTP CLIENT-->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
</dependency>
<!--TEST-->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-engine</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mock-server</groupId>
<artifactId>mockserver-netty</artifactId>
<version>${mock-server.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mock-server</groupId>
<artifactId>mockserver-client-java</artifactId>
<version>${mock-server.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Thank you in advance.
After following and diving into the documentation and testing.
I found that you can specify a "Times" that an expectation will be match
which solves perfectly my problem.
Link: https://www.mock-server.com/mock_server/creating_expectations.html
For every expectation i used "Times.exactly(1)".
The way this works is that an expectation is added to the list, when it's matched
it will be consumed, and removed from the list, leaving the following ones.
If no expectation is found for a call it will return a 404 from the mock server.
Link for examples from documentation:
https://www.mock-server.com/mock_server/creating_expectations.html#button_match_request_by_path_exactly_twice
Correct code:
//The first call will land here, and then this expectation will be deleted, remaining the next one
mockServer
.when(
request()
.withPath(testUrlValidateTransactionOk).withMethod(HttpMethod.POST.name())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody(
json("{\"id\":1}",
MatchType.ONLY_MATCHING_FIELDS)),
Times.exactly(1)
).respond(
response().withStatusCode(HttpStatus.OK.value())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody("{\"passed\":true}"));
//After the first call this will be consumed and removed, leaving no expectations
mockServer
.when(
request()
.withPath(testUrlValidateTransactionOk).withMethod(HttpMethod.POST.name())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody(
json("{\"id\":1}",
MatchType.ONLY_MATCHING_FIELDS)),
Times.exactly(1)
).respond(
response().withStatusCode(HttpStatus.OK.value())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString())
)
.withBody("{\"passed\":false}"));
You can create a sequence of responses by wrapping the when/request/response behavior in a method and calling it multiple times, like this:
private void whenValidateTransactionReturn(boolean isPassed) {
mockServer
.when(
request()
.withPath(testUrlValidateTransactionOk)
.withMethod(HttpMethod.POST.name())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString()))
.withBody(contains("\"id\":\"1\"")))
.respond(
response()
.withStatusCode(HttpStatus.OK.value())
.withHeaders(
new Header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON.toString()))
.withBody("{\"passed\":" + isPassed + "}"));
}
Then you can call this method multiple times:
#Test
void testValidationFailsSecondTime() {
whenValidateTransactionReturn(true);
whenValidateTransactionReturn(false);
//
// Test logic
//
// mockServer.verify(...);
}
I implement ElasticSearchConsumer class program which is supposed to return the Id document. The running program desplay a warning message and return just twitter :
déc. 23, 2020 9:41:10 AM org.elasticsearch.client.RestClient logResponse
WARNING: request [POST https://kafka-course-6054260476.us-east-1.bonsaisearch.net:443/twitter/tweets?timeout=1m] returned 1 warnings: [299 Elasticsearch-7.2.0-508c38a "[types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id})."]
[main] INFO com.gihub.simplesteph.kafka.tutorial3.ElasticSearchConsumer - twitter
Process finished with exit code 0
This is the code :
package com.gihub.simplesteph.kafka.tutorial3;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public class ElasticSearchConsumer {
public static RestHighLevelClient createClient(){
// replace with your own credentials
String hostname = "kafka-course-6054260476.us-east-1.bonsaisearch.net";
String username = "48h3frssnm";
String password = "8iliybmly0";
//don't do if you run a local ES
final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(username, password));
RestClientBuilder builder = RestClient.builder(
new HttpHost(hostname, 443, "https"))
.setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
#Override
public HttpAsyncClientBuilder customizeHttpClient(HttpAsyncClientBuilder httpAsyncClientBuilder) {
return httpAsyncClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
}
});
RestHighLevelClient client = new RestHighLevelClient(builder);
return client;
}
public static void main(String[] args) throws IOException {
Logger logger = LoggerFactory.getLogger(ElasticSearchConsumer.class.getName());
RestHighLevelClient client = createClient();
String jsonString = "{ \"foo\": \"bar\" }";
IndexRequest indexRequest = new IndexRequest("twitter", "tweets" ).source(jsonString, XContentType.JSON);
IndexResponse indexResponse = client.index(indexRequest, RequestOptions.DEFAULT);
String id = indexResponse.getIndex();
logger.info(id);
//close the client gracefully
client.close();
}
}
This is the pom.xml file :
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>kafka-beginners-course</artifactId>
<groupId>org.example</groupId>
<version>1.0-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>kafka-consumer-elasticsearch</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.elasticsearch.client/elasticsearch-rest-high-level-client -->
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.10.1</version>
</dependency>
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-client</artifactId>
<version>7.10.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.0.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-simple -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.25</version>
<!-- <scope>test</scope>-->
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpasyncclient</artifactId>
<version>4.1.4</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore-nio</artifactId>
<version>4.4.14</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.4.14</version>
</dependency>
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.15</version>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.2</version>
</dependency>
</dependencies>
</project>
You can use likewise its removed warning message.
IndexRequest indexRequest = new IndexRequest("twitter", "_doc",tweetId).
The warning does not related to what returned, it is related to deprecated constructor for IndexRequest() and you can ignore it or should not pass type argument(second argument) to constructor.
you retrieved the index name by getIndex() method but you should retrieve the Id by getId() method.
I am trying to write some record into parquet file in java.
Following is my sample code:
import org.apache.avro.reflect.ReflectData;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.parquet.avro.AvroParquetWriter;
import org.apache.parquet.hadoop.ParquetWriter;
import java.util.Date;
import java.util.Set;
import static org.apache.parquet.hadoop.ParquetFileWriter.Mode.OVERWRITE;
import static org.apache.parquet.hadoop.metadata.CompressionCodecName.SNAPPY;
public class App {
public static void main(String[] args) {
Path dataFile = new Path("/tmp/UpdateMetaData.snappy.parquet");
try {
ParquetWriter<UpdateMeta> writer = AvroParquetWriter.<UpdateMeta>builder(dataFile)
.withSchema(ReflectData.AllowNull.get().getSchema(UpdateMeta.class))
.withDataModel(ReflectData.get())
.withConf(new Configuration())
.withCompressionCodec(SNAPPY)
.withWriteMode(OVERWRITE)
.build();
} catch (Exception e) {
e.printStackTrace();
}
}
}
class UpdateMeta {
String updatedBy;
Date updatedAt;
Set<EmailContentField> emailContentField;
}
But I am getting following exception:
org.apache.parquet.schema.InvalidSchemaException: A group type can not
be empty. Parquet does not support empty group without leaves. Empty
group: updatedAt at
org.apache.parquet.schema.GroupType.(GroupType.java:92) at
org.apache.parquet.schema.GroupType.(GroupType.java:48) at
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:132)
at
org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:174)
at
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:151)
at
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:112)
at
org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:187)
at
org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:106)
at
org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:97)
at
org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
at
org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
at
org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
at
org.apache.parquet.hadoop.ParquetWriter$Builder.build(ParquetWriter.java:489)
at com.gartner.emailactivityimporter.dao.App.main(App.java:26)
Following are the dependencies I am using in my my pom file:
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.0</version>
<exclusions>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-hadoop</artifactId>
<version>1.8.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.parquet/parquet-avro -->
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.8.1</version>
</dependency>
Please help me to solve this exception.
Thanks
We can't write Date/Timestamp to parquet directly with above version of dependency.
So either we need to convert Date/Timestamp to String or long. And it worked.
Please comment if you have any other solution or suggestion.
Thanks
We're currently trying to bulk load some files from HDFS into titan using a map reduce job and the titan dependencies. However, we're running into an issue once the map jobs start where it can't find a tinkerpop class. This is the error:
java.lang.ClassNotFoundException: org.apache.tinkerpop.gremlin.structure.Vertex
I read somewhere that Titan 1.0.0 is only compatible with Tinkerpop 3.0.1-incubating, so that's what our versions are for dependencies.
It might help to see our pom.xml and code
pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>replacementID</groupId>
<artifactId>replacementID</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>com.thinkaurelius.titan</groupId>
<artifactId>titan-hbase</artifactId>
<version>1.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>hadoop-gremlin</artifactId>
<version>3.0.1-incubating</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-core</artifactId>
<version>3.0.1-incubating</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-driver</artifactId>
<version>3.0.1-incubating</version>
</dependency>
</dependencies>
</project>
Mapper:
package edu.rosehulman.brubakbd;
import java.io.IOException;
import org.apache.commons.configuration.BaseConfiguration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.TitanGraph;
import com.thinkaurelius.titan.core.TitanVertex;
import org.apache.tinkerpop.gremlin.structure.Vertex;
public class TitanMRMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String line = value.toString();
String[] vals = line.split("\t");
BaseConfiguration conf = new BaseConfiguration();
conf.setProperty("gremlin.graph", "com.thinkaurelius.titan.core.TitanFactory");
conf.setProperty("storage.backend", "hbase");
conf.setProperty("storage.hostname", "hadoop-16.csse.rose-hulman.edu");
conf.setProperty("storage.batch-loading", true);
conf.setProperty("storage.hbase.ext.zookeeper.znode.parent","/hbase-unsecure");
conf.setProperty("storage.hbase.ext.hbase.zookeeper.property.clientPort", 2181);
conf.setProperty("cache.db-cache",true);
conf.setProperty("cache.db-cache-clean-wait", 20);
conf.setProperty("cache.db-cache-time", 180000);
conf.setProperty("cache.db-cache-size", 0.5);
TitanGraph graph = TitanFactory.open(conf);
TitanVertex v1 = graph.addVertex();
v1.property("pageID", vals[0]);
TitanVertex v2 = graph.addVertex();
v2.property("pageID", vals[1]);
v1.addEdge("links_To", v2);
graph.tx().commit();
}
}
Driver:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.NullOutputFormat;
public class TitanMR {
public static void main(String[] args) throws Exception{
if (args.length != 1){
System.err.println("Usage: TitanMR <input path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(TitanMR.class);
job.setJobName("TitanMR");
FileInputFormat.addInputPath(job, new Path(args[0]));
job.setOutputFormatClass(NullOutputFormat.class);
job.setMapperClass(TitanMRMapper.class);
job.setNumReduceTasks(0);
System.out.println("about to submit job");
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
I'd suggest that you look into creating an uber-jar that contains all of your project dependencies. Since you're using Apache Maven for your build, you use the Apache Maven Assembly Plugin or the Apache Maven Shade Plugin.
Upgrade your gremlin jars in pom.xml
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>hadoop-gremlin</artifactId>
<version>3.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-core</artifactId>
<version>3.2.3</version>
</dependency>
<dependency>
<groupId>org.apache.tinkerpop</groupId>
<artifactId>gremlin-driver</artifactId>
<version>3.2.3</version>
</dependency>
I'm trying to read data from mongodb with spark using mongo-hadoop connector.
I tried different versions of hadoop-mongo connector jar but still getting this error.
There's no error during the compile time
What can i do to resolve this??
Thanks in advance.
Exception in thread "main" java.lang.NoClassDefFoundError: com/mongodb/hadoop/MongoInputFormat
at com.geekcap.javaworld.wordcount.Mongo.main(Mongo.java:47)
Caused by: java.lang.ClassNotFoundException: com.mongodb.hadoop.MongoInputFormat
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
My Code
import com.mongodb.hadoop.BSONFileOutputFormat;
import com.mongodb.hadoop.MongoInputFormat;
import com.mongodb.hadoop.MongoOutputFormat;
import java.util.Arrays;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Queue;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.bson.BSONObject;
public class MongoTest {
// Set configuration options for the MongoDB Hadoop Connector.
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("App1");
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration mongodbConfig;
mongodbConfig = new Configuration();
mongodbConfig.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
mongodbConfig.set("mongo.input.uri","mongodb://localhost:27017/MyCollectionName.collection");
JavaPairRDD<Object, BSONObject> documents = sc.newAPIHadoopRDD(
mongodbConfig, // Configuration
MongoInputFormat.class, // InputFormat: read from a live cluster.
Object.class, // Key class
BSONObject.class // Value class
);
documents.saveAsTextFile("b.txt");
}
}
pom.xml dependencies:
<!-- Import Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongodb-driver</artifactId>
<version>3.0.4</version>
</dependency>
<dependency>
<groupId>hadoopCom</groupId>
<artifactId>com.sample</artifactId>
<version>1.0</version>
<scope>system</scope>
<systemPath>/home/sys6002/NetBeansProjects/WordCount/lib/hadoop-common-2.7.1.jar</systemPath>
</dependency>
<dependency>
<groupId>hadoopCon1</groupId>
<artifactId>com.sample1</artifactId>
<version>1.0</version>
<scope>system</scope>
<systemPath>/home/sys6002/Downloads/mongo-hadoop-core-1.3.0.jar</systemPath>
</dependency>
</dependencies>
After several trails & changes, got this worked.
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.5.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.14</version>
</dependency>
<dependency>
<groupId>org.mongodb.mongo-hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>1.4.1</version>
</dependency>
</dependencies>
Java code
Configuration conf = new Configuration();
conf.set("mongo.job.input.format", "com.mongodb.hadoop.MongoInputFormat");
conf.set("mongo.input.uri", "mongodb://localhost:27017/databasename.collectionname");
SparkConf sconf = new SparkConf().setMaster("local").setAppName("Spark UM Jar");
JavaRDD<User> UserMaster = sc.newAPIHadoopRDD(conf, MongoInputFormat.class, Object.class, BSONObject.class)
.map(new Function<Tuple2<Object, BSONObject>, User>() {
#Override
public User call(Tuple2<Object, BSONObject> v1) throws Exception {
//return User
}
}