How to debug and test MapReduce on local Window machine? - java

I have found that debugging and testing a MapReduce project challenging.
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.log4j.Logger;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import writables.Friend;
import writables.FriendArray;
import writables.FriendPair;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.StringTokenizer;
public class FacebookFriendsMapper extends Mapper<LongWritable, Text, FriendPair, FriendArray> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Logger log = Logger.getLogger(FacebookFriendsMapper.class);
StringTokenizer st = new StringTokenizer(value.toString(), "\t");
String person = st.nextToken();
String friends = st.nextToken();
Friend f1 = populateFriend(person);
List<Friend> friendList = populateFriendList(friends);
Friend[] friendArray = Arrays.copyOf(friendList.toArray(), friendList.toArray().length, Friend[].class);
FriendArray farray = new FriendArray(Friend.class, friendArray);
for(Friend f2 : friendList) {
FriendPair fpair = new FriendPair(f1, f2);
context.write(fpair, farray);
log.info(fpair+"......"+ farray);
}
}
private Friend populateFriend(String friendJson) {
JSONParser parser = new JSONParser();
Friend friend = null;
try {
Object obj = (Object)parser.parse(friendJson);
JSONObject jsonObject = (JSONObject) obj;
Long lid = (long)jsonObject.get("id");
IntWritable id = new IntWritable(lid.intValue());
Text name = new Text((String)jsonObject.get("name"));
Text hometown = new Text((String)jsonObject.get("hometown"));
friend = new Friend(id, name, hometown);
} catch (ParseException e) {
e.printStackTrace();
}
return friend;
}
private List<Friend> populateFriendList(String friendsJson) {
List<Friend> friendList = new ArrayList<Friend>();
try {
JSONParser parser = new JSONParser();
Object obj = (Object)parser.parse(friendsJson.toString());
JSONArray jsonarray = (JSONArray) obj;
for(Object jobj : jsonarray) {
JSONObject entry = (JSONObject)jobj;
Long lid = (long)entry.get("id");
IntWritable id = new IntWritable(lid.intValue());
Text name = new Text((String)entry.get("name"));
Text hometown = new Text((String)entry.get("hometown"));
Friend friend = new Friend(id, name, hometown);
friendList.add(friend);
}
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return friendList;
}
}
For debugging and testing, I usually get the script above and put it inside a public static void main(String[] args) in another testing class and run in debug mode of Intellij IDEA with reading a sample data from the local filesystem. Hence, I am pretty sure that the mapper's logic is correct.
About the reducer script, I am not sure in details that how the mapper pass its outputs to reducer. I checked the sample Reducer scripts during my research and came up with the initial version of my reducers as below:
public class FacebookFriendsReducer extends
Reducer<FriendPair, FriendArray, FriendPair, FriendArray> {
#Override
public void reduce(FriendPair key, Iterable<FriendArray> values, Context context)
throws IOException, InterruptedException {
}
}
This is where I can not proceed further as I can not simulate how the mapper pass its output to the FacebookFriendsReducer and the reduce method. My current approach for debugging is to write the reducer logic in a public static void main(String[] args) and then running it in debug mode in the process before putting to its reducer class.
Can someone help me how to pass the correct output of the mapper into the reducer so that I can further work on the logic ?
If you have a better alternative for debugging and testing MapReduce on a Local window machine before packaging it into a jar file and shipping it to Hadoop cluster, please let me know.
Edit for #OneCricketeer's answer:
You can check the Driver (main class) as below:
public class FacebookFriendsDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
String inputPath = "E:\\sampleInputPath\\inputFile";
String outputPath = "E:\\sampleOutputPath\\outputFile";
// if (args.length != 2) {
// System.err.println("Usage: fberature <input path> <output path>");
// System.exit(-1);
// }
//Job Setup
Job fb = Job.getInstance(getConf(), "facebook-friends");
fb.setJarByClass(FacebookFriendsDriver.class);
//File Input and Output format
FileInputFormat.addInputPath(fb, new Path(inputPath));
FileOutputFormat.setOutputPath(fb, new Path(outputPath));
fb.setInputFormatClass(TextInputFormat.class);
fb.setOutputFormatClass(SequenceFileOutputFormat.class);
//Mapper-Reducer-Combiner specifications
fb.setMapperClass(FacebookFriendsMapper.class);
fb.setReducerClass(FacebookFriendsReducer.class);
fb.setMapOutputKeyClass(FriendPair.class);
fb.setMapOutputValueClass(FriendArray.class);
//Output key and value
fb.setOutputKeyClass(FriendPair.class);
fb.setOutputValueClass(FriendArray.class);
//Submit job
return fb.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new FacebookFriendsDriver(), args);
System.exit(exitCode);
}
}
The sample Driver class above which I created based on other MapReduce job existing in our system. But I can not make it work on my Local Window Machine with the error below:
Connected to the target VM, address: '127.0.0.1:59143', transport: 'socket'
23/01/10 10:52:22 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:324)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:339)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:332)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:431)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:477)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:171)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:154)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at FacebookFriendsDriver.main(FacebookFriendsDriver.java:60)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/C:/Users/Holyken/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0-cdh5.1.0/hadoop-auth-2.3.0-cdh5.1.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/01/10 10:52:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/10 10:52:23 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
23/01/10 10:52:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NullPointerException
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1090)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:451)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:745)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:582)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:612)
at FacebookFriendsDriver.run(FacebookFriendsDriver.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at FacebookFriendsDriver.main(FacebookFriendsDriver.java:60)
Disconnected from the target VM, address: '127.0.0.1:59143', transport: 'socket'
Can you elaborate more about how I can run MapReduce job on my local filesystem ?

You can set breakpoints in code from an IDE. You don't even need a real hadoop cluster. The code will run the same in local filesystem.
Otherwise, you can write unit tests, as well. For instance, your json parsing function, looks like it can return null values on exception, then you continue adding null values into your mapper output... You also don't need to convert a list to an array just to create a json array
Your main method for mapreduce for a Job driver application is what you'd start in a debugger.
can not simulate how the mapper pass its output to the FacebookFriendsReducer
The parameters are given like a GROUP BY key operation. Your value is an iterable of arrays, so you need to loop over them.
Not clear what your reducer needs to output, so the output types might not be correct

Related

Kafka Connect Starts, but nothing happens

I am writing a Kafka source connector based on a working producer that I use for audio files. The connector starts but nothing happens, no errors, no data, I am not sure if this is a coding problem or configuration problem.
The connector should read an entire directory, and read files as a byte array.
Config class:
package hothman.example;
import org.apache.kafka.common.config.AbstractConfig;
import org.apache.kafka.common.config.ConfigDef;
import org.apache.kafka.common.config.ConfigDef.Type;
import org.apache.kafka.common.config.ConfigDef.Importance;
import java.util.Map;
public class AudioSourceConnectorConfig extends AbstractConfig {
public static final String FILENAME_CONFIG="fileName";
private static final String FILENAME_DOC ="Enter the path of the audio files";
public static final String TOPIC_CONFIG = "topic";
private static final String TOPIC_DOC = "Enter the topic to write to..";
public AudioSourceConnectorConfig(ConfigDef config, Map<String, String> parsedConfig) {
super(config, parsedConfig);
}
public AudioSourceConnectorConfig(Map<String, String> parsedConfig) {
this(conf(), parsedConfig);
}
public static ConfigDef conf() {
return new ConfigDef()
.define(FILENAME_CONFIG, Type.STRING, Importance.HIGH, FILENAME_DOC)
.define(TOPIC_CONFIG, Type.STRING, Importance.HIGH, TOPIC_DOC);
}
public String getFilenameConfig(){
return this.getString("fileName");
}
public String getTopicConfig(){
return this.getString("topic");
}
}
SourceConnectorClass
package hothman.example;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.kafka.common.config.ConfigDef;
import org.apache.kafka.connect.connector.Task;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.source.SourceConnector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class AudioSourceConnector extends SourceConnector {
/*
Your connector should never use System.out for logging. All of your classes should use slf4j
for logging
*/
private static Logger log = LoggerFactory.getLogger(AudioSourceConnector.class);
private AudioSourceConnectorConfig config;
private String filename;
private String topic;
#Override
public String version() {
return VersionUtil.getVersion();
}
#Override
public void start(Map<String, String> props) {
filename = config.getFilenameConfig();
topic = config.getTopicConfig();
if (topic == null || topic.isEmpty())
throw new ConnectException("AudiSourceConnector configuration must include 'topic' setting");
if (topic.contains(","))
throw new ConnectException("AudioSourceConnector should only have a single topic when used as a source.");
}
#Override
public Class<? extends Task> taskClass() {
//TODO: Return your task implementation.
return AudioSourceTask.class;
}
#Override
public List<Map<String, String>> taskConfigs(int maxTasks) {
ArrayList<Map<String, String>> configsList = new ArrayList<>();
// Only one input stream makes sense.
Map<String, String> configs = new HashMap<>();
if (filename != null)
configs.put(config.getFilenameConfig(), filename);
configs.put(config.getTopicConfig(), topic);
configsList.add(configs);
return configsList;
}
#Override
public void stop() {
}
#Override
public ConfigDef config() {
return AudioSourceConnectorConfig.conf();
}
}
SourceTask class
package hothman.example;
import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.data.SchemaAndValue;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.source.SourceRecord;
import org.apache.kafka.connect.source.SourceTask;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import static com.sun.nio.file.ExtendedWatchEventModifier.FILE_TREE;
import static java.nio.file.StandardWatchEventKinds.ENTRY_CREATE;
import static java.nio.file.StandardWatchEventKinds.ENTRY_DELETE;
public class AudioSourceTask extends SourceTask {
/*
Your connector should never use System.out for logging. All of your classes should use slf4j
for logging
*/
static final Logger log = LoggerFactory.getLogger(AudioSourceTask.class);
private AudioSourceConnectorConfig config;
public static final String POSITION_FIELD = "position";
private static final Schema VALUE_SCHEMA = Schema.BYTES_SCHEMA;
private String filename;
private String topic = null;
private int offset = 0;
private FileSystem fs = FileSystems.getDefault();
private WatchService ws = fs.newWatchService();
private Path dir;
private File directoryPath;
private ArrayList<File> listOfFiles;
private byte[] temp = null;
public AudioSourceTask() throws IOException {
}
#Override
public String version() {
return VersionUtil.getVersion();
}
#Override
public void start(Map<String, String> props) {
filename = config.getFilenameConfig();
topic = config.getTopicConfig();
if (topic == null)
throw new ConnectException("AudioSourceTask config missing topic setting");
dir = Paths.get(filename);
try {
dir.register(ws, new WatchEvent.Kind[]{ENTRY_CREATE, ENTRY_DELETE}, FILE_TREE);
} catch (IOException e) {
e.printStackTrace();
}
directoryPath = new File(String.valueOf(dir));
}
#Override
public List<SourceRecord> poll() throws InterruptedException {
//TODO: Create SourceRecord objects that will be sent the kafka cluster.
listOfFiles = new ArrayList<File>(Arrays.asList(directoryPath.listFiles()));
Map<String, Object> offset = context.offsetStorageReader().
offset(Collections.singletonMap(config.getFilenameConfig(), filename));
ArrayList<SourceRecord> records = new ArrayList<>(1);
try {
for (File file : listOfFiles) {
// send existing files first
temp = Files.readAllBytes(Paths.get(file.toString()));
records.add(new SourceRecord(null,
null, topic, Schema.BYTES_SCHEMA, temp));
}
return records;
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
#Override
public void stop() {
//TODO: Do whatever is required to stop your task.
}
}
VersionClass
package hothman.example;
/**
* Created by jeremy on 5/3/16.
*/
class VersionUtil {
public static String getVersion() {
try {
return VersionUtil.class.getPackage().getImplementationVersion();
} catch(Exception ex){
return "0.0.0.0";
}
}
}
Connector.properties
name=AudioSourceConnector
tasks.max=1
connector.class=hothman.example.AudioSourceConnector
fileName = G:\\Files
topic= my-topic
Connect-standalone.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# These are defaults. This file just demonstrates how to override some settings.
bootstrap.servers=localhost:9092
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
#key.converter=org.apache.kafka.connect.json.JsonConverter
#value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=G:/Kafka/kafka_2.12-2.8.0/tmp/connect.offsets
# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000
# Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
# (connectors, converters, transformations). The list should consist of top level directories that include
# any combination of:
# a) directories immediately containing jars with plugins and their dependencies
# b) uber-jars with plugins and their dependencies
# c) directories immediately containing the package directory structure of classes of plugins and their dependencies
# Note: symlinks will be followed to discover dependencies or plugins.
# Examples:
# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
plugin.path=G:/Kafka/kafka_2.12-2.8.0/plugins
ERROR:
[2021-05-05 01:24:27,926] INFO WorkerSourceTask{id=AudioSourceConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:487)
[2021-05-05 01:24:27,928] ERROR WorkerSourceTask{id=AudioSourceConnector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:184)
java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Files.java:3099)
at java.nio.file.Files.readAllBytes(Files.java:3158)
at hothman.example.AudioSourceTask.poll(AudioSourceTask.java:93)
at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:273)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:240)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2021-05-05 01:24:27,929] INFO [Producer clientId=connector-producer-AudioSourceConnector-0] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer:1204)
[2021-05-05 01:24:27,933] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics:659)
[2021-05-05 01:24:27,934] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics:663)
[2021-05-05 01:24:27,934] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics:669)
[2021-05-05 01:24:27,935] INFO App info kafka.producer for connector-producer-AudioSourceConnector-0 unregistered (org.apache.kafka.common.utils.AppInfoParser:83)
[2021-05-05 01:24:36,479] INFO WorkerSourceTask{id=AudioSourceConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:487)
Using the Logger based on #OneCricketeer recommendation, I was able to pinpoint the problem.
config.getFilenameConfig();
returns null, so I had to encode the path manually for the time being in the connector.
the connector worked but gave java.lang.OutOfMemoryError: Java heap space error. to fix this i had to edit connect-standalone.properties file and change the size of producer.max.request.size and producer.buffer.memory and make sure their values are higher than any of the files that I am going to send.
I have also edited AudioSourceTask class and got rid of the for loop in the poll method and moved the initialization of listOfFiles from poll method to start method, they are as follows now
public void start(Map<String, String> props) {
filename = "G:\\AudioFiles";//config.getFilenameConfig();//
topic = "voice-wav1";//config.getTopicConfig();//
if (topic == null)
throw new ConnectException("AudioSourceTask config missing topic setting");
dir = Paths.get(filename);
try {
dir.register(ws, new WatchEvent.Kind[]{ENTRY_CREATE, ENTRY_DELETE}, FILE_TREE);
} catch (IOException e) {
e.printStackTrace();
}
directoryPath = new File(String.valueOf(dir));
listOfFiles = new ArrayList<File>(Arrays.asList(directoryPath.listFiles()));
}
#Override
public List<SourceRecord> poll() throws InterruptedException {
//TODO: Create SourceRecord objects that will be sent the kafka cluster.
Map<String, Object> offset = context.offsetStorageReader().
offset(Collections.singletonMap("G:\\AudioFiles", filename));
ArrayList<SourceRecord> records = new ArrayList<>(1);
try{
// send existing files first
if(listOfFiles.size()!=0) {
File file = listOfFiles.get(listOfFiles.size() - 1);
listOfFiles.remove(listOfFiles.size() - 1);
temp = Files.readAllBytes(Paths.get(file.toString()));
records.add(new SourceRecord(null,
null, topic, Schema.BYTES_SCHEMA, temp));
LOGGER.info("Reading file {}", file);
return records;
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

How to resolve a custom option in a protocol buffer FileDescriptor

I'm using protocol buffers 2.5 with Java. I have a proto file that defines a custom option. Another proto file uses that custom option. If I persist the corresponding FileDescriptorProto's and then read them and convert them to FileDescriptors, the reference to the custom option is manifested as unknown field. How do I cause that custom option to be resolved correctly?
Here's the code. I have two .proto files. protobuf-options.proto looks like this:
package options;
import "google/protobuf/descriptor.proto";
option java_package = "com.example.proto";
option java_outer_classname = "Options";
extend google.protobuf.FieldOptions {
optional bool scrub = 50000;
}
The imported google/protobuf/descriptor.proto is exactly the descriptor.proto that ships with Protocol Buffers 2.5.
example.proto looks like this:
package example;
option java_package = "com.example.protos";
option java_outer_classname = "ExampleProtos";
option optimize_for = SPEED;
option java_generic_services = false;
import "protobuf-options.proto";
message M {
optional int32 field1 = 1;
optional string field2 = 2 [(options.scrub) = true];
}
As you can see, field2 references the custom option defined by protobuf-options.proto.
The following code writes a binary-encoded version of all three protos to /tmp:
package com.example;
import com.google.protobuf.ByteString;
import com.google.protobuf.DescriptorProtos.FileDescriptorProto;
import com.google.protobuf.Descriptors.FileDescriptor;
import com.example.protos.ExampleProtos;
import java.io.FileOutputStream;
import java.io.OutputStream;
/**
*
*/
public class PersistFDs {
public void persist(final FileDescriptor fileDescriptor) throws Exception {
System.out.println("persisting "+fileDescriptor.getName());
try (final OutputStream outputStream = new FileOutputStream("/tmp/"+fileDescriptor.getName())) {
final FileDescriptorProto fileDescriptorProto = fileDescriptor.toProto();
final ByteString byteString = fileDescriptorProto.toByteString();
byteString.writeTo(outputStream);
}
for (final FileDescriptor dependency : fileDescriptor.getDependencies()) {
persist(dependency);
}
}
public static void main(String[] args) throws Exception {
final PersistFDs self = new PersistFDs();
self.persist(ExampleProtos.getDescriptor());
}
}
Finally, the following code loads those those protos from /tmp, converts them back into FileDescriptors, and then checks for the custom option on field2:
package com.example;
import com.google.protobuf.ByteString;
import com.google.protobuf.DescriptorProtos.FileDescriptorProto;
import com.google.protobuf.Descriptors.FieldDescriptor;
import com.google.protobuf.Descriptors.FileDescriptor;
import com.google.protobuf.UnknownFieldSet.Field;
import java.io.FileInputStream;
import java.io.InputStream;
/**
*
*/
public class LoadFDs {
public FileDescriptorProto loadProto(final String filePath) throws Exception {
try (final InputStream inputStream = new FileInputStream(filePath)) {
final ByteString byteString = ByteString.readFrom(inputStream);
final FileDescriptorProto result = FileDescriptorProto.parseFrom(byteString);
return result;
}
}
public static void main(final String[] args) throws Exception {
final LoadFDs self = new LoadFDs();
final FileDescriptorProto descriptorFDProto = self.loadProto("/tmp/google/protobuf/descriptor.proto");
final FileDescriptorProto optionsFDProto = self.loadProto("/tmp/protobuf-options.proto");
final FileDescriptorProto fakeBoxcarFDProto = self.loadProto("/tmp/example.proto");
final FileDescriptor fD = FileDescriptor.buildFrom(descriptorFDProto, new FileDescriptor[0]);
final FileDescriptor optionsFD = FileDescriptor.buildFrom(optionsFDProto, new FileDescriptor[] { fD });
final FileDescriptor fakeBoxcarFD = FileDescriptor.buildFrom(fakeBoxcarFDProto, new FileDescriptor[] { optionsFD });
final FieldDescriptor optionsFieldDescriptor = optionsFD.findExtensionByName("scrub");
if (optionsFieldDescriptor == null) {
System.out.println("Did not find scrub's FieldDescriptor");
System.exit(1);
}
final FieldDescriptor sFieldDescriptor = fakeBoxcarFD.findMessageTypeByName("M").findFieldByName("field2");
System.out.println("unknown option fields "+sFieldDescriptor.getOptions().getUnknownFields());
final boolean hasScrubOption = sFieldDescriptor.getOptions().hasField(optionsFieldDescriptor);
System.out.println("hasScrubOption: "+hasScrubOption);
}
}
When I run LoadFDs, it fails with this exception:
unknown option fields 50000: 1
Exception in thread "main" java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
at com.google.protobuf.GeneratedMessage$ExtendableMessage.verifyContainingType(GeneratedMessage.java:812)
at com.google.protobuf.GeneratedMessage$ExtendableMessage.hasField(GeneratedMessage.java:761)
at com.example.LoadFDs.main(LoadFDs.java:42)
The options for FieldDescriptor for the s field ought to have a field for that custom option, but instead it has an unknown field. The field number and value on the unknown field are correct. It's just that the custom option is not getting resolved. How do I fix that?
You need to use FileDescriptorProto.parseFrom(byte[] data, ExtensionRegistryLite extensionRegistry, and explicitly create an ExtentionRegistry. Here's one way to create an extension registry:
ExtensionRegistry extensionRegistry = ExtensionRegistry.newInstance();
com.example.proto.Options.registerAllExtensions(extensionRegistry);
(where com.example.proto.Options is the compiled custom options class)
This obviously only works if you have access to the custom options compiled file on client side. I don't know if there's a way to serialize the extension and deserialize it on the client side.

Spark Streaming: Using PairRDD.saveAsNewHadoopDataset function to save data to HBase

I want to save a Twitter stream in a HBase database. What I have now, is the Saprk Application to receive and transform the data. But I don't know how to save my TwitterStream into HBase?
The only thing I found that could be useful is the PairRDD.saveAsNewAPIHadoopDataset(conf) method. But how shall I use it, which Configurations do I have to make to able to save the RDD data to my HBase table?
The only thing I found yet is the HBase client library, which can insert data to a table via Put objects. But this isn't a solution for inside a Spark program, is it (would be necessary to iterate over all items inside the RDD!!)?
Can someone give an example in JAVA? My main problem seems to be the set-up of the org.apache.hadoop.conf.Configuration instance, I have to submit in the saveAsNewAPIHadoopDataset...
Here a code snippet:
JavaReceiverInputDStream<Status> statusDStream = TwitterUtils.createStream(streamingCtx);
JavaPairDStream<Long, String> statusPairDStream = statusDStream.mapToPair(new PairFunction<Status, Long, String>() {
public Tuple2<Long, String> call(Status status) throws Exception {
return new Tuple2<Long, String> (status.getId(), status.getText());
}
});
statusPairDStream.foreachRDD(new Function<JavaPairRDD<Long,String>, Void>() {
public Void call(JavaPairRDD<Long, String> status) throws Exception {
org.apache.hadoop.conf.Configuration conf = new Configuration();
status.saveAsNewAPIHadoopDataset(conf);
// HBase PUT here can't be correct!?
return null;
}
});
First thing is functions are discouraged, if you are using java 8. Pls. use lambda.
Below code snippet could address all your queries.
sample snippet:
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
....
public static void processYourMessages(final JavaRDD<YourMessage> rdd, final HiveContext sqlContext,
, MyMessageUtil messageutil) throws Exception {
final JavaRDD<Row> yourrdd = rdd.filter(msg -> messageutil.filterType(.....) // create a java rdd
final JavaPairRDD<ImmutableBytesWritable, Put> yourrddPuts = yourrdd.mapToPair(row -> messageutil.getPuts(row));
yourrddPuts.saveAsNewAPIHadoopDataset(conf);
}
where conf is like below
private Configuration conf = HBaseConfiguration.create();
conf.set(ZOOKEEPER_QUORUM, "comma seperated list of zookeeper quorum");
conf.set("hbase.mapred.outputtable", "your table name");
conf.set("mapreduce.outputformat.class", "org.apache.hadoop.hbase.mapreduce.TableOutputFormat");
MyMessageUtil has getPuts methods which is like below
public Tuple2<ImmutableBytesWritable, Put> getPuts(Row row) throws Exception {
Put put = ..// prepare your put with all the columns you have.
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
Hope this helps!

How to remove java apis from Nashorn-engine?

Is it possible to hide or remove java api's from nashorn-engine?
So that it could only see or use "default" ECMAScript 262 Edition 5.1 with some especially exposed functions / variables?
I would like to let my endusers create some specific logic for their own without worrying they could hack the whole system. Of course there might be some security holes in the nashorn engine etc. but that is the different topic.
Edit: Sorry I forgot to mention that I am running nashorn inside my java application, so no commandline parameters can be used.
Programmatically, you can also directly use the NashornScriptEngineFactory class which has an appropriate getScriptEngine() method:
import jdk.nashorn.api.scripting.NashornScriptEngineFactory;
...
NashornScriptEngineFactory factory = new NashornScriptEngineFactory();
...
ScriptEngine engine = factory.getScriptEngine("-strict", "--no-java", "--no-syntax-extensions");
OK, here is sample class with some limiting arguments:
package com.pasuna;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Random;
import javax.script.Invocable;
import javax.script.ScriptEngine;
import javax.script.ScriptException;
import jdk.nashorn.api.scripting.NashornScriptEngineFactory;
public class ScriptTest {
public static class Logger {
public void log(String message) {
System.out.println(message);
}
}
public static class Dice {
private Random random = new Random();
public int D6() {
return random.nextInt(6) + 1;
}
}
public static void main(String[] args) {
NashornScriptEngineFactory factory = new NashornScriptEngineFactory();
ScriptEngine engine = factory.getScriptEngine(new String[]{"-strict", "--no-java", "--no-syntax-extensions"});
//note final, does not work.
final Dice dice = new Dice();
final Logger logger = new Logger();
engine.put("dice", dice);
engine.put("log", logger);
engine.put("hello", "world");
try {
engine.eval("log.log(hello);");
engine.eval("log.log(Object.keys(this));");
engine.eval("log.log(dice.D6());"
+ "log.log(dice.D6());"
+ "log.log(dice.D6());");
engine.eval("log.log(Object.keys(this));");
engine.eval("Coffee"); //boom as should
engine.eval("Java"); //erm? shoud boom?
engine.eval("log = 1;"); //override final, boom, nope
engine.eval("log.log(hello);"); //boom
} catch (final ScriptException ex) {
ex.printStackTrace();
}
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String input = "";
do {
try {
input = br.readLine();
engine.eval(input);
} catch (final ScriptException | IOException se) {
se.printStackTrace();
}
} while (!input.trim().equals("quit"));
try {
engine.eval("var add = function(first, second){return first + second;};");
Invocable invocable = (Invocable) engine;
Object result = invocable.invokeFunction("add", 1, 2);
System.out.println(result);
} catch (final NoSuchMethodException | ScriptException se) {
se.printStackTrace();
}
Object l = engine.get("log");
System.out.println(l == logger);
}
}
more info about flags can be found from here: http://hg.openjdk.java.net/jdk8/jdk8/nashorn/rev/eb7b8340ce3a
(imho atm the nashorn documentation is poor)
You can specify any jjs option for script engines via -Dnashorn.args option when you launch your java program. For example:
java -Dnashorn.args=--no-java Main
where Main uses javax.script API with nashorn engine.
You can run "jjs" tool with --no-java option to prevent any explicit Java package/class access from scripts. That said Nashorn platform is secure and uses Java standard URL codebase based security model ('eval'-ed script without known URL origin is treated like untrusted, unsigned code and so gets only sandbox permissions.
--no-java is the main flag to turn off java extensions. --no-syntax-extensions turns off non-standard extensions.

Amazon Product Advertising API through Java/SOAP

I have been playing with Amazon's Product Advertising API, and I cannot get a request to go through and give me data. I have been working off of this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/ and this: Amazon Product Advertising API signed request with Java
Here is my code. I generated the SOAP bindings using this: http://docs.amazonwebservices.com/AWSECommerceService/2011-08-01/GSG/YourDevelopmentEnvironment.html#Java
On the Classpath, I only have: commons-codec.1.5.jar
import com.ECS.client.jax.AWSECommerceService;
import com.ECS.client.jax.AWSECommerceServicePortType;
import com.ECS.client.jax.Item;
import com.ECS.client.jax.ItemLookup;
import com.ECS.client.jax.ItemLookupRequest;
import com.ECS.client.jax.ItemLookupResponse;
import com.ECS.client.jax.ItemSearchResponse;
import com.ECS.client.jax.Items;
public class Client {
public static void main(String[] args) {
String secretKey = <my-secret-key>;
String awsKey = <my-aws-key>;
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(
secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item item : itemList.getItem()) {
System.out.println(item);
}
}
System.out.println("API Test stopped");
}
}
Here is what I get back.. I was hoping to see some Star Wars books available on Amazon dumped out to my console :-/:
API Test started
response: com.ECS.client.jax.ItemSearchResponse#7a6769ea
com.ECS.client.jax.Items#1b5ac06e
API Test stopped
What am I doing wrong (Note that no "item" in the second for loop is being printed out, because its empty)? How can I troubleshoot this or get relevant error information?
I don't use the SOAP API but your Bounty requirements didn't state that it had to use SOAP only that you wanted to call Amazon and get results. So, I'll post this working example using the REST API which will at least fulfill your stated requirements:
I would like some working example code that hits the amazon server and returns results
You'll need to download the following to fulfill the signature requirements:
http://associates-amazon.s3.amazonaws.com/signed-requests/samples/amazon-product-advt-api-sample-java-query.zip
Unzip it and grab the com.amazon.advertising.api.sample.SignedRequestsHelper.java file and put it directly into your project. This code is used to sign the request.
You'll also need to download Apache Commons Codec 1.3 from the following and add it to your classpath i.e. add it to your project's library. Note that this is the only version of Codec that will work with the above class (SignedRequestsHelper)
http://archive.apache.org/dist/commons/codec/binaries/commons-codec-1.3.zip
Now you can copy and paste the following making sure to replace your.pkg.here with the proper package name and replace the SECRET and the KEY properties:
package your.pkg.here;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringWriter;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Main {
private static final String SECRET_KEY = "<YOUR_SECRET_KEY>";
private static final String AWS_KEY = "<YOUR_KEY>";
public static void main(String[] args) {
SignedRequestsHelper helper = SignedRequestsHelper.getInstance("ecs.amazonaws.com", AWS_KEY, SECRET_KEY);
Map<String, String> params = new HashMap<String, String>();
params.put("Service", "AWSECommerceService");
params.put("Version", "2009-03-31");
params.put("Operation", "ItemLookup");
params.put("ItemId", "1451648537");
params.put("ResponseGroup", "Large");
String url = helper.sign(params);
try {
Document response = getResponse(url);
printResponse(response);
} catch (Exception ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
}
private static Document getResponse(String url) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(url);
return doc;
}
private static void printResponse(Document doc) throws TransformerException, FileNotFoundException {
Transformer trans = TransformerFactory.newInstance().newTransformer();
Properties props = new Properties();
props.put(OutputKeys.INDENT, "yes");
trans.setOutputProperties(props);
StreamResult res = new StreamResult(new StringWriter());
DOMSource src = new DOMSource(doc);
trans.transform(src, res);
String toString = res.getWriter().toString();
System.out.println(toString);
}
}
As you can see this is much simpler to setup and use than the SOAP API. If you don't have a specific requirement for using the SOAP API then I would highly recommend that you use the REST API instead.
One of the drawbacks of using the REST API is that the results aren't unmarshaled into objects for you. This could be remedied by creating the required classes based on the wsdl.
This ended up working (I had to add my associateTag to the request):
public class Client {
public static void main(String[] args) {
String secretKey = "<MY_SECRET_KEY>";
String awsKey = "<MY AWS KEY>";
System.out.println("API Test started");
AWSECommerceService service = new AWSECommerceService();
service.setHandlerResolver(new AwsHandlerResolver(secretKey)); // important
AWSECommerceServicePortType port = service.getAWSECommerceServicePort();
// Get the operation object:
com.ECS.client.jax.ItemSearchRequest itemRequest = new com.ECS.client.jax.ItemSearchRequest();
// Fill in the request object:
itemRequest.setSearchIndex("Books");
itemRequest.setKeywords("Star Wars");
itemRequest.getResponseGroup().add("Large");
// itemRequest.getResponseGroup().add("Images");
// itemRequest.setVersion("2011-08-01");
com.ECS.client.jax.ItemSearch ItemElement = new com.ECS.client.jax.ItemSearch();
ItemElement.setAWSAccessKeyId(awsKey);
ItemElement.setAssociateTag("th0426-20");
ItemElement.getRequest().add(itemRequest);
// Call the Web service operation and store the response
// in the response object:
com.ECS.client.jax.ItemSearchResponse response = port
.itemSearch(ItemElement);
String r = response.toString();
System.out.println("response: " + r);
for (Items itemList : response.getItems()) {
System.out.println(itemList);
for (Item itemObj : itemList.getItem()) {
System.out.println(itemObj.getItemAttributes().getTitle()); // Title
System.out.println(itemObj.getDetailPageURL()); // Amazon URL
}
}
System.out.println("API Test stopped");
}
}
It looks like the response object does not override toString(), so if it contains some sort of error response, simply printing it will not tell you what the error response is. You'll need to look at the api for what fields are returned in the response object and individually print those. Either you'll get an obvious error message or you'll have to go back to their documentation to try to figure out what is wrong.
You need to call the get methods on the Item object to retrieve its details, e.g.:
for (Item item : itemList.getItem()) {
System.out.println(item.getItemAttributes().getTitle()); //Title of item
System.out.println(item.getDetailPageURL()); // Amazon URL
//etc
}
If there are any errors you can get them by calling getErrors()
if (response.getOperationRequest().getErrors() != null) {
System.out.println(response.getOperationRequest().getErrors().getError().get(0).getMessage());
}

Categories

Resources