I'm new on apache storm and kafka and try to learn these notions via courses provided by OpenClassroom.The principle is simple, messages are sent via a python program to a kafka server, and are retrieved via a kafka spout defined in the main class of a Storm topology. The problem is that I don't understand how the bolt retrieves the messages. From what I understand this is done in the ParsingBolt class with the following line of code: JSONObject obj = (JSONObject)jsonParser.parse(input.getStringByField("value"));. The only thing is that I don't understand how we know that the messages are contained in the value field. Below you can find the main class and the parsing bolt class. (The whole project is available here)
The main class:
package velos;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.generated.AlreadyAliveException;
import org.apache.storm.generated.AuthorizationException;
import org.apache.storm.generated.InvalidTopologyException;
import org.apache.storm.generated.StormTopology;
import org.apache.storm.kafka.spout.KafkaSpout;
import org.apache.storm.kafka.spout.KafkaSpoutConfig;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseWindowedBolt;
import org.apache.storm.tuple.Fields;
public class App {
public static void main(String[] args)
throws AlreadyAliveException, InvalidTopologyException, AuthorizationException {
TopologyBuilder builder = new TopologyBuilder();
KafkaSpoutConfig.Builder<String, String> spoutConfigBuilder = KafkaSpoutConfig.builder("localhost:9092",
"velib-stations");
spoutConfigBuilder.setProp(ConsumerConfig.GROUP_ID_CONFIG, "city-stats");
KafkaSpoutConfig<String, String> spoutConfig = spoutConfigBuilder.build();
builder.setSpout("stations", new KafkaSpout<String, String>(spoutConfig));
builder.setBolt("station-parsing", new StationParsingBolt()).shuffleGrouping("stations");
builder.setBolt("city-stats",
new CityStatsBolt().withTumblingWindow(BaseWindowedBolt.Duration.of(1000 * 60 * 5)))
.fieldsGrouping("station-parsing", new Fields("city"));
builder.setBolt("save-results", new SaveResultsBolt()).fieldsGrouping("city-stats", new Fields("city"));
StormTopology topology = builder.createTopology();
Config config = new Config();
config.setMessageTimeoutSecs(60 * 30);
String topologyName = "Velos";
if (args.length > 0 && args[0].equals("remote")) {
StormSubmitter.submitTopology(topologyName, config, topology);
} else {
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(topologyName, config, topology);
}
}
}
The ParsingBolt:
package velos;
import java.util.Map;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.storm.shade.org.json.simple.JSONObject;
import org.apache.storm.shade.org.json.simple.parser.JSONParser;
import org.apache.storm.shade.org.json.simple.parser.ParseException;
public class StationParsingBolt extends BaseRichBolt {
private OutputCollector outputCollector;
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
outputCollector = collector;
}
#Override
public void execute(Tuple input) {
try {
process(input);
} catch (ParseException e) {
e.printStackTrace();
outputCollector.fail(input);
}
}
public void process(Tuple input) throws ParseException {
JSONParser jsonParser = new JSONParser();
JSONObject obj = (JSONObject)jsonParser.parse(input.getStringByField("value"));
String contract = (String)obj.get("contract_name");
Long availableStands = (Long)obj.get("available_bike_stands");
Long stationNumber = (Long)obj.get("number");
outputCollector.emit(new Values(contract, stationNumber, availableStands));
outputCollector.ack(input);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("city", "station_id", "available_stands"));
}
}
By default the "topic", "partition", "offset", "key", and "value" will be emitted to the "default" stream.
https://storm.apache.org/releases/2.4.0/storm-kafka-client.html
Use a RecordTranslator to change this.
I store all static data in the JSON file. This JSON file has up to 1000 rows. How to get the desired data without storing all rows as ArrayList?
My code, I'm using right now and I want to increase its efficiency.
List<Colors> colorsList = new ObjectMapper().readValue(resource.getFile(), new TypeReference<Colors>() {});
for(int i=0; i<colorsList.size(); i++){
if(colorsList.get(i).getColor.equals("Blue")){
return colorsList.get(i).getCode();
}
}
Is it possible? My goal is to increase efficiency without using ArrayList. Is there a way to make the code like this?
Colors colors = new ObjectMapper().readValue(..."Blue"...);
return colors.getCode();
Resource.json
[
...
{
"color":"Blue",
"code":["012","0324","15478","7412"]
},
{
"color":"Red",
"code":["145","001","1","7879","123984","89"]
},
{
"color":"White",
"code":["7","11","89","404"]
}
...
]
Colors.java
class Colors {
private String color;
private List<String> code;
public Colors() {
}
public String getColor() {
return color;
}
public void setColor(String color) {
this.color = color;
}
public List<String> getCode() {
return code;
}
public void setCode(List<String> code) {
this.code = code;
}
#Override
public String toString() {
return "Colors{" +
"color='" + color + '\'' +
", code=" + code +
'}';
}
}
Creating POJO classes in this case is a wasting because we do not use the whole result List<Colors> but only one internal property. To avoid this we can use native JsonNode and ArrayNode data types. We can read JSON using readTree method, iterate over array, find given object and finally convert internal code array. It could look like below:
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ArrayNode;
import java.io.File;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
public class JsonApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
ObjectMapper mapper = new ObjectMapper();
ArrayNode rootArray = (ArrayNode) mapper.readTree(jsonFile);
int size = rootArray.size();
for (int i = 0; i < size; i++) {
JsonNode jsonNode = rootArray.get(i);
if (jsonNode.get("color").asText().equals("Blue")) {
Iterator<JsonNode> codesIterator = jsonNode.get("code").elements();
List<String> codes = new ArrayList<>();
codesIterator.forEachRemaining(n -> codes.add(n.asText()));
System.out.println(codes);
break;
}
}
}
}
Above code prints:
[012, 0324, 15478, 7412]
Downside of this solution is we load the whole JSON to memory which could be a problem for us. Let's try to use Streaming API to do that. It is a bit difficult to use and you must know how your JSON payload is constructed but it is the fastest way to get code array using Jackson. Below implementation is naive and does not handle all possibilities so you should not rely on it:
import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonParser;
import com.fasterxml.jackson.core.JsonToken;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class JsonApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
System.out.println(getBlueCodes(jsonFile));
}
private static List<String> getBlueCodes(File jsonFile) throws IOException {
try (JsonParser parser = new JsonFactory().createParser(jsonFile)) {
while (parser.nextToken() != JsonToken.END_OBJECT) {
String fieldName = parser.getCurrentName();
// Find color property
if ("color".equals(fieldName)) {
parser.nextToken();
// Find Blue color
if (parser.getText().equals("Blue")) {
// skip everything until start of the array
while (parser.nextToken() != JsonToken.START_ARRAY) ;
List<String> codes = new ArrayList<>();
while (parser.nextToken() != JsonToken.END_ARRAY) {
codes.add(parser.getText());
}
return codes;
} else {
// skip current object because it is not `Blue`
while (parser.nextToken() != JsonToken.END_OBJECT) ;
}
}
}
}
return Collections.emptyList();
}
}
Above code prints:
[012, 0324, 15478, 7412]
At the end I need to mention about JsonPath solution which also can be good if you can use other library:
import com.jayway.jsonpath.JsonPath;
import net.minidev.json.JSONArray;
import java.io.File;
import java.util.List;
import java.util.stream.Collectors;
public class JsonPathApp {
public static void main(String[] args) throws Exception {
File jsonFile = new File("./resource/test.json").getAbsoluteFile();
JSONArray array = JsonPath.read(jsonFile, "$[?(#.color == 'Blue')].code");
JSONArray jsonCodes = (JSONArray)array.get(0);
List<String> codes = jsonCodes.stream()
.map(Object::toString).collect(Collectors.toList());
System.out.println(codes);
}
}
Above code prints:
[012, 0324, 15478, 7412]
You can use DSM stream parsing library for memory, CPU efficiency and fast development. DSM uses YAML based mapping file and reads the whole data only once.
Here is the solution of your question:
Mapping File:
params:
colorsToFilter: ['Blue','Red'] # parameteres can be passed programmatically
result:
type: array
path: /.*colors # path is regex
filter: params.colorsToFilter.contains(self.data.color) # select only color that exist in colorsToFilter list
fields:
color:
code:
type: array
Usage of DSM to parse json:
DSM dsm = new DSMBuilder(new File("path/maping.yaml")).create(Colors.class);
List<Colors> object = (List<Colors>) dsm.toObject(jsonData);
System.out.println(object);
Output:
[Colors{color='Blue', code=[012, 0324, 15478, 7412]}, Colors{color='Red', code=[145, 001, 1, 7879, 123984, 89]}]
I know this was not possible before, but now with the following update:
https://developers.google.com/web/updates/2017/04/devtools-release-notes#screenshots
this seems to be possible using Chrome Dev Tools.
Is it possible now using Selenium in Java?
Yes it possible to take a full page screenshot with Selenium since Chrome v59. The Chrome driver has two new endpoints to directly call the DevTools API:
/session/:sessionId/chromium/send_command_and_get_result
/session/:sessionId/chromium/send_command
The Selenium API doesn't implement these commands, so you'll have to send them directly with the underlying executor. It's not straightforward, but at least it's possible to produce the exact same result as DevTools.
Here's an example with python working on a local or remote instance:
from selenium import webdriver
import json, base64
capabilities = {
'browserName': 'chrome',
'chromeOptions': {
'useAutomationExtension': False,
'args': ['--disable-infobars']
}
}
driver = webdriver.Chrome(desired_capabilities=capabilities)
driver.get("https://stackoverflow.com/questions")
png = chrome_takeFullScreenshot(driver)
with open(r"C:\downloads\screenshot.png", 'wb') as f:
f.write(png)
, and the code to take a full page screenshot :
def chrome_takeFullScreenshot(driver) :
def send(cmd, params):
resource = "/session/%s/chromium/send_command_and_get_result" % driver.session_id
url = driver.command_executor._url + resource
body = json.dumps({'cmd':cmd, 'params': params})
response = driver.command_executor._request('POST', url, body)
return response.get('value')
def evaluate(script):
response = send('Runtime.evaluate', {'returnByValue': True, 'expression': script})
return response['result']['value']
metrics = evaluate( \
"({" + \
"width: Math.max(window.innerWidth, document.body.scrollWidth, document.documentElement.scrollWidth)|0," + \
"height: Math.max(innerHeight, document.body.scrollHeight, document.documentElement.scrollHeight)|0," + \
"deviceScaleFactor: window.devicePixelRatio || 1," + \
"mobile: typeof window.orientation !== 'undefined'" + \
"})")
send('Emulation.setDeviceMetricsOverride', metrics)
screenshot = send('Page.captureScreenshot', {'format': 'png', 'fromSurface': True})
send('Emulation.clearDeviceMetricsOverride', {})
return base64.b64decode(screenshot['data'])
With Java:
public static void main(String[] args) throws Exception {
ChromeOptions options = new ChromeOptions();
options.setExperimentalOption("useAutomationExtension", false);
options.addArguments("disable-infobars");
ChromeDriverEx driver = new ChromeDriverEx(options);
driver.get("https://stackoverflow.com/questions");
File file = driver.getFullScreenshotAs(OutputType.FILE);
}
import java.lang.reflect.Method;
import java.util.Map;
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.CommandInfo;
import org.openqa.selenium.remote.HttpCommandExecutor;
import org.openqa.selenium.remote.http.HttpMethod;
public class ChromeDriverEx extends ChromeDriver {
public ChromeDriverEx() throws Exception {
this(new ChromeOptions());
}
public ChromeDriverEx(ChromeOptions options) throws Exception {
this(ChromeDriverService.createDefaultService(), options);
}
public ChromeDriverEx(ChromeDriverService service, ChromeOptions options) throws Exception {
super(service, options);
CommandInfo cmd = new CommandInfo("/session/:sessionId/chromium/send_command_and_get_result", HttpMethod.POST);
Method defineCommand = HttpCommandExecutor.class.getDeclaredMethod("defineCommand", String.class, CommandInfo.class);
defineCommand.setAccessible(true);
defineCommand.invoke(super.getCommandExecutor(), "sendCommand", cmd);
}
public <X> X getFullScreenshotAs(OutputType<X> outputType) throws Exception {
Object metrics = sendEvaluate(
"({" +
"width: Math.max(window.innerWidth,document.body.scrollWidth,document.documentElement.scrollWidth)|0," +
"height: Math.max(window.innerHeight,document.body.scrollHeight,document.documentElement.scrollHeight)|0," +
"deviceScaleFactor: window.devicePixelRatio || 1," +
"mobile: typeof window.orientation !== 'undefined'" +
"})");
sendCommand("Emulation.setDeviceMetricsOverride", metrics);
Object result = sendCommand("Page.captureScreenshot", ImmutableMap.of("format", "png", "fromSurface", true));
sendCommand("Emulation.clearDeviceMetricsOverride", ImmutableMap.of());
String base64EncodedPng = (String)((Map<String, ?>)result).get("data");
return outputType.convertFromBase64Png(base64EncodedPng);
}
protected Object sendCommand(String cmd, Object params) {
return execute("sendCommand", ImmutableMap.of("cmd", cmd, "params", params)).getValue();
}
protected Object sendEvaluate(String script) {
Object response = sendCommand("Runtime.evaluate", ImmutableMap.of("returnByValue", true, "expression", script));
Object result = ((Map<String, ?>)response).get("result");
return ((Map<String, ?>)result).get("value");
}
}
To do this with Selenium Webdriver in Java takes a bit of work.. As hinted by Florent B. we need to change some classes uses by the default ChromeDriver to make this work. First we need to make a new DriverCommandExecutor which adds the new Chrome commands:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.remote.CommandInfo;
import org.openqa.selenium.remote.http.HttpMethod;
import org.openqa.selenium.remote.service.DriverCommandExecutor;
import org.openqa.selenium.remote.service.DriverService;
public class MyChromeDriverCommandExecutor extends DriverCommandExecutor {
private static final ImmutableMap<String, CommandInfo> CHROME_COMMAND_NAME_TO_URL;
public MyChromeDriverCommandExecutor(DriverService service) {
super(service, CHROME_COMMAND_NAME_TO_URL);
}
static {
CHROME_COMMAND_NAME_TO_URL = ImmutableMap.of("launchApp", new CommandInfo("/session/:sessionId/chromium/launch_app", HttpMethod.POST)
, "sendCommandWithResult", new CommandInfo("/session/:sessionId/chromium/send_command_and_get_result", HttpMethod.POST)
);
}
}
After that we need to create a new ChromeDriver class which will then use this thing. We need to create the class because the original has no constructor that lets us replace the command executor... So the new class becomes:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.Capabilities;
import org.openqa.selenium.WebDriverException;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.html5.LocalStorage;
import org.openqa.selenium.html5.Location;
import org.openqa.selenium.html5.LocationContext;
import org.openqa.selenium.html5.SessionStorage;
import org.openqa.selenium.html5.WebStorage;
import org.openqa.selenium.interactions.HasTouchScreen;
import org.openqa.selenium.interactions.TouchScreen;
import org.openqa.selenium.mobile.NetworkConnection;
import org.openqa.selenium.remote.FileDetector;
import org.openqa.selenium.remote.RemoteTouchScreen;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.remote.html5.RemoteLocationContext;
import org.openqa.selenium.remote.html5.RemoteWebStorage;
import org.openqa.selenium.remote.mobile.RemoteNetworkConnection;
public class MyChromeDriver extends RemoteWebDriver implements LocationContext, WebStorage, HasTouchScreen, NetworkConnection {
private RemoteLocationContext locationContext;
private RemoteWebStorage webStorage;
private TouchScreen touchScreen;
private RemoteNetworkConnection networkConnection;
//public MyChromeDriver() {
// this(ChromeDriverService.createDefaultService(), new ChromeOptions());
//}
//
//public MyChromeDriver(ChromeDriverService service) {
// this(service, new ChromeOptions());
//}
public MyChromeDriver(Capabilities capabilities) {
this(ChromeDriverService.createDefaultService(), capabilities);
}
//public MyChromeDriver(ChromeOptions options) {
// this(ChromeDriverService.createDefaultService(), options);
//}
public MyChromeDriver(ChromeDriverService service, Capabilities capabilities) {
super(new MyChromeDriverCommandExecutor(service), capabilities);
this.locationContext = new RemoteLocationContext(this.getExecuteMethod());
this.webStorage = new RemoteWebStorage(this.getExecuteMethod());
this.touchScreen = new RemoteTouchScreen(this.getExecuteMethod());
this.networkConnection = new RemoteNetworkConnection(this.getExecuteMethod());
}
#Override
public void setFileDetector(FileDetector detector) {
throw new WebDriverException("Setting the file detector only works on remote webdriver instances obtained via RemoteWebDriver");
}
#Override
public LocalStorage getLocalStorage() {
return this.webStorage.getLocalStorage();
}
#Override
public SessionStorage getSessionStorage() {
return this.webStorage.getSessionStorage();
}
#Override
public Location location() {
return this.locationContext.location();
}
#Override
public void setLocation(Location location) {
this.locationContext.setLocation(location);
}
#Override
public TouchScreen getTouch() {
return this.touchScreen;
}
#Override
public ConnectionType getNetworkConnection() {
return this.networkConnection.getNetworkConnection();
}
#Override
public ConnectionType setNetworkConnection(ConnectionType type) {
return this.networkConnection.setNetworkConnection(type);
}
public void launchApp(String id) {
this.execute("launchApp", ImmutableMap.of("id", id));
}
}
This is mostly a copy of the original class, but with some constructors disabled (because some of the needed code is package private). If you are in need of these constructors you must place the classes in the package org.openqa.selenium.chrome.
With these changes you are able to call the required code, as shown by Florent B., but now in Java with the Selenium API:
import com.google.common.collect.ImmutableMap;
import org.openqa.selenium.remote.Command;
import org.openqa.selenium.remote.Response;
import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Base64;
import java.util.HashMap;
import java.util.Map;
public class ChromeExtender {
#Nonnull
private MyChromeDriver m_wd;
public ChromeExtender(#Nonnull MyChromeDriver wd) {
m_wd = wd;
}
public void takeScreenshot(#Nonnull File output) throws Exception {
Object visibleSize = evaluate("({x:0,y:0,width:window.innerWidth,height:window.innerHeight})");
Long visibleW = jsonValue(visibleSize, "result.value.width", Long.class);
Long visibleH = jsonValue(visibleSize, "result.value.height", Long.class);
Object contentSize = send("Page.getLayoutMetrics", new HashMap<>());
Long cw = jsonValue(contentSize, "contentSize.width", Long.class);
Long ch = jsonValue(contentSize, "contentSize.height", Long.class);
/*
* In chrome 61, delivered one day after I wrote this comment, the method forceViewport was removed.
* I commented it out here with the if(false), and hopefully wrote a working alternative in the else 8-/
*/
if(false) {
send("Emulation.setVisibleSize", ImmutableMap.of("width", cw, "height", ch));
send("Emulation.forceViewport", ImmutableMap.of("x", Long.valueOf(0), "y", Long.valueOf(0), "scale", Long.valueOf(1)));
} else {
send("Emulation.setDeviceMetricsOverride",
ImmutableMap.of("width", cw, "height", ch, "deviceScaleFactor", Long.valueOf(1), "mobile", Boolean.FALSE, "fitWindow", Boolean.FALSE)
);
send("Emulation.setVisibleSize", ImmutableMap.of("width", cw, "height", ch));
}
Object value = send("Page.captureScreenshot", ImmutableMap.of("format", "png", "fromSurface", Boolean.TRUE));
// Since chrome 61 this call has disappeared too; it does not seem to be necessary anymore with the new code.
// send("Emulation.resetViewport", ImmutableMap.of());
send("Emulation.setVisibleSize", ImmutableMap.of("x", Long.valueOf(0), "y", Long.valueOf(0), "width", visibleW, "height", visibleH));
String image = jsonValue(value, "data", String.class);
byte[] bytes = Base64.getDecoder().decode(image);
try(FileOutputStream fos = new FileOutputStream(output)) {
fos.write(bytes);
}
}
#Nonnull
private Object evaluate(#Nonnull String script) throws IOException {
Map<String, Object> param = new HashMap<>();
param.put("returnByValue", Boolean.TRUE);
param.put("expression", script);
return send("Runtime.evaluate", param);
}
#Nonnull
private Object send(#Nonnull String cmd, #Nonnull Map<String, Object> params) throws IOException {
Map<String, Object> exe = ImmutableMap.of("cmd", cmd, "params", params);
Command xc = new Command(m_wd.getSessionId(), "sendCommandWithResult", exe);
Response response = m_wd.getCommandExecutor().execute(xc);
Object value = response.getValue();
if(response.getStatus() == null || response.getStatus().intValue() != 0) {
//System.out.println("resp: " + response);
throw new MyChromeDriverException("Command '" + cmd + "' failed: " + value);
}
if(null == value)
throw new MyChromeDriverException("Null response value to command '" + cmd + "'");
//System.out.println("resp: " + value);
return value;
}
#Nullable
static private <T> T jsonValue(#Nonnull Object map, #Nonnull String path, #Nonnull Class<T> type) {
String[] segs = path.split("\\.");
Object current = map;
for(String name: segs) {
Map<String, Object> cm = (Map<String, Object>) current;
Object o = cm.get(name);
if(null == o)
return null;
current = o;
}
return (T) current;
}
}
This lets you use the commands as specified, and creates a file with a png format image inside it. You can of course also directly create a BufferedImage by using ImageIO.read() on the bytes.
In Selenium 4, FirefoxDriver provides a getFullPageScreenshotAs method that handles vertical and horizontal scrolling, as well as fixed elements (e.g. navbars). ChromeDriver may implement this method in later releases.
System.setProperty("webdriver.gecko.driver", "path/to/geckodriver");
final FirefoxOptions options = new FirefoxOptions();
// set options...
final FirefoxDriver driver = new FirefoxDriver(options);
driver.get("https://stackoverflow.com/");
File fullScreenshotFile = driver.getFullPageScreenshotAs(OutputType.FILE);
// File will be deleted once the JVM exits, so you should copy it
For ChromeDriver, the selenium-shutterbug library can be used.
Shutterbug.shootPage(driver, Capture.FULL, true).save();
// Take screenshot of the whole page using Chrome DevTools
I'm trying to load a csv file as a JavaRDD String and then want to get the data in JavaRDD Vector
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.feature.HashingTF;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
import org.apache.spark.mllib.stat.Statistics;
import breeze.collection.mutable.SparseArray;
import scala.collection.immutable.Seq;
public class Trial {
public void start() throws InstantiationException, IllegalAccessException,
ClassNotFoundException {
run();
}
private void run(){
SparkConf conf = new SparkConf().setAppName("csvparser");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.flatMap(null);
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
}
private List<Vector> Seq(Vector dv) {
// TODO Auto-generated method stub
return null;
}
public static void main(String[] args) throws Exception {
Trial trial = new Trial();
trial.start();
}
}
The program is running without any error but i'm not able to get anything when trying to run it on spark-machine. Can anyone tell me whether the conversion of string RDD to Vector RDD is correct.
My csv file consist of only one column which are floating numbers
The null in this flatMap invocation might be a problem:
JavaRDD<Vector> datamain = data.flatMap(null);
I solved my answer by changing the code to this
JavaRDD<Vector> datamain = data.map(new Function<String,Vector>(){
public Vector call(String s){
String[] sarray = s.trim().split("\\r?\\n");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++) {
values[i] = Double.parseDouble(sarray[i]);
System.out.println(values[i]);
}
return Vectors.dense(values);
}
}
);
Assuming your trial.csv file looks like this
1.0
2.0
3.0
Taking your original code from your question a one line change is required with Java 8
SparkConf conf = new SparkConf().setAppName("csvparser").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDD<String> data = jsc.textFile("C:/Users/kalraa2/Documents/trial.csv");
JavaRDD<Vector> datamain = data.map(s -> Vectors.dense(Double.parseDouble(s)));
MultivariateStatisticalSummary mat = Statistics.colStats(datamain.rdd());
System.out.println(mat.mean());
Prints 2.0
The is actually related to the question How can I add row numbers for rows in PIG or HIVE?
The 3rd answer provided by srini works fine, but I have trouble to access the data after the udf.
The udf provided by srini is following
import java.io.IOException;
import java.util.Iterator;
import org.apache.pig.EvalFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.BagFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.impl.logicalLayer.schema.Schema;
import org.apache.pig.data.DataType;
public class RowCounter extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
DataBag bg = (DataBag)input.get(0);
Iterator it = bg.iterator();
Integer count = new Integer(1);
while(it.hasNext())
{ Tuple t = (Tuple)it.next();
t.append(count);
output.add(t);
count = count + 1;
}
return output;
} catch (ExecException ee) {
// error handling goes here
throw ee;
}
}
public Schema outputSchema(Schema input) {
try{
Schema bagSchema = new Schema();
bagSchema.add(new Schema.FieldSchema("RowCounter", DataType.BAG));
return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input),
bagSchema, DataType.BAG));
}catch (Exception e){
return null;
}
}
}
I wrote a simple test pig script as following
A = load 'input.txt' using PigStorage(' ') as (name:chararray, age:int);
/*
--A: {name: chararray,age: int}
(amy,56)
(bob,1)
(bob,9)
(amy,34)
(bob,20)
(amy,78)
*/
B = group A by name;
C = foreach B {
orderedGroup = order A by age;
generate myudfs.RowCounter(orderedGroup) as t;
}
/*
--C: {t: {(RowCounter: {})}}
({(amy,34,1),(amy,56,2),(amy,78,3)})
({(bob,1,1),(bob,9,2),(bob,20,3)})
*/
D = foreach C generate FLATTEN(t);
/*
D: {t::RowCounter: {}}
(amy,34,1)
(amy,56,2)
(amy,78,3)
(bob,1,1)
(bob,9,2)
(bob,20,3)
*/
The problem is how to use D in later operation. I tried multiple ways, but always got the following error
ava.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.DataBag
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:575)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
My guess is that because we don't have the schema for the tuple inside the bag. if this is the reason, how should I modify the udf?
ok, I found the solution by adding the outputSchema as following
public Schema outputSchema(Schema input) {
try{
Schema.FieldSchema counter = new Schema.FieldSchema("counter", DataType.INTEGER);
Schema tupleSchema = new Schema(input.getField(0).schema.getField(0).schema.getFields());
tupleSchema.add(counter);
Schema.FieldSchema tupleFs;
tupleFs = new Schema.FieldSchema("with_counter", tupleSchema, DataType.TUPLE);
Schema bagSchema = new Schema(tupleFs);
return new Schema(new Schema.FieldSchema("row_counter",
bagSchema, DataType.BAG));
}catch (Exception e){
return null;
}
}
}
Thanks.