Fail to read from bigtable in dataflow - java

I am using dataflow for my work to write some data into the bigtable. Currently, I got a task to read rows from the bigtable.
However, whenever I try to read rows from the bigtable using bigtable-hbase-dataflow, it fails and complains as follow.
Error: (3218070e4dd208d3): java.lang.IllegalArgumentException: b <= a
at org.apache.hadoop.hbase.util.Bytes.iterateOnSplits(Bytes.java:1720)
at org.apache.hadoop.hbase.util.Bytes.split(Bytes.java:1683)
at org.apache.hadoop.hbase.util.Bytes.split(Bytes.java:1664)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$AbstractSource.split(CloudBigtableIO.java:512)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$AbstractSource.getSplits(CloudBigtableIO.java:358)
at com.google.cloud.bigtable.dataflow.CloudBigtableIO$Source.splitIntoBundles(CloudBigtableIO.java:593)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:413)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:171)
at com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:149)
at com.google.cloud.dataflow.sdk.runners.worker.SourceOperationExecutor.execute(SourceOperationExecutor.java:58)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:288)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:221)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.doWork(DataflowWorkerHarness.java:193)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:173)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:160)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I am using 'com.google.cloud.dataflow:google-cloud-dataflow-java-sdk-all:1.6.0' and 'com.google.cloud.bigtable:bigtable-hbase-dataflow:0.9.0' now. Here's my code.
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId("project-id")
.withInstanceId("instance-id")
.withTableId("table")
.build();
pipeline.apply(Read.<Result>from(CloudBigtableIO.read(config)))
.apply(ParDo.of(new Test()));
FYI, I just read from bigtable and just count rows using aggregator in Test DoFn.
static class Test extends DoFn<Result, Result> {
private static final long serialVersionUID = 0L;
private final Aggregator<Long, Long> rowCount = createAggregator("row_count", new Sum.SumLongFn());
#Override
public void processElement(ProcessContext c) {
rowCount.addValue(1L);
c.output(c.element());
}
}
I just followed tutorial on the dataflow document, but it fails. Can anyone help me out?

The root cause was a dependency issue:
Previously, our build file omitted this dependency:
compile 'io.netty:netty-tcnative-boringssl-static:1.1.33.Fork22'
Today, I added the dependency and it resolved all the issues. I double-checked that the problem arises when I don't have it in the build file.
From https://github.com/GoogleCloudPlatform/cloud-bigtable-client/issues/912#issuecomment-249999380.

Related

SnapStart function priming - Doesn't seem to be working

I’m somewhat new to Kotlin/Java, but I have been using AWS Lambda for several years now (all Python and Node). I’ve been trying to “successfully” enable SnapStart on a SpringBoot Lambda using Kotlin running on java11 corretto (the only runtime supported currently), but it doesn’t seem to be working as I would have expected.
I have hooked into the CRaC lifecycle methods beforeCheckpoint and afterRestore. In beforeCheckpoint I’ve initialized the SpringBoot application and I can see it in the deployment logs (AWS creates log streams for the deployment phase with SnapStart lambdas).
However, the concerning thing is I’m also seeing the SpringBoot app get initialized in the function invocation logs too. I would have expected that to only happen during the deployment/initialization phase when the snapshot is being created. As a result I’m not really seeing a tremendous improvement on latency or overall.
Any ideas why this is happening?
I ran into essentially the same issue (with Java instead of Kotlin) and the solution was to switch the runtime->handler from
org.springframework.cloud.function.adapter.aws.SpringBootStreamHandler
to
org.springframework.cloud.function.adapter.aws.FunctionInvoker::handleRequest
It would probably be worth mentioning that as of 2023-02-20 SnapStart isn't engaged for $LATEST version of an AWS Lambda function, i.e. make sure you are invoking a particular published version. Otherwise, Best practices for working with Lambda SnapStart article says that the main performance killers are dynamically loaded classes, and network connections that need to be re-established from time to time.
From Snapstart Integration issue raised for Spring Cloud Function on GitHub I tend to think that switching to org.springframework.cloud.function.adapter.aws.FunctionInvoker probably somewhat helps, but doesn't address the performance challenges mentioned above. I'm not sure if I'm interpreting olegz's advice correctly, but what worked best so far for my AWS lambda function built with Spring Boot/Spring Cloud Function is a "warm-up" config. It hooks into the CRaC lifecycle via beforeCheckpoint() and issues dummy requests to S3 and DynamoDB before the VM snapshot is made. This way most dynamically-loaded classes are pre-loaded, and network connections are pre-established, before any subsequent function invocation takes place.
package eu.mycompany.mysamplesystem.attachmentstore.configuration;
import com.amazonaws.services.lambda.runtime.events.S3Event;
import eu.mycompany.mysamplesystem.attachmentstore.handlers.MainEventHandler;
import lombok.extern.slf4j.Slf4j;
import org.crac.Core;
import org.crac.Resource;
import org.springframework.context.annotation.Configuration;
import software.amazon.awssdk.services.s3.model.NoSuchKeyException;
import java.util.ArrayList;
import java.util.List;
#Configuration
#Slf4j
public class WarmUpConfig implements Resource {
private final MainEventHandler mainEventHandler;
public WarmUpConfig(final MainEventHandler mainEventHandler) {
Core.getGlobalContext().register(this);
this.mainEventHandler = mainEventHandler;
}
#Override
public void beforeCheckpoint(org.crac.Context<? extends Resource> context) {
log.debug("Warm-up MainEventHandler by issuing dummy requests");
dummyS3Invocation();
dummyDynamoDbInvocation();
}
#Override
public void afterRestore(org.crac.Context<? extends Resource> context) {
}
public void dummyS3Invocation() {
S3Event s3Event = generateWarmUpEvent("ObjectCreated:Put");
try {
mainEventHandler.handleRequest(s3Event, null);
throw new IllegalStateException("Warm-up event processing should have reached S3 and failed with S3Exception");
} catch (NoSuchKeyException e) {
log.debug("S3Exception is expected, since it is a warm-up");
}
}
public void dummyDynamoDbInvocation() {
S3Event s3Event = generateWarmUpEvent("ObjectRemoved:Delete");
mainEventHandler.handleRequest(s3Event, null);
}
private S3Event generateWarmUpEvent(String eventName) {
S3Event.S3BucketEntity s3BucketEntity = new S3Event.S3BucketEntity("hopefully_non_existing_bucket", null, null);
S3Event.S3ObjectEntity s3ObjectEntity = new S3Event.S3ObjectEntity("hopefully/non/existing.key", 0L, null, null, null);
S3Event.S3Entity s3Entity = new S3Event.S3Entity(null, s3BucketEntity, s3ObjectEntity, null);
List<S3Event.S3EventNotificationRecord> records = new ArrayList<>();
records.add(new S3Event.S3EventNotificationRecord(null, eventName, null, null, null, null, null, s3Entity, null));
return new S3Event(records);
}
}
P.S.: The MainEventHandler is basically the entry point to all the business logic exposed by the Function.
#SpringBootApplication
#RequiredArgsConstructor
public class Lambda {
private final MainEventHandler mainEventHandler;
public static void main(String... args) {
SpringApplication.run(Lambda.class, args);
}
#Bean
public Function<Message<S3Event>, String> defaultFunctionLambda() {
return message -> {
Context context = message.getHeaders().get("aws-context", Context.class);
return mainEventHandler.handleRequest(message.getPayload(), context);
};
}
}

Read operation right after Elasticsearch index creation causes exception

I try to perform a read opertion on Elasticsearch index right after it was created. Here is simple code to reproduce this situation:
import org.elasticsearch.action.admin.indices.exists.indices.IndicesExistsResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import static java.net.InetAddress.getLoopbackAddress;
public class ElasticIssue {
static String index = "my_index";
public static void main(String[] args) {
final Client c = getClient();
deleteIndexIfExists(c);
createIndex(c);
//refresh(c);
//flush(c);
//delay();
//indexDoc(c);
getDoc(c);
}
static void getDoc(Client client) {
client.prepareGet(index, "some-type", "1").get();
}
static void indexDoc(Client client) {
client.prepareIndex(index, "another-type", "25").setSource("{}").get();
}
static void createIndex(Client client) {
client.admin().indices().prepareCreate(index).get();
}
static void delay() {
try {Thread.sleep(3000);} catch (InterruptedException e) {}
}
static void flush(Client client) {
client.admin().indices().prepareFlush(index).get();
}
private static void refresh(Client client) {
client.admin().indices().prepareRefresh(index).get();
}
static void deleteIndexIfExists(Client client) {
final IndicesExistsResponse response = client.admin().indices().prepareExists(index).get();
if (response.isExists()) {
deleteIndex(client);
}
}
static void deleteIndex(Client client) {
client.admin().indices().prepareDelete(index).get();
}
static Client getClient() {
final Settings settings = Settings.builder()
.put("cluster.name", "elasticsearch") //default name
.put("node.name", "my-node")
.build();
return TransportClient.builder()
.settings(settings)
.build()
.addTransportAddress(new InetSocketTransportAddress(getLoopbackAddress(), 9300));
}
}
And then I get the following error:
Exception in thread "main" NoShardAvailableActionException[No shard available for [get [my_index][some-type][1]: routing [null]]]; nested: RemoteTransportException[[my-node][172.17.0.2:9300][indices:data/read/get[s]]]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:199)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:186)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.access$1300(TransportSingleShardAction.java:115)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:240)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:855)
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:833)
at org.elasticsearch.transport.TransportService$4.onFailure(TransportService.java:387)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: RemoteTransportException[[my-node][172.17.0.2:9300][indices:data/read/get[s]]]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
Caused by: [my_index][[my_index][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:1035)
at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:651)
at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:173)
at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:86)
at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:101)
at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:44)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:282)
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:275)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like Elasticsearch index creation was not complete despite response was already returned. That is a bit frustrating. And if I do any of: delay, index any doc, refresh index, flush index (uncomment any line for this); then read operation performs successfully.
What is the explanation of this behavior? What is a recommended way to make sure that index is ready to work? Listed solutions are found by experiment.
I'am using Elasticsearch 2.3.3 and Java 8. All communication with Elasticsearch is done using Transport protocol (with Java api).
For easier setup here is docker command to get container with all necessary settings:
docker run -p 9200:9200 -p 9300:9300 elasticsearch:2.3.3 -Des.node.name="my-node"
Here is Maven dependency for Elasticsearch Java API:
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch</artifactId>
<version>2.3.3</version>
</dependency>
You need to wait till the index is created. This is what you can do to wait till the health of index is in yellow status.
After index creation function call the below function :
static void indexStatusCheck(Client client) {
ClusterHealthResponse response = client.admin().cluster().prepareHealth().setIndices(index).setWaitForYellowStatus().get();
if (response.getStatus() == ClusterHealthStatus.RED) {
throw Exception("Index not ready");
}
}
Then you can proceed with the getDoc() call.

How do I include a resource file in Java project to be used with just new File()?

I'm writing a UDF for Pig using Java. It works fine but Pig doesn't give me options to separate environment. What my Pig script is doing is to get Geo location from IP address.
Here's my code on the Geo location part.
private static final String GEO_DB = "GeoLite2-City.mmdb";
private static final String GEO_FILE = "/geo/" + GEO_DB;
public Map<String, Object> geoData(String ipStr) {
Map<String, Object> geoMap = new HashMap<String, Object>();
DatabaseReader reader = new DatabaseReader.Builder(new File(GEO_DB)).build();
// other stuff
}
GeoLite2-City.mmdb exists in HDFS that's why I can refer from absolute path using /geo/GeoLite2-City.mmdb.
However, I can't do that from my JUnit test or I have to create /geo/GeoLite2-City.mmdb on my local machine and Jenkins which is not ideal. I'm trying to figure out a way to make my test passed while using new File(GEO_DB) and not
getClass().getResourceAsStream('./geo/GeoLite2-City.mmdb') because
getClass().getResourceAsStream('./geo/GeoLite2-City.mmdb')
Doesn't work in Hadoop.
And if I run Junit test it would fail because I don't have /geo/GeoLite2-City.mmdb on my local machine.
Is there anyway I can overcome this? I just want my tests to pass without changing the code to be using getClass().getResourceAsStream and I can't if/else around that because Pig doesn't give me a way to pass in parameter or maybe I'm missing something.
And this is my JUnit test
#Test
#Ignore
public void shouldGetGeoData() throws Exception {
String ipTest = "128.101.101.101";
Map<String, Object> geoJson = new LogLine2Json().geoData(ipTest);
assertThat(geoJson.get("lLa").toString(), is(equalTo("44.9759")));
assertThat(geoJson.get("lLo").toString(), is(equalTo("-93.2166")));
}
which it works if I read the database file from resource folder. That's why I have #Ignore
Besides, your whole code looks pretty un-testable.
Every time when you directly call new in your production code, you prevent dependency injection; and thereby you make it much harder to test your code.
The point is to not call new File() within your production code.
Instead, you could use a factory that gives you a "ready to use" DatabaseReader object. Then you can test your factory to do the right thing; and you can mock that factory when testing this code (to return a mocked database reader).
So, that one file instance is just the top of your "testing problems" here.
Honestly: don't write production code first. Do TDD: write test cases first; and you will quickly learn that such production code that you are presenting here is really hard to test. And when you apply TDD, you start from "test perspective", and you will create production code that is really testable.
You have to make the file location configurable. E.g. inject it via constructor. E.g. you could create a non-default constructor for testing only.
public class LogLine2Json {
private static final String DEFAULT_GEO_DB = "GeoLite2-City.mmdb";
private static final String DEFAULT_GEO_FILE = "/geo/" + GEO_DB;
private final String geoFile;
public LogLine2Json() {
this(DEFAULT_GEO_FILE);
}
LogLine2Json(String geoFile) {
this.geoFile = geoFile;
}
public Map<String, Object> geoData(String ipStr) {
Map<String, Object> geoMap = new HashMap<String, Object>();
File file = new File(geoFile);
DatabaseReader reader = new DatabaseReader.Builder(file).build();
// other stuff
}
}
Now you can create a file from the resource and use this file in your test.
public class LogLine2JsonTest {
#Rule
public final TemporaryFolder folder = new TemporaryFolder();
#Test
public void shouldGetGeoData() throws Exception {
File dbFile = copyResourceToFile("/geo/GeoLite2-City.mmdb");
String ipTest = "128.101.101.101";
LogLine2Json logLine2Json = new LogLine2Json(dbFile.getAbsolutePath())
Map<String, Object> geoJson = logLine2Json.geoData(ipTest);
assertThat(geoJson.get("lLa").toString(), is(equalTo("44.9759")));
assertThat(geoJson.get("lLo").toString(), is(equalTo("-93.2166")));
}
private File copyResourceToFile(String name) throws IOException {
InputStream resource = getClass().getResourceAsStream(name);
File file = folder.newFile();
Files.copy(resource, file.toPath(), StandardCopyOption.REPLACE_EXISTING);
return file;
}
}
TemporaryFolder is a JUnit rule that deletes every file that is created during test afterwards.
You may modify the asserts by using the hasToString matcher. This will give you more detailed information in case of a failing test. (And you have to read/write less code.)
assertThat(geoJson.get("lLa"), hasToString("44.9759"));
assertThat(geoJson.get("lLo"), hasToString("-93.2166"));
You don't. Your question embodies a contradiction in terms. Resources are not files and do not live in the file system. You can either distribute the file separately from the JAR and use it as a File or include it in the JAR and use it as a resource. Not both. You have to make up your mind.

Why is SystemProperty.environment.value() returning null in Google App Engine?

I have a Google App Engine Java app that is returning null from SystemProperty.environment.value(), and all other static members of SystemProperty. I see this when running my JUnit tests via Maven.
import com.google.appengine.api.utils.SystemProperty;
...
void printProps() {
log.info("props:" + System.getProperties());
log.info("env=" + SystemProperty.environment.value());
log.info("log=" + System.getProperty("java.util.logging.config.file"));
log.info("id=" + SystemProperty.applicationId.get());
log.info("ver=" + SystemProperty.applicationVersion.get());
}
The only item above that returns non-null is System.getProperties().
Here are some of the details of my setup:
IntelliJ IDEA EAP 13
Maven
App Engine SDK 1.8.5
Java 7 (1.7.0_40)
JUnit 4
I was having the same problem. I attempted to call these methods on LocalServiceTestHelper, but these did NOT populate SystemProperty.applicationId or SystemProperty.version
setEnvAppId(java.lang.String envAppId)
setEnvVersionId(java.lang.String envVersionId)
e.g.
public final LocalServiceTestHelper helper = new LocalServiceTestHelper(new LocalDatastoreServiceTestConfig(), new LocalTaskQueueTestConfig() ).setEnvAppId("JUnitApplicationId").setEnvVersionId("JUnitVersion");
My solution was simply to populate those properties myself in my JUnit setUp() method:
#Before
public void setUp() throws Exception {
SystemProperty.version.set("JUnitVersion");
SystemProperty.applicationId.set("JUnitApplicationId");
SystemProperty.applicationVersion.set("JUnitApplicationVersion");
SystemProperty.environment.set( SystemProperty.Environment.Value.Development );
helper.setUp();
datastore = DatastoreServiceFactory.getDatastoreService();
queue = QueueFactory.getDefaultQueue();
}
Note that the only valid values for SystemProperty.Environment are the static final values Production and Development.

Kafka Storm Integration using Kafka Spout

I am using KafkaSpout. Please find the test program below.
I am using Storm 0.8.1. Multischeme class is there in Storm 0.8.2. I will be using that. I just want to know how were the earlier versions working just by instantiating the StringScheme() class? Where can I download earlier versions of Kafka Spout? But I doubt that would be a correct alternative than to work on Storm 0.8.2. ??? (Confused)
When I run the code (given below) on storm cluster (i.e. when I push my topology) I get the following error (This happens when the Scheme part is commented else of course I will get compiler error as the class is not there in 0.8.1):
java.lang.NoClassDefFoundError: backtype/storm/spout/MultiScheme
at storm.kafka.TestTopology.main(TestTopology.java:37)
Caused by: java.lang.ClassNotFoundException: backtype.storm.spout.MultiScheme
In the code given below you may find the spoutConfig.scheme=new StringScheme(); part commented. I was getting compiler error if I don't comment that line which is but natural as there are no constructors in there. Also when I instantiate MultiScheme I get error as I dont have that class in 0.8.1.
public class TestTopology {
public static class PrinterBolt extends BaseBasicBolt {
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
public void execute(Tuple tuple, BasicOutputCollector collector) {
System.out.println(tuple.toString());
}
}
public static void main(String [] args) throws Exception {
List<HostPort> hosts = new ArrayList<HostPort>();
hosts.add(new HostPort("127.0.0.1",9092));
LocalCluster cluster = new LocalCluster();
TopologyBuilder builder = new TopologyBuilder();
SpoutConfig spoutConfig = new SpoutConfig(new KafkaConfig.StaticHosts(hosts, 1), "test", "/zkRootStorm", "STORM-ID");
spoutConfig.zkServers=ImmutableList.of("localhost");
spoutConfig.zkPort=2181;
//spoutConfig.scheme=new StringScheme();
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
builder.setSpout("spout",new KafkaSpout(spoutConfig));
builder.setBolt("printer", new PrinterBolt())
.shuffleGrouping("spout");
Config config = new Config();
cluster.submitTopology("kafka-test", config, builder.createTopology());
Thread.sleep(600000);
}
I had the same problem. Finally resolved it, and I put the complete running example up on github.
You are welcome to check it out here >
https://github.com/buildlackey/cep
(click on the storm+kafka directory for a sample program that should get you up and running).
We had a similar issue.
Our solution:
Open pom.xml
Change scope from provided to <scope>compile</scope>
If you want to know more about dependency scopes check the maven docu:
Maven docu - dependency scopes

Categories

Resources