I use Google Cloud Dataflow to process bound data and output to BigQuery, and I want it can process something and write something (like stream, not batch), Is any way I can do this?
Currently, Dataflow will wait worker process dont all data, and write to BigQuery, I try to add FixedWindow and use Log Timestamp param be a window_timestamp, but It doesn't work.
I want to know:
Is windowing right way to handle this problem?
Is BigQueryIO really write batch or maybe it just not show on my dashboard (background write stream?)
Is any way to do I need?
My source code is here: http://pastie.org/10907947
Thank you very much!
You need to set the streaming property to true in your PipelineOptions.
See "streaming execution" for more information.
In addition, you'll need to be using sources/sinks that can generate/consume unbounded data. BigQuery can already write in both modes, but currently TextIO only reads bounded data. But it's definitely possible to write a custom unbounded source that scans a directory for new files.
Related
I would like to use the Windowing mechanism that Kafka Streams provides in order to perform some actions on streaming data.
I've tried to simulate the windowing mechanism using Processor API to perform this action every n-seconds using context.schedule() but in this way I cannot have hopping windows.
Is there a way to achive this?
If I use the time interval in the context.schedule() for the advance of the window, then how can I controll/set the window size?
I need to keep a StateStore to hold a data structure through windows, then based on the content of this data structure I perform some modification of it based on the records that arrive. I need to perform this actions on every record that arrive, so I think I can use the trasform method.
Finally at the end of each window I need to forward some data from the data structure that I mention before.
I want to store Elasticsearch indices to HDFS files not using ES-Hadoop Connector.
A proposed solution is using Streaming Custom Receivers to read and save as parquet files and the code is like,
JavaDStream<String> jsonDocs = ssc.union(dsList.get(0), dsList.subList(1, dsList.size())); // I have a couple receivers
jsonDocs.foreachRDD( rdd -> {
Dataset<Row> ds = spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
ds.write().mode(SaveMode.Append).option("compression","gzip").parquet(path);
With this, I get some okay performance number, however, for I am new to Spark, I wonder if there is any room to improve.
For example, I see that json() and parquet() jobs take most of time, and is json() jobs taking long time necessary or can it be avoided?
(I have omitted some other jobs, e.g. count(), from the code snippet for simplicity.)
Using Structured Streaming looks a good but haven’t found a simple solution with Custom Receivers Streaming.
Thanks in advance,
spark.read().json(spark.createDataset(rdd.rdd(), Encoders.STRING()));
Looking above, reading json() might not best for performance sensitive work. Spark uses JacksonParser in it's data source api for reading json. If your json structure is simple try to parse it by yourself using map() function to create Row.
Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations
I am using MultiResourceItemReader class of Spring Batch. Which uses FlatFileReader bean as delegate.My files contains XML requests, my batch reading requestes from files hit its on to URL and writing response to corresponding output files. I want to define one thread for each file processing to decrease execution time. In my current requirement I have four input files , I want to define four thread to read ,process and write files. I tried with simpleTaskExecuter with
task-executor="simpleTaskExecutor" throttle-limit="20"
But after using this flatfileReader is throwing Exception.
I am beginner, please suggest me how to implement this. Thanks in advance.
There are a couple ways to go here. However, the easiest way would be to partition by file using the MultiResourcePartitioner. That in combination with the TaskExecutorPartitionHandler will give you reliable parallel processing of your input files. You can read more about partitioning in section 7.4 of our documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html
Iam writing a map-reduce job in Java I would like to know is it possible to obtain output of the job as stream(may be a output stream) rather a physical output file. My objective is to use the stream for another application.
You can write a Custom Output Format and use that write to any stream you want to. Not necessarily a file. See this tutorial on how to write a Custom Output Format.
Or else you can make use Hadoop Streaming API. Have a look here for that.
I don't think you can do this using Apache-Hadoop. It is designed to work in a distributed system and AFAIK providing the way to emit an output stream would defy the purpose, as then how system would decide on the stream to emit, i.e. of which reducer! You may write to a flat-file/DB/amazon-s3 etc but perhaps you won't get a stream.