Creating CSV connector for Kafka Connect - java

I'm planning to write my own Kafka connect CSV connector which will read the data from a CSV file and write the data to a topic. Data should be written to the topic in the form of JSON.
Also I came across kafka-connect-spooldir plugin of confluent. I don't want to use this and write my own.
Can anyone advice me how to go about creating a connector for the same?

The official Kafka documentation has a section on Connector development so that is probably the best first stop.
Kafka also ships with File Connectors (both Source and Sink). Have a look at the code: https://github.com/apache/kafka/tree/trunk/connect/file/src/main/java/org/apache/kafka/connect/file
It should not be too hard to modify these for your use case.
Finally as you mentioned, there are already connectors that can read CSV files and that are open source. So if you're stuck on something you can check how they did it.

Related

Is IT possoble to check what embeded kafka topic contains?

As on the title, i wonder if i can chech what embeded kafka topic contains, just lile what local variable contains?
I tried in intelij debug gui but didnt find anything
You need to write a KafkaConsumer to check data that any Kafka topic contains, embedded, or not.
Otherwise, you can download Kafka CLI tools, produce + flush some messages, then use kafka-dump-log tool on segment files written to disk, while your breakpoint is set.

Design a Spring batch application to read data from different resources(Flat files)

I am developing a batch application using (Spring boot, java, and Spring batch) for which I need to read data from different locations. Below is my use case:
Multiple paths such as C://Temp//M1, C://Temp//M2 , both locations can contain identical files with same data such as C://Temp//M1//File1.txt, C://Temp//M2//File1.txt, and C://Temp//M1//File2.txt, C://Temp//M2//File2.txt
At first, I need to merge them in memory if an identical file exists at both locations before starting batch after removing duplicates and pass the merged in-memory data as an argument to the reader.
I have designed batch using multiresourceitemreader which reads flat files and processes them but not able to achieve in-memory merging and duplicate removal from multiple files.
So may you please have a look and suggest me a way how can I achieve this?
Through my experience I have found the usage of BeanIO library priceless when it comes to dealing with flat files. Also it integrates with spring batch.
http://beanio.org/
Which regards of reading from 2 locations you can:
Implement your reader as a composite that read first line from file 1 then from file two
you first read through the reader file 1 and inside the prosessor you enrich with data from file number 2.
premerge the files
If you are aware of Kafka try Kafka connect framework. Use the Confluent platform to easily use their connectors.
Then consume from Kafka into your Spring application.
https://www.confluent.io/hub
If you are interested in Kafka I'll explain to you in detail

ElasticSearch : plain text file instead of JSON

Interested in elasticsearch and working with txt files not json. Can elasticsearch support plain text file? If yes, Is there any java API, I can use ( I tested crud operations with postman on JSON document and it's working fine )
Thanks for the help.
No, the elasticsearch document api supports only JSON.
But there is a workaround for this problem using ingestion pipelines running on ingestion nodes in your cluster https://www.elastic.co/guide/en/elasticsearch/reference/current/ingest.html
. By defaut each elasticsearch server instance is a ingestion node.
Please have a look on this wery well described approach for CSV https://www.elastic.co/de/blog/indexing-csv-elasticsearch-ingest-node which is easily adaptable for flat files.
Another option is to use a second tool like Filebeat or Logstash for file ingestion. Have a look here: https://www.elastic.co/products/beats or here https://www.elastic.co/products/logstash
Having a Filebeat in place will solve many problems with minimal effort. Give it a chance ;)

Delete Messages from a Topic in Apache Kafka

So I am new to working with Apache Kafka and I am trying to create a simple app so I can try to understand the API better. I know this question has been asked a lot here, but how can I clear out the messages/records that are stored on a topic?
Most of the answers I have seen say to change the message retention time or to delete & recreate the topic. Neither of these are options for me as I do not have access to the server.properties file. I am not running Kafka locally, it is hosted on a server. Is there a way to do do it in Java code maybe or something?
If you are searching for a way to delete messages selectively, the new AdminClient API (usable from Java code) provides the following deleteRecords method :
https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/admin/AdminClient.html

Best way to import 20GB CSV file to Hadoop

I have a huge 20GB CSV file to copy into Hadoop/HDFS. Of course I need to manage any error cases (if the the server or the transfer/load application crashes).
In such a case, I need to restart the processing (in another node or not) and continue the transfer without starting the CSV file from the beginning.
What is the best and easiest way to do that?
Using Flume? Sqoop? a native Java application? Spark?
Thanks a lot.
If the file is not hosted in HDFS, flume wont be able to parallelize that file (Same issue with Spark or other Hadoop based frameworks). Can you mount your HDFS on NFS and then use file copy?
One advantage of reading using flume would be to read the file and publish each line as a separate record and publish those records and let flume write one record to HDFS at a time, if something goes wrong you could start from that record instead of starting from beginning.

Categories

Resources