Kafka Streams KeyValueStore retention.bytes

Kafka Streams KeyValueStore retention.bytes - java

I have a funny behaviour with KeyValueStore and I have some assumption to explain it, may be you can tell I am right or wrong...
I configured a state store like the following
Map<String, String> storeConfig = new HashMap<>();
storeConfig.put(TopicConfig.RETENTION_MS_CONFIG, TimeUnit.DAYS.toMillis(30));
storeConfig.put(TopicConfig.CLEANUP_POLICY_CONFIG, "compact,delete");
StoreBuilder store1 = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("STORE1"),
Serdes.String(),
Serdes.String()
);
streamsBuilder.addStateStore(store1.withLoggingEnabled(storeConfig));
with this configuration, I am expecting a dataset older then 30 Days will disappear but I am observing something completely different.
When I look to the rockdb directory of the store, every 14451 bytes it rolls the file and I have a such structure in the directory
14451 1. Oct 19:00 LOG
14181 30. Sep 15:59 LOG.old.1569854012833395
14451 30. Sep 17:40 LOG.old.1569918431235734
14451 1. Oct 11:05 LOG.old.1569949239434224
It seems instead of realising retention over 30 days that is configured it also realises over the file size.
I found on the internet that there is also the parameter Topic.RETENTION_BYTES_CONFIG 'retention.bytes', do I also have to configure this parameter, so my Data is visible during the retention and not deleted because of the file size (I know I have value for my key but I can't access it after this phenomena occurs)...
Thx for answers..

Internally, KeyValueStores use RocksDB, and RocksDB uses a so-called LSM-Tree (Log-Structured-Merged-Tree) internally, and it creates many smaller segments that are later combined into larger segments. After this "compaction" step, the smaller segment files can be deleted, because the data is copied into larger segment files. Hence, there is nothing to worry about.
Furthermore, Topic.RETENTION_MS_CONFIG is a topic configuration and is not related to the local store of a Kafka Streams application. Furthermore, a KeyValueStore will retain data forever, until explicitly deleted via a "tombstone" message. Hence, if you set a retention time for the underlying changelog topic, the data might be deleted in the topic, but not in the local store.
If you want to apply a retention time to a local store, you cannot use a KevValueStore, but you could use WindowedStore that supports a retention time.

Related

Writing to GCS with Dataflow using element count

This is in reference to Apache Beam SDK Version 2.2.0.
I'm attempting to use AfterPane.elementCountAtLeast(...) but not having any success so far. What I want looks a lot like Writing to Google Cloud Storage from PubSub using Cloud Dataflow using DoFn, but needs to be adapted to 2.2.0. Ultimately I just need a simple OR where a file is written after X elements OR Y time has passed. I intend to set the time very high so that the write happens on the number of elements in the majority of cases, and only writes based on duration during times of very low message volume.
Using GCP Dataflow 2.0 PubSub to GCS as a reference here's what I've tried:
String bucketPath =
String.format("gs://%s/%s",
options.getBucketName(),
options.getDestinationDirName());
PCollection<String> windowedValues = stringMessages
.apply("Create windows",
Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(250)))
.discardingFiredPanes());
windowedValues
.apply("Write to GCS",
TextIO
.write()
.to(bucketPath)
.withNumShards(options.getNumShards())
.withWindowedWrites());
Where stringMessages is a PCollection that is reading from an Avro-encoded pubsub subscription. There is some unpacking happening upstream to get the events converted to strings, but no merging/partitioning/grouping, just transforms.
Element count is hard coded at 250 just for PoC. Once it is proven, it will likely be cranked up to the 10s or 100s of thousands range.
The Problem
This implementation has resulted in text files of various lengths. The files lengths start very high (1000s of elements) when the job first starts up (presumably processing backlogged data, and then stabilize at some point. I've tried altering the 'numShards' to 1 and 10. At 1, the element count of the written files stabilizes at 600, and with 10, it stabilizes at 300.
What am I missing here?
As a side note, this is only step 1. Once I figure out writing using
element count, I still need to figure out writing these files as
compressed json (.json.gz) as opposed to plain-text files.

Posting what I learned for reference by others.
What was not clear to me when I wrote this is the following from the Apache Beam Documentation:
Transforms that aggregate multiple elements, such as GroupByKey and
Combine, work implicitly on a per-window basis
With this knowledge, I rethought my pipeline a bit. From the FileIO documentation under Writing files -> How many shards are generated per pane:
Note that setting a fixed number of shards can hurt performance: it adds an additional GroupByKey to the pipeline. However, it is required to set it when writing an unbounded PCollection due to BEAM-1438 and similar behavior in other runners.
So I decided to use FileIO's writeDynamic to perform the writes and specify withNumShards in order to get the implicit GroupByKey. The final result looks like this:
PCollection<String> windowedValues = validMessageStream.apply(Window
.<String>configure()
.triggering(Repeatedly.forever(AfterFirst.of(
AfterPane.elementCountAtLeast(2000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(
Duration.standardSeconds(windowDurationSeconds)))))
.discardingFiredPanes());
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));

Kafka - problems with TimestampExtractor

I use org.apache.kafka:kafka-streams:0.10.0.1
I'm attempting to work with a time series based stream that doesn't seem to be triggering a KStream.Process() to trigger ("punctuate"). (see here for reference)
In a KafkaStreams config I'm passing in this param (among others):
config.put(
StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
EventTimeExtractor.class.getName());
Here, EventTimeExtractor is a custom timestamp extractor (that implements org.apache.kafka.streams.processor.TimestampExtractor) to extract the timestamp information from JSON data.
I would expect this to call my object (derived from TimestampExtractor) when each new record is pulled in. The stream in question is 2 * 10^6 records / minute. I have punctuate() set to 60 seconds and it never fires. I know the data passes this span very frequently since its pulling old values to catch up.
In fact it never gets called at all.
Is this the wrong approach to setting timestamps on KStream records?
Is this the wrong way to declare this configuration?

Update Nov 2017: Kafka Streams in Kafka 1.0 now supports punctuate() with both stream-time and with processing-time (wall clock time) behavior. So you can pick whichever behavior you prefer.
Your setup seems correct to me.
What you need to be aware of: As of Kafka 0.10.0, the punctuate() method operates on stream-time (by default, i.e. based on the default timestamp extractor, stream-time will mean event-time). And the stream-time is only advanced when new data records are coming in, and how much the stream-time is advanced is determined by the associated timestamps of these new records.
For example:
Let's assume you have set punctuate() to be called every 1 minute = 60 * 1000 (note: 1 minute of stream-time). Now, if it happens that no data is being received for the next 5 minutes, punctuate() will not be called at all -- even though you might expect it to be called 5 times. Why? Again, because punctuate() depends on stream-time, and the stream-time is only advanced based on newly received data records.
Might this be causing the behavior you are seeing?
Looking ahead: There's already a ongoing discussion in the Kafka project on how to make punctuate() more flexible, e.g. to have trigger it not only based on stream-time (which defaults to event-time) but also based on processing-time.

Your approach seems to be correct. Compare pargraph "Timestamp Extractor (timestamp.extractor):" in http://docs.confluent.io/3.0.1/streams/developer-guide.html#optional-configuration-parameters
Not sure, why your custom timestamp extractor is not used. Have a look into org.apache.kafka.streams.processor.internals.StreamTask. In the constructor there should be something like
TimestampExtractor timestampExtractor1 = (TimestampExtractor)config.getConfiguredInstance("timestamp.extractor", TimestampExtractor.class);
Check if your custom extractor is picked up there or not...

I think this is another case of issues at the broker level. I went and rebuilt the cluster using instances with more CPU and RAM. Now I'm getting the results I expected.
Note to distant observer(s): if your KStream app is behaving strangely take a look at your brokers and make sure they aren't stuck in GC and have plenty of 'headroom' for file handles, RAM, etc.
See also

Java FileHandler Adding Unnecessary Digit to File Name

I'm working with JUL for my logging (no I can't change that). I've developed a simple wrapper that I pass in the parameters and it creates the FileHandler with the correct format every time so that I don't have to recreate the logging in every project.
My test app functions exactly as intended, but when I import the library into other projects I seem to be getting one (only one so far) unique error: Every single time, it adds a ".0" to the end of the log file.
It does this even when there is no conflicts and the Filehandler has been configured to append to the end an existing file (which it does fine). I've played with various file names, most recently I've been using the simple "mylog.log" and the log file still gets output as "mylog.log.0". I've checked and the fileHandler is being passed the correct file ("mylog.log"), but it isn't logging there.
This does not happen in my logging test, only in the project I actually want to use it in. Even when using the exact same parameters, I get different file names.
Is there some quirk about JUL that I'm missing? Code is very simple. Relevent code:
String logFilePath = directory+name; // directory and name are method arguments
Handler newFileHandler;
File dirFile = new File(directory);
if(!dirFile.exists())
{
dirFile.mkdirs();
}
newFileHandler = new FileHandler(logFilePath, true);
newFileHandler.setFormatter(myformatter);
//... etc

It does this even when there is no conflicts ....
Is that proven with evidence or an assumption? According to the FileHandler documentation:
If no "%g" field has been specified and the file count is greater than one, then the generation number will be added to the end of the generated filename, after a dot.
If there is a conflict and no "%u" field has been specified, it will be added at the end of the filename after a dot. (This will be after any automatically added generation number.)
Note that the use of unique ids to avoid conflicts is only guaranteed to work reliably when using a local disk file system.
A conflict can include opening more than one FileHandler to the same location. You need to verify each of these points. What helps is adding code to grab the RuntimeMXBean and then add a single log statement to record the calling ClassLoader, current thread, runtime name and the start time. The runtime name usually maps to a process id and a host name. Run the program and verify the contents of the file.
The code you have included helps but, you need to include details on how the application is launched and what is included in the logging.properties.

I eventually figured this out and forgot to post the cause.
Two things were at work:
Due to the environment I was in, the "rolling" logging was being activated by some back ground variables I wasn't aware of, hence why the ".0" was being added when it shouldn't have been, but I only saw it once I moved it out of testing and into the actual implementing project.
JUL is ridiculously inflexible in how it works. Really I can't speak poorly enough about it. Anyway, long story short, if rolling logging is enabled it will always append a file number such that the active log ends in ".0". Typically I've used APIs that only had the number on the secondary logs, while the current log would maintain the exact name you gave it - this caused me some trouble as JUL also has no "get current log file" method to get the name of active file, so I needed to create an ugly method that predicted what the name should be based on the parameters, and hope nothing went wrong. As an aside, you cannot change the format of the generation numbers (which also caused me some issue as it was preferred to have the files number 01, 02, ... 10, 11 rather than 0,1,2,...10).

MarkLogic: Move document from one directory to another on some condition

I'm new to MarkLogic and trying to implement following scenario with its Java API:
For each user I'll have two directories, something like:
1.1. user1/xmls/recent/
1.2. user1/xmls/archived/
When user is doing something with his xml - it's put to the "recent" directory;
When user is doing something with his next xml and "recent" directory is full (e.g. has some amount of documents, let's say 20) - the oldest document is moved to the "archived" directory;
User can request all documents from the "recent" directory and should get no more than 20 records;
User can remove something from the "recent" directory manually; In this case, if it had 20 documents, after deleting one it must have 19;
User can do something with his xmls simultaneously and "recent" directory should never become bigger than 20 entries.
Questions are:
In order to properly handle simultaneous adding of xmls to the "recent" directory, should I block whole "recent" directory when adding new entry (to actually add it, check if there are more than 20 records after adding, select the oldest 21st one and move it to the "archived" directory and do all these steps atomically)? How can I do it?
Any suggestions on how to implement this via Java API?
Is it possible to change document's URI (e.g. replace "recent" with "archived" in my case)?
Should I consider using MarkLogic's collections here?
I'm open to any suggestions and comments (as I said I'm new to MarkLogic and maybe my thoughts on how to handle described scenario are completely wrong).

You can achieve atomicity of a sequence of transactions using Multi-Statement Transactions (MST)
It is possible to MST from the Java API: http://docs.marklogic.com/guide/java/transactions#id_79848
It's not possible to change a URI. However, it is possible to use an MST to delete the old document and reinsert a new one using the new URI in one an atomic step. This would have the same effect.
Possibly, and judging from your use case, unless you must have the recent/archived information as part of the URI, it may be simpler to store this information in collections. However, you should read the documentation and evaluate for yourself: http://docs.marklogic.com/guide/search-dev/collections#chapter

Personally I would skip all the hassle with separate directories as well as collections. You would endlessly have to move files around, or changes their properties. It would be much easier to not calculate anything up front, and simply use lastModified property, or something alike, to determine most recent items at run-time.
HTH!

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks

Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.

Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.

Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.

As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.