Jump to content
  • Avoid using timestamp fields in StreamBase Query Table primary keys when storing individual events


    As a general guideline, don't use StreamBase timestamp fields in the primary key of any query table that is going to be used to store each tuple in a stream.

    It's a really bad idea.

    StreamBase Query Tables are relational table constructs that require each table to have a primary key.

    Primary key values, by definition, are unique for each row of the table.

    It's a fairly common use case to write each incoming tuple from a stream into a Query Table so that the contents of the stream/table may be queried from other places in the application. Many stream schemas do not have a natural primary key, so that in order to store every tuple in the stream, the application has to synthesize a primary key field that has unique values.

    Generating a truly unique set of primary key values is not always trivially easy. Because the task is not easy, it's tempting to look for shortcuts. It's pretty common for application developers who are new to StreamBase to notice that a stream might have a timestamp field representing some time interesting to the events in the stream and to use those values as primary key values for a related query table.

    This design decision is almost always a really bad idea. People who make this mistake -- and almost everybody tries this at one time or another -- tend to make this mistake once, but not more than once.

    What's wrong with using timestamps as primary keys? Simple: timestamp values are not unique. The StreamBase timestamp data type has millisecond resolution -- nothing more granular. Say, for example, that there is a timestamp field named eventTime that represents the moment at which the event represented by the tuple occurred according to somebody's clock. If a StreamBase application uses eventTime as the primary key field for a table to which we write every incoming event, and more than one event occurs during a particular millisecond, then there will only be one row in the table representing all the events that occurred during that millisecond. Usually, that means we have overwritten previously arriving events in that millisecond with the last event to arrive with that millisecond as the eventTime.

    Milliseconds might seem granular enough to guarantee uniqueness, but it's not. There are only (duh) one thousand unique millisecond values per second, and it's not at all rare for StreamBase applications to process tens of thousands, hundreds of thousands, or even millions of events per second. Imagine a stream that always sends us 10,000 tuples per second with monotonically increasing and uniformly distributed timestamp values (a level of perfection that is rare indeed, but bear with us here for the example). In this example, for every 10 tuples processed, there would only be 1 row in the query table. We are throwing away 90% of the received data. Probably not what we wanted to do.

    Even when data rates are much lower -- perhaps we have a peak rate of 100 events per second -- we must not rely on timestamp values being unique. Real data often has timestamps that are identical on numerous events. We don't know when and how an upstream source of tuples generates time values. Batching and queuing effects, especially in congested situations, can make identical timestamps a commonplace.

    So, as a general practice, don't use timestamp fields as part of a primary key in a query table when you are using the table to store individual tuples from a stream.

    When might it be OK to use a timestamp field in a primary key? When the timestamp itself has meaning in the key. For example, perhaps I want to have a table that has a row in it for each item in my inventory and each minute in the day -- for example, if I want to keep counts of events per minute, or average prices per minute, or total volume of items for each minute. Then it makes sense to have a compound primary key such as (itemId, minute), and the value of the minute field would be a timestamp with a millisecond value of 0 to indicate the start of each minute.

    What about creating unique primary key values where the incoming stream to be saved in a table doesn't have a natural primary key? That's a longer discussion for another page, perhaps, but as a short answer, the application should generate or synthesize a unique value and make that the primary key for the table. Common approaches, just to get you thinking, include using a Sequence operator and a Sequence ID (but then the application should handle engine process restarts where the Sequence ID is kept in the heap and thus zero after engine restart, and also handle Sequence ID value overflow) or perhaps consider using a string primary key field that receives the value of the uuid() function for each tuple (but this approach can take a lot of space and also be too slow for very high processing rates).


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...