Feature Platforms: The Underrated Infrastructure Layer Behind Fast ML Teams

A feature platform is the infrastructure layer that manages how raw attributes from your domain; such as:

  • “how many orders has this user placed”
  • “what was the timestamp of the user’s last completed order”

are transformed into features that power your ML models, like:

  • “orders_last_7d”
  • “minutes_since_last_completed_order”

Think of it as a single-point solution to compute, store, and serve the derived data your models depend on - both for training and for production inference.

Feature Platforms

The examples in this post are from a hypothetical e-commerce and logistics applications. But the concepts are applicable to any domain that needs to compute and serve features for ML models.

What a Traditional Data Pipeline Looks Like (Without a Feature Platform)

Without a feature platform, teams typically stitch together a pipeline from separate components: a compute engine (Spark or dbt) to transform raw data, an orchestrator (Airflow, Prefect) to schedule runs, a key-value store (Redis, DynamoDB) to serve features at prediction time, and — if real-time signals are needed — a Kafka stream feeding a Flink or custom consumer into that same store.

Say you want a feature: “number of orders a user placed in the last 7 days.” In the traditional setup, a data scientist writes that aggregation in a Spark job — this is what generates training data. A data engineer wraps it in Airflow to keep the output refreshed in the online store, so models don’t have to recompute it on every prediction request. A backend engineer might write a third implementation in the serving layer to handle request-time edge cases: a brand new user with no history, or a slightly different time window depending on request context. Three separate implementations of the same logic, maintained by three different people.

Now the product team changes the definition to a rolling 14-day window. Who updates all three? How do you know the Spark output and the serving implementation are still computing the same thing? How do you catch the case where the Airflow job silently failed and the model is reading a value that is three days stale? How do you write a test for the feature logic that runs against real historical data without spinning up the entire pipeline?

In practice, most teams don’t have clean processes, and the indication that something is wrong is usually a degraded model metric weeks later.

The Problem

On top of this ad-hoc change management, traditional approaches to computing and storing features leave data scientists and ML engineers dealing with a familiar set of challenges:

  • Training-serving skew: Data scientists implement feature logic in Python/Pandas; engineers re-implement it in the language of the business-logic serving stack (e.g. Java, Go, or SQL). They drift apart and it is hard to catch until model performance degrades in production.
  • Slow iteration cycles: Getting a new feature from idea to production takes weeks because every feature requires a custom pipeline, validation, deployment, and sign-off.
  • No shared vocabulary: Features computed by one team get re-computed by possibly many others because there’s no discoverability layer.
  • Freshness gaps: Models rely on older than necessary data because there’s no streaming infrastructure, even for signals that change by the minute.

A feature platform addresses all of these as we will see in the subsequent sections.

Batch vs. Streaming

To establish why feature platforms are necessary, we need to understand the distinction between batch and streaming features. Different features change at different rates, and that rate determines how you should compute them. A user’s lifetime order count changes slowly — computing it once a day is fine. But whether that same user abandoned a cart 90 seconds ago so they can be nudged to complete the purchase is a signal that goes stale almost immediately.

Think of feature freshness in three tiers:

Batch features are pre-computed on a schedule — hourly, daily, or weekly jobs that write results to an online store. A driver’s completed order count over the last 30 days is a batch feature. Latency is minutes to hours. They’re cheap, reliable, and easy to backfill.

Near-real-time (NRT) features are computed by a stream processor — typically a Kafka consumer or similar — and reflect the last few minutes of activity. The number of orders a driver accepted in the last hour is NRT. Latency is seconds to a minute. These require stream infrastructure but are the sweet spot for most use cases that need freshness without extreme complexity.

Real-time features are computed at prediction time, inline with the request. They reflect what’s happening right now — the exact state of a request in-flight. Latency is sub-second. These are powerful but expensive, hard to backfill, and complex to test. And a lot of the times, it is recommended to pass these features from the request context itself, rather than from a feature platform.

For each feature you’re considering, ask two questions:

  1. How fast does this signal change? Driver location changes every few seconds. Lifetime trip count changes slowly. Features that change faster than your batch cadence are candidates for streaming.

  2. How much does staleness hurt the model? If swapping in a fresh value vs. a 1-hour-old value produces a metric change smaller than the variance you’d see between two identical experiment runs, batch is fine. If the improvement is consistent and repeatable, invest in NRT or real-time.

Why Feature Platforms Enable Faster Experimentation

The real bottleneck is the time between having an idea for a feature and being able to evaluate whether it improves the model.

Without a platform, that path typically looks like: write the feature logic, hand it to a data engineer to wrap in a pipeline, wait for the pipeline to land in the warehouse, write a separate serving implementation, validate they match, and then train. That cycle takes days to weeks per feature. Most ideas never get tested.

A feature platform compresses this by letting the feature author own the full lifecycle. You define the feature logic once — the platform handles materialization for training and serving both. You can test it locally or in a test environment against a small dataset before touching any production infrastructure. When it looks right, you promote it and use the features in your production models.

The result is that experimentation becomes cheap enough to actually do it. You can try a feature, run an offline eval, and discard it in a day rather than a sprint. That speed compounds — teams that can run significantly more experiments tend to ship better models.

How Feature Platforms Work

A feature platform is composed of a few components that together allow you to define feature logic once and have it work correctly across training, refresh, and serving.

Feature authoring DSL: Where you express what a feature computes, which entity it belongs to (user, driver, restaurant), and how fresh it needs to be. The critical capability here is time-windowed aggregations with point-in-time correctness — meaning the platform knows to compute “orders in the last 7 days as of the timestamp of each training example”, not as of today. Naive implementations often get this wrong by leaking future information — for example, joining every training row to a feature table computed as of “now” instead of as of the event timestamp, which silently introduces label leakage and overly optimistic offline metrics.

In a Python-native DSL, a feature like “orders in the last 7 days” for a user entity that is updated every hour might look like this:

@feature(
    name="orders_last_7d",
    entity="user",
    freshness="1h",
)
def orders_last_7d(orders):
    return (
        orders
        .groupby("user_id")
        .window("7d", on="event_timestamp")
        .order_id.count()
    )

Ingestion layer: Connects to your data sources — databases, event streams, data warehouses — and normalizes how raw data flows into the platform. This abstraction is what lets the same feature definition run against historical data for training and against live data for serving, without any changes to the feature logic itself.

Aggregation jobs: The platform reads your feature definitions and generates the appropriate compute jobs — Spark or dbt for batch, a Kafka consumer or Flink job for near-real-time. The freshness declaration in your feature definition drives this: change freshness="1d" to freshness="1m" and the platform promotes the feature to a streaming job automatically.

Materialized views / online store: Pre-computed feature values are written to a low-latency store (typically Redis or DynamoDB) so that serving a feature at prediction time is a key lookup, not a live computation. The platform manages writes, TTLs, and cold-start fallbacks for entities that haven’t been seen yet.

Feature registry: A catalog of every defined feature with its owner, entity, data lineage, and model dependencies. This is what prevents teams from independently recomputing the same features and makes it possible to have a shared vocabulary across teams.

Setting up a feature platform

A feature platform involves building and maintaining:

  • A feature registry with versioning and lineage
  • Batch materialization pipelines and scheduling
  • A streaming ingestion layer
  • An online store with low-latency serving
  • Monitoring for data drift, staleness, and computation failures
  • A developer-facing API or DSL for feature authoring

Each of these is a solvable problem, but together they represent months of engineering work — and ongoing maintenance.

Build vs. Buy

The Case for Managed Platforms

Managed platforms like Chalk.ai, Tecton, and Feast (self-hosted but with managed cloud options) provide most of the above out of the box. The value proposition is:

Speed to first feature: Instead of spending months building infrastructure before a single feature is materialized, you can have features in production in days. For teams with 1–2 ML engineers, this difference is significant.

Operational burden: Managed platforms handle the operational complexity of running streaming infrastructure at scale. You don’t need data engineers to manage the infrastructure.

Built-in correctness guarantees: Point-in-time correct training dataset generation — the thing that’s surprisingly hard to get right when you build it yourself — is handled by the platform.

Some managed platforms have a Python-native DSL and local testing mechanisms that enable ML engineers and data scientists to author, test, and deploy features largely independently of data engineering teams.

When to Build Your Own

Cost at scale: Managed platforms charge based on feature computation volume. At very high throughput (millions of predictions per day), the cost can exceed what it would cost to run equivalent infrastructure yourself on cloud primitives.

Existing infrastructure investment: If your org already runs Spark, Kafka, and Redis at scale, along with deep expertise in the stack, building a thin orchestration layer on top may be faster than learning and integrating a new vendor.

Vendor lock-in concerns: Feature platforms sit in the critical path of every model prediction. Migrating away from a managed vendor later is painful. If your org has strong opinions about control and portability, it might be worth building your own.

A Pragmatic Path to Adoption

Start with one painful feature, not a platform: Find the feature with the worst training-serving skew — the one most likely responsible for model degradation — and set that feature up using the platform you’re evaluating. A vertical slice of end-to-end implementation builds confidence in the platform and exposes any drawbacks in the development workflow early on.

Validate your event infrastructure before investing in streaming: Near-real-time features are only as good as the event streams feeding them. Before integrating with the platform, it is important to ensure there are event streams in place for the data you need, and that those events are being emitted correctly, consistently, and with the right schema. The streaming feature layer is just a matter of configuring the platform to consume the events and compute the features.

Treat monitoring as a first-class requirement: You need to monitor both the feature and the pipeline and processing infrastructure itself. Common things to monitor are:

  • Staleness of the feature
  • Data drift of the feature (mean, variance, distribution etc.)
  • Failure/missing data rate of the feature
  • Latency of feature fetch at inference time
  • Telemetry from the platform’s ingestion and aggregation jobs
  • Processing latencies
  • Online store and offline store health
  • Point in time correctness of the feature

Budget for backfill costs upfront: The first time you materialize a feature over two years of historical data, the compute bill will be larger than expected. Design your compute jobs to support partition pruning and incremental backfills from the start. Make sensible trade-offs based on your product’s actual data needs.

The organizational change: Moving from a workflow where data scientists hand off feature specs to engineers who reimplement them, to one where data scientists own features end-to-end, is a process change as much as a technical one. You might still need backend engineers to help with initial setup and debugging issues.

Error handling: You need to have a plan for how to handle errors and missing data. You might need to fall back on sensible defaults at inference time (usually in the model layer).

Feature platforms, leveraged well, enable data scientists to author and deploy features in days instead of weeks, and work with models that are trained on data that matches what they’ll see in production. A feature registry enables sharing features across teams and models, removing duplicate effort and ensuring consistency.