Writing Specs

Specs are YAML files that declaratively describe what you want to measure and how to slice the data. There are three spec types: MetricSpec, SliceSpec, and SegmentSpec.

MetricSpec

A metric defines a single measurable quantity from a source table.

metric:
  name: ctr
  description: Click-through rate — ratio of clicks to impressions
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "SUM(clicks)"
  denominator: "SUM(impressions)"
  timestamp_col: date

Fields

Field	Required	Description
`name`	Yes	Unique identifier used in `MetricCompute.compute()`
`source`	Yes	Source URI — see Connectors for format
`numerator`	Yes	SQL expression containing an aggregate function call (`SUM`, `AVG`, `COUNT`, `MIN`, `MAX`)
`timestamp_col`	Yes	Column used for `time_window` filtering
`denominator`	No	SQL expression containing an aggregate function call. When present, the metric is computed as `numerator / denominator` (ratio).
`entities`	No	List of entity column names supported for `by_entity` disaggregation (e.g. `[user_id, device_id]`). Must be non-empty if provided.
`format`	No	Value interpretation hint for consumers. See Format field below.
`description`	No	Human-readable description

The aggregation type is inferred from the SQL function in numerator (and denominator). There is no separate aggregation field — write the aggregate directly in the expression.

Aggregation types

ratiosumavgcountmaxmin

Ratio is implied by the presence of a denominator. Both numerator and denominator must contain an aggregate function.

metric:
  name: ctr
  description: Click-through rate — ratio of clicks to impressions
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "SUM(clicks)"
  denominator: "SUM(impressions)"
  timestamp_col: date

metric:
  name: total_revenue
  description: Total revenue generated across all campaigns
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "SUM(revenue)"
  timestamp_col: date

metric:
  name: avg_revenue
  description: Average revenue per campaign row
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "AVG(revenue)"
  timestamp_col: date

metric:
  name: campaign_count
  description: Number of campaign rows
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "COUNT(*)"
  timestamp_col: date

metric:
  name: max_revenue
  description: Peak revenue from a single campaign row
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "MAX(revenue)"
  timestamp_col: date

metric:
  name: min_ad_spend
  description: Lowest ad spend entry
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "MIN(ad_spend)"
  timestamp_col: date

Format field

The optional format field is a metadata hint that tells consumers how to interpret the metric_value. It does not affect computation — the value is stored as-is.

Value	Meaning
`percentage`	A proportion expressed as a decimal (e.g. 0.42 for 42%)
`absolute`	A plain count or sum with no unit
`ratio`	A dimensionless ratio that is not a percentage
`currency`	A monetary value where the currency is unspecified or mixed across rows
`currency:<CODE>`	A monetary value in a specific ISO 4217 currency. `<CODE>` must be 3 uppercase letters (e.g. `currency:USD`, `currency:EUR`).

metric:
  name: ctr
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "SUM(clicks)"
  denominator: "SUM(impressions)"
  timestamp_col: date
  format: percentage

metric:
  name: total_revenue
  source: duckdb://ad_campaigns.duckdb/ad_campaigns
  numerator: "SUM(revenue)"
  timestamp_col: date
  format: "currency:USD"

When format is set, the metric_format column in the output DataFrame carries the value. When omitted, metric_format is None.

Use METRIC_FORMAT_VALUES to enumerate the simple (non-currency-code) allowed values:

from aitaem import METRIC_FORMAT_VALUES
print(METRIC_FORMAT_VALUES)  # frozenset({'percentage', 'absolute', 'ratio', 'currency'})

Entity columns

Use entities to declare which columns in the source table identify entities that the metric can be disaggregated by. At compute time, pass by_entity to MetricCompute.compute() to select which entity column to group by.

metric:
  name: revenue
  source: duckdb://analytics.db/transactions
  numerator: "SUM(amount)"
  timestamp_col: event_ts
  entities: [user_id, device_id]   # supports per-user or per-device disaggregation

A metric without entities can still be computed normally — it simply cannot be disaggregated by entity. See Computing Metrics for usage.

SliceSpec

A slice defines a breakdown dimension — a set of mutually exclusive (or overlapping) filters applied to the metric query.

Leaf slice

A leaf slice defines the filter values for a single column dimension. There are two variants:

Value-list — enumerate explicit filter predicates:

slice:
  name: campaign_type
  description: Breakdown by ad campaign type
  values:
    - name: Search
      where: "campaign_type = 'Search'"
    - name: Display
      where: "campaign_type = 'Display'"
    - name: Video
      where: "campaign_type = 'Video'"
    - name: Shopping
      where: "campaign_type = 'Shopping'"

Wildcard — auto-discover distinct values from a column at query time by setting where at the spec level to a bare column name:

slice:
  name: industry
  where: industry

Dot-qualified column names are also accepted:

slice:
  name: country
  where: public.campaigns.country

Note

where at the spec level must be a bare column name — no spaces, no SQL expressions. Use the value-list form when you need explicit WHERE predicates.

Composite slice (cross-product)

A composite slice computes the cross-product of two or more leaf slices:

slice:
  name: campaign_type_x_geo
  description: Campaign type broken down by geography
  cross_product:
    - campaign_type
    - geo

Note

Composite slices cannot reference other composite slices (no nesting). All referenced slices must be loaded into the same SpecCache.

Fields

Field	Required	Description
`name`	Yes	Unique identifier
`values`	Leaf (value-list)	List of `{name, where}` filter definitions
`where`	Leaf (wildcard)	Bare column name whose distinct values are auto-discovered at query time. Dot-qualified names (e.g. `schema.table.column`) are accepted.
`cross_product`	Composite only	List of leaf slice names to cross
`description`	No	Human-readable description

SegmentSpec

A segment classifies fact rows by joining to a DIM table (dimension table). Unlike slices, which apply filter predicates directly to the fact table, a segment resolves entity attributes from a separate DIM table — the standard star-schema pattern where classification attributes (e.g., subscription tier, region, plan type) live in their own table.

segment:
  name: platform
  description: Breakdown by advertising platform
  source: duckdb://ad_campaigns.duckdb/dim_platforms
  entity_id: platform
  values:
    - name: Google Ads
      where: "platform = 'Google Ads'"
    - name: Meta Ads
      where: "platform = 'Meta Ads'"
    - name: TikTok Ads
      where: "platform = 'TikTok Ads'"

At compute time, aitaem generates a JOIN from the fact table to the DIM table on the entity key:

FROM fact_table t
JOIN dim_platforms _dim ON t.<join_key> = _dim.platform

The <join_key> is resolved at compute() time — see segments parameter.

Note

values[].where expressions are evaluated against the DIM table (aliased as _dim). Unqualified column references are automatically qualified — "platform = 'Google Ads'" becomes _dim.platform = 'Google Ads' in the generated SQL. You do not need to write the prefix yourself.

Fields

Field	Required	Description
`name`	Yes	Unique identifier
`source`	Yes	URI of the DIM table to join
`entity_id`	Yes	Primary key column on the DIM table. Used as the right-hand side of the JOIN ON condition (`_dim.<entity_id>`).
`values`	Yes	List of `{name, where}` filter definitions against DIM table columns
`join_keys`	No	Whitelist of fact-table FK columns accepted as join keys. When non-empty, the join key supplied in `compute()` must be in this list. When empty (default), any column name is accepted.
`description`	No	Human-readable description

Example — customer value tier

segment:
  name: customer_value
  description: Customer segmentation by lifetime value tier
  source: duckdb://analytics.db/dim_customers
  entity_id: customer_id
  join_keys: [buyer_id, seller_id]
  values:
    - name: high_value
      where: "lifetime_value > 1000"
    - name: low_value
      where: "lifetime_value <= 1000"

Here customer_id is the PK on dim_customers, and the join_keys whitelist restricts the caller to buyer_id or seller_id as the fact-table FK. At compute time you specify which one applies to the query (see segments parameter).

Column introspection

After loading a spec you can inspect every column it references without needing a warehouse connection. This is useful for downstream validation — for example, verifying that every referenced column exists in the source table before running compute().

result = metric_spec.validate()
if result.valid:
    print(result.referenced_columns)
    # {
    #   'numerator':     ['revenue'],
    #   'denominator':   ['impressions'],
    #   'timestamp_col': ['created_at'],
    #   'entities':      ['user_id'],
    # }

For slice specs:

result = slice_spec.validate()
if result.valid:
    print(result.referenced_columns)
    # {'values[0].where': ['region'], 'values[1].where': ['region', 'country']}

For segment specs:

result = segment_spec.validate()
if result.valid:
    print(result.referenced_columns)
    # {
    #   'entity_id':       ['customer_id'],
    #   'join_keys':       ['buyer_id', 'seller_id'],   # omitted when join_keys is empty
    #   'values[0].where': ['lifetime_value'],
    #   'values[1].where': ['lifetime_value'],
    # }

Note

referenced_columns is None when the spec is invalid — always check result.valid first.

Column names are unqualified: SUM(t.revenue) yields "revenue". For wildcard slices, the bare column name is returned directly. For composite slices, an empty dict {} is returned (composite specs contain no SQL expressions).

Loading specs

Use SpecCache to load all specs before computing:

from aitaem import SpecCache

# From directories (loads all *.yaml / *.yml files)
cache = SpecCache.from_yaml(
    metric_paths="metrics/",
    slice_paths="slices/",
    segment_paths="segments/",
)

# From individual files
cache = SpecCache.from_yaml(
    metric_paths=["metrics/ctr.yaml", "metrics/total_revenue.yaml"],
    slice_paths="slices/campaign_type.yaml",
)

# From YAML strings (useful for testing)
cache = SpecCache.from_string(
    metric_yaml="""
metric:
  name: ctr
  source: duckdb://:memory:/events
  numerator: "SUM(clicks)"
  denominator: "SUM(impressions)"
  timestamp_col: date
""",
)

Compatibility Scanning

Before calling compute(), use MetricCompute.scan() to check which slices and segments are usable with each metric. The scan introspects source table schemas and returns a ScanResult with one CompatibilityResult per metric × spec pair.

from aitaem import SpecCache, ConnectionManager, MetricCompute

cache = SpecCache.from_yaml(
    metric_paths="metrics/",
    slice_paths="slices/",
    segment_paths="segments/",
)
conn = ConnectionManager.from_yaml("connections.yaml")
mc = MetricCompute(cache, conn)

result = mc.scan()

# Slices compatible with the "ctr" metric
compatible_slices = result.compatible_slices("ctr")
# e.g. ["campaign_type", "country"]

# Segments compatible with the "ctr" metric
compatible_segs = result.compatible_segments("ctr")
# e.g. ["platform"]

# Pass scan results directly into compute()
table = mc.compute(
    metrics="ctr",
    slices=compatible_slices,
    segments=compatible_segs[0] if compatible_segs else None,
)
df = table.to_pandas()

For slices, the scan checks that every column referenced in values[].where (or the bare column field for wildcard slices) exists in the metric's source table.

For segments, only the fact-table side is checked: the scan verifies that at least one join key candidate (join_keys, or entity_id when join_keys is empty) exists in the metric's source table. The DIM-table columns (used in values[].where) are on the segment's own source and are not validated here.

# Inspect a specific result
r = result.for_spec("platform")[0]
print(r.compatible)        # True
print(r.valid_join_keys)   # ["platform"]
print(r.missing_columns)   # []

# Find which metrics an incompatible slice is blocking
r = result.for_spec("bad_slice")[0]
print(r.compatible)        # False
print(r.missing_columns)   # ["nonexistent_column"]
print(r.reason)            # "columns not found in source table: ['nonexistent_column']"