Skip to content

Conversation

cetagostini
Copy link
Contributor

@cetagostini cetagostini commented Sep 5, 2025

Description

Hey team, major updates in the utility functions. Feel free to jump, comment or modify if you think something can be done better!

New Masked Distribution Prior

As users have the multidimensionality capabilities, models can grow without control, for example, currently is possible to make a model with dims (date, country, state, city, brand, product, channel) & shape (720, 50, 90, 10, 100, 12).

Even when is great to bring this flexibility to the users, probably given the huge dimensional space not all data will be perfectly balance, probably some values in the matrix can be missing. The MaskedDist allow the user to add a mask over the prior, modifying the tensor graph behind it to avoid to sample does none existent parameters, optimizing the process.

What it does (in short)

  1. Wraps a Prior that is defined over named dims.
  2. Applies a boolean mask to those dims to:
  3. Build an “active” 1D RV only for unmasked positions (under a new active dim).
  4. Scatter the active RV back to the original full grid and return a deterministic with the original dims.
  5. Can also create masked likelihoods so only active positions contribute to the log-likelihood.

Benefits

  1. Efficient parameterization: optimizes only the active positions, reducing parameter count and gradients.
  2. Performance: faster sampling/optimization and lower memory use when grids are large and sparse.
  3. Shape safety: validates against model coords; handles scalars, broadcastable arrays, and nested priors consistently.
  4. Likelihood masking: excludes masked cells from the logp cleanly (no NaN gymnastics).
  5. Portable: supports to_dict/from-dict for serialization.
import numpy as np
import pymc as pm
from pymc_extras.prior import Prior
from pymc_marketing.pytensor_utils import MaskedDist

# Define a 3x4 grid
coords = {"row": [0, 1, 2], "col": [0, 1, 2, 3]}

# Create mask - only activate positions (0,0), (1,2), (2,3)
mask = np.array(
    [
        [True, False, False, False],
        [False, False, True, False],
        [False, False, False, True],
    ]
)

# Define prior over full grid
prior = Prior("Normal", mu=0, sigma=1, dims=("row", "col"))
masked_dist = MaskedDist(prior, mask)

with pm.Model(coords=coords):
    # Create masked distribution
    coeff = masked_dist.create_variable("coeff")

Users can combine this to modify the params into their functions, such as saturation or adstock.

LogisticSaturation(
    priors={
        "lam": MaskedDist(
            Prior("HalfNormal", sigma=1.0, dims=("country", "region")),
            mask=mask,
        ),
        "beta": MaskedDist(
            Prior("HalfNormal", sigma=1.0, dims=("country", "region")),
            mask=mask,
        ),
    }
)

If from the full grid, some panels are missing they can mask the likelihood as well to not sample on dates or combos dates-region-city which doesn't exist.

mu_prior = Prior("Normal", mu=0.0, sigma=1.0, dims=("date", "geo"))
masked_mu = MaskedDist(mu_prior, mask=mask)

observed = np.zeros((len(coords["date"]), len(coords["geo"])))

likelihood_prior = Prior(
    "Normal",
    sigma=Prior("HalfNormal", sigma=1.0),
    dims=("date", "geo"),
)
masked_lik = MaskedDist(likelihood_prior, mask=mask)

with pm.Model(coords=coords):
    # Masked deterministic mean over full grid (zeros where mask is False)
    mu_full = masked_mu.create_variable("mu_full")

    # Build likelihood as a Prior and mask it, so both mu and observed are masked
    masked_lik.create_likelihood_variable(
        name="y",
        mu=mu_full,
        observed=observed,
    )

Reducing the dimensionality brings huge improvements in efficiency. Already some examples show, even in low dimensional problems a 30% less of computational time given a mask which cuts a few diagonals in our matrix.

Important

Out of sample with only test data can't handle the masking around likelihood. So, in order to make out of sample, you must share the full X_train and X_test to the sample posterior predictive.

Model JAX sampling estimation

Related to the above, I added a helper in JAX to users estimate the minimum sample time from their models. Doing so, users could compare the minimum amount of time that their given model will take.

with pm.Model() as model:
        pm.Normal("x", mu=0.0, sigma=1.0, observed=np.array([0.0]))

estimator = ModelSamplerEstimator(
    tune=100, draws=200, chains=1, sequential_chains=1, seed=123
)

df = estimator.run(model)

This tool is great for comparison, helps to have ideas around how heavy is your model and make changes before even properly sample the first time.

Merge graphical models

Combines multiple independent PyMC Models into a single Model.

As we move forward in causality, and discover new DAGs, a way to collect direct or indirect effects could be create different regressions, every one adjusting by the minimum set of variables. Each regression will respond to the same DAG but will uncover different effects per channel depending on the adjustment.

Nevertheless, we probably want to optimize all regressions together or we want to join the mode to make several operations. But how to? Well let's remember PyMC models are graphical generative model so, each model can be represent as a function graph, and their nodes/variables could be prefixed (e.g., model1_, model2_). Doing so, we could decide which variable or data will be the one to merge_on="some_var" and the variable is kept unprefixed and shared across the merged models. All outputs and coordinates are merged into a single function graph.

Quick example, imagine you have the following DAG:

digraph G {
  rankdir=LR; splines=true; nodesep=0.4; ranksep=0.5;
  node [shape=ellipse, fontsize=12];

  C1 -> X1; C1 -> Y;
  C2 -> X2; C2 -> Y;
  C3 -> X3; C3 -> Y;

  X1 -> X2; X2 -> X3;
  X1 -> Y;  X2 -> Y;  X3 -> Y;

  // styling
  {rank=same; C1; C2; C3}
  {rank=same; X1; X2; X3}
  Y [shape=doublecircle];
}

This structure makes the minimal sufficient sets different for each channel’s direct effect on (Y):

  • For (X_1): backdoor via (C_1); mediated paths via (X_2(\to X_3)).
    Adjust: ({C_1, X_2}) (conditioning on (X_2) also blocks the (X_1 \to X_2 \to X_3 \to Y) path).

  • For (X_2): backdoors via (C_2) and via (X_1) (since (X_1 \to X_2) and (X_1 \to Y)); mediated path via (X_3).
    Adjust: ({C_2, X_1, X_3}).

  • For (X_3): backdoors via (C_3) and via (X_2) (since (X_2 \to X_3) and (X_2 \to Y)).
    Adjust: ({C_3, X_2}).

Meaning, we need three models to uncover the effects:

# Get x1 over Y
mask = np.array([True, False, False])
mmm_x1_effect = MMM(  # (Y ~ X1 + C1 + X2)
    date_column="date",
    target_column="Y",
    channel_columns=[
        "X1",
        "X2",
        "X3",
    ],  # always use all channels, but mask the ones that are not used
    control_columns=["X2", "C1"],
    adstock=NoAdstock(l_max=1),
    saturation=NoSaturation(
        priors={
            "beta": MaskedDist(
                Prior("HalfNormal", sigma=1.0, dims="channel"), mask=mask
            )
        }
    ),
)

# get x2 over Y
# make mask for channels making x1,x2 false
mask2 = np.array([False, True, False])
mmm_x2_effect = MMM(  # (Y ~ X2 + C2 + X1 + X3)
    date_column="date",
    target_column="Y",
    channel_columns=[
        "X1",
        "X2",
        "X3",
    ],  # always use all channels, but mask the ones that are not used
    control_columns=["X1", "C2", "X3"],
    adstock=NoAdstock(l_max=1),
    saturation=NoSaturation(
        priors={
            "beta": MaskedDist(
                Prior("HalfNormal", sigma=1.0, dims="channel"), mask=mask2
            )
        }
    ),
)

# get x3 over Y
# make mask for channels making x1,x2 false
mask3 = np.array([False, False, True])
mmm_x3_effect = MMM(  # (Y ~ X3 + C3 + X2)
    date_column="date",
    target_column="Y",
    channel_columns=[
        "X1",
        "X2",
        "X3",
    ],  # always use all channels, but mask the ones that are not used
    control_columns=["X2", "C3"],
    adstock=NoAdstock(l_max=1),
    saturation=NoSaturation(
        priors={
            "beta": MaskedDist(
                Prior("HalfNormal", sigma=1.0, dims="channel"), mask=mask3
            )
        }
    ),
)

As you can see we should be able to zero out with mask the distributions to avoid calculate the contributions for the channels we don't want to adjust in the regression. Nevertheless, even when channels are zero out given the priors and not consider during the sampling, the data input is the same for all.

We could merge based on it.

optimizable_model1 = MultiDimensionalBudgetOptimizerWrapper(
    model=mmm_x1_effect, start_date="2026-01-01", end_date="2026-01-31"
)

optimizable_model2 = MultiDimensionalBudgetOptimizerWrapper(
    model=mmm_x2_effect, start_date="2026-01-01", end_date="2026-01-31"
)

optimizable_model3 = MultiDimensionalBudgetOptimizerWrapper(
    model=mmm_x3_effect, start_date="2026-01-01", end_date="2026-01-31"
)

multi_model = MultiModelWrapper(
    models=[optimizable_model1, optimizable_model2, optimizable_model3],
    prefixes=["model1", "model2", "model3"],
    merge_on="channel_data",
)

And finally, optimize.

optimizer = BudgetOptimizer(
    num_periods=multi_model.num_periods,
    model=multi_model,
    response_variable="total_media_contribution_original_scale",
)

This opens the door to make optimization of different effects (total, indirects or directs) together. Allows for multi-objective optimization. Every model could have a different target variable but the target depends on same inputs, and consequently we want to optimize A and B together.

Conclusion

These examples are just the tip of the iceberg, but they allow users to scale their models in dimensionality and optimize them as much as possible, even in cases with panel imbalance. Estimate sampling time to decide if MCMC is feasible or VI methods could be a better option. Even for cases where they could end with multiple models, they can merge them together to consolidate their learnings or optimize together mixing different outputs.

Related Issue

  • Closes #
  • Related to #

Checklist


📚 Documentation preview 📚: https://pymc-marketing--1927.org.readthedocs.build/en/1927/

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added docs Improvements or additions to documentation tests labels Sep 5, 2025
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 77.75120% with 93 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.26%. Comparing base (49ed036) to head (eaae949).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pymc_marketing/pytensor_utils.py 74.92% 79 Missing ⚠️
pymc_marketing/mmm/budget_optimizer.py 85.26% 14 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1927      +/-   ##
==========================================
- Coverage   91.99%   91.26%   -0.73%     
==========================================
  Files          66       66              
  Lines        7843     8253     +410     
==========================================
+ Hits         7215     7532     +317     
- Misses        628      721      +93     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cetagostini
Copy link
Contributor Author

cetagostini commented Sep 5, 2025

Big kudos to @ricardoV94 because all initial draft functions came from him. I'm just a parrot opening the PRs and showing my use cases!

@cetagostini
Copy link
Contributor Author

PS: Very open to change names, I feel the current names are all quite bad 😅

@TeemuSailynoja
Copy link
Contributor

Wow! Just, wow! This will take a moment to process, but looks amazing.

@williambdean
Copy link
Contributor

Why one giant PR 😓

@cetagostini
Copy link
Contributor Author

Why one giant PR 😓

Sorry for it, but I added good level of tests, maybe try it out over different scenarios (?) and then we can move on! I just wanna to put all together. @williambdean

@github-actions github-actions bot added the mlflow label Sep 6, 2025
@williambdean
Copy link
Contributor

williambdean commented Sep 7, 2025

There might be a version / behavior change with latest mlflow. Can you check the version. Maybe a mlflow 3.0 thing

class MaskedDist:
"""Create a masked deterministic from a Prior over full dims.

The foal is to reduce the number of parameters in the model by creating a masked deterministic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for horsing around, but this should be "The goal" 😉

@williambdean
Copy link
Contributor

MLflow stuff should be fixed

@cetagostini
Copy link
Contributor Author

MLflow stuff should be fixed

Thanks a lot, you are the best. I'll follow your guide there, and continue adding tests.

Comment on lines +320 to +324
full_dims: tuple[str, ...] = (
(self.dims if isinstance(self.dims, tuple) else (self.dims,))
if self.dims
else ()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior class should already do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API docs Improvements or additions to documentation enhancement New feature or request maintenance major API breaking changes mlflow MMM multidimensional optimizer Prior class tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants