Warning
This project is in beta and is not fully publicly released. The API is subject to breakage and bugs may occur. Please wait until 0.1.0 for full release.
The following features are currently missing and are being prioritized for a 0.1.0 release:
- Missing value support (but probably not the EM algorithm algorithm for the time being).
Support for weightsSnowflake support- (Maybe) Fuller, paginated documentation on Github Pages.
Remove custom materialization
PCA in SQL, powered by dbt.
dbt_pca is an easy way to perform principal component analysis (PCA) in SQL using dbt.
Reasons to use dbt_pca:
- 📈 PCA in pure SQL: With the power of recursive CTEs and math, it is possible to implement PCA in pure SQL. Most SQL engines (even OLAP engines) do not have an implementation of PCA, so this fills a valuable niche.
dbt_pca
implements a true implementation of PCA via the NIPALS algorithm. - 📱 Simple interface: Just define a
table=
(which works withref()
,source()
, and CTEs), your column(s) withcolumns=
, an index withindex=
, and you're all set! Both "wide" and "long" data formats are supported. - 🤸 Flexibility: Tons of output options available to return things the way you want: choose from eigenvectors, factors, and projections in both wide and long formats.
- 🤗 User friendly: The API provides comprehensive feedback on input errors.
- 💪 Durable and tested: Everything in this code base is tested against equivalent PCA's performed in Statsmodels with high precision assertions (between 10e-5 to 10e-7, depending on the database engine).
- 🪄 Deep under-the-hood magic: In the pursuit of making this as easy as possible to implement across databases, tons of dbt dark magic wizardry happens under the hood. I spent weeks of my life so you could save seconds of yours!
Currently only DuckDB, Clickhouse, and Snowflake are supported.
Note: If you enjoy this project, you may also enjoy my other dbt machine learning project, dbt_linreg. 😊
dbt_pca works with the following databases:
Database | Supported | Precision asserted in CI* | Supported since version |
---|---|---|---|
DuckDB | ✅ | 10e-7 | 0.1.0 |
Clickhouse | ✅ | 10e-5 | 0.1.0 |
Snowflake | ✅ | 10e-7 | 0.1.0 |
Please see the Performance optimization section on how to get the best performance out of each database.
* Precision is comparison to
PCA(method='nipals')
in Statsmodels. In some cases, precision depends on the implementation method; precision is based on suggested implementation.
dbt-core >=1.4.0
is required to install dbt_pca
.
Add this the packages:
list your dbt project's packages.yml
:
- package: "dwreeves/dbt_pca"
version: "0.0.6"
The full file will look something like this:
packages:
# ...
# Other packages here
# ...
- package: "dwreeves/dbt_pca"
version: "0.0.6"
The following example runs PCA on 5 columns x1, x2, x3, x4, x5
, using data in the dbt model named collinear_matrix
.
{{
config(
materialized="table"
)
}}
select * from {{
dbt_pca.pca(
table=ref('collinear_matrix'),
index='idx',
columns=['x1', 'x2', 'x3', 'x4', 'x5'],
ncomp=3
)
}} as pca
comp | col | eigenvector | eigenvalue | coefficient |
---|---|---|---|---|
0 | x1 | 0.23766561884860857 | 29261.18329671314 | 40.6548443559801 |
0 | x2 | -0.5710963390821377 | 29261.183296713134 | -97.69117169801471 |
0 | x3 | -0.5730424984614262 | 29261.18329671314 | -98.02407978560524 |
0 | x4 | -0.5375734671620216 | 29261.18329671314 | -91.9567825723166 |
0 | x5 | 0.0010428157925962138 | 29261.183296713138 | 0.17838303220022225 |
1 | x1 | -0.0662409860256577 | 10006.031828705965 | -6.626096072806324 |
1 | x2 | -0.001674528192609409 | 10006.031828705967 | -0.16750331398380647 |
1 | x3 | -0.0027280948128929504 | 10006.031828705962 | -0.27289174588904075 |
1 | x4 | -0.022663548107992145 | 10006.031828705964 | -2.2670382209597086 |
1 | x5 | 0.9975411013143917 | 10006.031828705964 | 99.78419058137138 |
2 | x1 | 0.9637584043866467 | 8920.76195540799 | 91.02677443759194 |
2 | x2 | 0.09591639009268058 | 8920.761955407994 | 9.059282457195334 |
2 | x3 | 0.10093338977915689 | 8920.761955407994 | 9.533137000756994 |
2 | x4 | 0.2167293438854099 | 8920.761955407994 | 20.470040012174913 |
2 | x5 | 0.06935867943078204 | 8920.761955407997 | 6.550912385405418 |
The default output for dbt_pca is eigenvectors + coefficients + eigenvalues, although you can configure to output principal components / factors and projections as well with the output='...'
argument.
Also, collinear_matrix
is one of the test cases, so you can try this yourself!
The below example is equivalent to the previous example.
By setting an argument for values=...
, dbt_pca will infer that your data is long-formatted.
{{
config(
materialized="table"
)
}}
with preprocessed_data as (
select idx, 'x1' as c, x1 as x
union all
select idx, 'x2' as c, x2 as x
union all
select idx, 'x3' as c, x3 as x
union all
select idx, 'x4' as c, x4 as x
union all
select idx, 'x5' as c, x5 as x
)
select * from {{
dbt_pca.pca(
table='preprocessed_data',
index='idx',
columns='c',
values='x'
)
}} as pca
Data that we want to run PCA on very often comes long-formatted by default.
For example, imagine a database containing movies, and movies can be tagged as being in multiple genres as an array.
We can imagine flattening the array column into a column of strings, doing some additional preprocessing and then running PCA where rows='movie_id'
, columns='genre'
, and values='tagged_as_genre'
;
this would be a more natural way to define data for this particular PCA than widening the data.
See Karl Rohe's longpca
for a longer (pun not intentional) discussion about this.
The below example uses the dbt_linreg alongside dbt_pca.
We have a table called movies
, which contains ratings and tags for each movie in the database.
We want to see which movies over or under-perform based on their tags,
e.g. perhaps the tag foo
usually indicates a poor quality movie, but some movies tagged with foo
are exceptionally good.
We can do the following:
- Define a matrix with the following structure:
- Rows are movies
- Columns are tags
- Cells are
1
if movie has the tag, otherwise0
- Use
output='factors-wide'
to output principal components in a CTE with the following columns:id
,comp_0
,comp_1
,comp_2
,comp_3
,comp_4
. - Use the principal components as features in a linear regression.
- Predict the movie rating using the regression coefficients.
- Return the predictions sorted by
abs(residual) desc
.
Here is what that would look like:
{{
config(
materialized="table"
)
}}
with
movie_tags as (
select
id,
unnest(tags) as tag,
1 as has_tag
from {{ ref("movies") }}
),
pca as (
select * from {{
dbt_pca.pca(
table='movie_tags',
index='id',
columns='tag',
values='has_tag',
missing='zero',
output='factors-wide',
ncomp=5
)
}}
),
movie_rating_with_principal_components as (
select
m.id,
m.name,
m.rating,
c.comp_0,
c.comp_1,
c.comp_2,
c.comp_3,
c.comp_4
from {{ ref("movies") }} as m
),
linear_regression_coefficients as (
select * from {{
dbt_linreg.ols(
table='movie_rating_with_principal_components',
endog='rating',
exog=['comp_0', 'comp_1', 'comp_2', 'comp_3', 'comp_4']
)
}}
),
predictions as (
select
m.id,
m.name,
m.rating as actual_rating,
l.const
+ m.comp_0 * l.comp_0
+ m.comp_1 * l.comp_1
+ m.comp_2 * l.comp_2
+ m.comp_3 * l.comp_3
+ m.comp_4 * l.comp_4
as predicted_rating,
predicted_rating - actual_rating as residual
from
movie_rating_with_principal_components as m,
linear_regression_coefficients as l
)
select *
from predictions
order by abs(residual) desc
Here is a Snowflake specific example using Cortex vector embeddings to construct principal components over an embedding space.
The resulting table of components (using output='factors-wide'
) can then be used as features in a machine learning model.
(This is very similar to something I am doing right now at my current company!)
{{
config(
materialized="table"
)
}}
with
base as (
select
id,
/* Note: it is STRONGLY recommended you precompute embeddings in a separate incremental materialization,
since the cost of rerunning this function can be quite high. */
snowflake.cortex.embed_text_768('snowflake-arctic-embed-m', description) as description_embedding
from {{ ref("entity") }}
),
reshaped_embeddings as (
select
b.id,
v.index as embedding_index,
v.value::float as cell
from base as b,
lateral flatten(input => b.description_embedding::array) as v
)
select *
from {{ dbt_pca.pca(
table='reshaped_embeddings',
columns='embedding_index::integer',
values='cell',
index='id::integer',
ncomp=10,
output='factors-wide'
) }}
order by id
If you'd like to avoid recomputing the PCA each time, then you can run this as an incremental model, store the default output (output='loadings'
),
and compute factors manually using the loadings.
For best results, set demean=false
and standardize=false
, and optionally demean/standardize the data yourself and store those means and stdevs separately.
The below example shows a 2-step dbt pipeline that computes PCA loadings once every month and updates factors on every run.
This is achieved by computing the factors manually from the loadings.
Loadings in the below example are normalized via dividing by the sqrt of the eigenvalue.
-- loadings.sql
{{
config(
materialized="incremental",
unique_key=["comp", "embedding_index"]
)
}}
with
base as (
select
id,
description_embedding
from {{ ref("entity") }}
{% if is_incremental() %}
/* This makes it so the query only runs at the start of each month. */
where current_timestamp() >= (select date_trunc(month, dateadd(month, 1, max(last_updated))) from {{ this }})
{% endif %}
),
reshaped_embeddings as (
select
b.id,
v.index as embedding_index,
v.value::float as cell
from base as b,
lateral flatten(input => b.description_embedding::array) as v
)
select *, current_timestamp() as last_updated
from {{ dbt_pca.pca(
table='reshaped_embeddings',
columns='embedding_index::integer',
values='cell',
index='id::integer',
ncomp=10,
demean=false,
standardize=false,
output='factors-wide'
) }}
order by comp, embedding_index
------------------------------------------------------------------------------------------------------------------------
-- factors.sql
{{
config(
materialized="incremental",
unique_key=["id"]
)
}}
with
base as (
select
e.id,
e.description_embedding,
e.last_updated
from {{ ref("entity") }} as e
{% if is_incremental() %}
left join {{ this }} as t
on t.id = e.id
where
t.id is null
or e.last_updated > t.last_updated
or (select max(last_updated) from {{ ref("loadings") }}) > t.last_updated
{% endif %}
),
reshaped_embeddings as (
select
b.id,
v.index as embedding_index,
v.value::float as cell,
b.last_updated
from base as b,
lateral flatten(input => b.description_embedding::array) as v
),
factors as (
select
r.id,
l.comp,
sum(r.cell * l.eigenvector / sqrt(l.eigenvalue)) as factor,
max(greatest(r.last_updated, l.last_updated)) as last_updated
from reshaped_embeddings as r
inner join {{ ref("loadings") }} as l
on r.embedding_index = l.embedding_index
group by r.id, l.comp
)
select
id,
(array_agg(factor) within group (order by comp))::vector(float, 10) as factor_vec,
max(case when comp = 0 then factor end) as factor_0,
max(case when comp = 1 then factor end) as factor_1,
max(case when comp = 2 then factor end) as factor_2,
max(case when comp = 3 then factor end) as factor_3,
max(case when comp = 4 then factor end) as factor_4,
max(case when comp = 5 then factor end) as factor_5,
max(case when comp = 6 then factor end) as factor_6,
max(case when comp = 7 then factor end) as factor_7,
max(case when comp = 8 then factor end) as factor_8,
max(case when comp = 9 then factor end) as factor_9,
max(last_updated) as last_updated
from factors
group by id
order by id
You can then use the vector representation (factor_vec
) for cosine similarity calculations,
or use the factor_*
columns as features in a machine learning model.
Of course, there are many other things you can do with dbt_pca than just the above things. Hopefully this example inspires you to explore all the possibilities!
This is the core function of the dbt_pca package. Using Python typing notation, the full API for dbt_pca.pca()
looks like this:
def pca(
table: str,
index: str | list[str] | None = None,
columns: str | list[str] | None = None,
values: str | None = None,
ncomp: int | None = None,
normalize: bool = True,
standardize: bool = True,
demean: bool = True,
# missing: Literal[None] = None,
weights: list[float | int] | None = None,
output: str = 'loadings',
output_options: dict[str, Any] | None = None,
method: Literal['nipals'] = 'nipals',
method_options: dict[str, Any] | None = None,
materialization_options: dict[str, Any] | None = None
):
...
Where:
- table: Name of table or CTE to pull the data from. You can use a
ref()
,source()
, a CTE name, or subquery. - index: The uniquely identifying index for each row of data. You may define one or more columns as the index. This field is required if your data is long or you want to output components. (You can also specify
rows=...
instead ofindex=...
if you prefer.) - columns: Either a list of columns (for wide-formatted data), or a uniquely identifying index for each column of data (for long-formatted data). You may specify multiple columns even for long-formatted data; in this case the multiple columns will be treated like a multi-index.
- values: Specifying this will make dbt_pca treat your data as long-formatted. This field defines the column in your table corresponding with the cell values of the matrix.
- ncomp: Number of components to return. If
none
, it is set to the number of columns in data if the data is wide-formatted; if data is long-formatted, this must be set. It is strongly recommended you set the number of components for large data sets. Most database implementations (except DuckDB) requirencomp
to be set for long-formatted data. - normalize: Indicates whether to normalize the factors to have unit inner product. If False, the loadings will have unit inner product.
- standardize: Flag indicating to use standardized data with mean 0 and unit variance. standardized being True implies demean. Using standardized data is equivalent to computing principal components from the correlation matrix of data.
- demean: Flag indicating whether to demean data before computing principal components. demean is ignored if standardize is True. Demeaning data but not standardizing is equivalent to computing principal components from the covariance matrix of data.
- missing: Does nothing currently.
- weights: Column weights to use after transforming data according to standardize or demean when computing the principal components.
- output: See Outputs and output options section of the README for more. This can be one of the following:
'loadings'
'loadings-long'
'loadings-wide'
'eigenvectors-wide'
'eigenvectors-wide-transposed'
'coefficients-wide'
'coefficients-wide-transposed'
'factors'
'factors-long'
'factors-wide'
'projections'
'projections-long'
'projections-wide'
'projections-untransformed-wide'
- output_options: See Outputs and output options section of the README for more.
- method: The method used to calculate the regression. Currently only
'nipals'
is supported in non-Snowflake databases;'eig'
and'svd'
are additionally supported in Snowflake. See Methods and method options for more. - method_options: Options specific to the estimation method. See Methods and method options for more.
- materialization_options: Database-specific options relating to how the table is materialized. See Materialization options for more.
Names for function arguments and concepts vary across PCA implementations in different languages and frameworks. In this library, all names and concepts are equivalent to those in Statsmodels.
Output type | Description | Uniquely identifying columns | Other columns | Restrictions | Notes |
---|---|---|---|---|---|
'loadings' |
Returns the eigenvectors (i.e. loadings), eigenvalues, and coefficients. | comp , [columns] |
eigenvector , eigenvalue , coefficient |
- | - |
'loadings-long' |
- | comp , [columns] |
eigenvector , eigenvalue , coefficient |
All | Alias for 'loadings' . |
'loadings-wide' |
Same as 'loadings' , but wide. |
[columns] |
eigenvector_{i} , coefficient_{i} |
- | - |
'eigenvectors-wide' |
Same as 'loadings-wide' , except only display eigenvectors. |
[columns] |
eigenvector_{i} |
- | Alias for 'loadings-wide' with output_options['display_coefficients'] = false . |
'eigenvectors-wide-transposed' |
Same as 'eigenvectors-wide' , except transposed. |
comp |
[columns] |
Wide input format only | - |
'coefficients-wide' |
Same as 'loadings-wide' , except only display coefficients. |
[columns] |
coefficient_{i} |
- | Alias for 'loadings-wide' with output_options['display_eigenvectors'] = false . |
'coefficients-wide-transposed' |
Same as 'coefficients-wide' , except transposed. |
comp |
[columns] |
Wide input format only | - |
'factors' |
Returns the principal components (i.e. factors) in a long format. | comp , [index] |
factor |
Requires an index=... to be defined. |
- |
'factors-long' |
- | comp , [index] |
factor |
Requires an index=... to be defined. |
Alias for 'factors' . |
'factors-wide' |
Returns the principal components (i.e. factors) in a wide format. | [index] |
factor_{i} |
- | - |
'projections' |
Returns projections in a long format. | [columns] , [index] |
projection |
Requires an index=... to be defined. |
- |
'projections-long' |
- | [columns] , [index] |
projection |
Requires an index=... to be defined. |
Alias for 'projections' . |
'projections-wide' |
Returns projections in a wide format (each column is original column of the data.) | [index] |
[columns] |
Wide input format only | - |
'projections-untransformed-wide' |
Returns untransformed projections. | [index] |
[columns] |
Wide input format only | This is niche and you probably don't need this. |
- columns_column_name (
string
; default:'col'
): When converting a wide input to a long output, this is the column name used to group all the columns together. - eigenvector_column_name (
string
; default:'eigenvector'
): Column name for eigenvectors i.e. loadings.- In long formatted outputs, the column
{{eigenvector_column_name}}
contains the values of the eigenvectors. - In wide formatted outputs, there will be multiple columns
{{eigenvector_column_name}}_{{i}}
, wherei
is an index for the principal component.
- In long formatted outputs, the column
- eigenvalue_column_name (
string
; default:'eigenvalue'
): Column name for eigenvalues. - coefficient_column_name (
string
; default:'coefficient'
): Column names for coefficient i.e.eigenvector * sqrt(eigenvalue)
.- In long formatted outputs, the column
{{coefficient_column_name}}
contains the values of the coefficients. - In wide formatted outputs, there will be multiple columns
{{coefficient_column_name}}_{{i}}
, wherei
is an index for the principal component.
- In long formatted outputs, the column
- component_column_name (
string
; default:'comp'
): Identifier for principal component; this is an integer typed column that identifies the component (e.g. if there are 3 components, then{{component_column_name}}
can take on values of0
,1
, and2
). - factor_column_name (
string
; default:'factor'
): Column name for factors i.e. principal components.- In long formatted outputs, the column
{{factor_column_name}}
contains the values of the principal component vectors. - In wide formatted outputs, there will be multiple columns
{{factor_column_name}}_{{i}}
, wherei
is an index for the principal component.
- In long formatted outputs, the column
- projection_column_name (
string
; default:'projection'
): Column name for projections of data onto the principal components.
- display_eigenvalues (
bool
; default =True
): If True, display eigenvalues in loadings output. - display_coefficients (
bool
; default =True
): If True, display coefficients in loadings output.
- strip_quotes (
bool
; default =True
): If true, strip outer quotes from column names in long outputs; if false, always use string literals.
Output options can be set globally via vars
, e.g. in your dbt_project.yml
:
# dbt_project.yml
vars:
dbt_pca:
output_options:
eigenvector_column_name: loading
eigenvalue_column_name: eigenval
coefficient_column_name: coeff
There is currently only one method for calculating PCA in pure SQL, 'nipals'
, and I currently do not have plans to implement more as frankly it's taken years off my life to just implement the one. 😆
Because Snowflake is just a pass-through to sm.PCA()
, the Snowflake implementation supports everything Statsmodels supports: 'nipals'
, 'svd'
, and 'eig'
.
Nonlinear Iterative Partial Least Squares (NIPALS), a method that is optimized for calculating the first few PCs of a matrix but is less performant and less accurate for the last few PCs. In practical settings, we usually want far fewer components than there are columns in the data, so this ends up being a good trade-off.
Another advantage of NIPALS is it can be modified to handle missing data. This is a common method in many implementations of PCA, although it is not currently supported in the pure SQL implementations. (It's on the roadmap!)
Specify these in a dict using the method_options=
kwarg:
- max_iter (
int
; default: varies) - Maximum iterations of the NIPALS algorithm. - check_tol (
bool
; default: varies) - If True, then check for convergence using thetol
value. If False, always iteratemax_iter
times. - tol (
bool
; default:5e-8
) - Tolerance to use when checking for convergence. - deterministic_column_seeding (
bool
; default:False
) - If True, seed the initial column in a factor calculation with the alphanumerically first column. This guarantees the sign of the eigenvector even when normalized, but is significantly slower to converge. It is strongly recommended to keep this turned off.
Some default values vary based on the database engine being used, and whether calculate_in_steps
is enabled.
Method options can be set globally via vars
, e.g. in your dbt_project.yml
. Each method
gets its own config, e.g. the dbt_pca: method_options: nipals: ...
namespace only applies to the nipals
method. Here is an example:
# dbt_project.yml
vars:
dbt_pca:
method_options:
nipals:
max_iter: 300
Materialization options influences the behind-the-scenes stuff relating to how the dbt model runs.
Materialization options can be set globally via vars
, e.g. in your dbt_project.yml
:
# dbt_project.yml
vars:
dbt_pca:
materialization_options:
udf_database: my_udf_db
udf_schema: my_udf_schema
drop_udf: false
A lot of magic happens under the hood to make the Snowflake implementation possible.
The defaults should work fine, but a lot of things are exposed to the user via the materialization_options
to give the user control over things.
Basically, the Snowflake implementation cheats by wrapping sm.PCA()
.
This is implemented as a user-defined table function under the hood, which is created as pre-hook which is injected into the model config as a side-effect of calling dbt_pca.pca()
.
The pre-hook is called lazily by using the JSON config injected into the compiled model.
dbt_pca then reads the schema of the referenced node in the table=
arg to infer the types for the function inputs and outputs.
With all of that out of the way, here are the materialization options available for the Snowflake implementation:
udf_database
: (string
; default: use the model's database) - The database where the UDF is written to.udf_schema
: (string
; default: use the model's schema) - The schema where the UDF is written to.infer_function_signature_types
(bool
; default:True
) - If True, infer types of the UDF function signature based on the types of the upstream node. If False, usefloat
cast_types_to_udf
(bool
; default:False
ifinfer_function_signature_types
isTrue
, otherwiseTrue
) - If True, all columns are cast before being placed into the UDF (e.g.function(foo::number)
instead offunction(foo)
).drop_udf
(bool
; default:True
) - If True, drop the UDF after running the model (as a post-hook). If False, do not drop the UDF.
The below options you probably will not need (it is suggested you instead cast the type in the definition, e.g. index=['id::integer']
), but are available just in case the above options are insufficient:
column_types
(list[str] | None
; default:None
) - If set, take input types for the columns from this instead of automatically inferring them.index_types
(list[str] | None
; default:None
) - If set, take input types for the index from this instead of automatically inferring them.values_type
(str | None
; default:None
) - If set, take input types for the values from this instead of automatically inferring them.
calculate_in_steps
(bool
; default:True
) - If set, calculate the PCA in multiple steps: i.e. create temporary tables for each step involved. By default, this isTrue
for Clickhouse for performance reasons; I will set this toFalse
by default when Clickhouse releases materialized CTEs (for such supported versions of Clickhouse).
calculate_in_steps
(bool
; default:False
) - If set, calculate the PCA in multiple steps: i.e. create temporary tables for each step involved. By default, this isFalse
for Duckdb as it is unnecessary for performance.
Please open an issue on Github if you experience any performance problems. I am still ironing things out a bit.
The performance for DuckDB is very fast, and users should generally not have any issues with DuckDB. In testing, values are asserted to equal the Statsmodels implementation of PCA with a very high level of precision, and the test cases run very quickly.
Clickhouse is made performant by calculating individual steps in temporary tables. This is because Clickhouse does not support materialized CTEs (although this feature is scheduled to come out in 2025), which means expensive calculations are calculated redundantly multiple times as the number of principal components increases.
By default, the calculate_in_steps
materialization option is turned on for Clickhouse.
But calculate_in_steps
only works when table=...
is a ref()
or source()
to a non-ephemeral node.
For example, the following code works, but will be non-performant because it references a CTE:
with my_cte as (select * from {{ ref("my_table") }})
select * from {{ dbt_pca.pca(
table="my_cte",
index="index",
columns="col",
values="val",
ncomp=10
) }}
It would be better to write the code like this, so long as ref("my_table")
is not ephemeral (views are OK):
select * from {{ dbt_pca.pca(
table=ref("my_table"),
index="index",
columns="col",
values="val",
ncomp=10
) }}
Clickhouse also has some other performance considerations.
The max_iter
is set to 150 (100 without calculate_in_steps
), which is lower than in DuckDB and achieves an accuracy of 10e-6
, which is also lower than DuckDB.
This max_iter
was chosen to play nicely with Clickhouse's default max_recursive_cte_evaluation_depth
of 1000, however you can adjust this through a pre-hook:
{{
config(
materialized='table',
pre_hook='set max_recursive_cte_evaluation_depth = 5000;'
)
}}
select * from {{
dbt_pca.pca(
table=ref('my_table'),
index='a_id',
columns='b_id',
values='x',
ncomp=10,
method_options={'max_iter': 400}
)
}}
The Snowflake implementation cheats by wrapping sm.PCA()
(there is no other way to do it, really).
Because it runs as a Python UDF, for heavy workloads, "Snowpark-optimized" warehouses are recommended.
These warehouses allow for additional memory to be allocated for a given warehouse size, relative to the amount of compute power of the warehouse instance.
Read more about Snowpark-optimized warehouses in the Snowflake documentation.
Snowflake UDTFs have a limit of 30 minutes of runtime per call (in practice, Snowflake allows the UDTF calls to go over a little bit to ~40 minutes).
The Snowflake implementation of dbt-pca does not batch calls at all, neither by partitioning rows* nor by iterative calculations of components; everything is done in a single call, so that is a 30-minute limitation for the whole PCA calculation.
This means for working on very large data (e.g. anything larger than ncomp=20
on a 1e6 x 1e3
matrix in a MEMORY_64X_x86
XXLARGE
warehouse), you may run into issues with default performance settings.
* There exist methods for incrementalizing PCA that could be used to partition the data, but these are not currently used by dbt-pca.
For the Snowflake implementation, column types can be cast explicitly and this casting will be used for the UDF definition.
This can be useful when referencing a CTE, from which column types cannot be inferred. (Column types are inferred for ref()
's and source()
's only.)
For example:
{{
config(
materialized='table',
)
}}
with my_cte as (
select id, column_a, column_b, column_c, column_d, column_e
from {{ ref('upstream_table') }}
)
select * from {{
dbt_pca.pca(
table='my_cte',
index='id::integer',
columns=[
'column_a::number(38, 8)',
'column_b::number(38, 8)',
'column_c::number(38, 8)',
'column_d::number(38, 8)',
'column_e::number(38, 8)',
],
ncomp=2
)
}}
If you reference a CTE and do not cast types yourself, then types will be cast to the most permissive types, i.e. float
s for all columns used in the matrix, and varchar
s for all other columns.
For example, in the above example, without explicit typecasting, the index='id'
will be cast to a varchar:
{{
config(
materialized='table',
)
}}
with my_cte as (
select id, column_a, column_b, column_c, column_d, column_e
from {{ ref('upstream_table') }}
)
-- Because table=... is a CTE and there are no explicit typecasts,
-- id will be cast to a varchar, and column_*'s will be cast to floats
select * from {{
dbt_pca.pca(
table='my_cte',
index='id',
columns=[
'column_a',
'column_b',
'column_c',
'column_d',
'column_e',
],
ncomp=2
)
}}
Please note explicit type-casting of columns
/ index
/ values
only works in Snowflake.
Some things that could happen in the future:
- Better support for weights. Weights are only supported right now with wide formatted inputs.
Note that this wish list is unlikely unless I personally need it or unless someone else contributes these features. I will continue to maintain this project going forward, but this project was a lot of work to get where it is today, and adding new features is not a high priority in my life.
PCA via the NIPALS method can be implemented in SQL with nested recursive CTEs. This is how PCA is implemented in DuckDB and Clickhouse.
The pca()
macro generates SQL containing a fairly straightforward implementation of the steps outlined here.
Unlike my other library dbt_linreg no Jinja2 trickery (so to speak) is required to build the SQL for the core implementation.
The Snowflake implementation just wraps Statsmodels directly as a UDTF, although it is formatted to have the exact same API as DuckDB and Clickhouse.
All approaches were validated using Statsmodels sm.PCA()
.
One more note about the implementations for Clickhouse and Snowflake: these implementations use some cool pre-hook injection tricks to work.
Basically, when you call {{ dbt_pca.pca() }}
, under the hood, a config()
call occurs which injects a pre-hook that generates SQL to create temp tables (Clickhouse) or a Python UDF (Snowflake).
Due to weird limitations of how dbt works, these calculations require reading from a comment injected into the compiled SQL: basically, all the args to pca()
are serialized as a JSON and then deserialized during pre-hook evaluation.
Amusingly and perhaps counterintuitively, I found that making this under-the-hood process smooth and user-friendly was a lot harder and more time consuming than the SQL-based PCA implementation itself.
dbt_pca's implementation of PCA is optimized for long-formatted data in DuckDB and Clickhouse, and optimized for wide-formatted data in Snowflake. That said, I generally recommend not bothering to preprocess your data into long format if it's naturally wide, unless you have a good reason. dbt_pca's wide-to-long conversion is perfectly optimal as-is, so just do whatever is easiest.
Run tests with ./run test {target}
.
The default target is duckdb
, which runs locally.
The clickhouse
target requires Docker.
dbt is a trademark of dbt Labs.
This package is unaffiliated with dbt Labs.