-
-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Current Situation
If you want to use non-standard python libraries in an Airflow job, you'd need to build a custom image, pip install those and then use your custom image in your cluster.
Preferred Situation
You can configure a requirements.txt
, which then will be installed in the Airflow deployment.
Example
E.g. you want to use pandas==2.2.2
in a DAG, currently you would need to setup a CI/CD way of building and deploying a custom Airflow image. The Dockerfile
would look like:
FROM oci.stackable.tech/sdp/airflow:${AIRFLOW_VERSION}-stackable${STACKABLE_VERSION}
ARG PYTHON_VERSION=3.9
# Install custom python libraries
RUN pip install \
--no-cache-dir \
--upgrade \
pandas==2.2.2
Although this is fairly easy doable it implies maintenance and resources. I consider this being a fairly common use case and thus we should think about if we could cover it with e.g. ( no strong opinion neither on naming nor where it should be in the crd and how )
---
apiVersion: airflow.stackable.tech/v1alpha1
kind: AirflowCluster
metadata:
name: airflow
spec:
image:
productVersion: 2.9.3
clusterConfig:
loadExamples: false
exposeConfig: false
credentialsSecret: simple-airflow-credentials
requirements:
configMap:
name: custom_requirements
and a configMap
---
apiVersion: v1
kind: ConfigMap
metadata:
name: custom_requirements
data:
requirements.txt: |
pandas==2.2.2
I think a solution on operator level would remove the pain to construct and maintain a build pipeline to the cluster. It moves the maintenance effort into the Airflow Operator, but this already needs attention ( stackable versions, product versions ).
However, I can't evaluate how much effort we need to put in to archive this and what kind of risks this would imply.