a kedro hook to protect against breaking changes to data
steel-toes
is a plugin for the python data pipelining framework kedro.
It modifies each datasets filepath when you are developing a new feature to the
pipeline, but do not want to wreck pipelines running on other branches.
Motivation
kedro
is a ✨ fantastic project that allows for super-fast prototyping of
data pipelines, while yielding production-ready pipelines. kedro
promotes
collaborative projects by giving each team member access to the exact same
data. Team members will often make their own branch of the project and begin
work. Sometimes these changes will break existing functionality. Sometimes we
make mistakes as we develop, and fix them before merging in. Either case can be
detrimental to a teammate working downstream of your changes if not careful.
🥼 Wear the proper PPE during feature development
steel-toes
hooks into your catalog to prevent changing downstream data on
your teammates while developing in parallel.
on_catalog_created and before_pipeline_run
When your project creates a catalog steel-toes
will look to see if branched
data exists, if it does it will swap the filepath to the branched path. So you
will be able to load the latest data from the perspective of any branch
simulaneusly.
after_node_run
After your node is ran, before saving, steel-toes
will check if your
filepath
was swapped, if not it will swap it to the branched filepath
before saving.
Installation
steel-toes
is deployed to pypi and can be pip
installed.
pip install steel-toes
For a real kedro project you should add to your requirements.
Setup
To add SteelToes
to your kedro>0.18.0 project add an instance of the
SteelToes
hook to your tuple of hooks in src/<project_name>/settings.py.
settings.py
from steel_toes import SteelToes HOOKS = (SteelToes(),)
settings.py location
settings.py
is typically located in src/<python_package>/settings.py
.
ignore_types
Some datasets have a _filepath
attribute that is not meant for saving
datasets to and is not needed to be "branched", and should be ignored from
steel_toes, for example SQLQueryDataSet
.
settings.py
from kedro.extras.datasets.pandas.sql_dataset import SQLQueryDataSet, SQLTableDataSet from steel_toes import SteelToes HOOKS = (SteelToes(ignore_types=[SQLQueryDataSet, SQLTableDataSet]),)
Automatic branch naming
steel_toes
will automatically get the branch name from your git branch. All
you need to do is create a new branch, and steel-toes will make sure that all
the data you write will go to a specific place for that branch. It will not
change the filepaths until the dataset exists or just before its written, this
way your catalog will still load existing datasets from the dataset specified
in the catalog.
git checkout -b my-new-feature origin/main
Override with environment variable
In certain situations such as using kedro docker
in production, there is no
git branch to pull from. Setting an environment variable before steel-toes
initializes will set the branch.
set environment variable in the shell
run.sh
STEEL_TOES_BRANCH='PROD' # run kedro here
set environment variable with python
run.py
import os os.environ["STEEL_TOES_BRANCH"] = "PROD" # run kedro here
Example filenames
Here is an example of what filepaths look like when I add parquet catalog
entries to the spaceflights project, steel_toes
will add the branch name
automatically just before the file extension.
X_test: data/X_test_main.pq X_train: data/X_train_main.pq preprocessed_companies: data/02_intermediate/preprocessed_companies_main.pq preprocessed_shuttles: data/02_intermediate/preprocessed_shuttles_main.pq model_input_table: data/03_primary/model_input_table_main.pq regressor: data/06_models/regressor_main.pickle
Logs on first run
When first running your pipeline with steel-toes
it will start the
_filepath
swap after_node_run, since the swapped file does not yet exist.
Note
At this point catalog.load('preprocessed_shuttles') will not load the branched dataset.
❯ kedro run INFO Kedro project spaceflights session.py:340 ... INFO STEEL_TOES:after_node_run 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102 ... INFO Completed 6 out of 6 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. runner.py:90
Logs after dataset exists
Subsequent runs of kedro will swap the dataset to the branched filepath immediately after the catalog has been created.
Note
Now catalog.load('preprocessed_shuttles') will load the branched dataset.
INFO Kedro project spaceflights session.py:340 ... INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102 ... INFO Completed 6 out of 6 tasks sequential_runner.py:85 INFO Pipeline execution completed successfully. runner.py:90
CLI Usage
The CLI provides a handy interface to clean up your branched datasets.
steel-toes --help
Usage: steel-toes [OPTIONS] COMMAND [ARGS]... help Options: -V, --version Prints version and exits --help Show this message and exit. Commands: clean-branch finds branch datasets and removes them
steel-toes
also registers itself as a kedro
global cli plugin. You can run
kedro clean-branch
to clean your branched data.
steel-toes clean-branch --help
Usage: kedro clean-branch [OPTIONS] finds branch datasets and removes them Options: --dryrun Displays the files that would be deleted using the specified command without actually deleting them. -b, --branch TEXT git branch to clean files from -h, --help Show this message and exit.
Cleaning up old branches
To clean up your current branch, running kedro clean-branch
will remove all
the datasets that have been swapped to the current branch. Adding --dryrun
will only log what steel-toes
intends to do, and will not delete.
kedro clean-branch --dryrun
INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102 ... INFO STEEL_TOES:dryrun-remove | '/home/waylon/git/spaceflights/data/02_intermediate/preprocessed_shuttles_main.pq' steel_toes.py:141
Dropping the --dryrun
flag will delete all the branched datasets.
kedro clean-branch
INFO STEEL_TOES:after_catalog_created 'preprocessed_shuttles.pq' -> 'preprocessed_shuttles_main.pq' steel_toes.py:102 ... INFO STEEL_TOES:deleting | '/home/waylon/git/spaceflights/data/02_intermediate/preprocessed_shuttles_main.pq' steel_toes.py:141
Contributing
You're Awesome for considering a contribution! Contributions are welcome, please check out the Contributing Guide for more information. Please be a positive member of the community and embrace feedback
Versioning
We use SemVer for versioning. For the versions available, see the tags, or releases
Author
Waylon Walker - Original Author
License
This project is licensed under the MIT License - see the LICENSE. file for details