Skip to content

Commit

Permalink
feat: add data splitting step (#53)
Browse files Browse the repository at this point in the history
* feat(pipeline): ✨ Added named inputs and outputs to PipelineStep.

The PipelineStep now requires inputs and outputs names which will modify a single DataContainer throughout the entire Pipeline. The DataContainer data attribute is now a dictionary and allows steps to specifically select which data should be sent to the wrapped class executing the logic. This allows for splitting data to multiple DataFrames and specifically selecting one of them in future steps for instance.

Classes inheriting from PipelineSteps will need to be appended with _num_inputs and _num_outputs. The initialization of the IngestStep and ConverterStep needs to include the new parameters inputs and outputs.

* feat(cache): ✨ Added fingerprinter for Vaex DataFrames.

* feat(cache): ✨ Cache not supports both single and multiple outputs from cached methods.

If a method returns multiple objects in a list or tuple they will be individually cached as files on the format of hash_index.suffix. When multiple files of a single hash is found they will be loaded in the correct order.

* feat(data splitting): ✨ Added data splitting classes and pipeline step which splits a dataframe in two.

The RandomDataSplitter support random (shuffled, stratified) splitting according to ratios. The ExpressionDataSplitter splits data based on boolean string expressions.

The caching mechanism is updated to accomodate for multiple file outputs.

* fix: 🐛 Updated ignore list for library stubs to also include scikit-learn.

* fix: 🐛 Forced datetime columns option removed from convert step.

---------

Co-authored-by: Erik Båvenstrand <[email protected]>
  • Loading branch information
ErikBavenstrand and Erik Båvenstrand authored May 21, 2023
1 parent 57150ad commit a668b1a
Show file tree
Hide file tree
Showing 35 changed files with 1,290 additions and 205 deletions.
20 changes: 10 additions & 10 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,61 +4,61 @@ repos:
hooks:
- id: black
name: black
entry: black
entry: poetry run black
language: system
types: [python]
require_serial: true
- id: check-added-large-files
name: Check for added large files
entry: check-added-large-files
entry: poetry run check-added-large-files
language: system
- id: check-toml
name: Check Toml
entry: check-toml
entry: poetry run check-toml
language: system
types: [toml]
- id: check-yaml
name: Check Yaml
entry: check-yaml
entry: poetry run check-yaml
language: system
types: [yaml]
- id: darglint
name: darglint
entry: darglint
entry: poetry run darglint
language: system
types: [python]
stages: [manual]
- id: end-of-file-fixer
name: Fix End of Files
entry: end-of-file-fixer
entry: poetry run end-of-file-fixer
language: system
types: [text]
stages: [commit, push, manual]
exclude: ^CHANGELOG\.md$
- id: flake8
name: flake8
entry: flake8
entry: poetry run flake8
language: system
types: [python]
require_serial: true
args: [--darglint-ignore-regex, .*]
- id: isort
name: isort
entry: isort
entry: poetry run isort
require_serial: true
language: system
types_or: [cython, pyi, python]
args: ["--filter-files"]
- id: pyupgrade
name: pyupgrade
description: Automatically upgrade syntax for newer versions.
entry: pyupgrade
entry: poetry run pyupgrade
language: system
types: [python]
args: [--py37-plus]
- id: trailing-whitespace
name: Trim Trailing Whitespace
entry: trailing-whitespace-fixer
entry: poetry run trailing-whitespace-fixer
language: system
types: [text]
stages: [commit, push, manual]
Expand Down
8 changes: 7 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
{
"conventionalCommits.scopes": ["data source", "config", "cache"]
"conventionalCommits.scopes": [
"data source",
"config",
"cache",
"pipeline",
"data splitting"
]
}
173 changes: 130 additions & 43 deletions examples/Experiment.ipynb

Large diffs are not rendered by default.

108 changes: 107 additions & 1 deletion poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ boto3 = "^1.26.91"
botocore = "^1.29.91"
tqdm = "^4.65.0"
vaex = "^4.16.0"
scikit-learn = "^1.2.2"

[tool.poetry.group.dev.dependencies]
Pygments = ">=2.10.0"
Expand Down Expand Up @@ -97,7 +98,8 @@ module = [
"botocore.config",
"pyarrow",
"requests",
"requests.auth"
"requests.auth",
"sklearn.model_selection",
]
ignore_missing_imports = true

Expand Down
10 changes: 8 additions & 2 deletions src/mleko/cache/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,13 @@
resource usage, and make it easier to identify and manage changes in data.
"""
from .cache import CacheMixin, LRUCacheMixin
from .fingerprinters import CSVFingerprinter, Fingerprinter
from .fingerprinters import CSVFingerprinter, Fingerprinter, VaexFingerprinter


__all__ = ["CacheMixin", "LRUCacheMixin", "CSVFingerprinter", "Fingerprinter"]
__all__ = [
"CacheMixin",
"LRUCacheMixin",
"CSVFingerprinter",
"VaexFingerprinter",
"Fingerprinter",
]
Loading

0 comments on commit a668b1a

Please sign in to comment.