Changelog

v4.3.0 (2024-06-08)

✨ Features

model: Add check for fitted model in LGBMModel fingerprint. (f6a0933)

🐛 Bug Fixes

tuning: Optional enqueue_trials parameter added to fingerprint of OptunaTuner. (80fa374)
transformer: Update LabelEncoder to use PyArrow implementation of unique to prevent vaex bug from crashing the transformer. (85059d7)

v4.2.0 (2024-05-21)

✨ Features

transformer: Update ExpressionTransformer to use TypedDict instead of tuples. (3950abd)

v4.1.0 (2024-05-18)

✨ Features

tuning: Add support for enqueuing trials in OptunaTuner. (9e0b6b2)
data splitting: Add support for stratification on multiple features in the RandomSplitter. (d745434)
transformer: Add metadata option for the ExpressionTransformer that allows for creation of meta features not tracked in the DataSchema. (f16ea8b)
transformer: Add ExpressionTransformer for creating features using the vaex expression system. (c0faf74)

v4.0.0 (2024-05-09)

⛔️ BREAKING CHANGES

exporter: Add S3Exporter that implements cached S3 exporting of files from the local disk. (d17b2d2)
exporter: Add BaseExporter and LocalExporter implementations that support exporting data to disk, along with corresponding Pipeline steps. (6ce13cf)

✨ Features

exporter: Add LocalManifest support for LocalExporter which simplifies caching logic and enables S3 manifest translations. (2199ff0)
exporter: Add support for multiple data export using LocalExporter. (ff988b6)
data source: Add support for reading manifest files from S3 buckets in S3Ingester. (9c68a9b)
pipeline: Add disable_cache parameter to Pipeline execution. (da1e31a)

🐛 Bug Fixes

data cleaning: Fix newline characters breaking CSV reading using Arrow. (3a7e594)
tuning: Delete logging of storage URI to minimize risk of accidentally logging credentials. (054692d)

🛠️ Code Refactoring

data source: Extract shared S3 logic to utils which can be then used by S3Exporter. (97a7974)

v3.2.0 (2024-04-18)

✨ Features

tuning: Add support for RDSStorage using the OptunaTuner (cc06ddd)

🐛 Bug Fixes

data source: Fix bug where dataset_id consisting of path components would break local metadata file creation (17c4866)
model: Add verbosity parameter to BaseModel to set log level in the base class. (0a3828f)

v3.1.0 (2024-04-12)

✨ Features

model: Add optional memoization to datasets during model training. (#209) (2ca4465)
model: Add optional memoization to datasets during model training. (6a955dc)

v3.0.0 (2024-04-05)

⛔️ BREAKING CHANGES

model: Update LGBMModel to use dependency injection, now expects a lightgbm.LGBMModel as argument. (7250f34)

🐛 Bug Fixes

Switch vaex file format to Arrow instead of HDF5 for better type support. (ac8e500)
data cleaning: Fix bug where boolean columns are stored as numerical in the data schema due to int8 conversion. (da358d8)

v2.2.0 (2024-03-22)

✨ Features

filter: Add ImblearnResamplingFilter which is a wrapper for imblearn over- and under-samplers. (77a3d7d)
filter: Add ExpressionFilter and base class for simple DataFrame filtering using vaex expressions. (dc679ff)
cache: Add disable_cache argument to all cached functions to completely bypass all caching functionality. (fbdfc5d)

📝 Documentation

Update CHANGELOG.md format to include missing categories. (d97b32c)

v2.1.0 (2024-02-24)

✨ Features

Update Titanic dataset to mleko 2.0 API. (62bf991)
tuning: Add optuna-dashboard support to OptunaTuner including automatically generated experiment notes. (29d81c2)
transformer: Improve flexibility of LabelEncoderTransformer by adding optional null encoding and manual dictionary mapping. (f7b30a9)
Set cache_directory as optional argument, with custom default locations. (08e8777)

🐛 Bug Fixes

data cleaning: Fix meta_columns not being forcefully cast to correct data type in CSVToVaexConverter. (b42b9ed)

📝 Documentation

Update year in Copyright in README.md (#192) (eeb56e1)

🧪 Tests

Fix test cases generating cache directory outside temporary directory. (ba57fbf)

v2.0.0 (2024-02-07)

⛔️ BREAKING CHANGES

pipeline: Refactor PipelineStep to use TypedDict for both inputs and outputs. (2eb623c)

✨ Features

model: Refactor validation_dataframe parameter in BaseModel and LGBMModel to be optional. (d18ed29)
cache: Add cache support for None returns on fields using cache handlers not equipped to process None. (a489996)
model: Add support for custom evaluation function in LGBMModel. (4e70a55)

🐛 Bug Fixes

data cleaning: Rename empty column name to _empty to prevent vaex crashes. (da72b75)
data cleaning: Cast boolean columns to int8 during cleaning to reduce label encoding needs. (d94f7c9)
Added reserved keyword column name replacement to prevent evaluation errors from vaex. (3969ffd)

🛠️ Code Refactoring

Improve error logging messages, and update codebase to new black format. (a29ad45)
cache: Break out cache handler retrieval method. (aba9e41)

📝 Documentation

Refactor mleko package documentation to format bullet list correctly. (76ee895)

🤖 Continous Integration

Remove TypeGuard and PyUpgrade from build and pre-commit. (d374406)
Add custom template for release notes to follow changelog structure. (30518c0)

v1.2.6 (2024-01-25)

🐛 Bug Fixes

Bump patch release. (ff5f94e)

v1.2.5 (2024-01-25)

🐛 Bug Fixes

Fix CHANGELOG.md template location (141c9b7)

v1.2.4 (2024-01-25)

🐛 Bug Fixes

Trigger patch release. (7269dca)

🏗️ Build

semantic versioning: Update CHANGELOG.md template and semantic versioning logic. (1727e09)

v1.2.3 (2024-01-25)

🐛 Bug Fixes

Remove coverage from workflow (09eb09d)

v1.2.2 (2024-01-25)

🐛 Bug Fixes

Switch to trusted publishing (e84712d)

v1.2.1 (2024-01-25)

🐛 Bug Fixes

Experiment with semantic versioning (0942196)

🏗️ Build

🚧 Upgrade python-gitlab to 4.4.0 (15fff07)
🚧 Fix failing builds (79f7d95)

v1.2.0 (2023-10-09)

✨ Features

data source: ✨ Add support for pattern matching in *Ingester and add LocalManifest to index fetched files. (75974a4)

🐛 Bug Fixes

logging: 🐛 Fix LGBM logging routing to correct log level. (0e5fa77)

🎨 Style

remove unnecessary blank lines (a06edf2)
✏️ Improve logging of CSVToVaexConverter and fix typo in write_vaex_dataframe. (197e56a)

🏗️ Build

🔒️ Bump gitpython to resolve CVE-2023-41040 and CVE-2023-40590. (79627bd)

v1.1.0 (2023-09-27)

✨ Features

tuning: ✨ Add hyperparameter tuning functionality, initially including OptunaTuner. (be38c07)

🧪 Tests

tuning: 🧪 Add test cases for TuneStep. (d811c7d)

v1.0.0 (2023-09-20)

⛔️ BREAKING CHANGES

📝 Improve README.md with more up to date information. (b388b59)

✨ Features

transformer: ✨ Add DataSchema API to transformers fit, transform and fit_transform. (e053c85)

📝 Documentation

📝 Add example notebook for Titanic dataset. (e651af9)

v0.8.1 (2023-09-07)

🐛 Bug Fixes

config: 🐛 Fix readthedocs build to only generate html. (13fc207)

v0.8.0 (2023-09-06)

✨ Features

model: ✨ Add LGBMModel along with base class which can be extended for all types of future models. (b47a241)
✨ Add DataSchema which tracks dataset features throughout the pipeline and methods. (e03bd2c)
feature selection: ✨ Update BaseFeatureSelector and children to use the fit, transform and fit_transform pattern. (62e4dd1)
transformer: ✨ Add fit, transform and fit_transform to all Transformers, along with API and caching simplificatons. (5cc4ebc)
cache: ✨ Add CacheHandler which allows customization of read/write functions for each cached return value individually. (609e084)

🐛 Bug Fixes

feature selection: 🐛 Add DataSchema as partial return from all fit methods in feature selectors. (ebf2484)

🛠️ Code Refactoring

cache: 🚸 Replace disable_cache with a check if cache_size=0 for LRUCacheMixin. (cfd7592)

v0.7.0 (2023-07-11)

✨ Features

✨ Add fit transform support to all FeatureSelector along with refactoring the LRUCacheMixin. (3df0601)
✨ Add support for separate fitting and transforming inside the pipeline. (bb9b7a4)

🐛 Bug Fixes

data cleaning: 🐛 Switched to HDF5 as file format for faster I/O and better SageMaker support. (61f9e42)

v0.6.1 (2023-06-30)

🐛 Bug Fixes

data cleaning: 🐛 Fix date32/64[day] not converted to datetime. (98f4b26)
data source: 🐛 Fix bug where S3 buckets with no manifest caused crash. (9078845)

🏗️ Build

config: 🔧 Switch mypy for pyright and update configuration. (5631aed)

v0.6.0 (2023-06-26)

✨ Features

cache: ✨ Add cache_group that can segment an instance cache into different isolated parts. (#66) (5fa8c9c)
cache: ✨ Add cache_group that can segment an instance cache into different isolated parts. (b5c3de5)

v0.5.0 (2023-06-17)

✨ Features

transformer: ✨ Add MinMaxScalerTransformer for normalizing numerical features. (9b26c00)
transformer: ✨ Add MaxAbsScalerTransformer that scales numerical features. (1fd2a93)
transformer: ✨ Add CompositeTransformer for chaining together multiple transformers sequentially. (006d741)
transformer: ✨ Add LabelEncoderTransformer for ordinal encoding. (41a4c45)
transformer: ✨ Add FrequencyEncoderTransformer along with support for pipeline. (465e6db)

🛠️ Code Refactoring

💫 Switch to tqdm.auto to prevent breaking in Jupyter notebooks. (dc139cf)

🧪 Tests

✅ Now _get_local_filenames returns a sorted list of filenames to ensure stability. (774e8eb)

v0.4.2 (2023-06-11)

🚀 Performance improvements

⚡️ Optimize VarianceFeatureSelector when threshold is 0. (906dde3)

🛠️ Code Refactoring

➖ Remove pandas dependency. (40e264c)

🤖 Continous Integration

semantic versioning: 👷 Add more sections to changelog based on conventional commit categories. (e5b1594)

v0.4.1 (2023-06-04)

🐛 Bug Fixes

feature selection: 🐛 Fix FeatureSelector cache to use tuple in… (#60) (758cf5e)
feature selection: 🐛 Fix FeatureSelector cache to use tuple instead of frozenset to have stable fingerprint. (cd82417)

v0.4.0 (2023-06-03)

✨ Features

feature selection: ✨ Add that filters out invariant features. (798c261)
feature selection: ✨ Add PearsonCorrelationFeatureSelector which drops highly correlated features. (66e5cd2)
feature selection: ✨ Add CompositeFeatureSelector, for chaining multiple feature selection steps on the same DataFrame. (3d75079)
feature selection: ✨ Add standard deviation feature selector. (c56177b)
feature selection: ✨ Add missing rate feature selector. (d5ba8b5)

🐛 Bug Fixes

🐛 Fix typeguard breaking changes causing build to fail. (66c6a8e)

🛠️ Code Refactoring

🔥 Unify dataset subpackage naming to verbs and modules to nouns. (3ffb909)
🔥 Rename subpackages in dataset to singular variant. (51a8297)
🔥 Refactor entire project to improve maintainability. (dd1d22c)

v0.3.1 (2023-05-21)

🐛 Bug Fixes

🐛 Added notes to pipeline step docstrings. (d94f899)

🛠️ Code Refactoring

data source: 🐛 Added note to the KaggleDataSource init docstring. (d5f12d3)

🤖 Continous Integration

🚀 Removed semantic PR workflow and updated test workflow to not run on release commits. (8138745)

v0.3.0 (2023-05-21)

✨ Features

new notes (#54) (21239f7)

🐛 Bug Fixes

data splitting: 🐛 Added notes and examples to splitters docstrings. (d162c86)
pipeline: 🐛 Updated some docstrings. (56b36fd)

🤖 Continous Integration

🚀 Updated release to only trigger if the commit message does not contain chore(release). (c9f3f3f)

v0.2.0 (2023-05-21)

✨ Features

add data splitting step (#53) (a668b1a)

📝 Documentation

Removed duplicate row. (5d77131)
Adding pre-commit check for conventional commits. (dd2076e)

v0.1.3 (2023-05-13)

🐛 Bug Fixes

cache: 🐛 Cache modules exposed in subpackage init. (fd65e9d)

v0.1.2 (2023-05-13)

🐛 Bug Fixes

cache: 🐛 Fixed LRUCacheMixin eviction test case. (ce5bfc1)
🐛 Temporarely disabled failing tests for cache. (9c17960)

📝 Documentation

📝 Fixed sphinx-autoapi build warnings. (040963a)

v0.1.0 (2023-05-12)

✨ Features

data source: ✨ Add KaggleDataSource to download the dataset from Kaggle by providing a destination directory, owner slug, dataset slug, and necessary API credentials. (3fa07b6)

🐛 Bug Fixes

cache: 🐛 Fixed test by not testing it... (e3a0ce9)
cache: 🐛 Try logging using assert to fix GH issue (5e247ec)
cache: 🐛 Attempting to fix test case failing in GH actions. (4892591)
cache: 🐛 LRUCacheMixin now relies on file modification time instead of access time due to system limitations. (127d657)
🐛 Fixed docstrings for private methods in KaggleDataSource and removed xdoctest from build steps (bb55cf5)