Skip to content

v1.1.0

Compare
Choose a tag to compare
@matthewwardrop matthewwardrop released this 16 Dec 03:41
· 10 commits to main since this release

This is a major feature release that was motivated in many aspects by the migration of statstmodels from patsy to formulaic. Many thanks to @bashtage for driving those invasive changes forward. There are some semantic breaking changes, but unless you are deep in the internals of formulaic (which I do not believe to be the case for any external library) these are not expected to break common usage.

Breaking changes:

  • Formula is no longer always "structured" with special cases to handle the
    case where it has no structure. Legacy shims have been added to support old
    patterns, with DeprecationWarnings raised when they are used. It is not
    expected to break anyone not explicitly checking whether the Formula.root is
    a list instance (which formerly should have been simply assumed) [it is a now
    SimpleFormula instance that acts like an ordered sequence of Term
    instances].
  • The column names associated with categorical factors has changed. Previously,
    a prefix was unconditionally added to the level in the column name like
    feature[T.A], whether nor not the encoding will result in that term acting
    as a contrast. Now, in keeping with patsy, we only add the prefix if the
    categorical factor is encoded with reduced rank. Otherwise, feature[A] will
    be used instead.
  • formulaic.parsers.types.structured has been promoted to
    formulaic.utils.structured.

New features and enhancements:

  • Formula now instantiates to SimpleFormula or StructuredFormula, the
    latter being a tree-structure of SimpleFormula instances (as compared to
    List[Term]) previously. This simplifies various internal logic and makes the
    propagation of formula metadata more explicit. (#222)
  • Added support for restricting the set of features used by the default formula
    parser so that libraries can more easily restrict the structure of output
    formulae. (#207)
  • dict and recarray types are no associated with the pandas materializer
    by default (rather than raising), simplifying some user workflows. (#225)
  • Added support for the . operator (which is replaced with all variables not
    used on the left-hand-side of formulae). (#216)
  • Added experimental support for nested formulae of form [ ... ~ ... ].
    This is useful for (e.g.) generating formulae for IV 2SLS. (#108)
  • Add support for subsettings ModelSpec[s] based on an arbitrary
    strictly reduced FormulaSpec. (#208)
  • Added Formula.required_variables to more easily surface the expected data
    requirements of the formula. (#205)
  • Added support for extracting rows dropped during materialization. (#197)
  • Added cubic spline support for cyclic (cc) and natural (cr). See
    formulaic.materializers.transforms.cubic_spline.cubic_spline for
    more details.
  • Added a lag() transform.
  • Constructing LinearConstraints can now be done from a list of strings (for
    increased parity with patsy). (#201)
  • Categorical factors are now preceded with (e.g.) T. when they actully
    describe contrasts (i.e. when they are encoded with reduced rank). (#220)
  • Contrasts metadata is now added to the encoder state via encode_categorical;
    which is surfaced via ModelSpec.factor_contrasts. (#204)
  • Operator instances now received context which is optionally specified by
    the user during formula parsing, and updated by the parser. This is what makes
    the . implementation possible. (#216)
  • Given the generic usefulness of Structured, it has been promoted to
    formulaic.utils. (#223)
  • Added explicit support and testing for Python 3.13. (#202)

Bugfixes and cleanups:

  • Fixed nested ordering of Formula instance. (#200)
  • Allow Python tokens to multiple chained parentheses and brackets without using
    quotes as long as the parentheses are balanced. (#214, #218)
  • Reduced the number of redundant initialisation operations in Structured
    instances. (#200)
  • Fixed pickling ModelMatrix and FactorValues instances (whenever wrapped
    objects are picklable). (#209; thanks @bashtage)
  • basis_spline: Fixed evaluation involving datasets with null values, and
    disallow out-of-bounds knots. (#217; thanks @bashtage)
  • Improved robustness of data contexts involving PyArrow datasets.
  • We now use the same sentiles throughout the code-base, rather than having
    module specific sentinels in some places.
  • Migrated to ruff for linting, and updated mypy and pre-commit tooling.
  • Automatic fixes from ruff are automatically applied when using
    hatch run lint:format.

Documentation:

  • Fixed and updated docsite build, as well as other minor tweaks.