Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: adds _compliant sub-package #2149

Open
wants to merge 63 commits into
base: main
Choose a base branch
from
Open

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Mar 5, 2025

Closes #2044

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Direct follow-up to (#2119), aiming to make space for similar changes for other Compliant* protocols.

This PR spiralled a bit, but tackles a few quite-related issues.

Lazy-only typing

#2044 has been resolved through the use of the new protocol NativeExpr.

class NativeExpr(Protocol):
"""An `Expr`-like object from a package with [Lazy-only support](https://narwhals-dev.github.io/narwhals/extending/#levels-of-support).
Protocol members are chosen *purely* for matching statically - as they
are common to all currently supported packages.
"""
def between(self, *args: Any, **kwds: Any) -> Any: ...
def isin(self, *args: Any, **kwds: Any) -> Any: ...

(Arrow|PandasLike)* deduplication

All of the new classes prefixed with Eager now provide the common functionality between the two Eager-only backends.
The big one is EagerExpr

class EagerExpr(
CompliantExpr[EagerDataFrameT, EagerSeriesT],
Protocol38[EagerDataFrameT, EagerSeriesT],
):

A very unexpected part of that was that they could also share *ExprNamespace implementations

*Namespace protocols

Working on the Eager namespaces led to the new generic protocols for both (expr|series)_* namespaces https://github.com/narwhals-dev/narwhals/blob/b3bdb3b3ba52dada9cadde3c50d68520f1270a09/narwhals/_compliant/any_namespace.py

These can be re-used for all backends πŸ˜„

Misc

I followed up on (#2055 (comment)) and converted some methods to simpler .from_ variants.
The only remaining one (_create_compliant_series) is discussed in (#2149 (comment))

I also opened (#2169) for following up some strange pandas behavior

Note

There might be some other bits in commit descriptions I've missed

@dangotbanned dangotbanned changed the title refactor(DRAFT): adds _compliant sub-package refactor: adds _compliant sub-package Mar 6, 2025
@dangotbanned dangotbanned marked this pull request as ready for review March 6, 2025 16:04
@dangotbanned dangotbanned marked this pull request as draft March 6, 2025 17:22
@dangotbanned dangotbanned linked an issue Mar 6, 2025 that may be closed by this pull request
Comment on lines +27 to +31
CompliantSeriesOrNativeExprT_co = TypeVar(
"CompliantSeriesOrNativeExprT_co",
bound="CompliantSeries | NativeExpr",
covariant=True,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really want something that rolls off the tongue better than this πŸ€”

```log
error: All protocol members must have explicitly declared types  [misc]
```
@MarcoGorelli
Copy link
Member

thanks Dan! From a quick look I think this looks great, feel free to ping me for review when you're happy with it

@dangotbanned
Copy link
Member Author

thanks Dan! From a quick look I think this looks great, feel free to ping me for review when you're happy with it

Thanks @MarcoGorelli, will do!

Comment on lines +367 to +372
# For PyArrow.Series, we return Python Scalars (like Polars does) instead of PyArrow Scalars.
# However, when working with expressions, we keep everything PyArrow-native.
def _reuse_series_extra_kwargs(
self, *, returns_scalar: bool = False
) -> dict[str, Any]:
return {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is fine in terms of a pure refactoring of:

# For PyArrow.Series, we return Python Scalars (like Polars does) instead of PyArrow Scalars.
# However, when working with expressions, we keep everything PyArrow-native.
extra_kwargs = (
{"_return_py_scalar": False}
if returns_scalar and expr._implementation is Implementation.PYARROW
else {}
)
out: list[CompliantSeries] = [
plx._create_series_from_scalar(
getattr(series, attr)(**extra_kwargs, **_kwargs),
reference_series=series, # type: ignore[arg-type]
)

Into this override:

def _reuse_series_extra_kwargs(
self, *, returns_scalar: bool = False
) -> dict[str, Any]:
return {"_return_py_scalar": False} if returns_scalar else {}


But I really can't shake wanting to handle this differently somehow.
Haven't spent too long thinking about it - maybe take another look before review

- Eventually want to get rid of `create_compliant_series`
- `pandas` currently works differently in `create_compliant_series` and `native_series_from_iterable`
dangotbanned added a commit that referenced this pull request Mar 8, 2025
- Experimenting with an idea from (#2149)
- Trying to understand the different behavior from `create_compliant_series`
Only `.name` differs between the two `EagerExpr` - so why not share?
Comment on lines 54 to 55
@deprecated("ref'd in untyped code")
def _create_compliant_series(self, value: Any) -> EagerSeriesT: ...
Copy link
Member Author

@dangotbanned dangotbanned Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a while to fully unravel.

Tip

See this code search and my earlier note to understand existing usage.

Very rough notes

IGNORE PLS

I'm just sticking this here to get it out of my changes:

    # NOTE: Calling from an instance of `Namespace` (provides `context`)
    # NOTE: All usage within `*Expr.map_batches`
    #   - `PandasLikeExpr` uses that **once**
    #   - `ArrowExpr` uses **twice**
    #   - Expr -> Namespace -> Series
    #   - But handling `numpy`
    # NOTE: `PandasLikeDataFrame.with_row_index` uses the wrapped `utils` function once
    # NOTE External
    # - `_expression_parsing.extract_compliant` (numpy array)
    # - `dataframe.DataFrame._extract_compliant` (numpy array)
    # NOTE General
    # - Most similar to `EagerSeries._from_iterable`

I'm starting to think the broader replacement for _create_compliant_series will be:

Collapsed imports

from typing import TYPE_CHECKING
from typing import Any
from typing import Protocol
from typing import overload

from narwhals._compliant.typing import EagerDataFrameT
from narwhals._compliant.typing import EagerSeriesT

if TYPE_CHECKING:
    import numpy as np
    from typing_extensions import TypeAlias

    from narwhals.typing import _1DArray
    from narwhals.typing import _2DArray

_NumpyScalar: TypeAlias = "np.generic[Any]"


class EagerNamespace(
    CompliantNamespace[EagerDataFrameT, EagerSeriesT],
    Protocol[EagerDataFrameT, EagerSeriesT],
):
    @overload
    def from_numpy(self, value: _NumpyScalar | _1DArray, /) -> EagerSeriesT: ...

    @overload
    def from_numpy(self, value: _2DArray, /) -> EagerDataFrameT: ...

    def from_numpy(self, value: _NumpyScalar | _1DArray | _2DArray, /) -> Any: ...

_NumpyScalar is handled by:

def _from_scalar(self, value: Any) -> Self:
return self._from_iterable([value], name=self.name, context=self)

_1DArray is handled by:

@classmethod
def _from_iterable(
cls: type[Self], data: Iterable[Any], name: str, *, context: _FullContext

_2DArray would be each of the implementations in:

nw.functions.from_numpy

def from_numpy(
data: _2DArray,
schema: Mapping[str, DType] | Schema | Sequence[str] | None = None,
*,
native_namespace: ModuleType,
) -> DataFrame[Any]:
"""Construct a DataFrame from a NumPy ndarray.
Notes:
Only row orientation is currently supported.
For pandas-like dataframes, conversion to schema is applied after dataframe
creation.
Arguments:
data: Two-dimensional data represented as a NumPy ndarray.
schema: The DataFrame schema as Schema, dict of {name: type}, or a sequence of str.
native_namespace: The native library to use for DataFrame creation.
Returns:
A new DataFrame.
Examples:
>>> import numpy as np
>>> import pyarrow as pa
>>> import narwhals as nw
>>>
>>> arr = np.array([[5, 2, 1], [1, 4, 3]])
>>> schema = {"c": nw.Int16(), "d": nw.Float32(), "e": nw.Int8()}
>>> nw.from_numpy(arr, schema=schema, native_namespace=pa)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
|Narwhals DataFrame|
|------------------|
| pyarrow.Table |
| c: int16 |
| d: float |
| e: int8 |
| ---- |
| c: [[5,1]] |
| d: [[2,4]] |
| e: [[1,3]] |
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
"""
return _from_numpy_impl(data, schema, native_namespace=native_namespace)
def _from_numpy_impl(
data: _2DArray,
schema: Mapping[str, DType] | Schema | Sequence[str] | None = None,
*,
native_namespace: ModuleType,
) -> DataFrame[Any]:
from narwhals.schema import Schema
if not is_numpy_array_2d(data):
msg = "`from_numpy` only accepts 2D numpy arrays"
raise ValueError(msg)
implementation = Implementation.from_native_namespace(native_namespace)
if implementation is Implementation.POLARS:
if isinstance(schema, (Mapping, Schema)):
schema_pl: pl.Schema | Sequence[str] | None = Schema(schema).to_polars()
elif is_sequence_but_not_str(schema) or schema is None:
schema_pl = schema
else:
msg = (
"`schema` is expected to be one of the following types: "
"Mapping[str, DType] | Schema | Sequence[str]. "
f"Got {type(schema)}."
)
raise TypeError(msg)
native_frame = native_namespace.from_numpy(data, schema=schema_pl)
elif implementation.is_pandas_like():
if isinstance(schema, (Mapping, Schema)):
from narwhals._pandas_like.utils import get_dtype_backend
it: Iterable[DTypeBackend] = (
get_dtype_backend(native_type, implementation)
for native_type in schema.values()
)
native_frame = native_namespace.DataFrame(data, columns=schema.keys()).astype(
Schema(schema).to_pandas(it)
)
elif is_sequence_but_not_str(schema):
native_frame = native_namespace.DataFrame(data, columns=list(schema))
elif schema is None:
native_frame = native_namespace.DataFrame(
data, columns=[f"column_{x}" for x in range(data.shape[1])]
)
else:
msg = (
"`schema` is expected to be one of the following types: "
"Mapping[str, DType] | Schema | Sequence[str]. "
f"Got {type(schema)}."
)
raise TypeError(msg)
elif implementation is Implementation.PYARROW:
pa_arrays = [native_namespace.array(val) for val in data.T]
if isinstance(schema, (Mapping, Schema)):
schema_pa = Schema(schema).to_arrow()
native_frame = native_namespace.Table.from_arrays(pa_arrays, schema=schema_pa)
elif is_sequence_but_not_str(schema):
native_frame = native_namespace.Table.from_arrays(
pa_arrays, names=list(schema)
)
elif schema is None:
native_frame = native_namespace.Table.from_arrays(
pa_arrays, names=[f"column_{x}" for x in range(data.shape[1])]
)
else:
msg = (
"`schema` is expected to be one of the following types: "
"Mapping[str, DType] | Schema | Sequence[str]. "
f"Got {type(schema)}."
)
raise TypeError(msg)
else: # pragma: no cover
try:
# implementation is UNKNOWN, Narwhals extension using this feature should
# implement `from_numpy` function in the top-level namespace.
native_frame = native_namespace.from_numpy(data, schema=schema)
except AttributeError as e:
msg = "Unknown namespace is expected to implement `from_numpy` function."
raise AttributeError(msg) from e
return from_native(native_frame, eager_only=True)

The _2DArray case is really interesting to me, since it has already carved out space for us (albeit DataFrame only):

try:
# implementation is UNKNOWN, Narwhals extension using this feature should
# implement `from_numpy` function in the top-level namespace.
native_frame = native_namespace.from_numpy(data, schema=schema)
except AttributeError as e:
msg = "Unknown namespace is expected to implement `from_numpy` function."
raise AttributeError(msg) from e

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look ahead to v2, the particular names will change but direction is the same:

Copy link
Member Author

@dangotbanned dangotbanned Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's what I think the generics need to be:

from typing import Protocol
from typing_extensions import TypeVar

from narwhals.typing import _1DArray
from narwhals.typing import _2DArray

if TYPE_CHECKING:
    import numpy as np
    from typing_extensions import Self
    from typing_extensions import TypeAlias

    from narwhals.utils import _FullContext


_NumpyScalar: TypeAlias = np.generic[Any]

ToNumpyT_co = TypeVar("ToNumpyT_co", covariant=True)
FromNumpyT_contra = TypeVar("FromNumpyT_contra", contravariant=True, default=ToNumpyT_co)


class NumpyConvertible(Protocol[ToNumpyT_co, FromNumpyT_contra]):
    def to_numpy(self) -> ToNumpyT_co: ...
    @classmethod
    def from_numpy(cls, value: FromNumpyT_contra, *, context: _FullContext) -> Self: ...


class SeriesImpl(NumpyConvertible[_1DArray, _NumpyScalar | _1DArray]): ...

class DataFrameImpl(NumpyConvertible[_2DArray]): ...

Notes

Prior Art

@dangotbanned dangotbanned mentioned this pull request Mar 9, 2025
11 tasks
@dangotbanned
Copy link
Member Author

dangotbanned commented Mar 9, 2025

thanks Dan! From a quick look I think this looks great, feel free to ping me for review when you're happy with it

@MarcoGorelli I'm gonna have to wrap this up soon - just to avoid spending too long on resolving conflicts.

Still somewhat manageable atm - but fully a self-inflicted chore I want to avoid now πŸ˜…

Will do my best to get it in shape today - really happy with the rabbit holes I've gone down though πŸ˜‰

Update

This'll have to do for now - ready as I'll ever be @MarcoGorelli

Very open to alternatives on what the module should be called
Was the *least-bad* one I thought of
Really hate this as a solution and keep trying to get `mypy` to let me do something else
Was not easy getting that to please both `mypy` & `pyright`
@dangotbanned dangotbanned marked this pull request as ready for review March 9, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix TypeVar used in (SparkLike|DuckDB)Expr
2 participants