-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: adds _compliant
sub-package
#2149
base: main
Are you sure you want to change the base?
Conversation
Follow-up to #2119 (comment)
- Long-term, should probably be defined in `nw.typing` - Or just generally used in parts of each backend impl #2064 (comment)
_compliant
sub-package_compliant
sub-package
CompliantSeriesOrNativeExprT_co = TypeVar( | ||
"CompliantSeriesOrNativeExprT_co", | ||
bound="CompliantSeries | NativeExpr", | ||
covariant=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really want something that rolls off the tongue better than this π€
```log error: All protocol members must have explicitly declared types [misc] ```
thanks Dan! From a quick look I think this looks great, feel free to ping me for review when you're happy with it |
Thanks @MarcoGorelli, will do! |
# For PyArrow.Series, we return Python Scalars (like Polars does) instead of PyArrow Scalars. | ||
# However, when working with expressions, we keep everything PyArrow-native. | ||
def _reuse_series_extra_kwargs( | ||
self, *, returns_scalar: bool = False | ||
) -> dict[str, Any]: | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is fine in terms of a pure refactoring of:
narwhals/narwhals/_expression_parsing.py
Lines 160 to 172 in b508e63
# For PyArrow.Series, we return Python Scalars (like Polars does) instead of PyArrow Scalars. | |
# However, when working with expressions, we keep everything PyArrow-native. | |
extra_kwargs = ( | |
{"_return_py_scalar": False} | |
if returns_scalar and expr._implementation is Implementation.PYARROW | |
else {} | |
) | |
out: list[CompliantSeries] = [ | |
plx._create_series_from_scalar( | |
getattr(series, attr)(**extra_kwargs, **_kwargs), | |
reference_series=series, # type: ignore[arg-type] | |
) |
Into this override:
narwhals/narwhals/_arrow/expr.py
Lines 138 to 141 in e64626a
def _reuse_series_extra_kwargs( | |
self, *, returns_scalar: bool = False | |
) -> dict[str, Any]: | |
return {"_return_py_scalar": False} if returns_scalar else {} |
But I really can't shake wanting to handle this differently somehow.
Haven't spent too long thinking about it - maybe take another look before review
- Eventually want to get rid of `create_compliant_series` - `pandas` currently works differently in `create_compliant_series` and `native_series_from_iterable`
- Experimenting with an idea from (#2149) - Trying to understand the different behavior from `create_compliant_series`
Only `.name` differs between the two `EagerExpr` - so why not share?
- Bit confused as what I've done is fairly similar to (#2130) - Getting the same issue as (#2084) https://github.com/narwhals-dev/narwhals/actions/runs/13740460535/job/38429062404?pr=2149
`pandas` impl will fail now
Didn't realise until now that both backends did the same thing
narwhals/_compliant/namespace.py
Outdated
@deprecated("ref'd in untyped code") | ||
def _create_compliant_series(self, value: Any) -> EagerSeriesT: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a while to fully unravel.
Tip
See this code search and my earlier note to understand existing usage.
Very rough notes
IGNORE PLS
I'm just sticking this here to get it out of my changes:
# NOTE: Calling from an instance of `Namespace` (provides `context`)
# NOTE: All usage within `*Expr.map_batches`
# - `PandasLikeExpr` uses that **once**
# - `ArrowExpr` uses **twice**
# - Expr -> Namespace -> Series
# - But handling `numpy`
# NOTE: `PandasLikeDataFrame.with_row_index` uses the wrapped `utils` function once
# NOTE External
# - `_expression_parsing.extract_compliant` (numpy array)
# - `dataframe.DataFrame._extract_compliant` (numpy array)
# NOTE General
# - Most similar to `EagerSeries._from_iterable`
I'm starting to think the broader replacement for _create_compliant_series
will be:
Collapsed imports
from typing import TYPE_CHECKING
from typing import Any
from typing import Protocol
from typing import overload
from narwhals._compliant.typing import EagerDataFrameT
from narwhals._compliant.typing import EagerSeriesT
if TYPE_CHECKING:
import numpy as np
from typing_extensions import TypeAlias
from narwhals.typing import _1DArray
from narwhals.typing import _2DArray
_NumpyScalar: TypeAlias = "np.generic[Any]"
class EagerNamespace(
CompliantNamespace[EagerDataFrameT, EagerSeriesT],
Protocol[EagerDataFrameT, EagerSeriesT],
):
@overload
def from_numpy(self, value: _NumpyScalar | _1DArray, /) -> EagerSeriesT: ...
@overload
def from_numpy(self, value: _2DArray, /) -> EagerDataFrameT: ...
def from_numpy(self, value: _NumpyScalar | _1DArray | _2DArray, /) -> Any: ...
_NumpyScalar
is handled by:
narwhals/narwhals/_compliant/series.py
Lines 54 to 55 in 067252b
def _from_scalar(self, value: Any) -> Self: | |
return self._from_iterable([value], name=self.name, context=self) |
_1DArray
is handled by:
narwhals/narwhals/_compliant/series.py
Lines 57 to 59 in 067252b
@classmethod | |
def _from_iterable( | |
cls: type[Self], data: Iterable[Any], name: str, *, context: _FullContext |
_2DArray
would be each of the implementations in:
nw.functions.from_numpy
narwhals/narwhals/functions.py
Lines 435 to 561 in c223138
def from_numpy( | |
data: _2DArray, | |
schema: Mapping[str, DType] | Schema | Sequence[str] | None = None, | |
*, | |
native_namespace: ModuleType, | |
) -> DataFrame[Any]: | |
"""Construct a DataFrame from a NumPy ndarray. | |
Notes: | |
Only row orientation is currently supported. | |
For pandas-like dataframes, conversion to schema is applied after dataframe | |
creation. | |
Arguments: | |
data: Two-dimensional data represented as a NumPy ndarray. | |
schema: The DataFrame schema as Schema, dict of {name: type}, or a sequence of str. | |
native_namespace: The native library to use for DataFrame creation. | |
Returns: | |
A new DataFrame. | |
Examples: | |
>>> import numpy as np | |
>>> import pyarrow as pa | |
>>> import narwhals as nw | |
>>> | |
>>> arr = np.array([[5, 2, 1], [1, 4, 3]]) | |
>>> schema = {"c": nw.Int16(), "d": nw.Float32(), "e": nw.Int8()} | |
>>> nw.from_numpy(arr, schema=schema, native_namespace=pa) | |
ββββββββββββββββββββ | |
|Narwhals DataFrame| | |
|------------------| | |
| pyarrow.Table | | |
| c: int16 | | |
| d: float | | |
| e: int8 | | |
| ---- | | |
| c: [[5,1]] | | |
| d: [[2,4]] | | |
| e: [[1,3]] | | |
ββββββββββββββββββββ | |
""" | |
return _from_numpy_impl(data, schema, native_namespace=native_namespace) | |
def _from_numpy_impl( | |
data: _2DArray, | |
schema: Mapping[str, DType] | Schema | Sequence[str] | None = None, | |
*, | |
native_namespace: ModuleType, | |
) -> DataFrame[Any]: | |
from narwhals.schema import Schema | |
if not is_numpy_array_2d(data): | |
msg = "`from_numpy` only accepts 2D numpy arrays" | |
raise ValueError(msg) | |
implementation = Implementation.from_native_namespace(native_namespace) | |
if implementation is Implementation.POLARS: | |
if isinstance(schema, (Mapping, Schema)): | |
schema_pl: pl.Schema | Sequence[str] | None = Schema(schema).to_polars() | |
elif is_sequence_but_not_str(schema) or schema is None: | |
schema_pl = schema | |
else: | |
msg = ( | |
"`schema` is expected to be one of the following types: " | |
"Mapping[str, DType] | Schema | Sequence[str]. " | |
f"Got {type(schema)}." | |
) | |
raise TypeError(msg) | |
native_frame = native_namespace.from_numpy(data, schema=schema_pl) | |
elif implementation.is_pandas_like(): | |
if isinstance(schema, (Mapping, Schema)): | |
from narwhals._pandas_like.utils import get_dtype_backend | |
it: Iterable[DTypeBackend] = ( | |
get_dtype_backend(native_type, implementation) | |
for native_type in schema.values() | |
) | |
native_frame = native_namespace.DataFrame(data, columns=schema.keys()).astype( | |
Schema(schema).to_pandas(it) | |
) | |
elif is_sequence_but_not_str(schema): | |
native_frame = native_namespace.DataFrame(data, columns=list(schema)) | |
elif schema is None: | |
native_frame = native_namespace.DataFrame( | |
data, columns=[f"column_{x}" for x in range(data.shape[1])] | |
) | |
else: | |
msg = ( | |
"`schema` is expected to be one of the following types: " | |
"Mapping[str, DType] | Schema | Sequence[str]. " | |
f"Got {type(schema)}." | |
) | |
raise TypeError(msg) | |
elif implementation is Implementation.PYARROW: | |
pa_arrays = [native_namespace.array(val) for val in data.T] | |
if isinstance(schema, (Mapping, Schema)): | |
schema_pa = Schema(schema).to_arrow() | |
native_frame = native_namespace.Table.from_arrays(pa_arrays, schema=schema_pa) | |
elif is_sequence_but_not_str(schema): | |
native_frame = native_namespace.Table.from_arrays( | |
pa_arrays, names=list(schema) | |
) | |
elif schema is None: | |
native_frame = native_namespace.Table.from_arrays( | |
pa_arrays, names=[f"column_{x}" for x in range(data.shape[1])] | |
) | |
else: | |
msg = ( | |
"`schema` is expected to be one of the following types: " | |
"Mapping[str, DType] | Schema | Sequence[str]. " | |
f"Got {type(schema)}." | |
) | |
raise TypeError(msg) | |
else: # pragma: no cover | |
try: | |
# implementation is UNKNOWN, Narwhals extension using this feature should | |
# implement `from_numpy` function in the top-level namespace. | |
native_frame = native_namespace.from_numpy(data, schema=schema) | |
except AttributeError as e: | |
msg = "Unknown namespace is expected to implement `from_numpy` function." | |
raise AttributeError(msg) from e | |
return from_native(native_frame, eager_only=True) |
The _2DArray
case is really interesting to me, since it has already carved out space for us (albeit DataFrame
only):
narwhals/narwhals/functions.py
Lines 554 to 560 in c223138
try: | |
# implementation is UNKNOWN, Narwhals extension using this feature should | |
# implement `from_numpy` function in the top-level namespace. | |
native_frame = native_namespace.from_numpy(data, schema=schema) | |
except AttributeError as e: | |
msg = "Unknown namespace is expected to implement `from_numpy` function." | |
raise AttributeError(msg) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we look ahead to v2
, the particular names will change but direction is the same:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I think the generics need to be:
from typing import Protocol
from typing_extensions import TypeVar
from narwhals.typing import _1DArray
from narwhals.typing import _2DArray
if TYPE_CHECKING:
import numpy as np
from typing_extensions import Self
from typing_extensions import TypeAlias
from narwhals.utils import _FullContext
_NumpyScalar: TypeAlias = np.generic[Any]
ToNumpyT_co = TypeVar("ToNumpyT_co", covariant=True)
FromNumpyT_contra = TypeVar("FromNumpyT_contra", contravariant=True, default=ToNumpyT_co)
class NumpyConvertible(Protocol[ToNumpyT_co, FromNumpyT_contra]):
def to_numpy(self) -> ToNumpyT_co: ...
@classmethod
def from_numpy(cls, value: FromNumpyT_contra, *, context: _FullContext) -> Self: ...
class SeriesImpl(NumpyConvertible[_1DArray, _NumpyScalar | _1DArray]): ...
class DataFrameImpl(NumpyConvertible[_2DArray]): ...
Notes
- To get
TypeVar
defaults, I'll use the same trick as (feat(typing): Backport genericSeries
tov1
Β #2110 (comment))- Probably will want to move that into a new module
nw._typing_compat
- Can also be the home for
Protocol38
andnw.utils.deprecated
- Can also be the home for
- Using https://docs.astral.sh/ruff/settings/#lint_typing-modules
- Probably will want to move that into a new module
- I'm not using quoted annotations/forward refs in this example
- Will in the real thing
- Just trying to make it easier to read on GitHub π
Prior Art
@MarcoGorelli I'm gonna have to wrap this up soon - just to avoid spending too long on resolving conflicts. Still somewhat manageable atm - but fully a self-inflicted chore I want to avoid now π Will do my best to get it in shape today - really happy with the rabbit holes I've gone down though π UpdateThis'll have to do for now - ready as I'll ever be @MarcoGorelli |
Very open to alternatives on what the module should be called Was the *least-bad* one I thought of
Really hate this as a solution and keep trying to get `mypy` to let me do something else
Was not easy getting that to please both `mypy` & `pyright`
Closes #2044
What type of PR is this? (check all applicable)
Related issues
TypeVar
used in(SparkLike|DuckDB)Expr
Β #2044 (comment)CompliantExpr
Β #2119 (comment)Compliant*
sub-protocolsΒ #2055)_Stores(Native|Compliant)
Β #2130CompliantSelector
Β #2064Checklist
If you have comments or can explain your changes, please do so below
Direct follow-up to (#2119), aiming to make space for similar changes for other
Compliant*
protocols.This PR spiralled a bit, but tackles a few quite-related issues.
Lazy-only typing
#2044 has been resolved through the use of the new protocol
NativeExpr
.narwhals/narwhals/_compliant/expr.py
Lines 65 to 73 in b3bdb3b
(Arrow|PandasLike)*
deduplicationAll of the new classes prefixed with
Eager
now provide the common functionality between the two Eager-only backends.The big one is
EagerExpr
narwhals/narwhals/_compliant/expr.py
Lines 257 to 260 in b3bdb3b
A very unexpected part of that was that they could also share
*ExprNamespace
implementations*Namespace
protocolsWorking on the
Eager
namespaces led to the new generic protocols for both(expr|series)_*
namespaces https://github.com/narwhals-dev/narwhals/blob/b3bdb3b3ba52dada9cadde3c50d68520f1270a09/narwhals/_compliant/any_namespace.pyThese can be re-used for all backends π
Misc
I followed up on (#2055 (comment)) and converted some methods to simpler
.from_
variants.The only remaining one (
_create_compliant_series
) is discussed in (#2149 (comment))I also opened (#2169) for following up some strange
pandas
behaviorNote
There might be some other bits in commit descriptions I've missed