bug: `F.skewness` returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

FBruzzesi · 2025-02-15T23:31:48Z

Description

First and foremost, I discovered we were computing it wrong in narwhals with duckdb backend thanks to sqlframe!

It appears that pyspark computes the sample skewness while duckdb computes population skewness. The difference is the adjustment of a correction factor of

$$\frac{\sqrt{n(n-1)}}{n-2}$$

Let me know if this is out of scope (as it would only be needed to match pyspark behavior).

In code/numbers:

Spark:

from sqlframe.spark import SparkSession
import sqlframe.spark.functions as F

session = SparkSession()

data = {"a": [4, 4, 6]}

frame = session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])

frame.select(F.skewness("a")).show()
+--------------------+                                                          
|   skewness__a__    |
+--------------------+
| 0.7071067811865475 |
+--------------------+

DuckDB:

from sqlframe.duckdb import DuckDBSession
import sqlframe.duckdb.functions as F

session = DuckDBSession()

data = {"a": [4, 4, 6]}

frame = session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])
frame.select(F.skewness("a")).show()

+-------------------+
|   skewness__a__   |
+-------------------+
| 1.732050807568811 |
+-------------------+

I just opened a PR to fix it in narwhals with native duckdb backend if interested.

eakmanrq · 2025-02-16T17:39:56Z

The example PR of how to fix was very helpful! Thank you!

FBruzzesi · 2025-02-16T21:54:33Z

Happy to hear it helped 🙌

eakmanrq mentioned this issue Feb 16, 2025

fix: have skewness match pyspark #307

Merged

eakmanrq closed this as completed in #307 Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: `F.skewness` returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

bug: `F.skewness` returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

FBruzzesi commented Feb 15, 2025 •

edited

Loading

eakmanrq commented Feb 16, 2025

FBruzzesi commented Feb 16, 2025

bug: F.skewness returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

bug: F.skewness returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

Comments

FBruzzesi commented Feb 15, 2025 • edited Loading

Description

eakmanrq commented Feb 16, 2025

FBruzzesi commented Feb 16, 2025

bug: `F.skewness` returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

bug: `F.skewness` returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

FBruzzesi commented Feb 15, 2025 •

edited

Loading