Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: F.skewness returns different results for pyspark and duckdb sessions (sample vs population skewness) #306

Closed
FBruzzesi opened this issue Feb 15, 2025 · 2 comments · Fixed by #307

Comments

@FBruzzesi
Copy link
Contributor

FBruzzesi commented Feb 15, 2025

Description

First and foremost, I discovered we were computing it wrong in narwhals with duckdb backend thanks to sqlframe!

It appears that pyspark computes the sample skewness while duckdb computes population skewness. The difference is the adjustment of a correction factor of

$$\frac{\sqrt{n(n-1)}}{n-2}$$


Let me know if this is out of scope (as it would only be needed to match pyspark behavior).


In code/numbers:

Spark:

from sqlframe.spark import SparkSession
import sqlframe.spark.functions as F

session = SparkSession()

data = {"a": [4, 4, 6]}

frame = session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])

frame.select(F.skewness("a")).show()
+--------------------+                                                          
|   skewness__a__    |
+--------------------+
| 0.7071067811865475 |
+--------------------+

DuckDB:

from sqlframe.duckdb import DuckDBSession
import sqlframe.duckdb.functions as F

session = DuckDBSession()

data = {"a": [4, 4, 6]}

frame = session.createDataFrame([*zip(*data.values())], schema=[*data.keys()])
frame.select(F.skewness("a")).show()

+-------------------+
|   skewness__a__   |
+-------------------+
| 1.732050807568811 |
+-------------------+

I just opened a PR to fix it in narwhals with native duckdb backend if interested.

@eakmanrq
Copy link
Owner

The example PR of how to fix was very helpful! Thank you!

@FBruzzesi
Copy link
Contributor Author

Happy to hear it helped 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants