Re: Python and SAS returning different values

danielchoi626 · Posted 06-25-2021 05:18 AM

I am currently in the process of converting a Python script to SAS / PROC SQL and whilst everything has been going smoothly I've been stumped by an odd case where calculating Z-Scores with the following scripts give me different answers.

Python:
HO_HOME_PROVINCE_MEAN = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").mean().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_MEAN"}).reset_index()
HO_HOME_PROVINCE_STD = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").std().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_STD"}).reset_index()

df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_MEAN, on = "HOME_PROVINCE", how = "left")
df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_STD, on = "HOME_PROVINCE", how = "left")

df11["HO_HOME_DISTRICT_ZSCORE"] = (df_A11["EUCLIDEAN_HOME_OFFICE_DIST_M"] - df_A11["HO_HOME_DISTRICT_MEAN"]) / df_A11["HO_HOME_DISTRICT_STD"]

SAS:

PROC SQL; CREATE TABLE DEVDAT11 AS
SELECT * FROM DEVDAT10 AS A
LEFT JOIN HOME_PROVINCE_MEAN AS B
ON A.HOME_PROVINCE = B.HOME_PROVINCE;

PROC SQL; CREATE TABLE DEVDAT12 AS
SELECT * FROM DEVDAT11 AS A
LEFT JOIN HOME_PROVINCE_STD AS B
ON A.HOME_PROVINCE = B.HOME_PROVINCE;

DATA WORK.DEVDAT13;
SET DEVDAT12;
HO_HOME_PROVINCE_ZSCORE = (EUCLIDEAN_HOME_OFFICE_DIST_M - HO_HOME_PROVINCE_MEAN) / HO_HOME_PROVINCE_STD;
RUN;

I've made sure to check if I've made any miscalculations using .sum() for Python and Proc Means Sum for SAS and so far it doesn't seem like I've made any mistakes on that front. I suspect that it might have something to do with the way Python handles null values or zeroes in rows as opposed to SAS as I keep getting this message from the log:

NOTE: Division by zero detected at line 2202 column 90.

The file that I've attached to this post contains a zero and a null value in column "HO_HOME_PROVINCE_STD". Any advice would be very much appreciated

PaigeMiller · Posted 06-25-2021 05:48 AM

Have you actually looked at your datasets with your own eyes to identify where the differences are, and what might be causing this? This is investigative work that you can (and should) do, and will most likely resolve the problem. If it doesn't, then show us the very specific observations in the data that are not giving the same answers, for SAS variables named EUCLIDEAN_HOME_OFFICE_DIST_M, HO_HOME_PROVINCE_MEAN and HO_HOME_PROVINCE_STD.

--
Paige Miller

andreas_lds · Posted 06-25-2021 06:05 AM

I don't know how Python handles division by zero, but returning a missing value, as sas does, is correct.

Reeza · Posted 06-25-2021 11:21 AM

In neither of these tools should you be doing z scores manually. PROC STDIZE in SAS will ensure that your calculations are correct.

PaigeMiller · Posted 06-25-2021 11:30 AM

But the Z-score code shown in a DATA step ought to produce the exact same results as PROC STDIZE — assuming the mean and standard deviation were calculated correctly, which is something the original poster ought to have checked already.

--
Paige Miller

Reeza · Posted 06-25-2021 11:32 AM

Yes, but doing things manually and not using a proc leaves you more open to mistakes as shown exactly in this post by someone trying to do it manually in different applications.

And as OP has stated, they're not super familiar with SAS so knowing that PROC STDIZE can do it more efficiently is worth mentioning.

PaigeMiller · Posted 06-25-2021 11:43 AM

And I completely agree with you that people should use PROC STDIZE to calculate z-scores (and similar quantities).

I was trying to point out that the line of code which computes the z-scores in the original message could not be the cause of the problems that the original poster is experiencing. The error must be something else.

--
Paige Miller

Python and SAS returning different values