I am currently in the process of converting a Python script to SAS / PROC SQL and whilst everything has been going smoothly I've been stumped by an odd case where calculating Z-Scores with the following scripts give me different answers.
Python:
HO_HOME_PROVINCE_MEAN = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").mean().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_MEAN"}).reset_index()
HO_HOME_PROVINCE_STD = df_A10[["HOME_PROVINCE", "EUCLIDEAN_HOME_OFFICE_DIST_M"]].groupby("HOME_PROVINCE").std().rename(columns = {"EUCLIDEAN_HOME_OFFICE_DIST_M" : "HO_HOME_PROVINCE_STD"}).reset_index()
df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_MEAN, on = "HOME_PROVINCE", how = "left")
df_A11 = pd.merge(df_A10, HO_HOME_PROVINCE_STD, on = "HOME_PROVINCE", how = "left")
df11["HO_HOME_DISTRICT_ZSCORE"] = (df_A11["EUCLIDEAN_HOME_OFFICE_DIST_M"] - df_A11["HO_HOME_DISTRICT_MEAN"]) / df_A11["HO_HOME_DISTRICT_STD"]
SAS:
PROC SQL; CREATE TABLE DEVDAT11 AS
SELECT * FROM DEVDAT10 AS A
LEFT JOIN HOME_PROVINCE_MEAN AS B
ON A.HOME_PROVINCE = B.HOME_PROVINCE;
PROC SQL; CREATE TABLE DEVDAT12 AS
SELECT * FROM DEVDAT11 AS A
LEFT JOIN HOME_PROVINCE_STD AS B
ON A.HOME_PROVINCE = B.HOME_PROVINCE;
DATA WORK.DEVDAT13;
SET DEVDAT12;
HO_HOME_PROVINCE_ZSCORE = (EUCLIDEAN_HOME_OFFICE_DIST_M - HO_HOME_PROVINCE_MEAN) / HO_HOME_PROVINCE_STD;
RUN;
I've made sure to check if I've made any miscalculations using .sum() for Python and Proc Means Sum for SAS and so far it doesn't seem like I've made any mistakes on that front. I suspect that it might have something to do with the way Python handles null values or zeroes in rows as opposed to SAS as I keep getting this message from the log:
NOTE: Division by zero detected at line 2202 column 90.
The file that I've attached to this post contains a zero and a null value in column "HO_HOME_PROVINCE_STD". Any advice would be very much appreciated
Have you actually looked at your datasets with your own eyes to identify where the differences are, and what might be causing this? This is investigative work that you can (and should) do, and will most likely resolve the problem. If it doesn't, then show us the very specific observations in the data that are not giving the same answers, for SAS variables named EUCLIDEAN_HOME_OFFICE_DIST_M, HO_HOME_PROVINCE_MEAN and HO_HOME_PROVINCE_STD.
I don't know how Python handles division by zero, but returning a missing value, as sas does, is correct.
But the Z-score code shown in a DATA step ought to produce the exact same results as PROC STDIZE — assuming the mean and standard deviation were calculated correctly, which is something the original poster ought to have checked already.
And I completely agree with you that people should use PROC STDIZE to calculate z-scores (and similar quantities).
I was trying to point out that the line of code which computes the z-scores in the original message could not be the cause of the problems that the original poster is experiencing. The error must be something else.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.