- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
On the Kaggle website, users have posted the code and results they got when they tested the Titanic "train" data set using both R and Python. I wanted to try it out using SAS, however, I'm getting a different result for the correlation between survival and sex.
This is the result that Kaggle forum users got when they used Python or R to check the correlation between Survived (where 1=survived and 0=died). For sex, 0=female and 1=male (originally, the data set comes with a character variable showing "male" or "female" but a binary norminal is created using 0 and 1).
Survival Correlation by: Sex Sex Survived 0 female 0.742038 1 male 0.188908
I tried it in SAS using the below PROC CORR, but my result showed a correlation of -0.54 between sex and survived. So I split Sex further into two variables Male and Female (where the variable gets a 1 if that gender is present, or 0 if it is not present). This time it just shows a correlation of 0.54 with female and survived, and -0.54 with male and survived. So it is definitely not the same as what people get when they use Python and R.
I'm aware that technically binary variables like survived are not ideally suited to using Pearson correlation, since it is intended for continuous variables. But it is still helpful as part of the EDA process. Does anyone know how to get SAS to produce the same results that Python and R produced? Thanks.
PROC CORR DATA=train PLOTS=SCATTER; VAR survived sex female male; RUN;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are trying to calculate survival rate by sex. Here is how to get them and an answer to the question: did survival rate differ by sex:
proc freq data=train;
table survived*sex / norow nocum nopercent chisq;
run;
Try this to see if you get the values that you expect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are trying to calculate survival rate by sex. Here is how to get them and an answer to the question: did survival rate differ by sex:
proc freq data=train;
table survived*sex / norow nocum nopercent chisq;
run;
Try this to see if you get the values that you expect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you - using the two-way Proc Freq that you suggested does produce the same result (the bottom line shows 74.2% survival for females and 18.89% for males).
Survived Sex 0 1 Total 0 81 468 549 25.8 81.11 1 233 109 342 74.2 18.89
I read over the Python code the people had used on Kaggle and realised that they didn't even use a correlation technique; they've just used a pivot table and a cross tab and then created a title above it labelling it as "Survival Correlation by Sex." I was really confused because I know statistical packages can't create a real Pearson correlation coefficient unless the variables are continuous. But all the Kaggle people are actually doing is creating a table of percentages, which isn't a correlation at all 🙂
Thanks for your help.