BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Buzzy_Bee
Quartz | Level 8

On the Kaggle website, users have posted the code and results they got when they tested the Titanic "train" data set using both R and Python. I wanted to try it out using SAS, however, I'm getting a different result for the correlation between survival and sex.

This is the result that Kaggle forum users got when they used Python or R to check the correlation between Survived (where 1=survived and 0=died). For sex, 0=female and 1=male (originally, the data set comes with a character variable showing "male" or "female" but a binary norminal is created using 0 and 1).

Survival Correlation by: Sex
      Sex  Survived
0  female  0.742038
1    male  0.188908

I tried it in SAS using the below PROC CORR, but my result showed a correlation of -0.54 between sex and survived. So I split Sex further into two variables Male and Female (where the variable gets a 1 if that gender is present, or 0 if it is not present). This time it just shows a correlation of 0.54 with female and survived, and -0.54 with male and survived. So it is definitely not the same as what people get when they use Python and R.

I'm aware that technically binary variables like survived are not ideally suited to using Pearson correlation, since it is intended for continuous variables. But it is still helpful as part of the EDA process. Does anyone know how to get SAS to produce the same results that Python and R produced? Thanks.

 

PROC CORR DATA=train PLOTS=SCATTER; 
VAR survived sex female male;
RUN;
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

It looks like you are trying to calculate survival rate by sex. Here is how to get them and an answer to the question: did survival rate differ by sex:

 

proc freq data=train;
table survived*sex / norow nocum nopercent chisq;
run;

Try this to see if you get the values that you expect.

PG

View solution in original post

2 REPLIES 2
PGStats
Opal | Level 21

It looks like you are trying to calculate survival rate by sex. Here is how to get them and an answer to the question: did survival rate differ by sex:

 

proc freq data=train;
table survived*sex / norow nocum nopercent chisq;
run;

Try this to see if you get the values that you expect.

PG
Buzzy_Bee
Quartz | Level 8

Thank you - using the two-way Proc Freq that you suggested does produce the same result (the bottom line shows 74.2% survival for females and 18.89% for males).

Survived	Sex		
	0	1	Total
0	81	468	549
	25.8	81.11	
1	233	109	342
	74.2	18.89	

I read over the Python code the people had used on Kaggle and realised that they didn't even use a correlation technique; they've just used a pivot table and a cross tab and then created a title above it labelling it as "Survival Correlation by Sex." I was really confused because I know statistical packages can't create a real Pearson correlation coefficient unless the variables are continuous. But all the Kaggle people are actually doing is creating a table of percentages, which isn't a correlation at all 🙂

Thanks for your help.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 660 views
  • 3 likes
  • 2 in conversation