OK so bear with me, I think this is the simplest way to explain what's going on.
I have a dataset (work.DATA_stacked) that has 24,154 observations. When I apply the SQL code below, it correctly collapses to 10,082 observations. The max(xx) functions summarize the variables, and I retain information on enrolid, svcdate, sex, age, and region.
PROC SQL;
create table work.DATA_collapsed as
select enrolid, svcdate, sex, age, region,
max(cmg) as cmg,
max(vp) as vp,
max(upp) as upp,
max(emg) as emg,
max(ufr) as ufr,
max(vud) as vud
from work.DATA_stacked
group by enrolid, svcdate, sex, age, region;
QUIT;
RUN;
Now I want to retain an additional variable (stdprov) using the SQL code shown below. However, when I do so I get a final collapsed dataset with 10,480 observations.
Please note, I do NOT get any message in my SAS log saying "Note: The query requires remerging summary statistics back with original data."
PROC SQL;
create table work.DATA_collapsed as
select enrolid, svcdate, sex, age, region, stdprov,
max(cmg) as cmg,
max(vp) as vp,
max(upp) as upp,
max(emg) as emg,
max(ufr) as ufr,
max(vud) as vud
from work.DATA_stacked
group by enrolid, svcdate, sex, age, region, stdprov;
QUIT;
RUN;
I assumed this meant that there are observations that PROC SQL is trying to collapse that have different values for stdprov, but identical values for enrolid, svcdate, sex, age, region. I assumed that PROC SQL generated a dataset which contains my collapsed data, duplicating observations where there were multiple different values for stdprov.
However, when I search for duplicates in the output dataset (work.DATA_collapsed) I can only account for 378 duplicates when I should get 10,480 - 10,082 = 398.
DATA work.test;
SET work.DATA_collapsed;
LENGTH keyvar $500;
keyvar=enrolid||svcdate||sex||age||region;
KEEP enrolid svcdate sex age region keyvar;
RUN;
PROC FREQ DATA=work.test;
TABLES keyvar / noprint out=keylist;
RUN;
PROC PRINT DATA=work.keylist;
WHERE count ge 2;
RUN;
I really appreciate anybody who can help me out. I just want to make sure I have a clear understanding of where my data is going and why. I apologize if my description of the problem is long-winded.
Here is the log. I've underlined and bolded the numbers that I discuss above:
NOTE: Remote submit to SERVER commencing.
1149 *THANKS FOR YOUR HELP!*;
1150 PROC SQL;
1151 create table work.DATA_collapsed as
1152 select enrolid, svcdate, sex, age, region,
1153 max(cmg) as cmg,
1154 max(vp) as vp,
1155 max(upp) as upp,
1156 max(emg) as emg,
1157 max(ufr) as ufr,
1158 max(vud) as vud
1159 from work.DATA_stacked
1160 group by enrolid, svcdate, sex, age, region;
NOTE: Table WORK.DATA_COLLAPSED created, with 10082 rows and 11 columns.
1161 QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
1162 RUN;
1163
1164 PROC SQL;
1165 create table work.DATA_collapsed as
1166 select enrolid, svcdate, sex, age, region, stdprov,
1167 max(cmg) as cmg,
1168 max(vp) as vp,
1169 max(upp) as upp,
1170 max(emg) as emg,
1171 max(ufr) as ufr,
1172 max(vud) as vud
1173 from work.DATA_stacked
1174 group by enrolid, svcdate, sex, age, region, stdprov;
NOTE: Table WORK.DATA_COLLAPSED created, with 10480 rows and 12 columns.
1175 QUIT;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.03 seconds
cpu time 0.04 seconds
1176 RUN;
1177
1178 DATA work.test;
1179 SET work.DATA_collapsed;
1180 LENGTH keyvar $500;
1181 keyvar=enrolid||svcdate||sex||age||region;
1182 KEEP enrolid svcdate sex age region keyvar;
1183 RUN;
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
1181:12 1181:21 1181:35
NOTE: There were 10480 observations read from the data set WORK.DATA_COLLAPSED.
NOTE: The data set WORK.TEST has 10480 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
1184 PROC FREQ DATA=work.test;
1185 TABLES keyvar / noprint out=keylist;
1186 RUN;
NOTE: There were 10480 observations read from the data set WORK.TEST.
NOTE: The data set WORK.KEYLIST has 10082 observations and 3 variables. <-- THIS LOOKS RIGHT BUT MY CODE ONLY IDENTIFIES 378 below...
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.07 seconds
cpu time 0.07 seconds
1187 PROC PRINT DATA=work.keylist;
1188 WHERE count ge 2;
1189 RUN;
NOTE: There were 378 observations read from the data set WORK.KEYLIST.
WHERE count>=2;
NOTE: The PROCEDURE PRINT printed pages 73-81.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.02 seconds
NOTE: Remote submit to SERVER complete.
As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do
proc sql;
select sum(count)
from work.keylist
where count GE 2;
quit;
You should find the 398 you are looking for.
I suggest running something like
proc sql;
select distinct stdprov
from work.data_collapsed;
quit;
to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov
What if you asked for WHERE count ge 3 in that last step? - PG
How many unique values of STDPROV are there?
As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do
proc sql;
select sum(count)
from work.keylist
where count GE 2;
quit;
You should find the 398 you are looking for.
I suggest running something like
proc sql;
select distinct stdprov
from work.data_collapsed;
quit;
to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov
Thank you for all your help everyone. The stdprov variable has many potential values. As I understand it, the missing 20 (398 - 378) are accounted for by instances where there are 3 or 4 different values for stdprov for a given by group defined by the rest of the variables.
THANKS AGAIN!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.