OK so bear with me, I think this is the simplest way to explain what's going on. I have a dataset (work.DATA_stacked) that has 24,154 observations. When I apply the SQL code below, it correctly collapses to 10,082 observations. The max(xx) functions summarize the variables, and I retain information on enrolid, svcdate, sex, age, and region. PROC SQL; create table work.DATA_collapsed as select enrolid, svcdate, sex, age, region, max(cmg) as cmg, max(vp) as vp, max(upp) as upp, max(emg) as emg, max(ufr) as ufr, max(vud) as vud from work.DATA_stacked group by enrolid, svcdate, sex, age, region; QUIT; RUN; Now I want to retain an additional variable (stdprov) using the SQL code shown below. However, when I do so I get a final collapsed dataset with 10,480 observations. Please note, I do NOT get any message in my SAS log saying "Note: The query requires remerging summary statistics back with original data." PROC SQL; create table work.DATA_collapsed as select enrolid, svcdate, sex, age, region, stdprov, max(cmg) as cmg, max(vp) as vp, max(upp) as upp, max(emg) as emg, max(ufr) as ufr, max(vud) as vud from work.DATA_stacked group by enrolid, svcdate, sex, age, region, stdprov; QUIT; RUN; I assumed this meant that there are observations that PROC SQL is trying to collapse that have different values for stdprov, but identical values for enrolid, svcdate, sex, age, region. I assumed that PROC SQL generated a dataset which contains my collapsed data, duplicating observations where there were multiple different values for stdprov. However, when I search for duplicates in the output dataset (work.DATA_collapsed) I can only account for 378 duplicates when I should get 10,480 - 10,082 = 398. DATA work.test; SET work.DATA_collapsed; LENGTH keyvar $500; keyvar=enrolid||svcdate||sex||age||region; KEEP enrolid svcdate sex age region keyvar; RUN; PROC FREQ DATA=work.test; TABLES keyvar / noprint out=keylist; RUN; PROC PRINT DATA=work.keylist; WHERE count ge 2; RUN; I really appreciate anybody who can help me out. I just want to make sure I have a clear understanding of where my data is going and why. I apologize if my description of the problem is long-winded. Here is the log. I've underlined and bolded the numbers that I discuss above: NOTE: Remote submit to SERVER commencing. 1149 *THANKS FOR YOUR HELP!*; 1150 PROC SQL; 1151 create table work.DATA_collapsed as 1152 select enrolid, svcdate, sex, age, region, 1153 max(cmg) as cmg, 1154 max(vp) as vp, 1155 max(upp) as upp, 1156 max(emg) as emg, 1157 max(ufr) as ufr, 1158 max(vud) as vud 1159 from work.DATA_stacked 1160 group by enrolid, svcdate, sex, age, region; NOTE: Table WORK.DATA_COLLAPSED created, with 10082 rows and 11 columns. 1161 QUIT; NOTE: PROCEDURE SQL used (Total process time): real time 0.03 seconds cpu time 0.03 seconds 1162 RUN; 1163 1164 PROC SQL; 1165 create table work.DATA_collapsed as 1166 select enrolid, svcdate, sex, age, region, stdprov, 1167 max(cmg) as cmg, 1168 max(vp) as vp, 1169 max(upp) as upp, 1170 max(emg) as emg, 1171 max(ufr) as ufr, 1172 max(vud) as vud 1173 from work.DATA_stacked 1174 group by enrolid, svcdate, sex, age, region, stdprov; NOTE: Table WORK.DATA_COLLAPSED created, with 10480 rows and 12 columns. 1175 QUIT; NOTE: PROCEDURE SQL used (Total process time): real time 0.03 seconds cpu time 0.04 seconds 1176 RUN; 1177 1178 DATA work.test; 1179 SET work.DATA_collapsed; 1180 LENGTH keyvar $500; 1181 keyvar=enrolid||svcdate||sex||age||region; 1182 KEEP enrolid svcdate sex age region keyvar; 1183 RUN; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 1181:12 1181:21 1181:35 NOTE: There were 10480 observations read from the data set WORK.DATA_COLLAPSED. NOTE: The data set WORK.TEST has 10480 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 1184 PROC FREQ DATA=work.test; 1185 TABLES keyvar / noprint out=keylist; 1186 RUN; NOTE: There were 10480 observations read from the data set WORK.TEST. NOTE: The data set WORK.KEYLIST has 10082 observations and 3 variables. <-- THIS LOOKS RIGHT BUT MY CODE ONLY IDENTIFIES 378 below... NOTE: PROCEDURE FREQ used (Total process time): real time 0.07 seconds cpu time 0.07 seconds 1187 PROC PRINT DATA=work.keylist; 1188 WHERE count ge 2; 1189 RUN; NOTE: There were 378 observations read from the data set WORK.KEYLIST. WHERE count>=2; NOTE: The PROCEDURE PRINT printed pages 73-81. NOTE: PROCEDURE PRINT used (Total process time): real time 0.00 seconds cpu time 0.02 seconds NOTE: Remote submit to SERVER complete.
... View more