Help using Base SAS procedures

Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?

Accepted Solution Solved
Reply
Contributor
Posts: 50
Accepted Solution

Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?

OK so bear with me, I think this is the simplest way to explain what's going on.

I have a dataset (work.DATA_stacked) that has 24,154 observations.  When I apply the SQL code below, it correctly collapses to 10,082 observations.  The max(xx) functions summarize the variables, and I retain information on enrolid, svcdate, sex, age, and region.

          PROC SQL;

               create table work.DATA_collapsed as

               select enrolid, svcdate, sex, age, region,

                    max(cmg) as cmg,

                    max(vp) as vp,

                    max(upp) as upp,

                    max(emg) as emg,

                    max(ufr) as ufr,

                    max(vud) as vud

               from work.DATA_stacked

               group by enrolid, svcdate, sex, age, region;

          QUIT;

          RUN;

Now I want to retain an additional variable (stdprov) using the SQL code shown below.  However, when I do so I get a final collapsed dataset with 10,480 observations.

Please note, I do NOT get any message in my SAS log saying "Note: The query requires remerging summary statistics back with original data."

          PROC SQL;

               create table work.DATA_collapsed as

               select enrolid, svcdate, sex, age, region, stdprov,

                    max(cmg) as cmg,   

                    max(vp) as vp,

                    max(upp) as upp,

                    max(emg) as emg,

                    max(ufr) as ufr,

                    max(vud) as vud

               from work.DATA_stacked

               group by enrolid, svcdate, sex, age, region, stdprov;

          QUIT;

          RUN;

I assumed this meant that there are observations that PROC SQL is trying to collapse that have different values for stdprov, but identical values for enrolid, svcdate, sex, age, region.  I assumed that PROC SQL generated a dataset which contains my collapsed data, duplicating observations where there were multiple different values for stdprov.

However, when I search for duplicates in the output dataset (work.DATA_collapsed) I can only account for 378 duplicates when I should get 10,480 - 10,082 = 398.

          DATA work.test;

               SET work.DATA_collapsed;   

               LENGTH keyvar $500;

               keyvar=enrolid||svcdate||sex||age||region;

               KEEP enrolid svcdate sex age region keyvar;

          RUN;

          PROC FREQ DATA=work.test;

               TABLES keyvar / noprint out=keylist;

          RUN;

          PROC PRINT DATA=work.keylist;

               WHERE count ge 2;

          RUN;

I really appreciate anybody who can help me out.  I just want to make sure I have a clear understanding of where my data is going and why.  I apologize if my description of the problem is long-winded.

Here is the log.  I've underlined and bolded the numbers that I discuss above:

NOTE: Remote submit to SERVER commencing.

1149  *THANKS FOR YOUR HELP!*;

1150  PROC SQL;

1151  create table work.DATA_collapsed as

1152  select enrolid, svcdate, sex, age, region,

1153          max(cmg) as cmg,

1154          max(vp) as vp,

1155          max(upp) as upp,

1156          max(emg) as emg,

1157          max(ufr) as ufr,

1158          max(vud) as vud

1159  from work.DATA_stacked

1160  group by enrolid, svcdate, sex, age, region;

NOTE: Table WORK.DATA_COLLAPSED created, with 10082 rows and 11 columns.

1161  QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

1162  RUN;

1163

1164  PROC SQL;

1165  create table work.DATA_collapsed as

1166  select enrolid, svcdate, sex, age, region, stdprov,

1167          max(cmg) as cmg,

1168          max(vp) as vp,

1169          max(upp) as upp,

1170          max(emg) as emg,

1171          max(ufr) as ufr,

1172          max(vud) as vud

1173  from work.DATA_stacked

1174  group by enrolid, svcdate, sex, age, region, stdprov;

NOTE: Table WORK.DATA_COLLAPSED created, with 10480 rows and 12 columns.

1175  QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.03 seconds

      cpu time            0.04 seconds

1176  RUN;

1177

1178  DATA work.test;

1179      SET work.DATA_collapsed;

1180      LENGTH keyvar $500;

1181      keyvar=enrolid||svcdate||sex||age||region;

1182      KEEP enrolid svcdate sex age region keyvar;

1183  RUN;

NOTE: Numeric values have been converted to character values at the places given by:

      (Line)Smiley SadColumn).

      1181:12   1181:21   1181:35

NOTE: There were 10480 observations read from the data set WORK.DATA_COLLAPSED.

NOTE: The data set WORK.TEST has 10480 observations and 6 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

1184  PROC FREQ DATA=work.test;

1185      TABLES keyvar / noprint out=keylist;

1186  RUN;

NOTE: There were 10480 observations read from the data set WORK.TEST.

NOTE: The data set WORK.KEYLIST has 10082 observations and 3 variables.   <-- THIS LOOKS RIGHT BUT MY CODE ONLY IDENTIFIES 378 below...

NOTE: PROCEDURE FREQ used (Total process time):

      real time           0.07 seconds

      cpu time            0.07 seconds

1187  PROC PRINT DATA=work.keylist;

1188      WHERE count ge 2;

1189  RUN;

NOTE: There were 378 observations read from the data set WORK.KEYLIST.

      WHERE count>=2;

NOTE: The PROCEDURE PRINT printed pages 73-81.

NOTE: PROCEDURE PRINT used (Total process time):

      real time           0.00 seconds

      cpu time            0.02 seconds

NOTE: Remote submit to SERVER complete.


Accepted Solutions
Solution
‎08-10-2013 08:10 PM
Super Contributor
Posts: 339

Re: Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?


As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do

proc sql;

     select sum(count)

     from work.keylist

     where count GE 2;

quit;

You should find the 398 you are looking for.

I suggest running something like

proc sql;

     select distinct stdprov

     from work.data_collapsed;

quit;

to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov

View solution in original post


All Replies
Respected Advisor
Posts: 4,644

Re: Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?

What if you asked for WHERE count ge 3 in that last step? - PG

PG
Super Contributor
Posts: 307

Re: Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?

How many unique values of STDPROV are there?

Solution
‎08-10-2013 08:10 PM
Super Contributor
Posts: 339

Re: Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?


As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do

proc sql;

     select sum(count)

     from work.keylist

     where count GE 2;

quit;

You should find the 398 you are looking for.

I suggest running something like

proc sql;

     select distinct stdprov

     from work.data_collapsed;

quit;

to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov

Contributor
Posts: 50

Re: Why am I getting extra observations in my dataset when I try to collapse it using PROC SQL summary functions?

Thank you for all your help everyone.  The stdprov variable has many potential values.  As I understand it, the missing 20 (398 - 378) are accounted for by instances where there are 3 or 4 different values for stdprov for a given by group defined by the rest of the variables. 

THANKS AGAIN!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 425 views
  • 6 likes
  • 4 in conversation