BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mconover
Quartz | Level 8

OK so bear with me, I think this is the simplest way to explain what's going on.

I have a dataset (work.DATA_stacked) that has 24,154 observations.  When I apply the SQL code below, it correctly collapses to 10,082 observations.  The max(xx) functions summarize the variables, and I retain information on enrolid, svcdate, sex, age, and region.

          PROC SQL;

               create table work.DATA_collapsed as

               select enrolid, svcdate, sex, age, region,

                    max(cmg) as cmg,

                    max(vp) as vp,

                    max(upp) as upp,

                    max(emg) as emg,

                    max(ufr) as ufr,

                    max(vud) as vud

               from work.DATA_stacked

               group by enrolid, svcdate, sex, age, region;

          QUIT;

          RUN;

Now I want to retain an additional variable (stdprov) using the SQL code shown below.  However, when I do so I get a final collapsed dataset with 10,480 observations.

Please note, I do NOT get any message in my SAS log saying "Note: The query requires remerging summary statistics back with original data."

          PROC SQL;

               create table work.DATA_collapsed as

               select enrolid, svcdate, sex, age, region, stdprov,

                    max(cmg) as cmg,   

                    max(vp) as vp,

                    max(upp) as upp,

                    max(emg) as emg,

                    max(ufr) as ufr,

                    max(vud) as vud

               from work.DATA_stacked

               group by enrolid, svcdate, sex, age, region, stdprov;

          QUIT;

          RUN;

I assumed this meant that there are observations that PROC SQL is trying to collapse that have different values for stdprov, but identical values for enrolid, svcdate, sex, age, region.  I assumed that PROC SQL generated a dataset which contains my collapsed data, duplicating observations where there were multiple different values for stdprov.

However, when I search for duplicates in the output dataset (work.DATA_collapsed) I can only account for 378 duplicates when I should get 10,480 - 10,082 = 398.

          DATA work.test;

               SET work.DATA_collapsed;   

               LENGTH keyvar $500;

               keyvar=enrolid||svcdate||sex||age||region;

               KEEP enrolid svcdate sex age region keyvar;

          RUN;

          PROC FREQ DATA=work.test;

               TABLES keyvar / noprint out=keylist;

          RUN;

          PROC PRINT DATA=work.keylist;

               WHERE count ge 2;

          RUN;

I really appreciate anybody who can help me out.  I just want to make sure I have a clear understanding of where my data is going and why.  I apologize if my description of the problem is long-winded.

Here is the log.  I've underlined and bolded the numbers that I discuss above:

NOTE: Remote submit to SERVER commencing.

1149  *THANKS FOR YOUR HELP!*;

1150  PROC SQL;

1151  create table work.DATA_collapsed as

1152  select enrolid, svcdate, sex, age, region,

1153          max(cmg) as cmg,

1154          max(vp) as vp,

1155          max(upp) as upp,

1156          max(emg) as emg,

1157          max(ufr) as ufr,

1158          max(vud) as vud

1159  from work.DATA_stacked

1160  group by enrolid, svcdate, sex, age, region;

NOTE: Table WORK.DATA_COLLAPSED created, with 10082 rows and 11 columns.

1161  QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.03 seconds

      cpu time            0.03 seconds

1162  RUN;

1163

1164  PROC SQL;

1165  create table work.DATA_collapsed as

1166  select enrolid, svcdate, sex, age, region, stdprov,

1167          max(cmg) as cmg,

1168          max(vp) as vp,

1169          max(upp) as upp,

1170          max(emg) as emg,

1171          max(ufr) as ufr,

1172          max(vud) as vud

1173  from work.DATA_stacked

1174  group by enrolid, svcdate, sex, age, region, stdprov;

NOTE: Table WORK.DATA_COLLAPSED created, with 10480 rows and 12 columns.

1175  QUIT;

NOTE: PROCEDURE SQL used (Total process time):

      real time           0.03 seconds

      cpu time            0.04 seconds

1176  RUN;

1177

1178  DATA work.test;

1179      SET work.DATA_collapsed;

1180      LENGTH keyvar $500;

1181      keyvar=enrolid||svcdate||sex||age||region;

1182      KEEP enrolid svcdate sex age region keyvar;

1183  RUN;

NOTE: Numeric values have been converted to character values at the places given by:

      (Line):(Column).

      1181:12   1181:21   1181:35

NOTE: There were 10480 observations read from the data set WORK.DATA_COLLAPSED.

NOTE: The data set WORK.TEST has 10480 observations and 6 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.01 seconds

1184  PROC FREQ DATA=work.test;

1185      TABLES keyvar / noprint out=keylist;

1186  RUN;

NOTE: There were 10480 observations read from the data set WORK.TEST.

NOTE: The data set WORK.KEYLIST has 10082 observations and 3 variables.   <-- THIS LOOKS RIGHT BUT MY CODE ONLY IDENTIFIES 378 below...

NOTE: PROCEDURE FREQ used (Total process time):

      real time           0.07 seconds

      cpu time            0.07 seconds

1187  PROC PRINT DATA=work.keylist;

1188      WHERE count ge 2;

1189  RUN;

NOTE: There were 378 observations read from the data set WORK.KEYLIST.

      WHERE count>=2;

NOTE: The PROCEDURE PRINT printed pages 73-81.

NOTE: PROCEDURE PRINT used (Total process time):

      real time           0.00 seconds

      cpu time            0.02 seconds

NOTE: Remote submit to SERVER complete.

1 ACCEPTED SOLUTION

Accepted Solutions
Vince28_Statcan
Quartz | Level 8


As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do

proc sql;

     select sum(count)

     from work.keylist

     where count GE 2;

quit;

You should find the 398 you are looking for.

I suggest running something like

proc sql;

     select distinct stdprov

     from work.data_collapsed;

quit;

to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov

View solution in original post

4 REPLIES 4
PGStats
Opal | Level 21

What if you asked for WHERE count ge 3 in that last step? - PG

PG
Vince28_Statcan
Quartz | Level 8


As they pointed, it simply means that your stdprov variable takes more than 2 distinct values for a given "key" or by group defined by your other 5 variables. If you do

proc sql;

     select sum(count)

     from work.keylist

     where count GE 2;

quit;

You should find the 398 you are looking for.

I suggest running something like

proc sql;

     select distinct stdprov

     from work.data_collapsed;

quit;

to help diagnostic what's wrong. If it is intended to be a binary variable, maybe at has an error code as a 3rd value or maybe some rows have missing values for stdprov

mconover
Quartz | Level 8

Thank you for all your help everyone.  The stdprov variable has many potential values.  As I understand it, the missing 20 (398 - 378) are accounted for by instances where there are 3 or 4 different values for stdprov for a given by group defined by the rest of the variables. 

THANKS AGAIN!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1314 views
  • 6 likes
  • 4 in conversation