Re: Proc transpose understanding

David_Billa · Posted 06-14-2022 09:40 AM

Input dataset METRIC4 has 10258 observations and 30 variables in SAS and the output dataset METRIC5 has 131612 observations and 13 variables. My understanding is output METRIC5 is supposed to have 133354 observations since 10258 observations get multiplied by 13 variables.

I'd like to understand the record mismatch in the output dataset. Any insights?

PROC TRANSPOSE DATA=WORK.METRIC4
     OUT=WORK.METRIC5 (rename=(Col1=Value) drop=_LABEL_) name=Period;
     BY REG NM DESC METRIC_LEVEL_NM LEVEL_DESC NESTED_TYPE 
     NESTED_NM TARGETTYPE YR;
     VAR P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 YTD; 
RUN;

ballardw · Posted 06-14-2022 10:07 AM

How many unique value combinations of the BY variables are there? How many levels of the VAR variables occur for each combination of the by variables.

Lets look a a small enough example to see that the combinations and number of levels is important. The Sashelp.class data set you should have has 19 observations so is small enough to look at.

Proc sort data=sashelp.class
          out=work.class;
   by sex;
run;

proc transpose data=work.class out=trans;
   by sex;
   var age;
run;

And running this code the log and output data set will show 2 observations. One for each level of the BY variable, Sex.

Note that the transposed AGEs create up to 10 variables (one for each record that had the same value for Sex).

Since you have 9 By variables and your output data set contains 13, then it would appear that for some combinations of your BY variables that the value of one or more of your VAR variables has 3 levels. Do you now have variables Value1 through Value3 in your data? Is that what you expect?

David_Billa · Posted 06-14-2022 12:02 PM

@ballardw Is this the only possible reason for the record count in output table? Any idea how we can code the same logic in R programming?

Since you have 9 By variables and your output data set contains 13, then it would appear that for some combinations of your BY variables that the value of one or more of your VAR variables has 3 levels.

I don't understand this line, Do you now have variables Value1 through Value3 in your data? Is that what you expect?

ballardw · Posted 06-14-2022 12:36 PM

Don't speak R enough to know.

Provide a small example data set and the expected result. Perhaps you need something other than a basic Proc Transpose. To get a row*number of variables sized output data with Proc Transpose the values of the BY variables cannot have duplicates. To force such a behavior you can manually code something like this:

data want;
   set sashelp.class;
   array n (*) age height weight;
   do i=1 to dim(n);
      var=vname(n[i]);
      value= n[i];
      output;
   end;
   keep sex var value;
run;

3 var variables (age height weight) * 19 observations = 57 records.

Untested code since I do not have your data set to do what it sounds like you want:

data work.metric5;
   set work.metric4;
   array p(*) P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 YTD;
   do i=1 to dim(n);
      Period=vname(p[i]);
      value= n[i];
      output;
   end;
   keep  REG NM DESC METRIC_LEVEL_NM LEVEL_DESC NESTED_TYPE 
     NESTED_NM TARGETTYPE YR period value;
run;

don't understand this line, Do you now have variables Value1 through Value3 in your data? Is that what you expect?

The code you show, depending on the values of your data would have the by variables, 9 of them, the Period containing the name of the original variable and columns named Value. Since your note said that your output data set has 13 variables the By and Period are 10. That leaves up to 3 variable holding the values, you rename one to Value (I misread your original code). So there are two other variables, likely Col2 and and Col3. That means that some combinations of your BY variables have multiple values for some of the Var variables. So you get multiple columns, i.e. FEWER rows than expected.

Tom · Posted 06-14-2022 12:29 PM

@David_Billa wrote:

Input dataset METRIC4 has 10258 observations and 30 variables in SAS and the output dataset METRIC5 has 131612 observations and 13 variables. My understanding is output METRIC5 is supposed to have 133354 observations since 10258 observations get multiplied by 13 variables.

I'd like to understand the record mismatch in the output dataset. Any insights?
PROC TRANSPOSE DATA=WORK.METRIC4
     OUT=WORK.METRIC5 (rename=(Col1=Value) drop=_LABEL_) name=Period;
     BY REG NM DESC METRIC_LEVEL_NM LEVEL_DESC NESTED_TYPE 
     NESTED_NM TARGETTYPE YR;
     VAR P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 YTD; 
RUN;

If your BY variables uniquely identify the observations in your dataset then the number of observations generated will be NOBS*NVARS. But it some of the variables observations share the same values of the BY variables then you will get extra columns instead. So if one by group had 3 observations the output will still be only 13 observations, but instead of just COL1 it will also have COL2 and COL3 to have the values for those 13 variables for all three of the observations in the group.

Sounds like you did not expect there to be duplicate BY values, so the first thing is to investigate that.

proc sort data=metric4 nodupkey out=nodups outdups=dups;
     BY REG NM DESC METRIC_LEVEL_NM LEVEL_DESC NESTED_TYPE 
     NESTED_NM TARGETTYPE YR;
run;

If you really want each observation replicated 13 times then make sure the BY variables uniquely identify the observations.

You could add a ROW number variable to the dataset and include that in the BY variable list.

data metric4_unique;
  row+1;
  set metric4;
run;
PROC TRANSPOSE DATA=WORK.METRIC4_unique
     OUT=WORK.METRIC5 (rename=(Col1=Value) drop=_LABEL_) name=Period;
     BY row ....

Or just skip the PROC and do it yourself.

data metric5 ;
   set metric4 ;
   keep REG NM DESC METRIC_LEVEL_NM LEVEL_DESC NESTED_TYPE 
     NESTED_NM TARGETTYPE YR period Value 
   ;
   array x P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 YTD; 
   length period $32 value 8;
   do index=1 to dim(x);
      period =vname(x[index]);
      value = x[index];
      output;
   end;
run;

Proc transpose understanding