Solved: Re: Rotate from "standing" to "lying" dataset using arrays

mvalsamis

Prodgrp	Outward_postcode	Mon_year	avg_balance
CUR	AA1	JAN2000	0
CUR	AA1	FEB2000	0
CUR	AA1	MAR2000	0
CUR	AA1	APR2000	0
CUR	AA1	MAY2000	0
CUR	AA1	JUN2000	0
CUR	AA1	JUL2000	0
CUR	AA1	AUG2000	0
CUR	AA1	SEP2000	0
CUR	AA1	OCT2000	0
CUR	AA1	NOV2000	5

Hi to all the community! The table above is a short version of a table named WORK.AVG_PER_PRODGRP, which shows the average balances for each product group and outward postcode (data from a hypothetical bank).

Variable PRODGRP is the product group and has 4 values: CUR, LOA, SAV, VIS.

Variable OUTWARD_POSTCODE is the first part of a British postcode (showing a greater area). The code AA1 above is the outward postcode in the first by-group.

Variable MON_YEAR is a character variable (as requested by the client) showing the month of the year 2000. Assume that today it's 1st December 2000.

Finally, variable AVG_BALANCE is the average balance per PRODGRP and OUTWARD_POSTCODE. It is presented at the end of the examined period (so, NOV2000). For all the other months, AVG_BALANCE=0.

What I want to do now, is to rotate the dataset from a "standing" (or "tall") form to a "lying down" or "fat form". In the place of AVG_BALANCE, 4 new variables need to be created: CUR, LOA, SAV and VIS. The above example table must look like this:

Outward_postcode	Mon_year	CUR	LOA	SAV	VIS
AA1	JAN2000	0	0	0	0
AA1	FEB2000	0	0	0	0
AA1	MAR2000	0	0	0	0
AA1	APR2000	0	0	0	0
AA1	MAY2000	0	0	0	0
AA1	JUN2000	0	0	0	0
AA1	JUL2000	0	0	0	0
AA1	AUG2000	0	0	0	0
AA1	SEP2000	0	0	0	0
AA1	OCT2000	0	0	0	0
AA1	NOV2000	5	0	0	0

Finally, all missing values should be replaced with zeros.

I have made it successfully, using the following code:

proc sort data=work.avg_per_prodgrp;
by outward_postcode mon_year;
run;

data work.avg_rotated;
array bal(4) cur loa sav vis;
retain cur loa sav vis;
set work.avg_per_prodgrp;
by outward_postcode mon_year;
if prodgrp='CUR' then cur=avg_balance;
else if prodgrp='LOA' then loa=avg_balance;
else if prodgrp='SAV' then sav=avg_balance;
else vis=avg_balance;
do i=1 to dim(bal);
if bal(i)=. then bal(i)=0;
end;
if last.mon_year then output;
keep outward_postcode mon_year cur loa sav vis;
format cur loa sav vis comma12.2;
run;

Although I achieve the target, I use a lot of repetitive coding, which makes the code inefficient. I would like to replace the IF-THEN/ELSE statements with array(s) and DO loop(s), but I have not figured it out yet. Does anyone have any idea? Thank you very much in advance!

Tom

If you like that method then just generate a couple of macro variables with the list of PRODGRP values that you want to include. You could generate the list from the existing data, or have it already from some other source.

First let's convert your listing into an actual dataset. While we are at it lets introduce some other values of PRODGRP and some actual counts. Let's make sure MON_YEAR is an actual DATE variable so it will sort properly.

data avg_per_prodgrp;
  input Prodgrp $ Outward_postcode $ Mon_year :date. avg_balance ;
  format mon_year monyy7.;
cards;
CUR AA1 01JAN2000 1
CUR AA1 01FEB2000 2
CUR AA1 01MAR2000 3
CUR AA1 01APR2000 4
LOA AA1 01MAY2000 5
CUR AA1 01JUN2000 6
SAV AA1 01JUL2000 7
CUR AA1 01AUG2000 8
CUR AA1 01SEP2000 9
VIS AA1 01OCT2000 10
CUR AA1 01NOV2000 11
;

Now to get the list you can use PROC SQL to put the list into a macro variable. If the values of PRODGRP are not valid SAS names you might need two macro variables, one with the list of SAS names and one with the list of values.

proc sql noprint;
  select distinct prodgrp,nliteral(prodgrp)
   into :val_list separated by '|'
      , :var_list separated by ' '
  from have 
  order by 1
  ;
%let nvars=&sqlobs;
quit;

Now we can run your data step transpose:

data avg_rotated;
  do until (last.mon_year);
    set avg_per_prodgrp ;
    by outward_postcode mon_year;
    array bal &var_list ;
    if first.mon_year then do index=1 to dim(bal); bal[index]=0; end;
    index=findw("&val_list",prodgrp,'|','ite');
    if 1 <= index <= dim(bal) then bal[index]=avg_balance;
  end;
  drop prodgrp avg_balance index;
  format &var_list comma12.2;
run;

Results:

View solution in original post

PaigeMiller

Usually, we advise against converting long data sets to wide data sets, as this could make future steps more difficult. If you want to do this because you want a report (not a SAS data set), you should use PROC REPORT. So what do you want, a report or a SAS data set? If you really want a data set, use PROC TRANSPOSE (but really, don't — except in rare cases this is not a good thing to do).

Here is PROC REPORT code:

proc report data=have;
    columns outward_postcode mon_year prodgrp,avg_balance;
    define outward_postcode/group;
    define mon_year/group order=internal;
    define prodgrp/across;
    define avg_balance/sum;
run;

It would be helpful, since you want output with columns named CUR, LOA, SAV, VIS that you provide sample data with these four categories. It would also be helpful (actually we consider it mandatory from now on) for you to provide data in the form of working SAS data step code (instructions and examples) and not as screen capture or Excel files.

--
Paige Miller

mvalsamis

Thank you very much for your response. However, I am not intending to create a report, but a dataset. It is an exercise, not an actual case. I must not use PROC TRANSPOSE (although I did it and it worked perfectly), but instead I must only use DATA step. The DATA step that I wrote creates the dataset successfully, with the correct number of rows, columns and values. However, what I am looking for is to make it more efficient.

ballardw

@mvalsamis wrote:

Thank you very much for your response. However, I am not intending to create a report, but a dataset. It is an exercise, not an actual case. I must not use PROC TRANSPOSE (although I did it and it worked perfectly), but instead I must only use DATA step. The DATA step that I wrote creates the dataset successfully, with the correct number of rows, columns and values. However, what I am looking for is to make it more efficient.

Define what measure(s) you will apply to determine "more efficient". Without a rule to measure results against it is hard to know if some approach is more efficient.

Time to execute? Time to program and debug? Memory usage? Time to update programming? Number of lines of code?

Your basic array approach gets more inefficient as the number output variables increases, as pretty much every approach does. So you are attempting to write code that duplicates Proc Transpose without having the many man-hours (and roughly 40 years of improvements tested by actual users) that were invested by the developers at SAS to handle most of the routine actions needed for such.

If someone places a stack of $$$ on my desk then I might accept "must not use Proc Transpose" but I will tell them that there are many tradeoffs involved, such as having to rewrite or modify code every time an output column is added to the output.

Note: If YOUR Mon_year variable is character because a client want's it so then likely you have many other inefficient choices in your data. Your variable should be a date valued variable (with the MONYY7. format if you want to see values as the client sees them) and converted to character at the time the values are sent to the client.

Tom

If you like that method then just generate a couple of macro variables with the list of PRODGRP values that you want to include. You could generate the list from the existing data, or have it already from some other source.

First let's convert your listing into an actual dataset. While we are at it lets introduce some other values of PRODGRP and some actual counts. Let's make sure MON_YEAR is an actual DATE variable so it will sort properly.

data avg_per_prodgrp;
  input Prodgrp $ Outward_postcode $ Mon_year :date. avg_balance ;
  format mon_year monyy7.;
cards;
CUR AA1 01JAN2000 1
CUR AA1 01FEB2000 2
CUR AA1 01MAR2000 3
CUR AA1 01APR2000 4
LOA AA1 01MAY2000 5
CUR AA1 01JUN2000 6
SAV AA1 01JUL2000 7
CUR AA1 01AUG2000 8
CUR AA1 01SEP2000 9
VIS AA1 01OCT2000 10
CUR AA1 01NOV2000 11
;

Now to get the list you can use PROC SQL to put the list into a macro variable. If the values of PRODGRP are not valid SAS names you might need two macro variables, one with the list of SAS names and one with the list of values.

proc sql noprint;
  select distinct prodgrp,nliteral(prodgrp)
   into :val_list separated by '|'
      , :var_list separated by ' '
  from have 
  order by 1
  ;
%let nvars=&sqlobs;
quit;

Now we can run your data step transpose:

data avg_rotated;
  do until (last.mon_year);
    set avg_per_prodgrp ;
    by outward_postcode mon_year;
    array bal &var_list ;
    if first.mon_year then do index=1 to dim(bal); bal[index]=0; end;
    index=findw("&val_list",prodgrp,'|','ite');
    if 1 <= index <= dim(bal) then bal[index]=avg_balance;
  end;
  drop prodgrp avg_balance index;
  format &var_list comma12.2;
run;

Results:

mvalsamis

This works perfect, thanks Tom!

Kurt_Bremser

As a consultant, it is your duty to educate your client about the utter stupidity of storing date values as character (no calculations possible, do not sort correctly, ...).

As soon as you have intelligent data, sorting by outward_postcode and mon_year with a following PROC TRANSPOSE or PROC REPORT (without prior sorting!) will solve your issue quite nicely, with simple code, and no further action needed if there's an additional prodgrp.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mvalsamis

I understand what you mean, but this is just an exercise, not reality. But yes, numerical date variables are much more efficient and practical.

Ksharp

data avg_per_prodgrp;
  input Prodgrp $ Outward_postcode $ Mon_year :date. avg_balance ;
  format mon_year monyy7.;
cards;
CUR AA1 01JAN2000 1
CUR AA1 01FEB2000 2
CUR AA1 01MAR2000 3
CUR AA1 01APR2000 4
LOA AA1 01MAY2000 5
CUR AA1 01JUN2000 6
SAV AA1 01JUL2000 7
CUR AA1 01AUG2000 8
CUR AA1 01SEP2000 9
VIS AA1 01OCT2000 10
CUR AA1 01NOV2000 11
;
proc sql noprint;
select distinct catt('avg_per_prodgrp(where=(Prodgrp="',Prodgrp,'") rename=(avg_balance=',Prodgrp,'))')
 into :merge separated by ' '
 from avg_per_prodgrp;
quit;
data want;
merge &merge.;
by Outward_postcode  Mon_year;
drop Prodgrp;
run;
proc stdize data=want out=want2 reponly missing=0;run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!