Hi All,
I am creating a sas dataset from a sql database that contains ids(ID) for different years (variable=Yr). In it, I have a variable called 'A' with specific values. There is another column 'B' which has values related to a specific value in column A. i.e for A='E' the values in B are 'E1', 'E2',' E3', 'E4', 'E5' and 'E6'. The dataset looks as follows.
| 
 | ID | Yr | A | B | 
| 1 | 1 | 2001 | E | E1 | 
| 2 | 2 | 2001 | E | E1 | 
| 3 | 2 | 2001 | E | E2 | 
| 4 | 3 | 2001 | E | E1 | 
| 5 | 1 | 2002 | E | E1 | 
| 6 | 2 | 2002 | E | E3 | 
| 7 | 3 | 2002 | E | E2 | 
| 8 | 3 | 2002 | E | E3 | 
| 9 | 3 | 2002 | E | E4 | 
| 10 | 3 | 2002 | E | E5 | 
| 11 | 3 | 2002 | E | E6 | 
I have to create new variables for each unique value in varibale 'B' using conditonal logicand assign a value of 'Y' if a particular value is selected in varibale 'B', such as :
if A='E' then do;
if B='E1' then S1='Y'; else
if B='E2' then S2='Y'; else
if B='E3' then S3='Y'; else
if B='E4' then S4='Y'; else
if B='E5' then S5='Y'; else
if B='E6' then S6='Y';
end;
After doing that and sorting by ID and Yr I get:
| 
 | ID | Yr | A | B | S1 | S2 | S3 | S4 | S5 | S6 | 
| 1 | 1 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | 
| 2 | 1 | 2002 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | 
| 3 | 2 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | 
| 4 | 2 | 2001 | E | E2 | 
 | Y | 
 | 
 | 
 | 
 | 
| 5 | 2 | 2002 | E | E3 | 
 | 
 | Y | 
 | 
 | 
 | 
| 6 | 3 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | 
| 7 | 3 | 2002 | E | E2 | 
 | Y | 
 | 
 | 
 | 
 | 
| 8 | 3 | 2002 | E | E3 | 
 | 
 | Y | 
 | 
 | 
 | 
| 9 | 3 | 2002 | E | E4 | 
 | 
 | 
 | Y | 
 | 
 | 
| 10 | 3 | 2002 | E | E5 | 
 | 
 | 
 | 
 | Y | 
 | 
| 11 | 3 | 2002 | E | E6 | 
 | 
 | 
 | 
 | 
 | Y | 
Now I need to set all the missing values for S1 to S6 to 0 and create the final dataset which has to have a single records for each unique id for each unique year, illustrated in the table below:
| 
 | ID | Yr | S1 | S2 | S3 | S4 | S5 | S6 | 
| 1 | 1 | 2001 | Y | N | N | N | N | N | 
| 2 | 1 | 2002 | Y | N | N | N | N | N | 
| 3 | 2 | 2001 | Y | Y | N | N | N | N | 
| 4 | 2 | 2002 | N | N | Y | N | N | N | 
| 5 | 3 | 2001 | Y | N | N | N | N | N | 
| 6 | 3 | 2002 | N | Y | Y | Y | Y | Y | 
I tried using arrays but I am not getting the correct data:
%let crit= S1 S2 S3 S4 S5 S6;
array temp (*) $1 &crit;
length &crit $1 ;
by id yr;
if first.yr;
do x=1 to dim(temp);
if temp(x) = '' then temp(x) = 'N';
end;
I am not sure what is worong with the array. Please adise and of any other way to accomplish the final dataset.
Glad it worked, because I was running out of ideas.
SET won't work because SET keeps each observation separate from the other observations.
UPDATE (briefly) requires both a master data set (D3 with OBS=0) and a transaction data set (the second D3). Automatically, UPDATE takes observations from the transaction data set, applies the nonmissing values only, and outputs only when reaching the end of the BY group. So it can collapse multiple transactions into a single observation. The only reason the first D3 is needed here is because UPDATE requires exactly two data sets. It's just a placeholder that has the right variable definitions, but no data is actually used from the first mention of D3.
So far, there's nothing in your code that would combine multiple observations into one. Here's a way that takes the middle format of the data and converts it to the bottom format:
data want;
update have (obs=0 drop=a b) have (drop=a b);
by id year;
array s {6};
if last.year then do _n_=1 to 6;
if s{_n_}=' ' then s{_n_}='N';
end;
run;
Thanks for your answer. Yes, that works. I tweaked it a bit as:
%let crit= S1 S2 S3 S4 S5 S6;
array s {*} $1 &crit;
if last.yr then do
_n_=1 to dim(s);
if s{_n_}=' ' then s{_n_}='N';
end;
I have one more concern. Now, the variable 'A' has additional values such as F which has correspoding values on variable 'B' such a F1,F2,F3,F4,F5,F6. So now the dataset after applying conditional logic to create new variables (T1 to T6) looks like this:
| 
 | ID | Yr | A | B | S1 | S2 | S3 | S4 | S5 | S6 | T1 | T2 | T3 | T4 | T5 | T6 | 
| 1 | 1 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | 
| 2 | 1 | 2002 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | 
| 3 | 2 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | 
| 4 | 2 | 2001 | E | E2 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
| 5 | 2 | 2002 | E | E3 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
| 6 | 3 | 2001 | E | E1 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | 
| 7 | 3 | 2002 | E | E2 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
| 8 | 3 | 2002 | E | E3 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
| 9 | 3 | 2002 | E | E4 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
| 10 | 3 | 2002 | E | E5 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
| 11 | 3 | 2002 | E | E6 | 
 | 
 | 
 | 
 | 
 | Y | 
 | 
 | 
 | 
 | 
 | Y | 
I wanted to apply the arry statements to all variables S1 to S6 and T1 to T6:
If I use the array statemnets twice(see below) I dont get the right values for T1 to T6. All of them get a value of 'N' in the final dataset. I tried using a positonal parametre macro for the array elements and it didnlt works as well. Do you have any suggestions for this? Thanks in advance.
%let crit= S1 S2 S3 S4 S5 S6;
%let selec=T1 T2 T3 T4 T5 T6;
array s {*} $1 &crit;
if last.yr then do
_n_=1 to dim(s);
if s{_n_}=' ' then s{_n_}='N';
end;
array t (*) $1 &selec;
if last.yr then do
_n_=1 to dim(t);
if t{_n_}=' ' then t{_n_}='N';
end;
The answer is going to be something very basic. It's not being introduced by the complexities of the program.
Since I can't see a PROC CONTENTS of your data set, I will have to guess. Are the variables T1-T6 actually part of the data coming in?
Thanks all for your replies. Here is the original dataset, code that created the new variables S1 to S6 and T1 to T6. I have to create a "line listing" from the "sas dataset" created from the SQL database and the requirements state the need for new variables S1 to S6 and T1 to T6. Variable 'B' is a multiselection on the user interface. I have to account for the values that were not selected in variable 'B' and set values in the new variables to 'N' and accordingly set to 'Y' if they were selected. Hope I am making sense here.
Here is the original dataset:
| ID | Yr | A | B | 
| 1 | 2001 | E | E1 | 
| 2 | 2001 | E | E1 | 
| 2 | 2001 | E | E2 | 
| 3 | 2001 | E | E1 | 
| 1 | 2002 | E | E1 | 
| 2 | 2002 | E | E3 | 
| 3 | 2002 | E | E2 | 
| 3 | 2002 | E | E3 | 
| 3 | 2002 | E | E4 | 
| 3 | 2002 | E | E5 | 
| 3 | 2002 | E | E6 | 
| 1 | 2001 | F | F1 | 
| 2 | 2001 | F | F1 | 
| 2 | 2001 | F | F2 | 
| 3 | 2001 | F | F1 | 
| 1 | 2002 | F | F1 | 
| 2 | 2002 | F | F3 | 
| 3 | 2002 | F | F2 | 
| 3 | 2002 | F | F3 | 
| 3 | 2002 | F | F4 | 
| 3 | 2002 | F | F5 | 
| 3 | 2002 | F | F6 | 
Here is the conditional logic to create the new variables:
if A='E' then do;
if B='E1' then S1='Y'; else
if B='E2' then S2='Y'; else
if B='E3' then S3='Y'; else
if B='E4' then S4='Y'; else
if B='E5' then S5='Y'; else
if B='E6' then S6='Y';
end;
if A='F' then do;
if B='F1' then T1='Y'; else
if B='F2' then T2='Y'; else
if B='F3' then T3='Y'; else
if B='F4' then T4='Y'; else
if B='F5' then T5='Y'; else
if B='F6' then T6='Y';
end;
The final dataset has to look like this:
| Id | Yr | S1 | S2 | S3 | S4 | S5 | S6 | T1 | T2 | T3 | T4 | T5 | T6 | 
| 1 | 2001 | Y | N | N | N | N | N | Y | N | N | N | N | N | 
| 1 | 2002 | Y | N | N | N | N | N | Y | N | N | N | N | N | 
| 2 | 2001 | Y | N | N | N | N | N | Y | N | N | N | N | N | 
| 2 | 2002 | N | N | Y | N | N | N | N | N | Y | N | N | N | 
| 3 | 2001 | Y | N | N | N | N | N | Y | N | N | N | N | N | 
| 3 | 2002 | N | Y | N | N | N | N | N | Y | N | N | N | N | 
@Astounding, I was looking to use the aray code you suggested to get the result, but using the array statemnets twice did not get me the right values for T1 to T6. What would be the best way to get this final dataset? Please let me know.
It's a little tough to debug from a distance. The code should work, though.
When you work with the data just before the final version, what do you get from:
proc freq data=intermediate;
tables t1-t6;
run;
By intermediate, I am assuming you mena the dataset after application of the conditional logic to care the new variables T1 to T6 and S1 to S6.ere are the resuts:
| The SAS System | 
| T1 | Frequency | 
|---|---|
| Y | 4 | 
| Frequency Missing = 18 | |
| T2 | Frequency | 
|---|---|
| Y | 2 | 
| Frequency Missing = 20 | |
| T3 | Frequency | 
|---|---|
| Y | 2 | 
| Frequency Missing = 20 | |
| T4 | Frequency | 
|---|---|
| Y | 1 | 
| Frequency Missing = 21 | |
| T5 | Frequency | 
|---|---|
| Y | 1 | 
| Frequency Missing = 21 | |
| T6 | Frequency | 
|---|---|
| Y | 1 | 
| Frequency Missing = 21 | |
@Astounding it worked! I used the array statemnets twice. Not sure why it didn't before. I have one question. This is the first time I used the 'update' statemnet. I tried using the 'set' statement and the output looked really crazy (see code below). Why dosen't it work with the set statement? Also, I understand update statemnt needs to specify 2 datasets and the same dataset 'd3' was listed twice. How does it work here? I will study this at depth offcourse but a short answer would be helpful.
data d4(drop=a b);
*set d3;
update d3 (obs=0 drop=a b) d3 (drop=a b);
by id yr;
array s {*} $1 &crit;
if last.yr then do
_n_=1 to dim(s);
if s{_n_}=' ' then s{_n_}='N';
end;
array t (*) $1 &dual;
if last.yr then do
_n_=1 to dim(t);
if t{_n_}=' ' then t{_n_}='N';
end;
run;
Glad it worked, because I was running out of ideas.
SET won't work because SET keeps each observation separate from the other observations.
UPDATE (briefly) requires both a master data set (D3 with OBS=0) and a transaction data set (the second D3). Automatically, UPDATE takes observations from the transaction data set, applies the nonmissing values only, and outputs only when reaching the end of the BY group. So it can collapse multiple transactions into a single observation. The only reason the first D3 is needed here is because UPDATE requires exactly two data sets. It's just a placeholder that has the right variable definitions, but no data is actually used from the first mention of D3.
Thanks for your reply @Astounding. I uderstood the first d3 was a placeholder after reading about it.
Yes @Astounding, thanks for your help!
You say the T variables depend on the value of A but don't show where that is being considered. Perhaps you did not assign values to T1 etc before the array code?
BTW, It is not a good idea to habitually use _n_ as you are. _n_ is a SAS automatic variable that refers, in a loose sense, to the number of the row you have read from a set statment and if you are not extremely careful you may end up referencing a value you don't intend.
Below some code which hopefully sets you on the right path.
If you can't solve your "T" problem on your own then please use below data step and post a version which creates the additional source data. Then explain us the transformation logic required for the T's.
Please note: Below code will only work if the values in your B variable end with a number and if this numbering start with 1 and has no gaps. If you can't rely on this with your real data then you need to tell us as this would require some additional coding.
Can I ask why you need the data in this shape? There are quite a few procedures which could deal directly with the original data structure.
data have;
infile datalines truncover;
input ID Yr A $ B $;
datalines;
1 2001 E E1
2 2001 E E1
2 2001 E E2
3 2001 E E1
1 2002 E E1
2 2002 E E3
3 2002 E E2
3 2002 E E3
3 2002 E E4
3 2002 E E5
3 2002 E E6
;
run;
proc sql noprint;
  select count(distinct B) into :S_array_dim
  from have
  where A='E'
  ;
quit;
data want(drop=A B _:);
  set have(where=(A='E'));
  by Yr ID;
  array S (&S_array_dim) $1 ;
  retain S;
  if first.id then 
    do _i=1 to dim(s);
      s[_i]='N';
    end;
  if A='E' then
    do;
      _ind=input(substr(B,2),best32.);
      S[_ind]='Y';
    end;
  if last.id then output;
run;
Thanks for your reply Patrick. The actual new variables do not end with a number. Also, I provided the explanation to your questions. Hope that helps.
The question is why do you want all those variables? If there is one option per group then the following structure will contain exactly the same information, at reduced resource cost and be far easier to program with:
ID YR A B S T
1 2001 E E1 1 1
...
4 2001 E E2 2 2
...
You can where or if clauses off the data, do aggregates grouping by those variables etc. and programs only need to know there is an S variable and T variable. So where is the benefit of creating X amount of variables to show they don't hold anything?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.
