BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dr2014
Quartz | Level 8

Hi All,

I am creating a sas dataset from a sql database that contains ids(ID) for different years (variable=Yr). In it, I have a variable called 'A' with specific values. There is another column 'B' which has values related to a specific value in column A. i.e for A='E' the values in B are 'E1', 'E2',' E3', 'E4', 'E5' and 'E6'. The dataset looks as follows.

 

 

ID

Yr

A

B

1

1

2001

E

E1

2

2

2001

E

E1

3

2

2001

E

E2

4

3

2001

E

E1

5

1

2002

E

E1

6

2

2002

E

E3

7

3

2002

E

E2

8

3

2002

E

E3

9

3

2002

E

E4

10

3

2002

E

E5

11

3

2002

E

E6

 

 I have to create new variables for each unique value in varibale 'B' using conditonal logicand assign a value of 'Y' if a particular value is selected in varibale 'B', such as :

if A='E' then do;

if B='E1' then S1='Y'; else

if B='E2' then S2='Y'; else

if B='E3' then S3='Y'; else

if B='E4' then S4='Y'; else

if B='E5' then S5='Y'; else

if B='E6' then S6='Y';

end;

 

After doing that and sorting by ID and Yr I get:

 

ID

Yr

A

B

S1

S2

S3

S4

S5

S6

1

1

2001

E

E1

Y

 

 

 

 

 

2

1

2002

E

E1

Y

 

 

 

 

 

3

2

2001

E

E1

Y

 

 

 

 

 

4

2

2001

E

E2

 

Y

 

 

 

 

5

2

2002

E

E3

 

 

Y

 

 

 

6

3

2001

E

E1

Y

 

 

 

 

 

7

3

2002

E

E2

 

Y

 

 

 

 

8

3

2002

E

E3

 

 

Y

 

 

 

9

3

2002

E

E4

 

 

 

Y

 

 

10

3

2002

E

E5

 

 

 

 

Y

 

11

3

2002

E

E6

 

 

 

 

 

Y

 

 

Now I need to set all the missing values for S1 to S6 to 0 and create the final dataset which has to have a single records for each unique id for each unique year, illustrated in the table below:

 

 

ID

Yr

S1

S2

S3

S4

S5

S6

1

1

2001

Y

N

N

N

N

N

2

1

2002

Y

N

N

N

N

N

3

2

2001

Y

Y

N

N

N

N

4

2

2002

N

N

Y

N

N

N

5

3

2001

Y

N

N

N

N

N

6

3

2002

N

Y

Y

Y

Y

Y

 

I tried using arrays but I am not getting the correct data:

%let crit= S1 S2 S3 S4 S5 S6;

 

array temp (*) $1 &crit;

length &crit $1 ;

by id yr;

if first.yr;

do x=1 to dim(temp);

if temp(x) = '' then temp(x) = 'N';

end;

 

I am not sure what is worong with the array. Please adise and of any other way to accomplish the final dataset. 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Glad it worked, because I was running out of ideas.

 

SET won't work because SET keeps each observation separate from the other observations.

 

UPDATE (briefly) requires both a master data set (D3 with OBS=0) and a transaction data set (the second D3).  Automatically, UPDATE takes observations from the transaction data set, applies the nonmissing values only, and outputs only when reaching the end of the BY group.  So it can collapse multiple transactions into a single observation.  The only reason the first D3 is needed here is because UPDATE requires exactly two data sets.  It's just a placeholder that has the right variable definitions, but no data is actually used from the first mention of D3.

View solution in original post

14 REPLIES 14
Astounding
PROC Star

So far, there's nothing in your code that would combine multiple observations into one.  Here's a way that takes the middle format of the data and converts it to the bottom format:

 

data want;

update have (obs=0 drop=a b) have (drop=a b);

by id year;

array s {6};

if last.year then do _n_=1 to 6;

   if s{_n_}=' ' then s{_n_}='N';

end;

run;

dr2014
Quartz | Level 8

Thanks for your answer. Yes, that works. I tweaked it a bit as:

%let crit= S1 S2 S3 S4 S5 S6;

array s {*} $1 &crit;

if last.yr then do

_n_=1 to dim(s);

if s{_n_}=' ' then s{_n_}='N';

end;

 

I have one more concern. Now, the variable 'A' has additional values such as F which has correspoding values on variable 'B' such a F1,F2,F3,F4,F5,F6. So now the dataset after applying conditional logic to create new variables (T1 to T6) looks like this:

 

ID

Yr

A

B

S1

S2

S3

S4

S5

S6

T1

T2

T3

T4

T5

T6

1

1

2001

E

E1

Y

 

 

 

 

 

Y

 

 

 

 

 

2

1

2002

E

E1

Y

 

 

 

 

 

Y

 

 

 

 

 

3

2

2001

E

E1

Y

 

 

 

 

 

Y

 

 

 

 

 

4

2

2001

E

E2

 

Y

 

 

 

 

 

Y

 

 

 

 

5

2

2002

E

E3

 

 

Y

 

 

 

 

 

Y

 

 

 

6

3

2001

E

E1

Y

 

 

 

 

 

Y

 

 

 

 

 

7

3

2002

E

E2

 

Y

 

 

 

 

 

Y

 

 

 

 

8

3

2002

E

E3

 

 

Y

 

 

 

 

 

Y

 

 

 

9

3

2002

E

E4

 

 

 

Y

 

 

 

 

 

Y

 

 

10

3

2002

E

E5

 

 

 

 

Y

 

 

 

 

 

Y

 

11

3

2002

E

E6

 

 

 

 

 

Y

 

 

 

 

 

Y

 

I wanted to apply the arry statements to all variables S1 to S6 and T1 to T6:

If I use the array statemnets twice(see below) I dont get the right values for T1 to T6. All of them get a value of 'N' in the final dataset. I tried using a positonal parametre macro for the array elements and it didnlt works as well. Do you have any suggestions for this? Thanks in advance.

 

%let crit= S1 S2 S3 S4 S5 S6; 

%let selec=T1 T2 T3 T4 T5 T6; 

array s {*} $1 &crit;

if last.yr then do

_n_=1 to dim(s);

if s{_n_}=' ' then s{_n_}='N';

end;

array t (*) $1 &selec;

if last.yr then do

_n_=1 to dim(t);

if t{_n_}=' ' then t{_n_}='N';

end;

Astounding
PROC Star

The answer is going to be something very basic.  It's not being introduced by the complexities of the program. 

 

Since I can't see a PROC CONTENTS of your data set, I will have to guess.  Are the variables T1-T6 actually part of the data coming in?

dr2014
Quartz | Level 8

Thanks all for your replies. Here is the original dataset, code that created the new variables S1 to S6 and T1 to T6. I have to create a "line listing" from the "sas dataset" created from the SQL database and the requirements state the need for new variables S1 to S6 and T1 to T6. Variable 'B' is a multiselection on the user interface. I have to account for the values that were not selected in variable 'B' and set values in the new variables to  'N' and accordingly set to 'Y' if they were selected. Hope I am making sense here.

 

Here is the original dataset:

ID

Yr

A

B

1

2001

E

E1

2

2001

E

E1

2

2001

E

E2

3

2001

E

E1

1

2002

E

E1

2

2002

E

E3

3

2002

E

E2

3

2002

E

E3

3

2002

E

E4

3

2002

E

E5

3

2002

E

E6

1

2001

F

F1

2

2001

F

F1

2

2001

F

F2

3

2001

F

F1

1

2002

F

F1

2

2002

F

F3

3

2002

F

F2

3

2002

F

F3

3

2002

F

F4

3

2002

F

F5

3

2002

F

F6

 

Here is the conditional logic to create the new variables:

if A='E' then do;

if B='E1' then S1='Y'; else

if B='E2' then S2='Y'; else

if B='E3' then S3='Y'; else

if B='E4' then S4='Y'; else

if B='E5' then S5='Y'; else

if B='E6' then S6='Y';

end;

if A='F' then do;

if B='F1' then T1='Y'; else

if B='F2' then T2='Y'; else

if B='F3' then T3='Y'; else

if B='F4' then T4='Y'; else

if B='F5' then T5='Y'; else

if B='F6' then T6='Y';

end;

 

The final dataset has to look like this:

 

Id

Yr

S1

S2

S3

S4

S5

S6

T1

T2

T3

T4

T5

T6

1

2001

Y

N

N

N

N

N

Y

N

N

N

N

N

1

2002

Y

N

N

N

N

N

Y

N

N

N

N

N

2

2001

Y

N

N

N

N

N

Y

N

N

N

N

N

2

2002

N

N

Y

N

N

N

N

N

Y

N

N

N

3

2001

Y

N

N

N

N

N

Y

N

N

N

N

N

3

2002

N

Y

N

N

N

N

N

Y

N

N

N

N

 

@Astounding, I was looking to use the aray code you suggested to get the result, but using the array statemnets twice did not get me the right values for T1 to T6. What would be the best way to get this final dataset? Please let me know.

Astounding
PROC Star

It's a little tough to debug from a distance. The code should work, though.

 

When you work with the data just before the final version, what do you get from:

 

proc freq data=intermediate;

tables t1-t6;

run;

 

 

dr2014
Quartz | Level 8

By intermediate, I am assuming you mena the dataset after application of the conditional logic to care the new variables T1 to T6 and S1 to S6.ere are the resuts:

The SAS System

 

The FREQ Procedure
T1 Frequency
Y 4
Frequency Missing
= 18


T2 Frequency
Y 2
Frequency Missing
= 20


T3 Frequency
Y 2
Frequency Missing
= 20


T4 Frequency
Y 1
Frequency Missing
= 21


T5 Frequency
Y 1
Frequency Missing
= 21


T6 Frequency
Y 1
Frequency Missing
= 21

 

dr2014
Quartz | Level 8

@Astounding it worked! I used the array statemnets twice. Not sure why it didn't before. I have one question. This is the first time I used the 'update' statemnet. I tried using the 'set' statement and the output looked really crazy (see code below). Why dosen't it work with the set statement? Also,  I understand update statemnt needs to specify 2 datasets and the same dataset 'd3' was listed twice. How does it work here? I will study this at depth offcourse but a short answer would be helpful.

 

data d4(drop=a b);

*set d3;

update d3 (obs=0 drop=a b) d3 (drop=a b);

by id yr;

array s {*} $1 &crit;

if last.yr then do

_n_=1 to dim(s);

if s{_n_}=' ' then s{_n_}='N';

end;

array t (*) $1 &dual;

if last.yr then do

_n_=1 to dim(t);

if t{_n_}=' ' then t{_n_}='N';

end;

run;

Astounding
PROC Star

Glad it worked, because I was running out of ideas.

 

SET won't work because SET keeps each observation separate from the other observations.

 

UPDATE (briefly) requires both a master data set (D3 with OBS=0) and a transaction data set (the second D3).  Automatically, UPDATE takes observations from the transaction data set, applies the nonmissing values only, and outputs only when reaching the end of the BY group.  So it can collapse multiple transactions into a single observation.  The only reason the first D3 is needed here is because UPDATE requires exactly two data sets.  It's just a placeholder that has the right variable definitions, but no data is actually used from the first mention of D3.

dr2014
Quartz | Level 8

Thanks for your reply @Astounding. I uderstood the first d3 was a placeholder after reading about it.

dr2014
Quartz | Level 8

Yes @Astounding, thanks for your help!

ballardw
Super User

You say the T variables depend on the value of A but don't show where that is being considered. Perhaps you did not assign values to T1 etc before the array code?

 

BTW, It is not a good idea to habitually use _n_ as you are. _n_ is a SAS automatic variable that refers, in a loose sense, to the number of the row you have read from a set statment and if you are not extremely careful you may end up referencing a value you don't intend.

Patrick
Opal | Level 21

Below some code which hopefully sets you on the right path.

If you can't solve your "T" problem on your own then please use below data step and post a version which creates the additional source data. Then explain us the transformation logic required for the T's.

 

Please note: Below code will only work if the values in your B variable end with a number and if this numbering start with 1 and has no gaps. If you can't rely on this with your real data then you need to tell us as this would require some additional coding.

 

Can I ask why you need the data in this shape? There are quite a few procedures which could deal directly with the original data structure.

data have;
infile datalines truncover;
input ID Yr A $ B $;
datalines;
1 2001 E E1
2 2001 E E1
2 2001 E E2
3 2001 E E1
1 2002 E E1
2 2002 E E3
3 2002 E E2
3 2002 E E3
3 2002 E E4
3 2002 E E5
3 2002 E E6
;
run;

proc sql noprint;
  select count(distinct B) into :S_array_dim
  from have
  where A='E'
  ;
quit;

data want(drop=A B _:);
  set have(where=(A='E'));
  by Yr ID;

  array S (&S_array_dim) $1 ;
  retain S;
  if first.id then 
    do _i=1 to dim(s);
      s[_i]='N';
    end;
  if A='E' then
    do;
      _ind=input(substr(B,2),best32.);
      S[_ind]='Y';
    end;

  if last.id then output;
run;

 

dr2014
Quartz | Level 8

Thanks for your reply Patrick. The actual new variables do not end with a number. Also,  I provided the explanation to your questions. Hope that helps.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

The question is why do you want all those variables?  If there is one option per group then the following structure will contain exactly the same information, at reduced resource cost and be far easier to program with:

ID YR     A    B    S   T

1   2001  E    E1  1   1

...

4   2001  E    E2  2   2

...

 

You can where or if clauses off the data, do aggregates grouping by those variables etc. and programs only need to know there is an S variable and T variable.  So where is the benefit of creating X amount of variables to show they don't hold anything?

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 4088 views
  • 2 likes
  • 5 in conversation