Make new variables using if statements in proc sql

undercover_247 · Posted 10-31-2023 01:49 PM

I have a dataset that looks like this

ID	users_2016	users_2017	users_2018	users_2019	users_2020
jjones	1	1	0	0	0
asmith	0	1	1	1	0

I want a table that looks like this

Year	count_users_year
2016	1
2017	2

I want to do this in proc sql but I don't understand how to create a variable using a combination of if/then logic and creating a new variable. A users could be involved in this program for multiple years so I couldn't really do this in a datstep without multiple steps. Any advice?

ballardw · Posted 10-31-2023 02:08 PM

What happened to Id, users2018, users2019 and users2020?

SQL does not support "if". The Case statement, which allows conditional assignments in SQL doesn't really work across "rows" the way you may be thinking.

If you actually want all of your "years" in the result SQL is going to be pretty cumbersome to do this because the data as currently structured is not set up for SQL "group by" to get summaries such as total, which would want a single variable name Year(or similar) with the value of year AND another single variable to sum.

If this were my data:

data need;
   set have;
   array u (2016:2020) users_2016 - users_2020;
   do year= 2016 to 2020;
      value= u[year];
      output;
   end;
   keep year value;
run;

proc means data=need sum;
   class year;
   var value;
run;

Kurt_Bremser · Posted 10-31-2023 02:15 PM

You will need multiple steps, a you first have to correct the mistake of hiding data (years) in structure (variable names). Once you have transposed and extracted the year, further analysis will be simple.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PGStats · Posted 10-31-2023 03:37 PM

Just for fun. If you really insist on using SQL :

data have;
input ID $   users_2016  users_2017  users_2018  users_2019  users_2020;
datalines;
jjones    1   1   0   0   0
asmith    0   1   1   1   0
;

proc sql;
create table want (year num, nbUsers num);
insert into want select 2016, count (distinct ID) from have where users_2016;
insert into want select 2017, count (distinct ID) from have where users_2017;
insert into want select 2018, count (distinct ID) from have where users_2018;
insert into want select 2019, count (distinct ID) from have where users_2019;
insert into want select 2020, count (distinct ID) from have where users_2020;
select * from want;
quit;

😁

PG

mkeintz · Posted 11-01-2023 10:40 AM

@undercover_247 wrote:

A users could be involved in this program for multiple years so I couldn't really do this in a datstep without multiple steps. Any advice?

I don't see why this needs more than a single data step:

data have;
input ID $   users_2016  users_2017  users_2018  users_2019  users_2020;
datalines;
jjones    1   1   0   0   0
asmith    0   1   1   1   0
;

data want (keep=year count);
  set have end=end_of_have;
  array totl {2016:2020} _temporary_;
  array usrs {2016:2020} users_2016-users_2020;

  do year=lbound(usrs) to hbound(usrs);
    totl{year}+usrs{year};
  end;
  if end_of_have;
  do year=lbound(usrs) to hbound(usrs);
    count=totl{year};
    output;
  end;
run;

This primarily benefits from not only the ability to declare arrays, but also from the ability to set lower and upper array bounds corresponding to the expected YEAR values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PaigeMiller · Posted 11-01-2023 10:46 AM

proc summary data=have;
    var users:;
    output out=want sum=;
run;
proc transpose data=want out=want1;
    var users:;
run;

Agreeing with @Kurt_Bremser that putting years into variable names is a particularly poor choice, perhaps not so much in this simple example, but certainly in the long run you want to AVOID years (or other calendar information) in variable names as it will make your programming more difficult.

With regard to @mkeintz's solution, certainly you can do this in one data step, but the code he provides is usually beyond the ability of most programmers to create, and is (in my opinion) rather complicated for what is a relatively simple problem. I don't want users (especially new SAS users) to get the impression that finding sums or counts requires such complicated code. In any case, I recommend using built-in SAS PROCs whenever possible, as I show above.

Complicated code or multi-step code is avoided by originally creating the data into a more SAS-friendly structure. Instead of years in the variable names, year should be a variable itself.

data have;
     input user $ year;
     cards;
jjones 2016
jjones 2017
asmith 2016
asmith 2017
asmith 2018
;

With this layout, the calculations are done with very simple code

proc freq data=have;
    tables year/noprint out=want(drop=percent);
run;

--
Paige Miller

Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Re: Make new variables using if statements in proc sql

Registration is open

SAS Training: Just a Click Away