Solved: Create identifier by ID variable

PeterBr · Posted 08-29-2020 10:32 AM

Hi All,

I have the following have data where I only care about codes a, b, and c. I have an identifier variable for each of the codes that are 1 or 0 for whether that data row has the specific code. However, I'd like that identifier variable for each code to be 1 for every row of the ID if that ID had the code at some point. I've included the want data below. Thanks!

data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;
run;

data want;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 1 1
1 b 1 1 1
1 c 1 1 1
2 b 0 1 0
2 f 0 1 0
3 a 1 0 1
3 c 1 0 1
;
run;

PeterClemmensen · Posted 08-29-2020 10:44 AM

data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;

data want(drop=a b c);
   do until (last.id);
      set have;
      by id;
      a = max(code_a, a);
      b = max(code_b, b);
      c = max(code_c, c);
   end;

   do until (last.id);
      set have;
      by id;
      code_a = a;
      code_b = b;
      code_c = c;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Kurt_Bremser · Posted 08-29-2020 10:41 AM

proc summary data=have;
class id;
var code_:;
output
  out=want (drop=_type_ _freq_)
  max()=
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 08-29-2020 10:44 AM

data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;

data want(drop=a b c);
   do until (last.id);
      set have;
      by id;
      a = max(code_a, a);
      b = max(code_b, b);
      c = max(code_c, c);
   end;

   do until (last.id);
      set have;
      by id;
      code_a = a;
      code_b = b;
      code_c = c;
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

mkeintz · Posted 08-29-2020 11:14 AM

Here is a program that takes advantage of how the sas data step populates a PDV (program data vector, think of it as a list of variables.

The strategy below does the following:

Merges HAVE with 3 subsets of Have with matching ID's
1. subset with code_a=1, and with 2 variables (id and code_a).
2. subset with code_b=1, and with 2 variables (id and code_b).
3. subset with code_c=1, and with 2 variables (id and code_c).
  The very first observation for each id will have a pattern of 1's and 0's for the codes the you want, but subsequent observations produced by merge would not.
With the desired codes in hand (i.e. when the record-in-hand is the first produced by merge), read in all the other records for the id (except the code variables) and output the records.

data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;
run;

data want;
  merge have
        have (keep=id code_a where=(code_a=1))
        have (keep=id code_b where=(code_b=1))
        have (keep=id code_c where=(code_c=1)) ;
  by id;
  if first.id then do until (last.id);
    set have (drop=code_a code_b code_c);
	by id;
	output;
  end;
run;

Ordinarily I would suggest using the proc summary solutioni offered by @Kurt_Bremser , but it produces one record per id, which you would have to re-merge with have if you want the original number of records.

And this program may look a little strange compared to the well-known double do offered by draycut. I'd suggest sticking with @PeterClemmensen's suggestion. But if you need to scale to situations with multiple variables for each code (e.g. code_a1 code_a2 code_a3 code_a4 code_a5, code_b1 code_b2 code_b3 code_b4 code_b5, code_c1 code_c2 code_c3 code_c4 code_c5 ), then the code above might be easier to maintain.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PeterClemmensen · Posted 08-29-2020 02:42 PM

@mkeintz very nice.

The DATA to DATA Step Macro
Blog: SASnrd

novinosrin · Posted 08-29-2020 11:31 AM

Hi @PeterBr



data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;
run;

data want;
 do _iorc_=1 by 1 until(last.id);
  set have;
  by id;
  array t  code_a code_b code_c ;
  array u(3) _temporary_;
  do _n_=1 to dim(t);
   if t(_n_) then u(_n_)=t(_n_);
  end;
 end;
 do _iorc_=1 to _iorc_;
  set have;
  do _n_=1 to dim(t);
   t(_n_)=^^u(_n_);
  end;
  output;
 end;
 call missing(of u(*));
run;

Ksharp · Posted 08-30-2020 06:45 AM

data have;
   input id $ code $ code_a code_b code_c ;
   datalines;
1 a 1 0 0
1 b 0 1 0
1 c 0 0 1
2 b 0 1 0
2 f 0 0 0
3 a 1 0 0
3 c 0 0 1
;
run;
proc sql;
create table want as
select id,code,
max(code_a) as code_a,
max(code_b) as code_b,
max(code_c) as code_c
from have 
 group by id;
quit;

Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Re: Create identifier by ID variable

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!