Solved: creation of dummy

amanjot_42 · Posted 01-30-2020 09:21 PM

Hi guys,

I am having a dataset containing firm id, year and audiit_year, and I want to create a dummy (new column) where year is equal to audit_year.

FirmID Year Audit_Year Dummy(what I want)

11 1990 . 0

11 1991 . 1

11 1992 1991 0

11 1993 . 1

11 1994 1993 0

11 1995 . 0

11 1996 . 0

12 2001 . 1

12 2002 2001 0

12 2003 . 0

12 2004 . 0

12 2005 . 0

12 2006 . 0

12 2007 . 0

12 2008 . 0

12 2009 . 0

Basically, I want to create a new dummy variable, where dummy=1 for the year when it is equal to the audit year (by firm ID).

Regards,

Amanjot

r_behata · Posted 01-30-2020 10:29 PM

data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | ^ missing(Audit_Year));
 	set have;
 	by firmid notsorted;
 end;
 year_=Audit_Year;
 
 do _n_=1 to _n_;
 	set have;
 	by firmid;
 	
 	if year=year_ then Dummy=1;
 		else Dummy=0;
 	output;
 end;
 
 drop year_;
run;

View solution in original post

novinosrin · Posted 01-30-2020 09:35 PM

Hi @amanjot_42 Plain and straight forward SQL


data have;
input FirmID Year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;

proc sql;
create table want as
select a.* ,a.year=b.audit_year as dummy
from have a left join have b
on a.firmid=b.firmid and a.year=b.audit_year
order by a.firmid, a.year;
quit;

/*Or a simple HASH table look up*/
data want ; 
 if _n_=1 then do;
   dcl hash H (dataset:'have(where=(audit_year^=.))') ;
   h.definekey  ("firmid","audit_year") ;
   h.definedone () ;
 end;
 set have;
 dummy=h.check()=0;
run;

/*Or a simple merge*/
data want;
 merge have have(in=b drop= year rename=audit_year=year where=(year^=.));
 by firmid year;
 dummy=b;
run;

mkeintz · Posted 01-30-2020 10:28 PM

This problem is a version of mixing two time series (the series of all records, and the (sub)series of non-missing audit_year records).

data want;
  merge have  
        have (keep=firmid audit_year rename=(audit_year=year) where=(year^=.) in=inaudit);
  by firmid year;
  dummy=inaudit;
run;

And you can strike out everything I said earlier below - overcomplicating the problem.

~~In this particular mix you want to~~

~~Read an audit record with non-missing audit_year.~~
~~Read and output ALL the records up to the record found in step 1 (i.e. the firmid and year match the firmid and audit_year of step 1). or otherwise up to the end of the data set.~~
~~Once the dummy is a 1 (matching firmid's and year=audit_year), go back and get the next audit record.~~

data have;
input FirmID Year Audit_Year;
cards;
11      1990       .
11      1991       .
11      1992   1991 
11      1993       .
11      1994   1993 
11      1995       .
11      1996       .
12      2001       .
12      2002   2001
12      2003      .
12      2004      .
12     2005       .
12     2006       .
12     2007       .
12     2008       .
12     2009       .
;


data want (drop=_:);
  if eoh1=0 then set have (where=(audit_year^=.) rename=(firmid=_firmid)) end=eoh1;
  else call missing(audit_year);
  _yr=audit_year;

  do until (end_of_have);
    set have end=end_of_have;
    dummy=ifn(year=_yr and firmid=_firmid,1,0);
    output;
    if dummy=1 then leave;
  end;
run;

~~The "if eoh1 then SET …" statement at the beginning is done to avoid premature stopping of the first SET (and therefore the DATA step), before the second SET has processed all the data.~~

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

r_behata · Posted 01-30-2020 10:29 PM

data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | ^ missing(Audit_Year));
 	set have;
 	by firmid notsorted;
 end;
 year_=Audit_Year;
 
 do _n_=1 to _n_;
 	set have;
 	by firmid;
 	
 	if year=year_ then Dummy=1;
 		else Dummy=0;
 	output;
 end;
 
 drop year_;
run;

amanjot_42 · Posted 01-31-2020 10:23 PM

Hi,

Thanks for the help!

I have noticed, there is another column of audit year which looks like this:

Firm ID  Year    Audit_year        Dummy (what I want)
11      1990       .                   0
11      1991       .                   0 
11      1992   1993                    0
11      1993       .                   1
11      1994   1995                    0
11      1995       .                   1  
11      1996       .                   0
12      2001       .                   0
12      2002   2003                    0
12      2003      .                    1
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0

In the latter table, audit years are one year ahead of YEAR. Please help with respect to the codes

Regards,
Amanjot

s

r_behata · Posted 02-01-2020 01:25 PM

Hi @amanjot_42

I think the solutions posted by @mkeintz and @novinosrin will handle both the scenarios for you . Feel free to change the answer if you think it is appropriate.

Personally I like the below solution.

data want;
  merge have  
        have (keep=firmid audit_year rename=(audit_year=year) where=(year^=.) in=inaudit);
  by firmid year;
  dummy=inaudit;
run;

amanjot_42 · Posted 02-01-2020 03:07 PM

Hi,

Thanks for getting back,

The latter procedures are not working as expected (@mkeintz and @novinosrin). They are increasing the number of observations (a lot of duplicated records are getting generated). The codes provided by you worked perfectly well with respect to the earlier scenario. Can you please help with the codes for the recent scenario.

Regards,

Amanjot

r_behata · Posted 02-01-2020 05:03 PM

Let me know if this works :

data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   
11      1991       .                    
11      1992   1993                
11      1993       .               
11      1994   1995                
11      1995       .                 
11      1996       .               
12      2001       .               
12      2002   2003                
12      2003      .                
12      2004      .                
12     2005       .                
12     2006       .                
12     2007       .                
12     2008       .                
12     2009       .          
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | brk );
 	set have;
 	by firmid notsorted;
 	if ^missing(Audit_Year) then year_=Audit_Year;
 	
 	if year_=year then brk=1;
 	

 end;

 do _n_=1 to _n_; 
  	set have; 
  	by firmid;

  	if  year_=year then Dummy=1; 
  		else Dummy=0; 
  	output; 
  end; 
 
 
 drop year_ brk;
 
run;

amanjot_42 · Posted 02-01-2020 09:38 PM

Hi,

Thank you so much,

It worked really well!

I have one more question - for my another analysis (with a similar dataset - but having multiple observations for a particular year, say 3 rows for the year 2006. So, in this case, if I run the same codes for the audit year dummies, then it put 1 for only the first row of 2006. Rest of the two rows are depicting 0.

Please help in this regard!

Regards,

Cheers

r_behata · Posted 02-02-2020 09:55 PM

Please Open a new thread and post some sample data.

amanjot_42 · Posted 02-03-2020 12:17 AM

Thanks,

I will do that!

Regards

creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Re: creation of dummy

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away