BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
amanjot_42
Fluorite | Level 6

Hi guys,

I am having a dataset containing firm id, year and audiit_year, and I want to create a dummy (new column) where year is equal to audit_year.

 

FirmID Year Audit_Year Dummy(what I want)

11      1990       .                   0

11      1991       .                   1 

11      1992   1991                0

11      1993       .                   1

11      1994   1993                0

11      1995       .                   0  

11      1996       .                   0

12      2001       .                   1

12      2002   2001                0

12      2003      .                    0

12      2004      .                    0

12     2005       .                    0

12     2006       .                    0

12     2007       .                    0

12     2008       .                    0

12     2009       .                    0

 

Basically, I want to create a new dummy variable, where dummy=1 for the year when it is equal to the audit year (by firm ID).

 

Regards,

Amanjot 

1 ACCEPTED SOLUTION

Accepted Solutions
r_behata
Barite | Level 11
data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | ^ missing(Audit_Year));
 	set have;
 	by firmid notsorted;
 end;
 year_=Audit_Year;
 
 do _n_=1 to _n_;
 	set have;
 	by firmid;
 	
 	if year=year_ then Dummy=1;
 		else Dummy=0;
 	output;
 end;
 
 drop year_;
run;

View solution in original post

10 REPLIES 10
novinosrin
Tourmaline | Level 20

Hi @amanjot_42  Plain and straight forward SQL

 

 


data have;
input FirmID Year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;

proc sql;
create table want as
select a.* ,a.year=b.audit_year as dummy
from have a left join have b
on a.firmid=b.firmid and a.year=b.audit_year
order by a.firmid, a.year;
quit;

 

/*Or a simple HASH table look up*/
data want ; 
 if _n_=1 then do;
   dcl hash H (dataset:'have(where=(audit_year^=.))') ;
   h.definekey  ("firmid","audit_year") ;
   h.definedone () ;
 end;
 set have;
 dummy=h.check()=0;
run;

 

/*Or a simple merge*/
data want;
 merge have have(in=b drop= year rename=audit_year=year where=(year^=.));
 by firmid year;
 dummy=b;
run;
mkeintz
PROC Star

This problem is a version of mixing two time series (the series of all records, and the (sub)series of non-missing audit_year records). 

 

 

 

data want;
  merge have  
        have (keep=firmid audit_year rename=(audit_year=year) where=(year^=.) in=inaudit);
  by firmid year;
  dummy=inaudit;
run;

 

 

And you can strike out everything I said earlier below - overcomplicating the problem.

 

In this particular mix you want to

  1. Read an audit record with non-missing audit_year.
  2. Read and output ALL the records up to the record found in step 1 (i.e. the firmid and year match the firmid and audit_year of step 1). or otherwise up to the end of the data set.
  3. Once the dummy is a 1 (matching firmid's and year=audit_year), go back and get the next audit record.
data have;
input FirmID Year Audit_Year;
cards;
11      1990       .
11      1991       .
11      1992   1991 
11      1993       .
11      1994   1993 
11      1995       .
11      1996       .
12      2001       .
12      2002   2001
12      2003      .
12      2004      .
12     2005       .
12     2006       .
12     2007       .
12     2008       .
12     2009       .
;


data want (drop=_:);
  if eoh1=0 then set have (where=(audit_year^=.) rename=(firmid=_firmid)) end=eoh1;
  else call missing(audit_year);
  _yr=audit_year;

  do until (end_of_have);
    set have end=end_of_have;
    dummy=ifn(year=_yr and firmid=_firmid,1,0);
    output;
    if dummy=1 then leave;
  end;
run;

 

The "if eoh1 then SET …" statement at the beginning is done to avoid premature stopping of the first SET (and therefore the DATA step), before the second SET has processed all the data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
r_behata
Barite | Level 11
data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   0
11      1991       .                   1 
11      1992   1991                0
11      1993       .                   1
11      1994   1993                0
11      1995       .                   0  
11      1996       .                   0
12      2001       .                   1
12      2002   2001                0
12      2003      .                    0
12      2004      .                    0
12     2005       .                    0
12     2006       .                    0
12     2007       .                    0
12     2008       .                    0
12     2009       .                    0
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | ^ missing(Audit_Year));
 	set have;
 	by firmid notsorted;
 end;
 year_=Audit_Year;
 
 do _n_=1 to _n_;
 	set have;
 	by firmid;
 	
 	if year=year_ then Dummy=1;
 		else Dummy=0;
 	output;
 end;
 
 drop year_;
run;
amanjot_42
Fluorite | Level 6

Hi,

Thanks for the help!

 

I have noticed, there is another column of audit year which looks like this:

 

Firm ID  Year    Audit_year        Dummy (what I want)
11 1990 . 0 11 1991 . 0 11 1992 1993 0 11 1993 . 1 11 1994 1995 0 11 1995 . 1 11 1996 . 0 12 2001 . 0 12 2002 2003 0 12 2003 . 1 12 2004 . 0 12 2005 . 0 12 2006 . 0 12 2007 . 0 12 2008 . 0 12 2009 . 0

In the latter table, audit years are one year ahead of YEAR. Please help with respect to the codes

Regards,
Amanjot

r_behata
Barite | Level 11

Hi @amanjot_42 

 

I think the solutions posted by @mkeintz  and @novinosrin  will handle both the scenarios for you . Feel free to change the answer if you think it is appropriate.

 

Personally I like the below solution.

 

data want;
  merge have  
        have (keep=firmid audit_year rename=(audit_year=year) where=(year^=.) in=inaudit);
  by firmid year;
  dummy=inaudit;
run;
amanjot_42
Fluorite | Level 6

Hi,

Thanks for getting back,

 

The latter procedures are not working as expected (@mkeintz  and @novinosrin). They are increasing the number of observations (a lot of duplicated records are getting generated). The codes provided by you worked perfectly well with respect to the earlier scenario. Can you please help with the codes for the recent scenario.

 

Regards,

Amanjot

r_behata
Barite | Level 11

Let me know if this works :

 

data have;
input FirmID year Audit_Year;
cards;
11      1990       .                   
11      1991       .                    
11      1992   1993                
11      1993       .               
11      1994   1995                
11      1995       .                 
11      1996       .               
12      2001       .               
12      2002   2003                
12      2003      .                
12      2004      .                
12     2005       .                
12     2006       .                
12     2007       .                
12     2008       .                
12     2009       .          
;
run;

data want;


 do _n_=1 by 1 until(last.FirmID | brk );
 	set have;
 	by firmid notsorted;
 	if ^missing(Audit_Year) then year_=Audit_Year;
 	
 	if year_=year then brk=1;
 	

 end;

 do _n_=1 to _n_; 
  	set have; 
  	by firmid;

  	if  year_=year then Dummy=1; 
  		else Dummy=0; 
  	output; 
  end; 
 
 
 drop year_ brk;
 
run;
amanjot_42
Fluorite | Level 6

Hi,

Thank you so much, 

It worked really well!

 

I have one more question - for my another analysis (with a similar dataset - but having multiple observations for a particular year, say 3 rows for the year 2006. So, in this case, if I run the same codes for the audit year dummies, then it put 1 for only the first row of 2006. Rest of the two rows are depicting 0. 

 

Please help in this regard!

 

Regards,

Cheers

r_behata
Barite | Level 11

Please Open a new thread and post some sample data.

amanjot_42
Fluorite | Level 6

Thanks,

I will do that!

 

Regards

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2509 views
  • 2 likes
  • 4 in conversation