A small sample of the data can be downloaded from my Google Drive.
You can use the data to debug my code.
This is how my data looks like. At each annual report date, there is a count of male directors (male_dirs) and a count of female directors (male_dirs).
During a year, members may leave or join the board, denoted by male_dirs_ann and female_dirs_ann ("_ann" is for "announcement" because they are merged from the "announcement" table), where 1 indicates one joined, -1 one left.
I wanted to update the numbers of male and female directors (male_dirs and female_dirs) when any left or joined the board, by generating two new variables, "total_male" and "total_female". On the annual report dates, I just copied over the male_dirs and female_dirs because there is no change on these date. For dates with changes, I took the number from the last date (male_dirs/female_dirs) and added/deducted the current change (male_dirs_ann/female_dirs_ann).
Row 2: total_male = 7 + 1 = 8, which is correct.
Row 6 is wrong: A male director left, and total_male should have been 6 (i.e., 7-1). Rows 9 and 17 are also wrong, among others.
My code is as follows:
data board_size_change;
set board_size_change;
by companyID date;
retain total_male total_female;
if first.companyID = 1 then do;
total_male = 0;
total_female = 0;
end;
*At annual report date;
if not missing(annualreportdate) then do;
total_male=male_dirs;
total_female=female_dirs;
output;
end;
*At announcement;
if missing(annualreportdate) then do;
total_male + male_dirs_ann;
total_female + female_dirs_ann;
output;
end;
run;
What I missed or is there another way to approach this? Thanks a lot!
I think your doing more coding than needed. I believe this produces what you are looking for.
dm 'clear log;clear out;';
data work.board_size_change;
infile datalines truncover;
input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
run;
data want;
set board_size_change;
by companyid ;
if first.companyid then call missing(total_male,total_female);
total_male + male_dirs_ann;
total_female+ female_dirs_ann;
if not missing(annualreportdate) then do;
total_male = male_dirs;
total_female= female_dirs;
end;
run;
Note this program assumes that total_male and total_female are new variables.
Editted insertion: It also assumes that the first obs for each companyid has a non-missing annualreportdate.
Avoid keeping subtotals in the same table as unsummed data. Calculate running totals in a view or another table, and merge the two tables when reporting them together.
Your data step begins with:
data board_size_change;
set board_size_change;
by companyID date;
retain total_male total_female;
Since you are copying a dataset (board_size_change) to itself, it's very likely that variables total_male and total_female are NOT new variables. On the one hand this makes the retain statement superfluous (vars in a SET statement are automatically retained until replaced by a new obs).
But more importantly it means that with each new obs, total_male and total_female values are replaced by the incoming obs values. Consequently the statements
if missing(annualreportdate) then do;
total_male + male_dirs_ann;
total_female + female_dirs_ann;
output;
end;
need not be starting with total_male and total_female values inherited from the prior obs.
Consider changing your SET statement to
set board_size_change (drop=total_male total_female);
and keep the RETAIN statement.
Also, I did NOT download your data for testing purposes, because it obligates me to create/register/use a google account. I try to avoid providing more data to google (they know enough about me already) for someone else's convenience.
Thank you!
I added
drop=total_male total_female
The problem still exists.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
You don't need to include all of the data, enough to work through the cases you need. You could use the OBS= data set option to copy just 20 or so observations or a more complicated code to subset your existing data before using the macro.
Thank you. I tried to convert the dataset to data set code. The code is as follows:
data work.board_size_change;
infile datalines dsd truncover;
input CompanyID:32. date:YYMMDDN8. AnnualReportDate:YYMMDDN8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
;;;;
However, there was an error when I ran this code.
ERROR 48-59: The informat YYMMDDN was not found or could not be loaded.
Okay, so it seems as if something in the macro caused DSD to appear in the INFILE command when in fact it is not necessary. Also, the informats in the INPUT statement should be YYMMDD8. There is no informat YYMMDDN8. although there is a format YYMMDDN8.
data work.board_size_change;
infile datalines truncover;
input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
;
after that, your code works where I have made one minor change, where the new data set is named BOARD_SIZE_CHANGE1
data board_size_change1;
set board_size_change;
by companyID date;
retain total_male total_female;
if first.companyID = 1 then do;
total_male = 0;
total_female = 0;
end;
*At annual report date;
if not missing(annualreportdate) then do;
total_male=male_dirs;
total_female=female_dirs;
output;
end;
*At announcement;
if missing(annualreportdate) then do;
total_male + male_dirs_ann;
total_female + female_dirs_ann;
output;
end;
run;
I think your doing more coding than needed. I believe this produces what you are looking for.
dm 'clear log;clear out;';
data work.board_size_change;
infile datalines truncover;
input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
run;
data want;
set board_size_change;
by companyid ;
if first.companyid then call missing(total_male,total_female);
total_male + male_dirs_ann;
total_female+ female_dirs_ann;
if not missing(annualreportdate) then do;
total_male = male_dirs;
total_female= female_dirs;
end;
run;
Note this program assumes that total_male and total_female are new variables.
Editted insertion: It also assumes that the first obs for each companyid has a non-missing annualreportdate.
Thanks a lot. This works perfectly!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.