BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
beihorse
Calcite | Level 5
 

YeWMM.png

 

A small sample of the data can be downloaded from my Google Drive.

You can use the data to debug my code.

This is how my data looks like. At each annual report date, there is a count of male directors (male_dirs) and a count of female directors (male_dirs).

During a year, members may leave or join the board, denoted by male_dirs_ann and female_dirs_ann ("_ann" is for "announcement" because they are merged from the "announcement" table), where 1 indicates one joined, -1 one left.

I wanted to update the numbers of male and female directors (male_dirs and female_dirs) when any left or joined the board, by generating two new variables, "total_male" and "total_female". On the annual report dates, I just copied over the male_dirs and female_dirs because there is no change on these date. For dates with changes, I took the number from the last date (male_dirs/female_dirs) and added/deducted the current change (male_dirs_ann/female_dirs_ann).

Row 2: total_male = 7 + 1 = 8, which is correct.

Row 6 is wrong: A male director left, and total_male should have been 6 (i.e., 7-1). Rows 9 and 17 are also wrong, among others.

My code is as follows:

data board_size_change;
set board_size_change;
by companyID date;
retain total_male total_female;

if first.companyID = 1 then do; 
   total_male = 0;
   total_female = 0;
end;

*At annual report date;
if not missing(annualreportdate) then do;
    total_male=male_dirs;
    total_female=female_dirs;
    output;
end;

*At announcement;
if missing(annualreportdate) then do;
    total_male + male_dirs_ann;
    total_female + female_dirs_ann;
    output;
end;
run;

What I missed or is there another way to approach this? Thanks a lot!

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

I think your doing more coding than needed.  I believe this produces what you are looking for.

 

dm 'clear log;clear out;';
data work.board_size_change;
  infile datalines truncover;
  input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
  format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
run;

data want;
 set board_size_change;
 by companyid ;

 if first.companyid then call missing(total_male,total_female);
 total_male  + male_dirs_ann;
 total_female+ female_dirs_ann;

 if not missing(annualreportdate) then do;
   total_male  = male_dirs;
   total_female= female_dirs;
 end;
run;

Note this program assumes that total_male and total_female are new variables.

 

Editted insertion: It also assumes that the first obs for each companyid has a non-missing annualreportdate.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

8 REPLIES 8
rudfaden
Pyrite | Level 9

Avoid keeping subtotals in the same table as unsummed data. Calculate running totals in a view or another table, and merge the two tables when reporting them together.

mkeintz
PROC Star

Your data step begins with:

 

data board_size_change;
set board_size_change;
by companyID date;
retain total_male total_female;

Since you are copying a dataset (board_size_change) to itself, it's very likely that variables total_male and total_female are NOT new variables.   On the one hand this makes the retain statement superfluous (vars in a SET statement are automatically retained until replaced by a new obs).

 

But more importantly it means that with each new obs, total_male and total_female values are replaced by the incoming obs values. Consequently the statements

 

if missing(annualreportdate) then do;
    total_male + male_dirs_ann;
    total_female + female_dirs_ann;
    output;
end;

need not be starting with total_male and total_female values inherited from the prior obs.

 

Consider changing your SET statement to 

set board_size_change (drop=total_male total_female);

and keep the RETAIN statement.

 

Also, I did NOT download your data for testing purposes, because it obligates me to create/register/use a google account.  I try to avoid providing more data to google (they know enough about me already) for someone else's convenience.

 

 

 

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
beihorse
Calcite | Level 5

Thank you!

 

I added 

drop=total_male total_female

The problem still exists. 

 

ballardw
Super User

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

 

You don't need to include all of the data, enough to work through the cases you need. You could use the OBS= data set option to copy just 20 or so observations or a more complicated code to subset your existing data before using the macro.

beihorse
Calcite | Level 5

Thank you.   I tried to convert the dataset to data set code.  The code is as follows:

 

data work.board_size_change;
  infile datalines dsd truncover;
  input CompanyID:32. date:YYMMDDN8. AnnualReportDate:YYMMDDN8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
  format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
;;;;

However, there was an error when I ran this code. 

ERROR 48-59: The informat YYMMDDN was not found or could not be loaded.

 

PaigeMiller
Diamond | Level 26

Okay, so it seems as if something in the macro caused DSD to appear in the INFILE command when in fact it is not necessary. Also, the informats in the INPUT statement should be YYMMDD8. There is no informat YYMMDDN8. although there is a format YYMMDDN8.

 

data work.board_size_change;
  infile datalines truncover;
  input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
  format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
;

 

after that, your code works where I have made one minor change, where the new data set is named BOARD_SIZE_CHANGE1

 

data board_size_change1;
set board_size_change;
by companyID date;
retain total_male total_female;

if first.companyID = 1 then do; 
   total_male = 0;
   total_female = 0;
end;

*At annual report date;
if not missing(annualreportdate) then do;
    total_male=male_dirs;
    total_female=female_dirs;
    output;
end;

*At announcement;
if missing(annualreportdate) then do;
    total_male + male_dirs_ann;
    total_female + female_dirs_ann;
    output;
end;
run;

 

 

--
Paige Miller
mkeintz
PROC Star

I think your doing more coding than needed.  I believe this produces what you are looking for.

 

dm 'clear log;clear out;';
data work.board_size_change;
  infile datalines truncover;
  input CompanyID:32. date:YYMMDD8. AnnualReportDate:YYMMDD8. male_dirs:32. female_dirs:32. male_dirs_ann:32. female_dirs_ann:32.;
  format date YYMMDDN8. AnnualReportDate YYMMDDN8.;
datalines;
3 20051201 20051201 7 0 0 0
3 20061027 . . . 1 0
3 20061201 20061201 8 0 0 0
6 20050701 20050701 8 1 0 0
6 20060701 20060701 7 1 0 0
6 20061207 . . . -1 0
6 20061207 . . . 1 0
6 20070207 . . . 0 -1
6 20070208 . . . 0 1
6 20070316 . . . -1 0
6 20070627 . . . 1 0
6 20070701 20070701 7 1 0 0
6 20080601 20080601 7 1 0 0
6 20090319 . . . 1 0
6 20090601 20090601 8 1 0 0
6 20100601 20100601 8 1 0 0
6 20101202 . . . 0 -1
6 20110701 20110701 8 0 0 0
6 20120112 . . . -1 0
6 20120112 . . . 0 1
run;

data want;
 set board_size_change;
 by companyid ;

 if first.companyid then call missing(total_male,total_female);
 total_male  + male_dirs_ann;
 total_female+ female_dirs_ann;

 if not missing(annualreportdate) then do;
   total_male  = male_dirs;
   total_female= female_dirs;
 end;
run;

Note this program assumes that total_male and total_female are new variables.

 

Editted insertion: It also assumes that the first obs for each companyid has a non-missing annualreportdate.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
beihorse
Calcite | Level 5

Thanks a lot. This works perfectly!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 947 views
  • 0 likes
  • 5 in conversation