BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fostet85
Calcite | Level 5

Hello! I am using SAS 9.4. This community has been very helpful as I learn SAS and I am hoping someone can help me reach an answer to this question as well.

 

I have a dataset where the level of observation is company-date. The company identifier is GVKEY and the date identifier is TAXYR. I am looking to find the count of companies that meet the following criteria:

 

1. Have an observation whose DATADATE is on or after 4/15/2020
2. Have an observation whose TAXYR is 2019 (this may be the same as the first criteria)
3. Have an observation whose TAXYR is 2018
4. Have an observation whose TAXYR is 2017


The output I need is a listing or count of GVKEYs that have information available in all years. After figuring this out, I will expand the question - how many have data in these 3 years and also in 2016? How many have data back to 2015? I will then compare these counts to determine a reasonable restriction on data availability to create the sample for my study.

 

The part where I am getting stuck is going from a GVKEY-TAXYR level of observation to just a GVKEY level. Would I do this by dropping duplicates by just GVKEY to get a list of unique firm identifiers and then merge the original dataset on top of that and create indicator variables for each criteria? Or is there a simpler approach?

 

The attached a sample dataset. I have already cut down the data according to other criteria and dropped duplicates by GVKEY TAXYR.

 

Thank you for the help!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Have no clue what you mean / how you intend to use Datadate.

 

You might want to start with an exploratory report such as:

proc tabulate data=y1;
   class gvkey taxyr;
   table gvkey,
         taxyr*n=' '
         /misstext=' '
   ;
run; 

This will create a summary report with a 1 in the columns of Taxyr where it is present for gvkey.

 

Custom format(s) for Taxyr would give you some more direct counts.

proc format;
value tx (multilabel)
2017, 2018, 2019 = 'In 2017-2019'
2016,2017, 2018, 2019 = 'In 2016-2019'
2015,2016,2017, 2018, 2019 = 'In 2015-2019'
other='not in period'
;
run;
proc tabulate data=tmp3.y1;
   class gvkey;
   class taxyr / mlf;
   format taxyr tx.;
   table gvkey,
         taxyr*n=' '
         /misstext=' '
   ;
run; 

Which will make something like:

  Tax Year
In 2015-2019 In 2016-2019 In 2017-2019 not in period
Global Company Key 5 4 3 2
001004
001045 5 4 3 3
001050 5 4 3 3
001072 4 3 2 2
001075 5 4 3 3
001076 5 4 3 3
001078 5 4 3 3
001084 5 4 3 3
001094 3 2 1 3
001097 5 4 3 2

 

The number in column indicates how many Taxyr fall in the column heading. So if you see a 5 in the 2015-2019 that means all of the years are represented, if you only see a 1 then only one value is in the range.

 

The Format TX is a multilabel format which allows overlapping ranges. However only a very few procedures, Tabulate, Report, Summary/Means will use all the the ranges. But here it may be appropriate.

View solution in original post

4 REPLIES 4
ChrisNZ
Tourmaline | Level 20

Something like this is easy to read and adapt, if somewhat inefficient:

proc sql;
  create table WANT as
  select sum(sum(YR2017,YR2018,YR2019)=3) as ALL_3_YEARS
  from (select unique GVKEY              from TBL                 )  a
       left join 
       (select unique GVKEY, 1 as YR2016 from TBL where TAXYR=2016)  b
       on a.GVKEY=b.GVKEY
       left join 
       (select unique GVKEY, 1 as YR2017 from TBL where TAXYR=2017)  c
       on a.GVKEY=c.GVKEY
       left join 
       (select unique GVKEY, 1 as YR2018 from TBL where TAXYR=2018)  d
       on a.GVKEY=d.GVKEY
       left join 
       (select unique GVKEY, 1 as YR2019 from TBL where TAXYR=2019)  e
       on a.GVKEY=e.GVKEY;
quit;

 

ballardw
Super User

Have no clue what you mean / how you intend to use Datadate.

 

You might want to start with an exploratory report such as:

proc tabulate data=y1;
   class gvkey taxyr;
   table gvkey,
         taxyr*n=' '
         /misstext=' '
   ;
run; 

This will create a summary report with a 1 in the columns of Taxyr where it is present for gvkey.

 

Custom format(s) for Taxyr would give you some more direct counts.

proc format;
value tx (multilabel)
2017, 2018, 2019 = 'In 2017-2019'
2016,2017, 2018, 2019 = 'In 2016-2019'
2015,2016,2017, 2018, 2019 = 'In 2015-2019'
other='not in period'
;
run;
proc tabulate data=tmp3.y1;
   class gvkey;
   class taxyr / mlf;
   format taxyr tx.;
   table gvkey,
         taxyr*n=' '
         /misstext=' '
   ;
run; 

Which will make something like:

  Tax Year
In 2015-2019 In 2016-2019 In 2017-2019 not in period
Global Company Key 5 4 3 2
001004
001045 5 4 3 3
001050 5 4 3 3
001072 4 3 2 2
001075 5 4 3 3
001076 5 4 3 3
001078 5 4 3 3
001084 5 4 3 3
001094 3 2 1 3
001097 5 4 3 2

 

The number in column indicates how many Taxyr fall in the column heading. So if you see a 5 in the 2015-2019 that means all of the years are represented, if you only see a 1 then only one value is in the range.

 

The Format TX is a multilabel format which allows overlapping ranges. However only a very few procedures, Tabulate, Report, Summary/Means will use all the the ranges. But here it may be appropriate.

fostet85
Calcite | Level 5

Thank you all for your solutions! They all accomplished what I was trying to do, this one is just the simplest for me to understand and update down the road. 

 

Here is the solution that I ended up using, which was a combination from all 3 submitted solutions:

 

/*Create a dataset which is a listing of all GVKEYs with a filing on or after 4/15/2020*/
data gvkey_apr152020; set y1;
where apr152020=1;
keep gvkey conm tic apr152020;
run;

/*Drop duplicates*/
proc sort data=gvkey_apr152020 nodupkey dupout=dups2; 
	by gvkey;
run;

/*Create the table that makes dummy variables for each tax year*/
/*This exports the file to excel. We will re-import in the next step */
/*This is done to convert the table to a dataset*/
ods excel file="NOL_CB_entitylistingA.xlsx";
proc tabulate data=y1;
   class gvkey taxyr;
   table gvkey,
         taxyr*n=' '
         /misstext=' '
   ;
run; 
ods excel close;

******************************************************************;
** update headers and un-merge cells manually in excel **;
******************************************************************;

/*Re-import the excel table*/
proc import
 datafile="NOL_CB_entitylistingA.xlsx"
 out=GVKEY_taxyr;
 scantext=yes;
run;

/*Merge 	A) the list of GVKEYs with filings after 4/15/2020 with 			*/
/*			B) the dataset with dummy variables for filings in each tax year	*/
 proc sql;
  create table perm.CARESNOLCB_sample
   as select * 
   from gvkey_apr152020 a left join GVKEY_taxyr b
    on a.gvkey = b.gvkey
order by gvkey;
quit;

/*Export to excel*/
proc export 
  data=perm.CARESNOLCB_sample
  dbms=xlsx 
  outfile="CARESNOLCB_sample.xlsx" 
  replace;
run;
mkeintz
PROC Star

Read all records for a single gvkey, meanwhile populating an array of variables named taxyr2012 through taxyr2020 signifying the presence of the corresponding TAXYR.

 

At the end of the gvkey, after applying a filter like "if min(of taxdummy2017-taxdummy2020)=1 and apr2020='Y'.  Then starting with 2017 go backwards to find the first consecutive taxyr, and then output.

 


data want (keep=gvkey apr2020 taxdummy: starttax);
  do until (last.gvkey);
    set have (keep=gvkey datadate taxyr);
    by gvkey;
    if first.gvkey then apr2020='N';
    if datadate>='15apr2020'd then apr2020='Y';
    array taxdummy {2012:2020} taxdummy2012-taxdummy2020;
    taxdummy{taxyr}=1;
  end;
  do taxyr=2012 to 2020;
    if taxdummy{taxyr}=. then taxdummy{taxyr}=0;
  end;
  if min(of taxdummy2017-taxdummy2020)=1  and apr2020='Y';
  do starttax=2017 to 2012 by -1 while(taxdummy{starttax}=1);
  end;
run;

You could then generate the subpopulation counts with, say, a proc freq:

 

proc freq;
  tables starttax;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 472 views
  • 3 likes
  • 4 in conversation