Hi Everyone,
I have the following data structure :
Inventor | 1stfirm | 2ndfirm | movedate | patent | yearPatent |
A | 390 | 909 | 1973 | 37237239 | 1969 |
A | 390 | 909 | 1973 | 78877878 | 1969 |
A | 390 | 909 |
| 87090990 | 1970 |
A | 390 | 909 |
| 98908087 | 1970 |
A | 390 | 909 |
| 44347799 | 1970 |
A | 390 | 909 |
| 77879808 | 1971 |
A | 390 | 909 |
| 78787099 | 1972 |
A | 390 | 909 |
| 87900800 | 1972 |
A | 390 | 909 |
| 34778790 | 1973 |
A | 390 | 909 |
| 87989898 | 1974 |
A | 390 | 909 |
| 23477879 | 1975 |
A | 390 | 909 |
| 87990008 | 1975 |
A | 390 | 909 |
| 34778879 | 1976 |
A | 390 | 909 | 1973 | 79887877 | 1977 |
I am trying to count the number of patents an inventor has produced 3 years right before leaving the 1stfirm and 3 years right after moving to the 2nd firm. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving and then count patents produced by A in 1974,1975,1976 only and name it as postmoving. Using this logic, I want to first count number of patents per year and then number of patents for 3 years per inventor.
While I know the basic proc sql command and group by to do this, I wanted to ask the experts here as this is a large dataset and I did not want to make mistakes. Please let me know if there is a simple robust code to do this. Thank you once again for the amazing help you all provide
This should satisfy your initial request, assuming the data are sorted by ID (doesn't matter if data are not sorted by year within id):
data want (keep=id firm1 firm2 movedate premove postmove);
merge have (where=(movedate^=.)) have (drop=movedate);
by id;
if first.id then do;
premove=0;
postmove=0;
end;
if 1<=(movedate-yearpatent)<=3 then premove+1; else
if 1<=(yearpatent-movedate)<=3 then postmove+1;
if last.id;
run;
Apparently MOVEDATE is not present on each observation ... and no inventer moves twice. That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data. The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.
editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements. Note the expression premove+1 is not in an assignment statement (i.e. it's not like total=premove+1). Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.
The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.
This does all the 3-year combinations in a very efficient manner.
It's then easy to keep the periods you want.
Provide usable data if you want it in our code.
data _F;
retain FMTNAME 'THREEYEARS' HLO 'M';
do START=1900 to 2100;
END=START+2;
LABEL=catx('-',START,END);
output;
end;
run;
proc format cntlin=_F;
run;
data HAVE;
INVENTOR ='A';
MOVEDATE =1973;
YEARPATENT=1972;
run;
proc summary data=HAVE nway completetypes ;
class INVENTOR ;
class YEARPATENT / mlf;
format YEARPATENT threeyears.;
var MOVEDATE;
output out=SUM max=;
run;
INVENTOR | YEARPATENT | _FREQ_ | MOVEDATE |
---|---|---|---|
A | 1970-1972 | 1 | 1973 |
A | 1971-1973 | 1 | 1973 |
A | 1972-1974 | 1 | 1973 |
Please post data in usable form and show what you expect as result.
I have the following data structure : OUTPUT NEEDED (Highlighted in red)
Inventor | 1stfirm | 2ndfirm | movedate | patent | yearPatent | Premoving/per year | Premoving/3 years | Postmoving/per year | Postmoving/3 years |
A | 390 | 909 | 1973 | 37237239 | 1969 | 0 | 0 | 0 | 0 |
A | 390 | 909 | 1973 | 78877878 | 1969 | 0 | 0 | 0 | 0 |
A | 390 | 909 | 1973 | 87090990 | 1970 | 3 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 98908087 | 1970 | 3 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 44347799 | 1970 | 3 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 77879808 | 1971 | 1 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 78787099 | 1972 | 2 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 87900800 | 1972 | 2 | 6 | 0 | 0 |
A | 390 | 909 | 1973 | 34778790 | 1973 | 0 | 0 | 0 | 0 |
A | 390 | 909 | 1973 | 87989898 | 1974 | 0 | 0 | 1 | 4 |
A | 390 | 909 | 1973 | 23477879 | 1975 | 0 | 0 | 2 | 4 |
A | 390 | 909 | 1973 | 87990008 | 1975 | 0 | 0 | 2 | 4 |
A | 390 | 909 | 1973 | 34778879 | 1976 | 0 | 0 | 1 | 4 |
A | 390 | 909 | 1973 | 79887877 | 1977 | 0 | 0 | 0 | 0 |
I am trying to come up with a code to create the 4 data columns highlighted in red. I am interested in counting the number of patents an inventor has produced 3 years right before the movedate and 3 years right after the movedate. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving/3 years and then count patents produced by A in 1974,1975,1976 only and name it as postmoving/ 3 years. I also want to produce an intermediary step wherein I can count the number of patents produced by A per year -premoving/per year and postmoving/ per year. I want to exclude counting the number of patents during the movedate while counting. From the values listed in the red columns, the zeroes represent all the patent years which are not of interest as they do not fall within the three year range. Thanks you for taking the time to look at my post.
So you change the requirements after people spent time answering your question? Is the time we volunteer to help you an unlimited resource?
My solution works for the new requirements, just add more formatting intervals to include one-year intervals.
The solution I provided is the most efficient way to derive the counts you are seeking.
Just add more more format intervals if you want single years,
data _F;
retain FMTNAME 'yeargrp' HLO 'M';
do START=1900 to 2100;
END=START;
LABEL=cat(START);
output;
END=START+2;
LABEL=catx('-',START,END);
output;
end;
run;
and reorganise the columns afterwards.
This should satisfy your initial request, assuming the data are sorted by ID (doesn't matter if data are not sorted by year within id):
data want (keep=id firm1 firm2 movedate premove postmove);
merge have (where=(movedate^=.)) have (drop=movedate);
by id;
if first.id then do;
premove=0;
postmove=0;
end;
if 1<=(movedate-yearpatent)<=3 then premove+1; else
if 1<=(yearpatent-movedate)<=3 then postmove+1;
if last.id;
run;
Apparently MOVEDATE is not present on each observation ... and no inventer moves twice. That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data. The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.
editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements. Note the expression premove+1 is not in an assignment statement (i.e. it's not like total=premove+1). Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.
The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.