Solved: Re: Counting with conditions

StevenR · Posted 06-18-2020 06:11 AM

Hi Everyone,

I have the following data structure :

Inventor	1stfirm	2ndfirm	movedate	patent	yearPatent
A	390	909	1973	37237239	1969
A	390	909	1973	78877878	1969
A	390	909		87090990	1970
A	390	909		98908087	1970
A	390	909		44347799	1970
A	390	909		77879808	1971
A	390	909		78787099	1972
A	390	909		87900800	1972
A	390	909		34778790	1973
A	390	909		87989898	1974
A	390	909		23477879	1975
A	390	909		87990008	1975
A	390	909		34778879	1976
A	390	909	1973	79887877	1977

I am trying to count the number of patents an inventor has produced 3 years right before leaving the 1stfirm and 3 years right after moving to the 2nd firm. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving and then count patents produced by A in 1974,1975,1976 only and name it as postmoving. Using this logic, I want to first count number of patents per year and then number of patents for 3 years per inventor.

While I know the basic proc sql command and group by to do this, I wanted to ask the experts here as this is a large dataset and I did not want to make mistakes. Please let me know if there is a simple robust code to do this. Thank you once again for the amazing help you all provide

mkeintz · Posted 06-30-2020 03:47 PM

This should satisfy your initial request, assuming the data are sorted by ID (doesn't matter if data are not sorted by year within id):

data want (keep=id firm1 firm2 movedate premove postmove);
  merge have (where=(movedate^=.))  have (drop=movedate);
  by id;

  if first.id then do;
    premove=0;
	postmove=0;
  end;

  if 1<=(movedate-yearpatent)<=3 then premove+1; else
  if 1<=(yearpatent-movedate)<=3 then postmove+1;
  if last.id;
run;

Apparently MOVEDATE is not present on each observation ... and no inventer moves twice. That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data. The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.

editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements. Note the expression premove+1 is not in an assignment statement (i.e. it's not like total=premove+1). Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.

The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

StevenR · Posted 06-18-2020 06:46 AM

Another variation of this logic that I am interested to calculate in addition to the above is - to count the number of patents premoving and postmoving for an inventor using a 3 year moving window. This means I do not want to use only three years before and after moving but I want to count all patents produced by an inventor using a three year window. If we look at the table above for A, to count patents for year 1971, I should take into account sum of 1969+1970+1971, so the 3 year count should be 2+3+1 which are the number of patents produced by the inventor A, i.e. 6. And as I am interested in the moving count, the three year count window is always supposed to take into consideration the current year + the previous two years which means the next set of patent count for A for year 1972 would be 1970+1971+1972. But then from 1974 onwards A's patent count should follow the same logic as before but now it should be counted in a separate column as post moving

ChrisNZ · Posted 06-18-2020 07:17 AM

This does all the 3-year combinations in a very efficient manner.

It's then easy to keep the periods you want.

Provide usable data if you want it in our code.

data _F;
  retain FMTNAME 'THREEYEARS' HLO 'M';
  do START=1900 to 2100;
    END=START+2;
    LABEL=catx('-',START,END);
    output;
  end;
run;
proc format cntlin=_F;
run;

data HAVE;
  INVENTOR  ='A';
  MOVEDATE  =1973;
  YEARPATENT=1972;
run;

proc summary data=HAVE nway completetypes ; 
  class INVENTOR ;
  class YEARPATENT / mlf;
  format YEARPATENT threeyears.;
  var MOVEDATE;
  output out=SUM max=;
run;

INVENTOR	YEARPATENT	_FREQ_	MOVEDATE
A	1970-1972	1	1973
A	1971-1973	1	1973
A	1972-1974	1	1973

High-Performance SAS Coding - Third Edition

StevenR · Posted 06-18-2020 07:45 AM

Thank you so much for posting the code, I will edit my earlier post and provide data in usable form as suggested.

andreas_lds · Posted 06-18-2020 07:02 AM

Please post data in usable form and show what you expect as result.

Ksharp · Posted 06-18-2020 08:25 AM

Yes. post sas code to contains these data, and post the output you want to see .

StevenR · Posted 06-25-2020 11:47 AM

I have the following data structure : OUTPUT NEEDED (Highlighted in red)

Inventor	1stfirm	2ndfirm	movedate	patent	yearPatent	Premoving/per year	Premoving/3 years	Postmoving/per year	Postmoving/3 years
A	390	909	1973	37237239	1969	0	0	0	0
A	390	909	1973	78877878	1969	0	0	0	0
A	390	909	1973	87090990	1970	3	6	0	0
A	390	909	1973	98908087	1970	3	6	0	0
A	390	909	1973	44347799	1970	3	6	0	0
A	390	909	1973	77879808	1971	1	6	0	0
A	390	909	1973	78787099	1972	2	6	0	0
A	390	909	1973	87900800	1972	2	6	0	0
A	390	909	1973	34778790	1973	0	0	0	0
A	390	909	1973	87989898	1974	0	0	1	4
A	390	909	1973	23477879	1975	0	0	2	4
A	390	909	1973	87990008	1975	0	0	2	4
A	390	909	1973	34778879	1976	0	0	1	4
A	390	909	1973	79887877	1977	0	0	0	0

I am trying to come up with a code to create the 4 data columns highlighted in red. I am interested in counting the number of patents an inventor has produced 3 years right before the movedate and 3 years right after the movedate. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving/3 years and then count patents produced by A in 1974,1975,1976 only and name it as postmoving/ 3 years. I also want to produce an intermediary step wherein I can count the number of patents produced by A per year -premoving/per year and postmoving/ per year. I want to exclude counting the number of patents during the movedate while counting. From the values listed in the red columns, the zeroes represent all the patent years which are not of interest as they do not fall within the three year range. Thanks you for taking the time to look at my post.

ChrisNZ · Posted 06-26-2020 12:10 AM

So you change the requirements after people spent time answering your question? Is the time we volunteer to help you an unlimited resource?

My solution works for the new requirements, just add more formatting intervals to include one-year intervals.

High-Performance SAS Coding - Third Edition

StevenR · Posted 06-29-2020 05:48 PM

Sorry if there was a misunderstanding but this question is very similar to above , I just included the output I want as others requested for it and to be clear- no I do not think the time you volunteer is "unlimited resource"!

In case you didn't read my first post, this is it verbatim
"I am trying to count the number of patents an inventor has produced 3 years right before leaving the 1stfirm and 3 years right after moving to the 2nd firm. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving and then count patents produced by A in 1974,1975,1976 only and name it as postmoving. Using this logic, I want to first count number of patents per year and then number of patents for 3 years per inventor."

ChrisNZ · Posted 06-29-2020 06:54 PM

The solution I provided is the most efficient way to derive the counts you are seeking.

Just add more more format intervals if you want single years,

data _F;
  retain FMTNAME 'yeargrp' HLO 'M';
  do START=1900 to 2100;
    END=START;
    LABEL=cat(START);
    output;
    END=START+2;
    LABEL=catx('-',START,END);
    output;
  end;
run;

and reorganise the columns afterwards.

High-Performance SAS Coding - Third Edition

StevenR · Posted 06-29-2020 07:33 PM

Thank you so much for your quick reply, I will try the code you suggested

StevenR · Posted 06-30-2020 11:00 AM

While the suggested code is robust in producing the three year intervals. The code is still producing the following errors:
1. It is not looping over all the inventors in the dataset- this is a panel dataset and has several thousands of observations
2. It is not disregarding the movedate when it is counting the number of patents. It is including the move date. (In my initial post I do mention that I do not wan to include the movedate)
3. As per my initial post , I am interested in counting just three years prior to moving and three years right after moving -a three year block for total number of patents but this is using a three year moving window- that was a variation of my initial post as mentioned above, unfortunately it wasn't what I am seeking to do

Thank you for the code though !

mkeintz · Posted 06-30-2020 03:47 PM

This should satisfy your initial request, assuming the data are sorted by ID (doesn't matter if data are not sorted by year within id):

data want (keep=id firm1 firm2 movedate premove postmove);
  merge have (where=(movedate^=.))  have (drop=movedate);
  by id;

  if first.id then do;
    premove=0;
	postmove=0;
  end;

  if 1<=(movedate-yearpatent)<=3 then premove+1; else
  if 1<=(yearpatent-movedate)<=3 then postmove+1;
  if last.id;
run;

Apparently MOVEDATE is not present on each observation ... and no inventer moves twice. That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data. The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.

editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements. Note the expression premove+1 is not in an assignment statement (i.e. it's not like total=premove+1). Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.

The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

StevenR · Posted 07-01-2020 12:22 PM

Thank you so much, this code looks very neat 🙂 but before executing the code, I was wondering which of the lines captures the patent count. In the data some inventors can move multiple times.

mkeintz · Posted 07-01-2020 08:34 PM

I've revised my response to add explanation on the calculation of premove and postmove counts.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away