BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
StevenR
Obsidian | Level 7

Hi Everyone, 

I have the following data structure :

Inventor

1stfirm

2ndfirm

movedate

patent

yearPatent

A

390

909

1973

37237239

1969

A

390

909

1973

78877878

1969

A

390

909

 

87090990

1970

A

390

909

 

98908087

1970

A

390

909

 

44347799

1970

A

390

909

 

77879808

1971

A

390

909

 

78787099

1972

A

390

909

 

87900800

1972

A

390

909

 

34778790

1973

A

390

909

 

87989898

1974

A

390

909

 

23477879

1975

A

390

909

 

87990008

1975

A

390

909

 

34778879

1976

A

390

909

1973

79887877

1977

 

I am trying to count the number of patents an inventor has produced 3 years right before leaving the 1stfirm and 3 years right after moving to the 2nd firm. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving and then count patents produced by A in 1974,1975,1976 only and name it as postmoving. Using this logic, I want to first count number of patents per year and then number of patents for 3 years per inventor.

 

While I know the basic proc sql command and group by to do this, I wanted to ask the experts here as this is a large dataset and I did not want to make mistakes. Please let me know if there is a simple robust code to do this. Thank you once again for the amazing help you all provide 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This should satisfy your initial request, assuming the data are sorted by ID  (doesn't matter if data are not sorted by year within id):

 

 

data want (keep=id firm1 firm2 movedate premove postmove);
  merge have (where=(movedate^=.))  have (drop=movedate);
  by id;

  if first.id then do;
    premove=0;
	postmove=0;
  end;

  if 1<=(movedate-yearpatent)<=3 then premove+1; else
  if 1<=(yearpatent-movedate)<=3 then postmove+1;
  if last.id;
run;

 

Apparently MOVEDATE is not present on each observation ... and no inventer moves twice.  That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data.  The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.

 

editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements.  Note the expression premove+1 is not in an assignment statement (i.e. it's not like   total=premove+1).  Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.

 

The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

15 REPLIES 15
StevenR
Obsidian | Level 7
Another variation of this logic that I am interested to calculate in addition to the above is - to count the number of patents premoving and postmoving for an inventor using a 3 year moving window. This means I do not want to use only three years before and after moving but I want to count all patents produced by an inventor using a three year window. If we look at the table above for A, to count patents for year 1971, I should take into account sum of 1969+1970+1971, so the 3 year count should be 2+3+1 which are the number of patents produced by the inventor A, i.e. 6. And as I am interested in the moving count, the three year count window is always supposed to take into consideration the current year + the previous two years which means the next set of patent count for A for year 1972 would be 1970+1971+1972. But then from 1974 onwards A's patent count should follow the same logic as before but now it should be counted in a separate column as post moving
ChrisNZ
Tourmaline | Level 20

This does all the 3-year combinations in a very efficient manner.

It's then easy to keep the periods you want.

Provide usable data if you want it in our code.

data _F;
  retain FMTNAME 'THREEYEARS' HLO 'M';
  do START=1900 to 2100;
    END=START+2;
    LABEL=catx('-',START,END);
    output;
  end;
run;
proc format cntlin=_F;
run;

data HAVE;
  INVENTOR  ='A';
  MOVEDATE  =1973;
  YEARPATENT=1972;
run;

proc summary data=HAVE nway completetypes ; 
  class INVENTOR ;
  class YEARPATENT / mlf;
  format YEARPATENT threeyears.;
  var MOVEDATE;
  output out=SUM max=;
run;

INVENTOR YEARPATENT _FREQ_ MOVEDATE
A 1970-1972 1 1973
A 1971-1973 1 1973
A 1972-1974 1 1973
StevenR
Obsidian | Level 7
Thank you so much for posting the code, I will edit my earlier post and provide data in usable form as suggested.
andreas_lds
Jade | Level 19

Please post data in usable form and show what you expect as result.

Ksharp
Super User
Yes. post sas code to contains these data, and post the output you want to see .
StevenR
Obsidian | Level 7

I have the following data structure :                                             OUTPUT NEEDED (Highlighted in red)

Inventor

1stfirm

2ndfirm

movedate

patent

yearPatent

Premoving/per year

Premoving/3 years

Postmoving/per year

Postmoving/3 years

A

390

909

1973

37237239

1969

0

0

0

0

A

390

909

1973

78877878

1969

0

0

0

0

A

390

909

 1973

87090990

1970

3

6

0

0

A

390

909

 1973

98908087

1970

3

6

0

0

A

390

909

 1973

44347799

1970

3

6

0

0

A

390

909

 1973

77879808

1971

1

6

0

0

A

390

909

 1973

78787099

1972

2

6

0

0

A

390

909

 1973

87900800

1972

2

6

0

0

A

390

909

 1973

34778790

1973

0

0

0

0

A

390

909

 1973

87989898

1974

0

0

1

4

A

390

909

 1973

23477879

1975

0

0

2

4

A

390

909

 1973

87990008

1975

0

0

2

4

A

390

909

 1973

34778879

1976

0

0

1

4

A

390

909

1973

79887877

1977

0

0

0

0

 

I am trying to come up with a code to create the 4 data columns highlighted in red. I am interested in counting the number of patents an inventor has produced 3 years right before the movedate and  3 years right after the movedate. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving/3 years and then count patents produced by A in 1974,1975,1976 only and name it as postmoving/ 3 years. I also want to produce an intermediary step wherein I can count the number of patents produced by A per year -premoving/per year and postmoving/ per year. I want to exclude counting the number of patents during the movedate while counting. From the values listed in the red columns, the zeroes represent all the patent years which are not of interest as they do not fall within the three year range. Thanks you for taking the time to look at my post.

ChrisNZ
Tourmaline | Level 20

So you change the requirements after people spent time answering your question? Is the time we volunteer to help you an unlimited resource?

My solution works for the new requirements, just add more formatting intervals to include one-year intervals. 

StevenR
Obsidian | Level 7
Sorry if there was a misunderstanding but this question is very similar to above , I just included the output I want as others requested for it and to be clear- no I do not think the time you volunteer is "unlimited resource"!

In case you didn't read my first post, this is it verbatim
"I am trying to count the number of patents an inventor has produced 3 years right before leaving the 1stfirm and 3 years right after moving to the 2nd firm. So in the example above as the movedate is 1973, I want to count the patents produced by A in 1970,1971,1972 only and name it as premoving and then count patents produced by A in 1974,1975,1976 only and name it as postmoving. Using this logic, I want to first count number of patents per year and then number of patents for 3 years per inventor."
ChrisNZ
Tourmaline | Level 20

The solution I provided is the most efficient way to derive the counts you are seeking.

Just add more more format intervals if you want single years,

data _F;
  retain FMTNAME 'yeargrp' HLO 'M';
  do START=1900 to 2100;
    END=START;
    LABEL=cat(START);
    output;
    END=START+2;
    LABEL=catx('-',START,END);
    output;
  end;
run;

and reorganise the columns afterwards.

StevenR
Obsidian | Level 7
Thank you so much for your quick reply, I will try the code you suggested
StevenR
Obsidian | Level 7
While the suggested code is robust in producing the three year intervals. The code is still producing the following errors:
1. It is not looping over all the inventors in the dataset- this is a panel dataset and has several thousands of observations
2. It is not disregarding the movedate when it is counting the number of patents. It is including the move date. (In my initial post I do mention that I do not wan to include the movedate)
3. As per my initial post , I am interested in counting just three years prior to moving and three years right after moving -a three year block for total number of patents but this is using a three year moving window- that was a variation of my initial post as mentioned above, unfortunately it wasn't what I am seeking to do

Thank you for the code though !
mkeintz
PROC Star

This should satisfy your initial request, assuming the data are sorted by ID  (doesn't matter if data are not sorted by year within id):

 

 

data want (keep=id firm1 firm2 movedate premove postmove);
  merge have (where=(movedate^=.))  have (drop=movedate);
  by id;

  if first.id then do;
    premove=0;
	postmove=0;
  end;

  if 1<=(movedate-yearpatent)<=3 then premove+1; else
  if 1<=(yearpatent-movedate)<=3 then postmove+1;
  if last.id;
run;

 

Apparently MOVEDATE is not present on each observation ... and no inventer moves twice.  That's why I constructed the self-merge statement with non-missing MOVEDATE merged against movedate-less data.  The merge will propagate the non-missing MOVEDATE value throughout all the records for a given ID.

 

editted notation: The counting of patents is done in the "if 1<= (....) <=3 then premove+1" (or "postmove+1") statements.  Note the expression premove+1 is not in an assignment statement (i.e. it's not like   total=premove+1).  Instead it is a "summing statement", which tells SAS to (1) store the result in the variable premove (or postmove), and (2) to retained the current updated value of premove and postmove as you step through all the observations.

 

The final "if last.id" says to keep only the last record for each id - i.e. when all the premove and postmove counts have been completely accumulated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
StevenR
Obsidian | Level 7
Thank you so much, this code looks very neat 🙂 but before executing the code, I was wondering which of the lines captures the patent count. In the data some inventors can move multiple times.
mkeintz
PROC Star
I've revised my response to add explanation on the calculation of premove and postmove counts.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1559 views
  • 6 likes
  • 5 in conversation