Hi SAS Master,
I have a dataset with three variables, including firmID, year, and ProductID. I would like to finish two tasks.
1. remove duplicate product within 3 years for each firm so that a product is only shown once within 3 years for a specific firm. (A product can be shown in several different firms. It can also be shown several times in a specific firm as long as the time lag is longer than 3 years. I just want to make sure that one product is only shown once within 3 years for a specific firm.)
2. count the number of products that each firm has for each year.
For example:
have
FirmID | Year | ProductID |
1001 | 1995 | A1 |
1001 | 1995 | B3 |
1001 | 1995 | R9 |
1001 | 1995 | V5 |
1001 | 1996 | B2 |
1001 | 1996 | Y8 |
1001 | 1997 | A1 |
1001 | 1997 | O6 |
1001 | 1997 | U3 |
1001 | 1998 | E4 |
1001 | 1998 | W1 |
1001 | 1998 | P3 |
1002 | 2003 | D5 |
1002 | 2003 | I8 |
1002 | 2003 | V5 |
1002 | 2003 | A1 |
1002 | 2004 | G7 |
1002 | 2004 | N6 |
1002 | 2005 | I8 |
1002 | 2005 | P2 |
1002 | 2005 | M5 |
1002 | 2006 | N6 |
1002 | 2006 | K9 |
1002 | 2006 | H4 |
1002 | 2007 | V5 |
1002 | 2007 | P2 |
1002 | 2007 | R4 |
want
FirmID | Year | Pro_Num |
1001 | 1995 | 4 |
1001 | 1996 | 2 |
1001 | 1997 | 2 |
1001 | 1998 | 3 |
1002 | 2003 | 4 |
1002 | 2004 | 2 |
1002 | 2005 | 2 |
1002 | 2006 | 2 |
1002 | 2007 | 2 |
A1 is shown in 1995 and 1997, so the number of products in 1997 is 2.
V5 is shown in both companies, so it is counted in firm 1001 in 1995 and firm 1002 in 2003.
I8 is shown in 2003 and 2005, so the number of products in 2005 is 2.
N6 is shown in 2004 and 2006, so the number of products in 2006 is 2.
V5 is shown in both in 2003 and 2007 and the time lag is more than 3 years. P2 is shown in both 2005 and 2007so it will be counted and the number of products in 2007 is 1.
What code do I need to use? Thanks in advance!
Hi @dapenDaniel
Here is an attempt to achieve this.
So how should be counted the V5 product. As the time lag is more than 3 years (2003 and 2007), there should be 2 products in 2007: V5 and R4? Did I miss something?
Best,
proc sort data=have;
by FirmID ProductID Year;
run;
data have1;
set have;
by FirmID ProductID Year;
retain Year_ref;
if first.ProductID then Year_ref = year;
if first.ProductID or year >= (Year_ref + 3) then output;
run;
proc freq data=have1 noprint;
table FirmID*Year / out=want (rename=(count = Pro_Num) drop=percent);
run;
Here are instructions on how to provide sample data as a data step:
https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...
Hi @dapenDaniel
Here is an attempt to achieve this.
So how should be counted the V5 product. As the time lag is more than 3 years (2003 and 2007), there should be 2 products in 2007: V5 and R4? Did I miss something?
Best,
proc sort data=have;
by FirmID ProductID Year;
run;
data have1;
set have;
by FirmID ProductID Year;
retain Year_ref;
if first.ProductID then Year_ref = year;
if first.ProductID or year >= (Year_ref + 3) then output;
run;
proc freq data=have1 noprint;
table FirmID*Year / out=want (rename=(count = Pro_Num) drop=percent);
run;
Hi @ed_sas_member ,
Thanks for your answer.
You are correct. I am sorry I put the wrong number for 2007. It should be 2. Thanks for remining me!
It looks to me like you are using the Compustat SEGMENTs data. That's a dataset in which a firm's product-segments (and geo-segments) for a given fiscal year are reported for three successive years.
This is a way to capture re-organization of a company's business model. Say a company changes its product array in 1996. Then you have the 1994 sales data reported twice (once in 1994, once in 1995) using the old product array, followed by once (in 1996) using the new array.
If this indeed is your situation, just remember that if you use sales data for all the unique product identifiers assigned to a given fiscal year you will be double counting whenever a given year's data is reported using multiple product arrays.
That said, here's what I would do to get the number of new products introduced year-by-year:
data have;
input FirmID Year ProductID :$2. ;
datalines;
1001 1995 A1
1001 1995 B3
1001 1995 R9
1001 1995 V5
1001 1996 B2
1001 1996 Y8
1001 1997 A1
1001 1997 O6
1001 1997 U3
1001 1998 E4
1001 1998 W1
1001 1998 P3
1002 2003 D5
1002 2003 I8
1002 2003 V5
1002 2003 A1
1002 2004 G7
1002 2004 N6
1002 2005 I8
1002 2005 P2
1002 2005 M5
1002 2006 N6
1002 2006 K9
1002 2006 H4
1002 2007 V5
1002 2007 P2
1002 2007 R4
run;
data want (keep=firmid year pro_num);
length _long_list $2000;
do until (last.firmid);
do pro_num=0 by 0 until (last.year);
set have;
by firmid year;
if findw(_long_list,productid,' ')=0 then do;
pro_num=pro_num+1;
_long_list=catx(' ',_long_list,productid);
end;
end;
output;
end;
run;
Why is PRO_NUM for 1002/2007 only 1. It has three values (V5, P2, and R4)
R4 has never occurred before, and should be counted.
V5 was most recently listed in 2003, more than 3 year prior, so should be counted.
P2 was most recently listed in 2005, so is less than 4 years prior
Shouldn't both R4 and V5 be counted?
Hi @mkeintz ,
Thanks for your reply. You are correct. The number for 2007 should be 2. Thanks for correcting me!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.