BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dapenDaniel
Obsidian | Level 7

Hi SAS Master,

 

I have a dataset with three variables, including firmID, year, and ProductID. I would like to finish two tasks.

 

1. remove duplicate product within 3 years for each firm so that a product is only shown once within 3 years for a specific firm. (A product can be shown in several different firms. It can also be shown several times in a specific firm as long as the time lag is longer than 3 years. I just want to make sure that one product is only shown once within 3 years for a specific firm.)

 

2. count the number of products that each firm has for each year.

 

For example:

 

have

FirmIDYearProductID
10011995A1
10011995B3
10011995R9
10011995V5
10011996B2
10011996Y8
10011997A1
10011997O6
10011997U3
10011998E4
10011998W1
10011998P3
10022003D5
10022003I8
10022003V5
10022003A1
10022004G7
10022004N6
10022005I8
10022005P2
10022005M5
10022006N6
10022006K9
10022006H4
10022007V5
10022007P2
10022007R4


want

 

FirmIDYearPro_Num
100119954
100119962
100119972
100119983
100220034
100220042
100220052
100220062
100220072

 

A1 is shown in 1995 and 1997, so the number of products in 1997 is 2.

V5 is shown in both companies, so it is counted in firm 1001 in 1995 and firm 1002 in 2003.

I8 is shown in 2003 and 2005, so the number of products in 2005 is 2.

N6 is shown in 2004 and 2006, so the number of products in 2006 is 2.

V5 is shown in both in 2003 and 2007 and the time lag is more than 3 years. P2 is shown in both 2005 and 2007so it will be counted and the number of products in 2007 is 1.

 

 

 

 

What code do I need to use? Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
ed_sas_member
Meteorite | Level 14

Hi @dapenDaniel 

 

Here is an attempt to achieve this.

So how should be counted the V5 product. As the time lag is more than 3 years (2003 and 2007), there should be 2 products in 2007: V5 and R4? Did I miss something?

 

Best,

 

proc sort data=have;
	by FirmID ProductID Year;
run;

data have1;
	set have;
	by FirmID ProductID Year;

	retain Year_ref;
	if first.ProductID then Year_ref = year;

	if first.ProductID or year >= (Year_ref + 3) then output;
run;

proc freq data=have1 noprint;
	table FirmID*Year / out=want (rename=(count = Pro_Num) drop=percent);
run;

View solution in original post

7 REPLIES 7
ed_sas_member
Meteorite | Level 14

Hi @dapenDaniel 

 

Here is an attempt to achieve this.

So how should be counted the V5 product. As the time lag is more than 3 years (2003 and 2007), there should be 2 products in 2007: V5 and R4? Did I miss something?

 

Best,

 

proc sort data=have;
	by FirmID ProductID Year;
run;

data have1;
	set have;
	by FirmID ProductID Year;

	retain Year_ref;
	if first.ProductID then Year_ref = year;

	if first.ProductID or year >= (Year_ref + 3) then output;
run;

proc freq data=have1 noprint;
	table FirmID*Year / out=want (rename=(count = Pro_Num) drop=percent);
run;
Reeza
Super User
What if you have a more than 5 year period though? Ie 1998 to 2020? Using Year_ref+3 won't account for that, will it?
dapenDaniel
Obsidian | Level 7

Hi @ed_sas_member ,

 

Thanks for your answer.

 

You are correct. I am sorry I put the wrong number for 2007. It should be 2. Thanks for remining me!

mkeintz
PROC Star

@dapenDaniel 

 

It looks to me like you are using the Compustat SEGMENTs data.  That's a dataset in which a firm's product-segments (and geo-segments) for a given fiscal year are reported for three successive years.

 

This is a way to capture re-organization of a company's business model.  Say a company changes its product array in 1996.   Then you have the 1994 sales data reported twice (once in 1994, once in 1995) using the old product array, followed by once (in 1996) using the new array.

 

If this indeed is your situation, just remember that if you use sales data for all the unique product identifiers assigned to a given fiscal year you will be double counting whenever a given year's data is reported using multiple product arrays.

 

That said, here's what I would do to get the number of new products introduced year-by-year:

 

data have;
  input FirmID   Year      ProductID :$2. ;
datalines;
1001     1995      A1
1001     1995      B3
1001     1995      R9
1001     1995      V5
1001     1996      B2
1001     1996      Y8
1001     1997      A1
1001     1997      O6
1001     1997      U3
1001     1998      E4
1001     1998      W1
1001     1998      P3
1002     2003      D5
1002     2003      I8
1002     2003      V5
1002     2003      A1
1002     2004      G7
1002     2004      N6
1002     2005      I8
1002     2005      P2
1002     2005      M5
1002     2006      N6
1002     2006      K9
1002     2006      H4
1002     2007      V5
1002     2007      P2
1002     2007      R4
run;

data want (keep=firmid year pro_num);
  length _long_list $2000;
  do until (last.firmid);
    do pro_num=0 by 0 until (last.year);
      set have;
      by firmid year;
      if findw(_long_list,productid,' ')=0 then do;
        pro_num=pro_num+1;
        _long_list=catx(' ',_long_list,productid);
      end;
    end;
    output;
  end;
run;
  1. Just make sure the variable _LONG_LIST is long enough to hold the largest conceivable list of products for any firm.
  2. Why do I use "do pro_num=0 by 0 unitl (last.year)".  Only becuase it is a single-statement replacing these two statements:
    1. pro_num=0;
    2. DO until (last.year);
  3. This program assumes the non of your PRODUCTID values have an internal blank

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

Why is PRO_NUM for 1002/2007 only 1.  It has three values (V5, P2, and R4)

   R4 has never occurred before, and should be counted.

   V5 was most recently listed in 2003, more than 3 year prior, so should be counted.

   P2 was most recently listed in 2005, so is less than 4 years prior

Shouldn't both R4 and V5 be counted?

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
dapenDaniel
Obsidian | Level 7

Hi @mkeintz ,

 

Thanks for your reply. You are correct. The number for 2007 should be 2. Thanks for correcting me!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 711 views
  • 0 likes
  • 4 in conversation