Complicated case: removal of duplicates

Accepted Solution Solved
Reply
Occasional Contributor DiG
Occasional Contributor
Posts: 19
Accepted Solution

Complicated case: removal of duplicates

Hello,

I have difficulties in removing duplicates in my dataset. There are certain conditions that I don’t know how to code.

I have a table that lists companies in the form of company id, fiscal year end date (FYE), and date when the company reported some financial statements. Each company may report several statements on a given day for the same FYE, like in rows 2-3 or 5-7.

Also, some companies can have more than one FYE falling in the same calendar year like in rows 9-10, 11-12 and 14-15.

#

company ID

FYE

reporting date

amount

1

330A

  31DEC2005

  31MAR2006

456

2

330A

  31DEC2006

  31MAY2007

393

3

330A

  31DEC2006

  31MAY2007

695

4

350B

  31MAY2008

  10OCT2008

45

5

350B

  31MAY2009

  22JUL2009

86

6

350B

  31MAY2009

  22JUL2009

23334

7

350B

  31MAY2009

  22JUL2009

866

8

350B

  31MAY2010

  16OCT2011

299

9

840A

  20AUG2011

  31DEC2011

3245

10

840A

  31OCT2011

  31DEC2011

876

11

840A

20AUG2012

  31NOV2012

34

12

840A

  31OCT2012

  31NOV2012

87

13

840A

  20AUG2013

  30SEP2013

222

14

125N

  20JUL2009

  28NOV2009

64

15

125N

  20SEP2009

  18SEP2010

88777

16

125N

  20JUL2010

  18SEP2010

754

My goal is to ‘clean’ those that have several FYE in the given calendar year leaving only records with the latest FYE. Cases with multiple financial reports for the same FYE should stay as they are. This is what the table should look like after the cleaning:

#

company ID

FYE

reporting date

amount

1

330A

  31DEC2005

  31MAR2006

456

2

330A

  31DEC2006

  31MAY2007

393

3

330A

  31DEC2006

  31MAY2007

695

4

350B

  31MAY2008

  10OCT2008

45

5

350B

  31MAY2009

  22JUL2009

86

6

350B

  31MAY2009

  22JUL2009

23334

7

350B

  31MAY2009

  22JUL2009

866

8

350B

  31MAY2010

  16OCT2011

299

9

840A

  31OCT2011

  31DEC2011

876

10

840A

  31OCT2012

  31NOV2012

87

11

840A

  20AUG2013

  30SEP2013

222

12

125N

  20SEP2009

  18SEP2010

88777

13

125N

  20JUL2010

  18SEP2010

754

Can anybody help me please?


Accepted Solutions
Solution
‎09-30-2013 12:34 PM
Respected Advisor
Posts: 3,156

Re: Complicated case: removal of duplicates

It could be as simple as the following:

proc sql; 

  create table want as

  select * from have

group by company, year(fye)

having fye=max(fye)

;

quit;

Not tested, as I have issues to input your data due to the format, but you 've got the idea.

Haikuo

View solution in original post


All Replies
Solution
‎09-30-2013 12:34 PM
Respected Advisor
Posts: 3,156

Re: Complicated case: removal of duplicates

It could be as simple as the following:

proc sql; 

  create table want as

  select * from have

group by company, year(fye)

having fye=max(fye)

;

quit;

Not tested, as I have issues to input your data due to the format, but you 've got the idea.

Haikuo

Super Contributor
Posts: 334

Re: Complicated case: removal of duplicates

I had the data inputted already and code matches the output requested.

EJ

Occasional Contributor DiG
Occasional Contributor
Posts: 19

Re: Complicated case: removal of duplicates

Yes, this is what I needed. Thank you.

Super Contributor
Posts: 339

Re: Complicated case: removal of duplicates

The issue with the above is that year function only extracts the 4 digit year whilst fiscal years are not necessarily from 01/01-31/12. This could lead to, for example, having FYE 06/31/2012 and 12/31/2011 deemed as 2 distinct fiscal years whilst then 06/31/2011 would be an ignored data point for the "bad" 12/31/2011 FYE.

I don't quite have an easy solution on top of my head to further account for this though.

Occasional Contributor DiG
Occasional Contributor
Posts: 19

Re: Complicated case: removal of duplicates

Posted in reply to Vince28_Statcan

Looks like Hai.kuo's code works for me.

Super Contributor
Posts: 339

Re: Complicated case: removal of duplicates


It depends on how the data is setup and any additionnal knowledge you have about it. In Canada, companies can each have their own distinct fiscal year period and depending on their size they are forced, by law, to report data either on a monthly or yearly basis (there might be additionnal periodicity cases, I'm not working in the business survey department). Point being, depending on what you ought to do with your timeseries considering the FYE as the latest reported FYE date of a given year may or may not achieve the desired analysis.

If my fiscal year is from sept01 to aug31 and that I report on a monthly or quarterly basis, the approach presented by Hai.Kuo would consider the FYE of my company to be on dec31 or so if it reports monthly or on november 30 for quarterly. Again, it is very possible that the way your data is setup, this solution is sufficient. I'm merely giving you a big warning not to draw conclusions too fast.

Vincent

Respected Advisor
Posts: 3,156

Re: Complicated case: removal of duplicates

Posted in reply to Vince28_Statcan

Good point, . I believe that can be solved using customized date interval as long as all of the companies are using the same fical year. If not, then the code complexity will have to be bumped up to a higher level.

Say if Fical year starts on 6/1:

%let date='31may2013'd;

data _null_;

year=year(intnx('year.6',&date,0,'begin'));

put year=;

run;

Haikuo

Occasional Contributor DiG
Occasional Contributor
Posts: 19

Re: Complicated case: removal of duplicates

Thank you Hai.kuo! It worked!! Smiley Happy

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 262 views
  • 7 likes
  • 4 in conversation