Hello,
I have difficulties in removing duplicates in my dataset. There are certain conditions that I don’t know how to code.
I have a table that lists companies in the form of company id, fiscal year end date (FYE), and date when the company reported some financial statements. Each company may report several statements on a given day for the same FYE, like in rows 2-3 or 5-7.
Also, some companies can have more than one FYE falling in the same calendar year like in rows 9-10, 11-12 and 14-15.
# | company ID | FYE | reporting date | amount |
1 | 330A | 31DEC2005 | 31MAR2006 | 456 |
2 | 330A | 31DEC2006 | 31MAY2007 | 393 |
3 | 330A | 31DEC2006 | 31MAY2007 | 695 |
4 | 350B | 31MAY2008 | 10OCT2008 | 45 |
5 | 350B | 31MAY2009 | 22JUL2009 | 86 |
6 | 350B | 31MAY2009 | 22JUL2009 | 23334 |
7 | 350B | 31MAY2009 | 22JUL2009 | 866 |
8 | 350B | 31MAY2010 | 16OCT2011 | 299 |
9 | 840A | 20AUG2011 | 31DEC2011 | 3245 |
10 | 840A | 31OCT2011 | 31DEC2011 | 876 |
11 | 840A | 20AUG2012 | 31NOV2012 | 34 |
12 | 840A | 31OCT2012 | 31NOV2012 | 87 |
13 | 840A | 20AUG2013 | 30SEP2013 | 222 |
14 | 125N | 20JUL2009 | 28NOV2009 | 64 |
15 | 125N | 20SEP2009 | 18SEP2010 | 88777 |
16 | 125N | 20JUL2010 | 18SEP2010 | 754 |
My goal is to ‘clean’ those that have several FYE in the given calendar year leaving only records with the latest FYE. Cases with multiple financial reports for the same FYE should stay as they are. This is what the table should look like after the cleaning:
# | company ID | FYE | reporting date | amount |
1 | 330A | 31DEC2005 | 31MAR2006 | 456 |
2 | 330A | 31DEC2006 | 31MAY2007 | 393 |
3 | 330A | 31DEC2006 | 31MAY2007 | 695 |
4 | 350B | 31MAY2008 | 10OCT2008 | 45 |
5 | 350B | 31MAY2009 | 22JUL2009 | 86 |
6 | 350B | 31MAY2009 | 22JUL2009 | 23334 |
7 | 350B | 31MAY2009 | 22JUL2009 | 866 |
8 | 350B | 31MAY2010 | 16OCT2011 | 299 |
9 | 840A | 31OCT2011 | 31DEC2011 | 876 |
10 | 840A | 31OCT2012 | 31NOV2012 | 87 |
11 | 840A | 20AUG2013 | 30SEP2013 | 222 |
12 | 125N | 20SEP2009 | 18SEP2010 | 88777 |
13 | 125N | 20JUL2010 | 18SEP2010 | 754 |
Can anybody help me please?
It could be as simple as the following:
proc sql;
create table want as
select * from have
group by company, year(fye)
having fye=max(fye)
;
quit;
Not tested, as I have issues to input your data due to the format, but you 've got the idea.
Haikuo
It could be as simple as the following:
proc sql;
create table want as
select * from have
group by company, year(fye)
having fye=max(fye)
;
quit;
Not tested, as I have issues to input your data due to the format, but you 've got the idea.
Haikuo
Yes, this is what I needed. Thank you.
The issue with the above is that year function only extracts the 4 digit year whilst fiscal years are not necessarily from 01/01-31/12. This could lead to, for example, having FYE 06/31/2012 and 12/31/2011 deemed as 2 distinct fiscal years whilst then 06/31/2011 would be an ignored data point for the "bad" 12/31/2011 FYE.
I don't quite have an easy solution on top of my head to further account for this though.
Looks like Hai.kuo's code works for me.
It depends on how the data is setup and any additionnal knowledge you have about it. In Canada, companies can each have their own distinct fiscal year period and depending on their size they are forced, by law, to report data either on a monthly or yearly basis (there might be additionnal periodicity cases, I'm not working in the business survey department). Point being, depending on what you ought to do with your timeseries considering the FYE as the latest reported FYE date of a given year may or may not achieve the desired analysis.
If my fiscal year is from sept01 to aug31 and that I report on a monthly or quarterly basis, the approach presented by Hai.Kuo would consider the FYE of my company to be on dec31 or so if it reports monthly or on november 30 for quarterly. Again, it is very possible that the way your data is setup, this solution is sufficient. I'm merely giving you a big warning not to draw conclusions too fast.
Vincent
Good point, . I believe that can be solved using customized date interval as long as all of the companies are using the same fical year. If not, then the code complexity will have to be bumped up to a higher level.
Say if Fical year starts on 6/1:
%let date='31may2013'd;
data _null_;
year=year(intnx('year.6',&date,0,'begin'));
put year=;
run;
Haikuo
Thank you Hai.kuo! It worked!!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.