Thank you for your solution. It works very well!
I tried using the same code (modified slightly) to eliminate duplicates for a quarterly data set but ran into some difficulties and would be grateful for your help.
Unlike the other dataset that had yearly observations with different dates, the duplicates in this quarterly dataset have identical dates, so I was not able to use MAX(.) efficiently in the code.
For the quarterly dataset, the rule to select the correct date, is that the observation must have more variables with non-missing values. For duplicates with the same number of non-missing variables, the rule is to select the observation where DATAFQTR has the same calendar year as DATADATE. DATAFQTR is however a character variable.
I have attached the code and a sample of the SAS datafile that highlights these problem for two firms. They are identified by their respective GVKEY’s.
Notice that firm 206883 has duplicate observations (17 and 18) that differ only for DATAFQTR. Here, observation 17 should be chosen since its DATAFQTR has the same calendar year as DATADATE (i.e. 2009). For observations 21 and 22 for the same firm, observation 21 should be chosen since it has more variables with non-missing values. Here, DATAFTQR has the same calendar date as DATADATE.
Firm 212151 has duplicate observations 49 and 50. In this case, observation 49 should be chosen since it has more variables with non-missing values even if DATAFQTR does not have the same calendar date as DATAFQTR.
The variable DATAFQTR could possibly help to solve this problem but I am not sure how to use it here. Thanks.
proc sql;
create table want as
/****************************/
/*obs with 1 record per year*/
select *
from have
group by Firm, year(date)
having count(year(date))=1
/****************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year and all records with > 1 missing values*/
select a.*
from have as a right join
(select Firm, max(date) as Date2 format=DDMMYY10. /* Min Max ??*/
from have
group by Firm, year(date)
having count(year(date))>1 and min(nmiss(A,B,C,D)) > 1) as b
on a.Firm=b.Firm and a.Date=b.Date2
/*********************************************************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year and at least one record with 0 or 1 missing value*/
select a.*
from have as a right join
(select Firm, max(date) as Date2 format=DDMMYY10.
from (select *
from have
group by Firm, year(date)
having count(year(date))>1 and min(nmiss(A,B,C,D)) <= 1)
where nmiss(A,B,C,D)<= 1
group by Firm, year(date)) as b
on a.Firm=b.Firm and a.Date=b.Date2
/*********************************************************************/
order by obs;
quit;
Sorry, the wrong SAS code was attached. Please use this one instead. Thanks.
proc sql;
create table temp5 as
/****************************/
/*obs with 1 record per year*/
select *
from testfile
group by gvkey, year(datadate), qtr(datadate)
having count(qtr(datadate))=1
/****************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year and all records with > 1 missing values*/
select a.*
from testfile as a right join
(select gvkey, max(datadate) as Date2 format=DDMMYY10. /* Min Max ??*/
from testfile
group by gvkey, year(datadate), qtr(datadate)
having count(qtr(datadate))>1 and min(nmiss(atq,cheq,ltq,revtq)) > 1) as b
on a.gvkey=b.gvkey and a.datadate=b.Date2
/*********************************************************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year and at least one record with 0 or 1 missing value*/
select a.*
from testfile as a right join
(select gvkey, max(datadate) as Date2 format=DDMMYY10.
from (select *
from testfile
group by gvkey, year(datadate), qtr(datadate)
having count(qtr(datadate))>1 and min(nmiss(atq,cheq,ltq,revtq)) <= 1)
where nmiss(atq,cheq,ltq,revtq)<= 1
group by gvkey, year(datadate), qtr(datadate)) as b
on a.gvkey=b.gvkey and a.datadate=b.Date2;
quit;
Hi @sjm
I have printed below the last of "duplicates" based on the key 'GVKEY, DATADATE'.
Could you please clarify the rule for obs 19&20 or 23&24 for example -> they have the same number of missing values (actually 0) and the year identify in the variable DATAFQTR is the same as the one in DATADATE. Which record should we choose?
Best,
Hi @sjm
Thank you for the clarification 😊
I have adapted the program as follows. I have also edited the list of removed observations so that you can check that the program works according to your expectations.
Best,
data have;
set mylib.testfile; /*Input file*/
obs_num+1; /*Observation number*/
run;
proc sql;
create table want as
/****************************/
/*obs with 1 record per quarter per year*/
select *
from have
group by GVKEY, DATADATE
having count(DATADATE)=1
/****************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year AND 1 record with the min number of missing values */
select *
from (select *
from
(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
from have
group by GVKEY, DATADATE
having count(DATADATE)>1)
group by GVKEY, DATADATE
having count_miss = min(count_miss))
group by GVKEY, DATADATE
having count(DATADATE)=1
/*********************************************************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year AND >1 records:
- with the min number of missing values
- 1 record having sames years in both DATADATE & DATAFQTR variables */
select *
from (select *
from (select *
from
(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
from have
group by GVKEY, DATADATE
having count(DATADATE)>1)
group by GVKEY, DATADATE
having count_miss = min(count_miss))
group by GVKEY, DATADATE
having count(DATADATE)>1)
where year(DATADATE)=input(substr(DATAFQTR,1,4),4.)
group by GVKEY, DATADATE
having count(DATADATE)=1
/*********************************************************************/
union all corr
/*********************************************************************/
/*obs with >1 record per year AND >1 records:
- with the min number of missing values
- >1 records having sames years in both DATADATE & DATAFQTR variables */
select *
from(select *
from (select *
from (select *
from
(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
from have
group by GVKEY, DATADATE
having count(DATADATE)>1)
group by GVKEY, DATADATE
having count_miss = min(count_miss))
group by GVKEY, DATADATE
having count(DATADATE)>1)
where year(DATADATE)=input(substr(DATAFQTR,1,4),4.)
group by GVKEY, DATADATE
having count(DATADATE)>1)
group by GVKEY, DATADATE
having input(substr(DATAFQTR,6,1),1.) = max(input(substr(DATAFQTR,6,1),1.))
/*********************************************************************/
order by obs_num;
quit;
/******* Quality checks *******/
data check_delete;
merge have(in=x) want(in=y);
by obs_num;
if y=0;
run;
proc print data=check_delete;
title 'QUALITY CHECK: list of removed observations';
id obs_num;
run;
Check next code:
data have;
infile cards truncover dlm='09'x;
input Obs Firm Date :ddmmyy10. A B C D;
yy = year(date);
format date ddmmyy10.;
cards;
1 1 30/6/2014 1.6 . 3.6 .
2 1 31/12/2014 1.2 3.1 3 5.6
3 1 31/12/2015 1.7 2.5 2.3 5.1
4 1 31/12/2016 1.4 2.3 1.9 6
5 2 31/12/2014 6.2 5.7 7.2 8
6 2 31/12/2015 5.8 6.2 7 8.2
7 2 31/3/2016 5.9 6.4 6.5 8.9
8 2 30/9/2016 6.5 7 . .
;
run;
data temp;
set have;
miss = nmiss(a,b,c,d);
run;
proc sort data=temp; by descending yy miss; run;
data want;
set temp;
by descending yy;
if first.yy or miss=0;
drop yy;
run;
proc sort data=want; by date; run;
One way is to first figure out the minimum number of missing values for that firm for that year. Then you can pick the most recent date with that number of missing values.
Let's start by turning your listing into data.
data have;
input firm date :ddmmyy. a b c d;
format date yymmdd10.;
cards;
1 30/06/2014 1.6 . 3.6 .
1 31/12/2014 1.2 3.1 3 5.6
1 31/12/2015 1.7 2.5 2.3 5.1
1 31/12/2016 1.4 2.3 1.9 6
2 31/12/2014 6.2 5.7 7.2 8
2 31/12/2015 5.8 6.2 7 8.2
2 31/03/2016 5.9 6.4 6.5 8.9
2 30/09/2016 6.5 7 . .
;
Now we can use an SQL step to make the new variables and calculate the min missing and also do the sorting.
proc sql;
create table step1 as
select *
, year(date) as yr
, nmiss(a,b,c,d) as nmiss
, min(calculated nmiss) as min_miss
from have
group by firm,yr
order by firm,yr,date desc
;
quit;
Now we just need to find the first place where nmiss matches min_miss and output that record only.
data want;
set step1;
by firm yr;
if first.yr then found=0;
if not found and min_miss=nmiss then do;
found=1;
output;
end;
retain found;
run;
If you want to allow selecting an observations with more than the actual minimum number of missing values just change the logic of finding the first date.
if not found and max(1,min_miss)>=nmiss then do;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.