Solved: Re: Removing Duplicate Dates with conditions - Page 2

sjm · Posted 05-11-2020 09:04 AM

Thanks, Shmuel.

sjm · Posted 05-11-2020 07:09 AM

Thank you for your solution. It works very well!

I tried using the same code (modified slightly) to eliminate duplicates for a quarterly data set but ran into some difficulties and would be grateful for your help.

Unlike the other dataset that had yearly observations with different dates, the duplicates in this quarterly dataset have identical dates, so I was not able to use MAX(.) efficiently in the code.

For the quarterly dataset, the rule to select the correct date, is that the observation must have more variables with non-missing values. For duplicates with the same number of non-missing variables, the rule is to select the observation where DATAFQTR has the same calendar year as DATADATE. DATAFQTR is however a character variable.

I have attached the code and a sample of the SAS datafile that highlights these problem for two firms. They are identified by their respective GVKEY’s.

Notice that firm 206883 has duplicate observations (17 and 18) that differ only for DATAFQTR. Here, observation 17 should be chosen since its DATAFQTR has the same calendar year as DATADATE (i.e. 2009). For observations 21 and 22 for the same firm, observation 21 should be chosen since it has more variables with non-missing values. Here, DATAFTQR has the same calendar date as DATADATE.

Firm 212151 has duplicate observations 49 and 50. In this case, observation 49 should be chosen since it has more variables with non-missing values even if DATAFQTR does not have the same calendar date as DATAFQTR.

The variable DATAFQTR could possibly help to solve this problem but I am not sure how to use it here. Thanks.

proc sql;
	
	create table want as

	/****************************/
	/*obs with 1 record per year*/
	select *
	from have
	group by Firm, year(date)
	having count(year(date))=1
	/****************************/

	union all corr

	/*********************************************************************/
	/*obs with >1 record per year and all records with > 1 missing values*/	
	select a.*
	from have as a right join
		(select Firm, max(date) as Date2 format=DDMMYY10. /* Min Max ??*/
		from have
		group by Firm, year(date)
		having count(year(date))>1 and min(nmiss(A,B,C,D)) > 1) as b
		on a.Firm=b.Firm and a.Date=b.Date2
	/*********************************************************************/

	union all corr

	/*********************************************************************/
	/*obs with >1 record per year and at least one record with 0 or 1 missing value*/
	select a.*
	from have as a right join
		(select Firm, max(date) as Date2 format=DDMMYY10.
		from (select *
			  from have
			  group by Firm, year(date)
	     	  having count(year(date))>1 and min(nmiss(A,B,C,D)) <= 1)
	     where nmiss(A,B,C,D)<= 1
	     group by Firm, year(date)) as b
	     on a.Firm=b.Firm and a.Date=b.Date2
	/*********************************************************************/
	
	 order by obs;
	 
quit;

sjm · Posted 05-11-2020 07:29 AM

Sorry, the wrong SAS code was attached. Please use this one instead. Thanks.

proc sql;	
	create table temp5 as
	/****************************/
	/*obs with 1 record per year*/
	select *
	from testfile
	group by gvkey, year(datadate), qtr(datadate)
	having count(qtr(datadate))=1
	/****************************/
	union all corr
	/*********************************************************************/
	/*obs with >1 record per year and all records with > 1 missing values*/	
	select a.*
	from testfile as a right join
		(select gvkey, max(datadate) as Date2 format=DDMMYY10. /* Min Max ??*/
		from testfile
		group by gvkey, year(datadate), qtr(datadate)
		having count(qtr(datadate))>1 and min(nmiss(atq,cheq,ltq,revtq)) > 1) as b
		on a.gvkey=b.gvkey and a.datadate=b.Date2
	/*********************************************************************/
	union all corr
	/*********************************************************************/
	/*obs with >1 record per year and at least one record with 0 or 1 missing value*/
	select a.*
	from testfile as a right join
		(select gvkey, max(datadate) as Date2 format=DDMMYY10.
		from (select *
			  from testfile
			  group by gvkey, year(datadate), qtr(datadate)
	     	  having count(qtr(datadate))>1 and min(nmiss(atq,cheq,ltq,revtq)) <= 1)
	     where nmiss(atq,cheq,ltq,revtq)<= 1
	     group by gvkey, year(datadate), qtr(datadate)) as b
	     on a.gvkey=b.gvkey and a.datadate=b.Date2;
quit;

ed_sas_member · Posted 05-11-2020 10:19 AM

Hi @sjm

I have printed below the last of "duplicates" based on the key 'GVKEY, DATADATE'.

Could you please clarify the rule for obs 19&20 or 23&24 for example -> they have the same number of missing values (actually 0) and the year identify in the variable DATAFQTR is the same as the one in DATADATE. Which record should we choose?

Best,

sjm · Posted 05-11-2020 10:29 AM

Thank you for pointing this out. When even the calendar matches, then the last resort is to look at which quarter is most recent. In this case, it is Q2, so observation 19 should be chosen.

ed_sas_member · Posted 05-11-2020 10:57 AM

Hi @sjm

Thank you for the clarification 😊

I have adapted the program as follows. I have also edited the list of removed observations so that you can check that the program works according to your expectations.

Best,

data have;
	set mylib.testfile;	/*Input file*/
	obs_num+1; 			/*Observation number*/
run;

proc sql;
	
	create table want as

	/****************************/
	/*obs with 1 record per quarter per year*/
	select *
	from have
	group by GVKEY, DATADATE
	having count(DATADATE)=1
	/****************************/

	union all corr

	/*********************************************************************/
	/*obs with >1 record per year AND 1 record with the min number of missing values */	
	select *
	from (select *
		from 
			(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
			from have
			group by GVKEY, DATADATE
			having count(DATADATE)>1)
		group by GVKEY, DATADATE
		having count_miss = min(count_miss))
		group by GVKEY, DATADATE
		having count(DATADATE)=1
		
	/*********************************************************************/

	union all corr

	/*********************************************************************/
	/*obs with >1 record per year AND >1 records:
		- with the min number of missing values
		- 1 record having sames years in both DATADATE & DATAFQTR variables */

	select *
	from (select *
		  from (select *
				from 
					(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
					from have
					group by GVKEY, DATADATE
					having count(DATADATE)>1)
		 		group by GVKEY, DATADATE
				having count_miss = min(count_miss))
		  group by GVKEY, DATADATE
		  having count(DATADATE)>1)
	where year(DATADATE)=input(substr(DATAFQTR,1,4),4.)
	group by GVKEY, DATADATE
	having count(DATADATE)=1
	
	/*********************************************************************/

union all corr

	/*********************************************************************/
	/*obs with >1 record per year AND >1 records:
		- with the min number of missing values
		- >1 records having sames years in both DATADATE & DATAFQTR variables */
	select *
	from(select *
		from (select *
			  from (select *
					from 
						(select *, nmiss(ATQ,CHEQ,LTQ,REVTQ) as count_miss
						from have
						group by GVKEY, DATADATE
						having count(DATADATE)>1)
			 		group by GVKEY, DATADATE
					having count_miss = min(count_miss))
			  group by GVKEY, DATADATE
			  having count(DATADATE)>1)
		where year(DATADATE)=input(substr(DATAFQTR,1,4),4.)
		group by GVKEY, DATADATE
		having count(DATADATE)>1)
	group by GVKEY, DATADATE
	having input(substr(DATAFQTR,6,1),1.) = max(input(substr(DATAFQTR,6,1),1.))
	
	/*********************************************************************/
	
 	order by obs_num;
quit;

/******* Quality checks *******/

data check_delete;
	merge have(in=x) want(in=y);
	by obs_num;
	if y=0;
run;

proc print data=check_delete;
	title 'QUALITY CHECK: list of removed observations';
	id obs_num;
run;

sjm · Posted 05-13-2020 07:52 AM

Thanks v much. The code does the job nicely!

Shmuel · Posted 05-09-2020 11:27 AM

Check next code:

data have;
infile cards truncover dlm='09'x;
input Obs Firm Date :ddmmyy10. A B C D;
yy = year(date);
format date ddmmyy10.;
cards;
1	1	30/6/2014	1.6	.	3.6	.
2	1	31/12/2014	1.2	3.1	3	5.6
3	1	31/12/2015	1.7	2.5	2.3	5.1
4	1	31/12/2016	1.4	2.3	1.9	6
5	2	31/12/2014	6.2	5.7	7.2	8
6	2	31/12/2015	5.8	6.2	7	8.2
7	2	31/3/2016	5.9	6.4	6.5	8.9
8	2	30/9/2016	6.5	7	.	.
;
run;

data temp;
 set have;
     miss = nmiss(a,b,c,d);
run;
proc sort data=temp; by descending yy miss; run;

data want;
 set temp;
  by descending yy;
     if first.yy or miss=0;
     drop yy;
run;
proc sort data=want; by date; run;

Tom · Posted 05-10-2020 09:46 AM

One way is to first figure out the minimum number of missing values for that firm for that year. Then you can pick the most recent date with that number of missing values.

Let's start by turning your listing into data.

data have;
  input firm date :ddmmyy. a b c d;
  format date yymmdd10.;
cards;
1 30/06/2014 1.6   . 3.6 .
1 31/12/2014 1.2 3.1   3 5.6
1 31/12/2015 1.7 2.5 2.3 5.1
1 31/12/2016 1.4 2.3 1.9 6
2 31/12/2014 6.2 5.7 7.2 8
2 31/12/2015 5.8 6.2   7 8.2
2 31/03/2016 5.9 6.4 6.5 8.9
2 30/09/2016 6.5   7   . .
;

Now we can use an SQL step to make the new variables and calculate the min missing and also do the sorting.

proc sql;
create table step1 as 
  select *
       , year(date) as yr
       , nmiss(a,b,c,d) as nmiss
       , min(calculated nmiss) as min_miss
  from have
  group by firm,yr
  order by firm,yr,date desc 
;
quit;

Now we just need to find the first place where nmiss matches min_miss and output that record only.

data want;
  set step1;
  by firm yr;
  if first.yr then found=0;
  if not found and min_miss=nmiss then do;
    found=1;
    output;
  end;
  retain found;
run;

Shmuel · Posted 05-10-2020 10:11 AM

@Tom, I like your approach but your code
selects obs 9 (apr 2016) instead obs 10 (sep 2016)

Tom · Posted 05-10-2020 12:34 PM

If you want to allow selecting an observations with more than the actual minimum number of missing values just change the logic of finding the first date.

  if not found and max(1,min_miss)>=nmiss then do;

Registration is open

SAS Training: Just a Click Away