Solved: Removing duplicate dates from observation, and count number of dates

Barkat · Posted 07-15-2020 07:11 PM

Hello,

I have the following dataset

data have;
infile datalines dsd dlm=',' ;
input name :$3. id :$3. date1 :MMDDYY10. date2 :MMDDYY10. date3 :MMDDYY10. date4 :MMDDYY10. date5 :MMDDYY10.;
datalines;
abc,010,01/10/2017,03/20/2016,.,01/10/2017,.
def,023,.,.,04/20/2018,.,04/20/2018
pqr,125,08/10/2013,.,05/20/2012,.,.
xyz,200,01/12/2016,01/21/2016,05/10/2017,09/01/2018,03/02/2016
;
run;

I would like to have an output from the above that will look like the output of the following dataset. Please note that there will be no duplicate dates for an observation, and the remaining dates will be sorted and shifted to the left columns in ascending or descending order. Missing values will be shifted to the right columns.

I would also like to add a column that will count the number of dates of each observation. Suppose number of dates for abc is 2, number of dates for def is 1, number of dates for pqr is 2, and number of dates for xyz is 5.

data have;
infile datalines dsd dlm=',' ;
input name :$3. id :$3. date1 :MMDDYY10. date2 :MMDDYY10. date3 :MMDDYY10. date4 :MMDDYY10. date5 :MMDDYY10.;
datalines;
abc,010,03/20/2016,01/10/2017,.,.,.
def,023,04/20/2018,.,.,.,.,.
pqr,125,05/20/2012,08/10/2013,.,.,.
xyz,200,01/12/2016,01/21/2016,03/02/2016,05/10/2017,09/01/2018
;
run;

Thanks a lot.

mkeintz · Posted 07-16-2020 02:06 AM

You can declare an array to hold sorted values of the original dates, then traverse that array copying non-duplicate values back to the original data varables:

data have;
infile datalines dsd dlm=',' ;
input name :$3. id :$3. date1 :MMDDYY10. date2 :MMDDYY10. date3 :MMDDYY10. date4 :MMDDYY10. date5 :MMDDYY10.;
format date:  mmddyy10. ;
datalines;
abc,010,01/10/2017,03/20/2016,.,01/10/2017,.
def,023,.,.,04/20/2018,.,04/20/2018
pqr,125,08/10/2013,.,05/20/2012,.,.
xyz,200,01/12/2016,01/21/2016,05/10/2017,09/01/2018,03/02/2016
run;

data want (drop=d t nd tmpdats: );
  set have;
  array dats {*} date: ;
  array tmpdats  {20} ;

  nd=n(of dats{*});
  if nd>0 then do;
    do d=1 to nd;
      tmpdats{d}=smallest(d,of dats{*});
    end;    
    call missing(of dats{*});

    dats{1}=tmpdats{1};
	d=1;
    if nd>1 then do t=2 to nd;
      if tmpdats{t}=tmpdats{t-1} then continue;
      d=d+1;
      dats{d}=tmpdats{t};
    end;
  end;
run;

The choice of 20 for the size of the tmpdats array is arbitrary. Just chose a number as least as large as the number of original data variables.

The "if tmpdats{t}=tmpdats{t-1} then continue" check for consecutive duplicates. If they are duplicate, then "CONTINUE" says to go to the next iteration of the do loop, thereby incrementing only t (indexing tmpdats), but not d (indexing the result array DATS). d is incremented only when non-duplicates are found.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

andreas_lds · Posted 07-16-2020 01:30 AM

Try:

proc transpose data=have out=transposed(drop=_name_);
	by name id;
	var date:;
run;

proc sort data=transposed(where=(not missing(col1))) out=sorted nodupkey;
	by name id col1;
run;

proc transpose data=sorted out=pre_want(drop=_name_) prefix=date;
	by name id;
	var col1;
run;

data want;
	set pre_want;
	
	array dates date:;	
	non_missing_dates = dim(dates) - nmiss(of dates[*]);
run;

Please note that there could be less than five date-variables in dataset "want", if not at least one name/id has five not missing and unique dates in dataset "have".

EDIT: Fixed missing creation of the variable containing the number of non-missing-dates.

mkeintz · Posted 07-16-2020 02:06 AM

You can declare an array to hold sorted values of the original dates, then traverse that array copying non-duplicate values back to the original data varables:

data have;
infile datalines dsd dlm=',' ;
input name :$3. id :$3. date1 :MMDDYY10. date2 :MMDDYY10. date3 :MMDDYY10. date4 :MMDDYY10. date5 :MMDDYY10.;
format date:  mmddyy10. ;
datalines;
abc,010,01/10/2017,03/20/2016,.,01/10/2017,.
def,023,.,.,04/20/2018,.,04/20/2018
pqr,125,08/10/2013,.,05/20/2012,.,.
xyz,200,01/12/2016,01/21/2016,05/10/2017,09/01/2018,03/02/2016
run;

data want (drop=d t nd tmpdats: );
  set have;
  array dats {*} date: ;
  array tmpdats  {20} ;

  nd=n(of dats{*});
  if nd>0 then do;
    do d=1 to nd;
      tmpdats{d}=smallest(d,of dats{*});
    end;    
    call missing(of dats{*});

    dats{1}=tmpdats{1};
	d=1;
    if nd>1 then do t=2 to nd;
      if tmpdats{t}=tmpdats{t-1} then continue;
      d=d+1;
      dats{d}=tmpdats{t};
    end;
  end;
run;

The choice of 20 for the size of the tmpdats array is arbitrary. Just chose a number as least as large as the number of original data variables.

The "if tmpdats{t}=tmpdats{t-1} then continue" check for consecutive duplicates. If they are duplicate, then "CONTINUE" says to go to the next iteration of the do loop, thereby incrementing only t (indexing tmpdats), but not d (indexing the result array DATS). d is incremented only when non-duplicates are found.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Barkat · Posted 07-21-2020 08:06 PM

@mkeintz:
I was trying to understand the purpose of each statements in this program. Could you share some documents/materials, so I can learn this type of programming, please.

Kurt_Bremser · Posted 07-22-2020 12:25 AM

Go here, and select "Data Step Programming".

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 07-16-2020 04:10 AM

proc transpose
  data=have
  out=want1 (
    drop=_name_
    rename=(col1=date)
    where=(date ne .)
  )
;
by name id;
var date:;
run;

proc sort data=want1 nodupkey;
by name id date;
run;

proc sql;
create table want2 as
  select
    name,
    id,
    count(*) as count
  from want1
  group by name, id
;
quit;

I did not re-transpose to wide on purpose; the "long" dataset layout is always easier to work with, and it avoids waste of storage for missing values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Re: Removing duplicate dates from observation, and count number of dates

Ready to join fellow brilliant minds for the SAS Hackathon?