Re: Deleting one of the duplicate

michan22 · Posted 08-18-2017 05:43 PM

Dear all,

I have a dataset that looks like

id event dob res status date ........

1 1 month 5/20/80 yes . 5/30/17 ..........

1 2 month . no . .

1 3 month . . . .

1 3 month . . 1 .

2 1 month 3/22/80 no . .

2 2 month . . 2 9/20/16

3 1 month 7/20/81 no . 9/25/16

3 3 month . . . .

3 3 month . . 1 .

3 6 month . yes 1 10/3/16

and so on...

I need to transpose this dataset to wide format, but most subjects have duplicate 'event' value, so my usual transpose code did not work. I did:

proc transpose data=long1 out=wide1 prefix=res;

by id;

id event;

var res;

run;

I went back to the original dataset and seems like one of the duplicate does not contain information in any of the variables except for 'id' and 'event'.

My question: is it possible to delete the duplicate event line that doesn't contain any information so I can use my transpose code??

Thank you!!

Reeza · Posted 08-18-2017 06:14 PM

First, define your rules. How exactly are you defining the rows that need to be deleted?

Second, sort the data

Look at PROC SORT options to remove duplicate values.

michan22 · Posted 08-18-2017 06:34 PM

I am aware of the proc sort option NODUP (compares all variables) and NODUPKEY (compares only the BY variable).

I want to delete the observation that has no value in the variables except for 'id' and 'event', in my example this would be the first line of the duplicate.

I thought about using the NODUPKEY option but that would delete the second duplicate correct?

Reeza · Posted 08-18-2017 06:46 PM

Yes and no. NODUPKEY keeps the first record.

So double sort. Sort first so that the records you want to keep are first. Then sort again using the NODUPKEY.

michan22 · Posted 08-19-2017 03:05 PM

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!

Shmuel · Posted 08-19-2017 04:00 PM

Your question:

@michan22 wrote:

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!

Missing vallue is lower than any othe value, therefore while sorting by:

(1) by ID EVENT - you cannot know which of the duplicates will be first

(2) by ID EVENT varx - where varx is any selected variable - the missing vallue will come first

(3) by ID EVENT DESCENDING varx - in case that varx is missing but there may be other variables non missing

therefore you cannot assure that the all missing be the second

to summarise - the answer to your question, in this case, is NO.

Unless you check all variables to missing and assign seq_no = 2 if positive, otherwise seq_no = 1.

Shmuel · Posted 08-18-2017 08:35 PM

If you want to delete rows where all variables are missing except id and event,

you can do:

proc transpose data=long1 
         (where=(dob ne . and status ne . and date ne . and res ne ' ' ...))
               out=wide1 prefix=res;
by id;
id event;
var res;
run;

michan22 · Posted 08-19-2017 03:02 PM

Thank you for the suggestion. It's good to know that I can use the WHERE option in proc transpose like this. I have over 100 variables, it might take me a while to fill everything in!

Shmuel · Posted 08-19-2017 03:44 PM

I agree that when there are so many variables it will be a sizific work to type the where staement manually,

so it can be done programmatically:

 proc sql;
  create table vars as select name, type
  from dictionary.columns
  where libname='WORK' and memname='LONG1'
     and name not in('ID' 'EVENT');
quit;
data _NULL_;
 set vars end=eof;
   length whr $240;   /* you may need addapt length to needs */
   retain whr ;
   if _N_=1 then do;
      whr='where=(';
      if type='num' 
         then whr = catx(' ',whr,name,'ne .');
         else whr = catx(' ',whr,name,'ne " "');
   end; else do;
      if type='num' 
         then whr = catx(' ',whr,'or',name,'ne .');
         else whr = catx(' ',whr,'or',name,'ne " "');   
   end;
   if eof then do;
      whr = trim(whr)||')';
      call symput('whr',whr);
   end;
run;

proc transpose data=long1 (&whr)
               out=wide1 prefix=res;
  by id;
  id event;
  var res;
run;

Patrick · Posted 08-19-2017 09:32 PM

@michan22

Please provide in the future sample data via a SAS data step as done in below code so we can test our code without having to spend time converting your data.

Below two options.

- Option 1 will remove all rows where only ID and Event got values even if it's not a duplicate. It will keep all rows with values even if there are duplicates.

- Option 2 keeps one row per by group with preference for non-missing values.

data LONG1;
 infile datalines dlm='|' truncover;
 input id event $ dob:anydtdte. res $ status $ date:anydtdte. ;
 format dob date date9.;
 datalines;
1|1 month|5/20/80|yes|.|5/30/17|..........
1|2 month|.|no|.|.
1|3 month|.|.|.|.
1|3 month|.|.|1|.
2|1 month|3/22/80|no|.|.
2|2 month|.|.|2|9/20/16
3|1 month|7/20/81|no|.|9/25/16
3|3 month|.|.|.|.
3|3 month|.|.|1|.
3|6 month|.|yes|1|10/3/16
3|7 month
;
run;

/* option 1: Keeps all rows where at least one non missing value */
proc sql noprint;
  select 
    name,
    count(name)
      into 
        :var_list separated by ',',
        :var_cnt
    from dictionary.columns
      where libname='WORK' and memname='LONG1'
        and upcase(name) not in('ID' 'EVENT');
quit;

%put %nrbquote(&=var_list);
%put %nrbquote(&=var_cnt);

proc transpose data=long1 
  (where=(cmiss(&var_list) ne &var_cnt))
  out=wide1 prefix=res;
  by id;
  id event;
  var res;
run;


/* option 2: Keeps one row per by group; preference for non-missing values */
proc sort data=long1 out=inter;
  by id event _all_;
run;

data deduped;
  set inter;
  by id event;
  if last.event;
run;

proc transpose data=deduped 
  out=wide2 prefix=res;
  by id;
  id event;
  var res;
run;

PGStats · Posted 08-21-2017 12:41 AM

If you know the total number of variables in your dataset, you don't have to list them. Use a SAS view to avoid creating a single use dataset:

data have;
input id event &$  dob $ res $ status date $;
datalines; 
1          1 month        5/20/80     yes             .               5/30/17
1          2 month          .               no             .                  .
1          3 month          .                .                .                 .
1          3 month          .                .               1                 .
2          1 month        3/22/80     no              .                 .
2          2 month          .               .                 2               9/20/16
3          1 month         7/20/81     no             .                9/25/16
3          3 month           .                .                .                 .
3          3 month           .                .                1                .
3          6 month           .              yes              1              10/3/16
;

data have2 / view=have2;
set have;
/* Total number of variables in have - 2 = 4 */
if cmiss(of _all_) = 4 and not missing(id) and not missing(event) then delete;
run;

proc transpose data=have2 out=want prefix=res;
by id;
id event;
var res;
run;

PG

michan22 · Posted 08-21-2017 10:07 AM

Thank you all for the wonderful suggestions @Shmuel @Patrick @PGStats, I will try them out! Also apologies for the inconvenient data format.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away