Deleting one of the duplicate

Reply
Contributor
Posts: 59

Deleting one of the duplicate

Dear all,

 

I have a dataset that looks like

 

id         event             dob           res           status          date        ........

1          1 month        5/20/80     yes             .               5/30/17   ..........

1          2 month          .               no             .                  .

1          3 month          .                .                .                 .

1          3 month          .                .               1                 .

2          1 month        3/22/80     no              .                 .

2          2 month          .               .                 2               9/20/16

3          1 month         7/20/81     no             .                9/25/16

3          3 month           .                .                .                 .

3          3 month           .                .                1                .

3          6 month           .              yes              1              10/3/16

 

and so on...

 

I need to transpose this dataset to wide format, but most subjects have duplicate 'event' value, so my usual transpose code did not work. I did:

proc transpose data=long1 out=wide1 prefix=res;

by id;

id event;

var res;

run;

 

I went back to the original dataset and seems like one of the duplicate does not contain information in any of the variables except for 'id' and 'event'.

My question: is it possible to delete the duplicate event line that doesn't contain any information so I can use my transpose code??

Thank you!! 

Super User
Posts: 23,224

Re: Deleting one of the duplicate

First, define your rules. How exactly are you defining the rows that need to be deleted? 

Second, sort the data

Look at PROC SORT options to remove duplicate values. 

Contributor
Posts: 59

Re: Deleting one of the duplicate

I am aware of the proc sort option NODUP (compares all variables) and NODUPKEY (compares only the BY variable).

I want to delete the observation that has no value in the variables except for 'id' and 'event', in my example this would be the first line of the duplicate.

I thought about using the NODUPKEY option but that would delete the second duplicate correct?

Super User
Posts: 23,224

Re: Deleting one of the duplicate

Yes and no. NODUPKEY keeps the first record. 

So double sort. Sort first so that the records you want to keep are first. Then sort again using the NODUPKEY.  

 

 

Contributor
Posts: 59

Re: Deleting one of the duplicate

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!

Trusted Advisor
Posts: 1,831

Re: Deleting one of the duplicate

Your question:


michan22 wrote:

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!


Missing vallue is lower than any othe value, therefore while sorting by:

(1)  by ID EVENT - you cannot know which of the duplicates will be first

(2)  by ID EVENT varx - where varx is any selected variable - the missing vallue will come first 

(3)  by ID EVENT DESCENDING varx - in case that varx is missing but there may be other variables non missing

       therefore you cannot assure that the all missing be the second

 

to summarise - the answer to your question, in this case, is NO.

 

Unless you check all variables to missing and assign seq_no = 2 if positive, otherwise seq_no = 1.

Trusted Advisor
Posts: 1,831

Re: Deleting one of the duplicate

If you want to delete rows where all variables are missing except id and event,

you can do:

 

proc transpose data=long1 
         (where=(dob ne . and status ne . and date ne . and res ne ' ' ...))
               out=wide1 prefix=res;
by id;
id event;
var res;
run;
 

 

 

 

 

Contributor
Posts: 59

Re: Deleting one of the duplicate

Thank you for the suggestion. It's good to know that I can use the WHERE option in proc transpose like this. I have over 100 variables, it might take me a while to fill everything in!

 

Trusted Advisor
Posts: 1,831

Re: Deleting one of the duplicate

[ Edited ]

I agree that when there are so many variables it will be a sizific work to type the where staement manually,

so it can be done programmatically:

 proc sql;
  create table vars as select name, type
  from dictionary.columns
  where libname='WORK' and memname='LONG1'
     and name not in('ID' 'EVENT');
quit;
data _NULL_;
 set vars end=eof;
   length whr $240;   /* you may need addapt length to needs */
   retain whr ;
   if _N_=1 then do;
      whr='where=(';
      if type='num' 
         then whr = catx(' ',whr,name,'ne .');
         else whr = catx(' ',whr,name,'ne " "');
   end; else do;
      if type='num' 
         then whr = catx(' ',whr,'or',name,'ne .');
         else whr = catx(' ',whr,'or',name,'ne " "');   
   end;
   if eof then do;
whr = trim(whr)||')';
call symput('whr',whr);
end; run;

proc transpose data=long1 (&whr) out=wide1 prefix=res; by id; id event; var res; run;
Respected Advisor
Posts: 4,665

Re: Deleting one of the duplicate

[ Edited ]

@michan22

Please provide in the future sample data via a SAS data step as done in below code so we can test our code without having to spend time converting your data.

 

Below two options.

- Option 1 will remove all rows where only ID and Event got values even if it's not a duplicate. It will keep all rows with values even if there are duplicates.

- Option 2 keeps one row per by group with preference for non-missing values.

data LONG1;
 infile datalines dlm='|' truncover;
 input id event $ dob:anydtdte. res $ status $ date:anydtdte. ;
 format dob date date9.;
 datalines;
1|1 month|5/20/80|yes|.|5/30/17|..........
1|2 month|.|no|.|.
1|3 month|.|.|.|.
1|3 month|.|.|1|.
2|1 month|3/22/80|no|.|.
2|2 month|.|.|2|9/20/16
3|1 month|7/20/81|no|.|9/25/16
3|3 month|.|.|.|.
3|3 month|.|.|1|.
3|6 month|.|yes|1|10/3/16
3|7 month
;
run;

/* option 1: Keeps all rows where at least one non missing value */
proc sql noprint;
  select 
    name,
    count(name)
      into 
        :var_list separated by ',',
        :var_cnt
    from dictionary.columns
      where libname='WORK' and memname='LONG1'
        and upcase(name) not in('ID' 'EVENT');
quit;

%put %nrbquote(&=var_list);
%put %nrbquote(&=var_cnt);

proc transpose data=long1 
  (where=(cmiss(&var_list) ne &var_cnt))
  out=wide1 prefix=res;
  by id;
  id event;
  var res;
run;


/* option 2: Keeps one row per by group; preference for non-missing values */
proc sort data=long1 out=inter;
  by id event _all_;
run;

data deduped;
  set inter;
  by id event;
  if last.event;
run;

proc transpose data=deduped 
  out=wide2 prefix=res;
  by id;
  id event;
  var res;
run;

 

Esteemed Advisor
Posts: 5,474

Re: Deleting one of the duplicate

If you know the total number of variables in your dataset, you don't have to list them. Use a SAS view to avoid creating a single use dataset:

 

data have;
input id event &$  dob $ res $ status date $;
datalines; 
1          1 month        5/20/80     yes             .               5/30/17
1          2 month          .               no             .                  .
1          3 month          .                .                .                 .
1          3 month          .                .               1                 .
2          1 month        3/22/80     no              .                 .
2          2 month          .               .                 2               9/20/16
3          1 month         7/20/81     no             .                9/25/16
3          3 month           .                .                .                 .
3          3 month           .                .                1                .
3          6 month           .              yes              1              10/3/16
;

data have2 / view=have2;
set have;
/* Total number of variables in have - 2 = 4 */
if cmiss(of _all_) = 4 and not missing(id) and not missing(event) then delete;
run;

proc transpose data=have2 out=want prefix=res;
by id;
id event;
var res;
run;
PG
Contributor
Posts: 59

Re: Deleting one of the duplicate

Thank you all for the wonderful suggestions @Shmuel @Patrick @PGStatsI will try them out! Also apologies for the inconvenient data format.

 

Ask a Question
Discussion stats
  • 11 replies
  • 262 views
  • 5 likes
  • 5 in conversation