BookmarkSubscribeRSS Feed
michan22
Quartz | Level 8

Dear all,

 

I have a dataset that looks like

 

id         event             dob           res           status          date        ........

1          1 month        5/20/80     yes             .               5/30/17   ..........

1          2 month          .               no             .                  .

1          3 month          .                .                .                 .

1          3 month          .                .               1                 .

2          1 month        3/22/80     no              .                 .

2          2 month          .               .                 2               9/20/16

3          1 month         7/20/81     no             .                9/25/16

3          3 month           .                .                .                 .

3          3 month           .                .                1                .

3          6 month           .              yes              1              10/3/16

 

and so on...

 

I need to transpose this dataset to wide format, but most subjects have duplicate 'event' value, so my usual transpose code did not work. I did:

proc transpose data=long1 out=wide1 prefix=res;

by id;

id event;

var res;

run;

 

I went back to the original dataset and seems like one of the duplicate does not contain information in any of the variables except for 'id' and 'event'.

My question: is it possible to delete the duplicate event line that doesn't contain any information so I can use my transpose code??

Thank you!! 

11 REPLIES 11
Reeza
Super User

First, define your rules. How exactly are you defining the rows that need to be deleted? 

Second, sort the data

Look at PROC SORT options to remove duplicate values. 

michan22
Quartz | Level 8

I am aware of the proc sort option NODUP (compares all variables) and NODUPKEY (compares only the BY variable).

I want to delete the observation that has no value in the variables except for 'id' and 'event', in my example this would be the first line of the duplicate.

I thought about using the NODUPKEY option but that would delete the second duplicate correct?

Reeza
Super User

Yes and no. NODUPKEY keeps the first record. 

So double sort. Sort first so that the records you want to keep are first. Then sort again using the NODUPKEY.  

 

 

michan22
Quartz | Level 8

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!

Shmuel
Garnet | Level 18

Your question:


@michan22 wrote:

Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!


Missing vallue is lower than any othe value, therefore while sorting by:

(1)  by ID EVENT - you cannot know which of the duplicates will be first

(2)  by ID EVENT varx - where varx is any selected variable - the missing vallue will come first 

(3)  by ID EVENT DESCENDING varx - in case that varx is missing but there may be other variables non missing

       therefore you cannot assure that the all missing be the second

 

to summarise - the answer to your question, in this case, is NO.

 

Unless you check all variables to missing and assign seq_no = 2 if positive, otherwise seq_no = 1.

Shmuel
Garnet | Level 18

If you want to delete rows where all variables are missing except id and event,

you can do:

 

proc transpose data=long1 
         (where=(dob ne . and status ne . and date ne . and res ne ' ' ...))
               out=wide1 prefix=res;
by id;
id event;
var res;
run;
 

 

 

 

 

michan22
Quartz | Level 8

Thank you for the suggestion. It's good to know that I can use the WHERE option in proc transpose like this. I have over 100 variables, it might take me a while to fill everything in!

 

Shmuel
Garnet | Level 18

I agree that when there are so many variables it will be a sizific work to type the where staement manually,

so it can be done programmatically:

 proc sql;
  create table vars as select name, type
  from dictionary.columns
  where libname='WORK' and memname='LONG1'
     and name not in('ID' 'EVENT');
quit;
data _NULL_;
 set vars end=eof;
   length whr $240;   /* you may need addapt length to needs */
   retain whr ;
   if _N_=1 then do;
      whr='where=(';
      if type='num' 
         then whr = catx(' ',whr,name,'ne .');
         else whr = catx(' ',whr,name,'ne " "');
   end; else do;
      if type='num' 
         then whr = catx(' ',whr,'or',name,'ne .');
         else whr = catx(' ',whr,'or',name,'ne " "');   
   end;
   if eof then do;
whr = trim(whr)||')';
call symput('whr',whr);
end; run;

proc transpose data=long1 (&whr) out=wide1 prefix=res; by id; id event; var res; run;
Patrick
Opal | Level 21

@michan22

Please provide in the future sample data via a SAS data step as done in below code so we can test our code without having to spend time converting your data.

 

Below two options.

- Option 1 will remove all rows where only ID and Event got values even if it's not a duplicate. It will keep all rows with values even if there are duplicates.

- Option 2 keeps one row per by group with preference for non-missing values.

data LONG1;
 infile datalines dlm='|' truncover;
 input id event $ dob:anydtdte. res $ status $ date:anydtdte. ;
 format dob date date9.;
 datalines;
1|1 month|5/20/80|yes|.|5/30/17|..........
1|2 month|.|no|.|.
1|3 month|.|.|.|.
1|3 month|.|.|1|.
2|1 month|3/22/80|no|.|.
2|2 month|.|.|2|9/20/16
3|1 month|7/20/81|no|.|9/25/16
3|3 month|.|.|.|.
3|3 month|.|.|1|.
3|6 month|.|yes|1|10/3/16
3|7 month
;
run;

/* option 1: Keeps all rows where at least one non missing value */
proc sql noprint;
  select 
    name,
    count(name)
      into 
        :var_list separated by ',',
        :var_cnt
    from dictionary.columns
      where libname='WORK' and memname='LONG1'
        and upcase(name) not in('ID' 'EVENT');
quit;

%put %nrbquote(&=var_list);
%put %nrbquote(&=var_cnt);

proc transpose data=long1 
  (where=(cmiss(&var_list) ne &var_cnt))
  out=wide1 prefix=res;
  by id;
  id event;
  var res;
run;


/* option 2: Keeps one row per by group; preference for non-missing values */
proc sort data=long1 out=inter;
  by id event _all_;
run;

data deduped;
  set inter;
  by id event;
  if last.event;
run;

proc transpose data=deduped 
  out=wide2 prefix=res;
  by id;
  id event;
  var res;
run;

 

PGStats
Opal | Level 21

If you know the total number of variables in your dataset, you don't have to list them. Use a SAS view to avoid creating a single use dataset:

 

data have;
input id event &$  dob $ res $ status date $;
datalines; 
1          1 month        5/20/80     yes             .               5/30/17
1          2 month          .               no             .                  .
1          3 month          .                .                .                 .
1          3 month          .                .               1                 .
2          1 month        3/22/80     no              .                 .
2          2 month          .               .                 2               9/20/16
3          1 month         7/20/81     no             .                9/25/16
3          3 month           .                .                .                 .
3          3 month           .                .                1                .
3          6 month           .              yes              1              10/3/16
;

data have2 / view=have2;
set have;
/* Total number of variables in have - 2 = 4 */
if cmiss(of _all_) = 4 and not missing(id) and not missing(event) then delete;
run;

proc transpose data=have2 out=want prefix=res;
by id;
id event;
var res;
run;
PG
michan22
Quartz | Level 8

Thank you all for the wonderful suggestions @Shmuel @Patrick @PGStatsI will try them out! Also apologies for the inconvenient data format.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2031 views
  • 5 likes
  • 5 in conversation