Dear all,
I have a dataset that looks like
id event dob res status date ........
1 1 month 5/20/80 yes . 5/30/17 ..........
1 2 month . no . .
1 3 month . . . .
1 3 month . . 1 .
2 1 month 3/22/80 no . .
2 2 month . . 2 9/20/16
3 1 month 7/20/81 no . 9/25/16
3 3 month . . . .
3 3 month . . 1 .
3 6 month . yes 1 10/3/16
and so on...
I need to transpose this dataset to wide format, but most subjects have duplicate 'event' value, so my usual transpose code did not work. I did:
proc transpose data=long1 out=wide1 prefix=res;
by id;
id event;
var res;
run;
I went back to the original dataset and seems like one of the duplicate does not contain information in any of the variables except for 'id' and 'event'.
My question: is it possible to delete the duplicate event line that doesn't contain any information so I can use my transpose code??
Thank you!!
First, define your rules. How exactly are you defining the rows that need to be deleted?
Second, sort the data
Look at PROC SORT options to remove duplicate values.
I am aware of the proc sort option NODUP (compares all variables) and NODUPKEY (compares only the BY variable).
I want to delete the observation that has no value in the variables except for 'id' and 'event', in my example this would be the first line of the duplicate.
I thought about using the NODUPKEY option but that would delete the second duplicate correct?
Yes and no. NODUPKEY keeps the first record.
So double sort. Sort first so that the records you want to keep are first. Then sort again using the NODUPKEY.
Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!
Your question:
@michan22 wrote:
Thank you for the suggestion. Is there a way to sort the data so that the obeservations with missing information come second within the duplicate? I am still learning SAS... thanks!
Missing vallue is lower than any othe value, therefore while sorting by:
(1) by ID EVENT - you cannot know which of the duplicates will be first
(2) by ID EVENT varx - where varx is any selected variable - the missing vallue will come first
(3) by ID EVENT DESCENDING varx - in case that varx is missing but there may be other variables non missing
therefore you cannot assure that the all missing be the second
to summarise - the answer to your question, in this case, is NO.
Unless you check all variables to missing and assign seq_no = 2 if positive, otherwise seq_no = 1.
If you want to delete rows where all variables are missing except id and event,
you can do:
proc transpose data=long1
(where=(dob ne . and status ne . and date ne . and res ne ' ' ...))
out=wide1 prefix=res;
by id;
id event;
var res;
run;
Thank you for the suggestion. It's good to know that I can use the WHERE option in proc transpose like this. I have over 100 variables, it might take me a while to fill everything in!
I agree that when there are so many variables it will be a sizific work to type the where staement manually,
so it can be done programmatically:
proc sql; create table vars as select name, type from dictionary.columns where libname='WORK' and memname='LONG1' and name not in('ID' 'EVENT'); quit; data _NULL_; set vars end=eof; length whr $240; /* you may need addapt length to needs */ retain whr ; if _N_=1 then do; whr='where=('; if type='num' then whr = catx(' ',whr,name,'ne .'); else whr = catx(' ',whr,name,'ne " "'); end; else do; if type='num' then whr = catx(' ',whr,'or',name,'ne .'); else whr = catx(' ',whr,'or',name,'ne " "'); end; if eof then do;
whr = trim(whr)||')';
call symput('whr',whr);
end; run;proc transpose data=long1 (&whr) out=wide1 prefix=res; by id; id event; var res; run;
Please provide in the future sample data via a SAS data step as done in below code so we can test our code without having to spend time converting your data.
Below two options.
- Option 1 will remove all rows where only ID and Event got values even if it's not a duplicate. It will keep all rows with values even if there are duplicates.
- Option 2 keeps one row per by group with preference for non-missing values.
data LONG1;
infile datalines dlm='|' truncover;
input id event $ dob:anydtdte. res $ status $ date:anydtdte. ;
format dob date date9.;
datalines;
1|1 month|5/20/80|yes|.|5/30/17|..........
1|2 month|.|no|.|.
1|3 month|.|.|.|.
1|3 month|.|.|1|.
2|1 month|3/22/80|no|.|.
2|2 month|.|.|2|9/20/16
3|1 month|7/20/81|no|.|9/25/16
3|3 month|.|.|.|.
3|3 month|.|.|1|.
3|6 month|.|yes|1|10/3/16
3|7 month
;
run;
/* option 1: Keeps all rows where at least one non missing value */
proc sql noprint;
select
name,
count(name)
into
:var_list separated by ',',
:var_cnt
from dictionary.columns
where libname='WORK' and memname='LONG1'
and upcase(name) not in('ID' 'EVENT');
quit;
%put %nrbquote(&=var_list);
%put %nrbquote(&=var_cnt);
proc transpose data=long1
(where=(cmiss(&var_list) ne &var_cnt))
out=wide1 prefix=res;
by id;
id event;
var res;
run;
/* option 2: Keeps one row per by group; preference for non-missing values */
proc sort data=long1 out=inter;
by id event _all_;
run;
data deduped;
set inter;
by id event;
if last.event;
run;
proc transpose data=deduped
out=wide2 prefix=res;
by id;
id event;
var res;
run;
If you know the total number of variables in your dataset, you don't have to list them. Use a SAS view to avoid creating a single use dataset:
data have;
input id event &$ dob $ res $ status date $;
datalines;
1 1 month 5/20/80 yes . 5/30/17
1 2 month . no . .
1 3 month . . . .
1 3 month . . 1 .
2 1 month 3/22/80 no . .
2 2 month . . 2 9/20/16
3 1 month 7/20/81 no . 9/25/16
3 3 month . . . .
3 3 month . . 1 .
3 6 month . yes 1 10/3/16
;
data have2 / view=have2;
set have;
/* Total number of variables in have - 2 = 4 */
if cmiss(of _all_) = 4 and not missing(id) and not missing(event) then delete;
run;
proc transpose data=have2 out=want prefix=res;
by id;
id event;
var res;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.