Hi, I am currently working with a dataset that isn't the most user friendly and I am stuck on this one issue.
My dataset looks like the following:
ID# 1stMutation 2ndMutation
2345 KRAS TP53
2345 EGFR KRAS
2345 NRAS TSC1
4455 TSC2 TP53
4455 EGFR KRAS
4455 KRAS ARID1A
4455 ATC KRAS
I'd like to find a way to find out how many patients have both a KRAS mutation in the 1stMutation and the 2ndMutation. I'd like a way where my results would show that 2 patients have matching KRAS mutations. Part 2 would be to then have results that show there are 3 total matching KRAS mutations.
Is that possible in SAS? Thank you so much for the help!
Once you subset the observations, the problem becomes easier. For example, assuming your data set is sorted by ID:
data want;
do until (last.ID);
set have;
by ID;
if mutation1='KRAS' then flag1=1;
if mutation2='KRAS' then flag2=1;
end;
do until (last.ID);
set have;
by ID;
if flag1=1 and flag2=1 and (mutation1='KRAS' or mutation2='KRAS') then output;
end;
drop flag1 flag2;
run;
That gives you the observations of interest, for further processing.
Optionally, you may wish to remove duplicate mutations for the same patient:
proc sort data=want NODUPKEY;
by ID mutation1 mutation2;
run;
Finally, get the answers:
proc sql data=want;
select count(distinct ID) from want;
quit;
proc freq data=want;
tables mutation1 * mutation2 / missing list;
run;
how does your final output look like.
What do you mean by final output? I'm new to SAS, so slowly trying to gain a grasp on it, but at the moment I have only been pulling up frequency tables to understand the data
would this help? @kiranv_ is right. We wanna know your result structure
data have;
input (ID IstMutation IIndMutation) ($);
datalines;
2345 KRAS TP53
2345 EGFR KRAS
2345 NRAS TSC1
4455 TSC2 TP53
4455 EGFR KRAS
4455 KRAS ARID1A
4455 ATC KRAS
;
data want;
set have;
if _n_=1 then do;
if 0 then set have;
declare hash h(dataset:'have', multidata:'y' );
h.defineKey('id','IIndMutation');
h.defineData(all:'y');
h.defineDone();
end;
set have;
by id;
if h.check(key:id, key:IstMutation)=0 then count+1;
if last.id then do;tot_count=count+1;output;end;
run;
Would it be possible for my results to just list the ID numbers that have matched KRAS mutations?
Result:
ID
2345
4455
Forget the hash, do a simple sql self join or a subquery and apply where condition equality operator. Tons of samples in google. Sorry, i am too lazy. My apologies
Questions about Part 2 ...
Are the counts limited to patients with KRAS in both Mutation 1 and Mutation 2, or should the counts be expanded to include patients that have just one KRAS instance?
If the exact same mutation occurs twice for the same patient, should it be counted once or twice?
You can easily get a report with the program below. I just want to make sure it's the right report (hence the questions):
proc freq data=have;
tables mutation1 * mutation2 / missing list;
where mutation1='KRAS' or mutation2='KRAS';
run;
I'd like the counts to be limited to only patients who have both KRAS mutations in both Mutation 1 and Mutation 2
I tried this, but I keep getting the following error:
Thank you for all the help!
proc sql;
select a.*,
from KRAS a, KRAS b
where a.mutation1=b.mutation2 and
a.mutation2=b.mutation1;
quit;
LOG:
remove comma
proc sql;
select a.*,
from KRAS a, KRAS b
and change it to
proc sql;
select a.*
from KRAS a, KRAS b
this proc sql link may be helpful to you
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146906.htm
Once you subset the observations, the problem becomes easier. For example, assuming your data set is sorted by ID:
data want;
do until (last.ID);
set have;
by ID;
if mutation1='KRAS' then flag1=1;
if mutation2='KRAS' then flag2=1;
end;
do until (last.ID);
set have;
by ID;
if flag1=1 and flag2=1 and (mutation1='KRAS' or mutation2='KRAS') then output;
end;
drop flag1 flag2;
run;
That gives you the observations of interest, for further processing.
Optionally, you may wish to remove duplicate mutations for the same patient:
proc sort data=want NODUPKEY;
by ID mutation1 mutation2;
run;
Finally, get the answers:
proc sql data=want;
select count(distinct ID) from want;
quit;
proc freq data=want;
tables mutation1 * mutation2 / missing list;
run;
Thank you! I will try and use the do-loop, for some reason the proc sql didn't work and gave me some of the duplicates matched up on the same line, but also got rid of some of the observations.
I will go ahead and play around with the do-loop and see if that works. Thank you for all the help!
The do-loop worked so well, thank you so much! I appreciate all the help!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.