Solved: Re: Finding matching observations among a variable

kmardinian · Posted 01-08-2018 02:16 PM

Hi, I am currently working with a dataset that isn't the most user friendly and I am stuck on this one issue.

My dataset looks like the following:

ID# 1stMutation 2ndMutation

2345 KRAS TP53

2345 EGFR KRAS

2345 NRAS TSC1

4455 TSC2 TP53

4455 EGFR KRAS

4455 KRAS ARID1A

4455 ATC KRAS

I'd like to find a way to find out how many patients have both a KRAS mutation in the 1stMutation and the 2ndMutation. I'd like a way where my results would show that 2 patients have matching KRAS mutations. Part 2 would be to then have results that show there are 3 total matching KRAS mutations.

Is that possible in SAS? Thank you so much for the help!

Astounding · Posted 01-08-2018 03:04 PM

Once you subset the observations, the problem becomes easier. For example, assuming your data set is sorted by ID:

data want;

do until (last.ID);

set have;

by ID;

if mutation1='KRAS' then flag1=1;

if mutation2='KRAS' then flag2=1;

end;

do until (last.ID);

set have;

by ID;

if flag1=1 and flag2=1 and (mutation1='KRAS' or mutation2='KRAS') then output;

end;

drop flag1 flag2;

run;

That gives you the observations of interest, for further processing.

Optionally, you may wish to remove duplicate mutations for the same patient:

proc sort data=want NODUPKEY;

by ID mutation1 mutation2;

run;

Finally, get the answers:

proc sql data=want;

select count(distinct ID) from want;

quit;

proc freq data=want;

tables mutation1 * mutation2 / missing list;

run;

View solution in original post

kiranv_ · Posted 01-08-2018 02:21 PM

how does your final output look like.

kmardinian · Posted 01-08-2018 02:23 PM

What do you mean by final output? I'm new to SAS, so slowly trying to gain a grasp on it, but at the moment I have only been pulling up frequency tables to understand the data

novinosrin · Posted 01-08-2018 02:28 PM

would this help? @kiranv_ is right. We wanna know your result structure

data have;
input (ID IstMutation IIndMutation) ($);
datalines;
2345 KRAS TP53
2345 EGFR KRAS
2345 NRAS TSC1
4455 TSC2 TP53
4455 EGFR KRAS
4455 KRAS ARID1A
4455 ATC KRAS
;

data want;
set have;
if _n_=1 then do;
if 0 then set have;
declare hash h(dataset:'have', multidata:'y' );
h.defineKey('id','IIndMutation');
h.defineData(all:'y');
h.defineDone();
end;

set have;
by id;
if h.check(key:id, key:IstMutation)=0 then count+1;
if last.id then do;tot_count=count+1;output;end;
run;

kmardinian · Posted 01-08-2018 02:38 PM

Would it be possible for my results to just list the ID numbers that have matched KRAS mutations?

Result:

ID

2345
4455

novinosrin · Posted 01-08-2018 02:39 PM

Forget the hash, do a simple sql self join or a subquery and apply where condition equality operator. Tons of samples in google. Sorry, i am too lazy. My apologies

Astounding · Posted 01-08-2018 02:44 PM

Questions about Part 2 ...

Are the counts limited to patients with KRAS in both Mutation 1 and Mutation 2, or should the counts be expanded to include patients that have just one KRAS instance?

If the exact same mutation occurs twice for the same patient, should it be counted once or twice?

You can easily get a report with the program below. I just want to make sure it's the right report (hence the questions):

proc freq data=have;

tables mutation1 * mutation2 / missing list;

where mutation1='KRAS' or mutation2='KRAS';

run;

kmardinian · Posted 01-08-2018 02:55 PM

I'd like the counts to be limited to only patients who have both KRAS mutations in both Mutation 1 and Mutation 2

I tried this, but I keep getting the following error:

Thank you for all the help!

proc sql;
select a.*,
from KRAS a, KRAS b
where a.mutation1=b.mutation2 and
a.mutation2=b.mutation1;
quit;

LOG:

67

68 proc sql;

69 select a.*,

70 from KRAS a, KRAS b

____

79

ERROR 79-322: Expecting a FROM.

71 where a.mutation1=b.mutation2 and

72 a.mutation2=b.mutation1;

NOTE: PROC SQL set option NOEXEC and will continue to check the syntax of statements.

73 quit;

NOTE: The SAS System stopped processing this step because of errors.

NOTE: PROCEDURE SQL used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

74

75

76 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

89

kiranv_ · Posted 01-08-2018 03:03 PM

remove comma

proc sql;
select a.*,
from KRAS a, KRAS b

and change it to

proc sql;
select a.*
from KRAS a, KRAS b

this proc sql link may be helpful to you

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146906.htm

Astounding · Posted 01-08-2018 03:04 PM

Once you subset the observations, the problem becomes easier. For example, assuming your data set is sorted by ID:

data want;

do until (last.ID);

set have;

by ID;

if mutation1='KRAS' then flag1=1;

if mutation2='KRAS' then flag2=1;

end;

do until (last.ID);

set have;

by ID;

if flag1=1 and flag2=1 and (mutation1='KRAS' or mutation2='KRAS') then output;

end;

drop flag1 flag2;

run;

That gives you the observations of interest, for further processing.

Optionally, you may wish to remove duplicate mutations for the same patient:

proc sort data=want NODUPKEY;

by ID mutation1 mutation2;

run;

Finally, get the answers:

proc sql data=want;

select count(distinct ID) from want;

quit;

proc freq data=want;

tables mutation1 * mutation2 / missing list;

run;

kmardinian · Posted 01-08-2018 03:19 PM

Thank you! I will try and use the do-loop, for some reason the proc sql didn't work and gave me some of the duplicates matched up on the same line, but also got rid of some of the observations.

I will go ahead and play around with the do-loop and see if that works. Thank you for all the help!

kmardinian · Posted 01-08-2018 03:28 PM

The do-loop worked so well, thank you so much! I appreciate all the help!

Registration is open