I have a data set with 337 genotypes that were analyzed for disease resistance over an 11 year period. The first few lines looks like this:
However, not every genotype was tested in every year. I need to use only the genotypes that were tested in at least 5 of the 11 years. How can I create a dataset with only these genotypes? I can manually go and count them, then use the data step (I show below how I did this) to pick out the genotypes I want, but there's got to be a better way. I've been recommended to use PROC SQL but I can't work out how to tell it what I need.
Suggestions? Thank you in advance! (I'm using SAS 9.4)
Here's the data step I mentioned:
data Potato_Blight4;
set Potato_Blight3;
if Gen = "FRASER" then output;
if Gen="I.HARDY" then output;
if Gen="IWA" then output;
if Gen="KARAKA" then output;
if Gen="KIWITEA" then output;
if Gen="P.DELL" then output;
if Gen="R.BURBANK" then output;
if Gen="R.RASCAL" then output;
if Gen="RUA" then output;
if Gen="W.DELIGHT" then output;
run;
proc print data=Potato_Blight4; run;
And this is the proc sql I've tried but I know it's wrong:
proc sql;
create table Potato_Blight_5yrs as
select gen
from Potato_Blight2
where y in year >5
group by gen, year
having sum(value='')=0;
/* having count(*) > 4; */
quit;
Keep it simple:
proc sql;
create table want as
select *
from have
group by gen
having count(distinct year) >= 5;
quit;
Proc SQL YAY!
proc sql;
select * from have
where id in (select id from have group by id having count(distinct year) >=5);
quit;
Modified as you wish.
Patrick
Keep it simple:
proc sql;
create table want as
select *
from have
group by gen
having count(distinct year) >= 5;
quit;
@PGStats unfortunately the simpler code as written doesn't work. Can you rewrite using the data step here?
data have;
input obs_in year gen $ rep y;
cards;
7 1983 060.1 1 9
36 1983 060.1 2 9
46 1983 060.1 3 9
71 1983 060.1 4 8
81 1983 060.1 5 9
;
run;
proc sql;
create table want as
select *
from have
group by gen
having count(distinct(year)) >= 5;
quit;
Your data shows a single genotype measured on a single year (1983). So it hasn't been measured over 5 years or more and is thus not kept. Doesn't that agree with your request?
LOL I guess you're right!
Thank you so much!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.