I have a data set with 337 genotypes that were analyzed for disease resistance over an 11 year period. The first few lines looks like this:
However, not every genotype was tested in every year. I need to use only the genotypes that were tested in at least 5 of the 11 years. How can I create a dataset with only these genotypes? I can manually go and count them, then use the data step (I show below how I did this) to pick out the genotypes I want, but there's got to be a better way. I've been recommended to use PROC SQL but I can't work out how to tell it what I need.
Suggestions? Thank you in advance! (I'm using SAS 9.4)
Here's the data step I mentioned:
data Potato_Blight4;
set Potato_Blight3;
if Gen = "FRASER" then output;
if Gen="I.HARDY" then output;
if Gen="IWA" then output;
if Gen="KARAKA" then output;
if Gen="KIWITEA" then output;
if Gen="P.DELL" then output;
if Gen="R.BURBANK" then output;
if Gen="R.RASCAL" then output;
if Gen="RUA" then output;
if Gen="W.DELIGHT" then output;
run;
proc print data=Potato_Blight4; run;
And this is the proc sql I've tried but I know it's wrong:
proc sql;
create table Potato_Blight_5yrs as
select gen
from Potato_Blight2
where y in year >5
group by gen, year
having sum(value='')=0;
/* having count(*) > 4; */
quit;
Keep it simple:
proc sql;
create table want as
select *
from have
group by gen
having count(distinct year) >= 5;
quit;
Proc SQL YAY!
proc sql;
select * from have
where id in (select id from have group by id having count(distinct year) >=5);
quit;
Modified as you wish.
Patrick
Keep it simple:
proc sql;
create table want as
select *
from have
group by gen
having count(distinct year) >= 5;
quit;
@PGStats unfortunately the simpler code as written doesn't work. Can you rewrite using the data step here?
data have;
input obs_in year gen $ rep y;
cards;
7 1983 060.1 1 9
36 1983 060.1 2 9
46 1983 060.1 3 9
71 1983 060.1 4 8
81 1983 060.1 5 9
;
run;
proc sql;
create table want as
select *
from have
group by gen
having count(distinct(year)) >= 5;
quit;
Your data shows a single genotype measured on a single year (1983). So it hasn't been measured over 5 years or more and is thus not kept. Doesn't that agree with your request?
LOL I guess you're right!
Thank you so much!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.