Hi,
I want to share individuals in classes by respecting certain conditions. I have the following data:
ID: identifier;
G: group;
W: work
TD: date of the test;
OBS: the measure.
Each ID normally belongs to a single group (G) but can have 1, 2, 3, 4 or 5 Works (W) and in each Work, it has been verified 3, 4, ... or 10 times (number of TD):
ID G W TD OBS
b1 H1 1 28/02/2008 13.5
b1 H1 1 01/04/2008 17.2
b1 H1 1 17/05/2008 16.8
b1 H1 1 05/08/2008 12.5
b1 H1 1 22/09/2008 10.0
b1 H1 3 27/03/2009 22.3
b1 H1 3 23/02/2009 20.1
b1 H1 3 11/05/2009 18.3
b1 H1 3 19/06/2009 18.9
b1 H1 3 29/07/2009 16.9
b2 H1 1 02/10/2009 20.0
b2 H1 1 28/11/2009 22.2
b2 H1 1 20/12/2009 24.6
b2 H1 1 31/01/2010 23.0
b2 H1 1 27/02/2010 20.5
....
(this is just an example, you can't test the code (it requires more data)
What I am trying to do is to form classes of [Group * Year * Month of the test date] (GYM ).
In the new data I am looking for I only want to keep observations that respect the following conditions::
- Each GYM class must contain at least 6 observations.
- Each group G kept must have at least 4 different IDs.
- Each selected ID must be present with at least its first 3 observations for a given Work.
Please note that we may lose whole works for a given ID, or some IDs for a given group and all their controls, or even entire groups with all the IDs they contain.
data new; set have;
Y = year (TD);
M = month (TD);
GYM = compress (G || Y || M);
proc sort; by ID W;
run;
proc sql;
create table new2 as
select *, count (distinct(TD)) as N_TD
from new
group by GYM;
quit;
/ ****** Here, I noticed that the number of observations by GYM classes varies from 1 to hundreds. (I want to keep only the groups with at least 6 observations, by respecting the conditions above) ***/
Can someone help me formulate this code or tell me if there is another way?
Thank you for your expertise and insight.
I don't think SQL will work for this condition:
- Each selected ID must be present with at least its first 3 observations for a given Work.
Your best bet is to summarize the data, so calculate each of your statistics (# of Gym, # of distinct ID, first three obs) and then you can filter that data sets. The # of Gym and # of distinct ID can be calculated inline with SQL and then you can check the first three obs with a data step. I'm also assuming that the 'first three' is when sorted by date?
This won't work, because I don't know what a GYM is, or how you identify it in your data but it should give you the idea of how to set up your data.
proc sql;
create table summary_stats as
select *, count(G) as N_G, count(distinct ID) as unique_ID
from have
group by GYM;
quit;
data want;
set summary_stats;
by gym;
if first.gym then do; count=0; nmissing=0; end;
count+1;
if not missing(obs) then nmissing+1;
if count=3 and nmissing=0 then output ID;
run;
Yes, I understood what you did, but it did not work by adjusting it to my actual data. find attached a part of my file in csv format that can still help (dates are given in SAS format).
What is a 'GYM' and which variable represents a GYM?
Saying 'it did not work' doesn't really explain what didn't work. Please show your code and explain how your results don't match what you want - which I may be interpreting incorrectly because that was never specified either.
I'm sorry, I don't have time to download a file, write an import and then answer your question. If you post it as an example with full code I'll take another look, otherwise I'm relatively sure someone else will help out shortly.
a GYM is a new variable created by putting (Group || Test Year || Test Month), Test year (Y) and Test Month (M) were taken from TD variable by Y = year (TD) and M = month (TD)
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.