Re: Create classes under some conditions

soumri · Posted 08-31-2018 06:59 AM

Hi,
I want to share individuals in classes by respecting certain conditions. I have the following data:
ID: identifier;
G: group;
W: work
TD: date of the test;
OBS: the measure.
Each ID normally belongs to a single group (G) but can have 1, 2, 3, 4 or 5 Works (W) and in each Work, it has been verified 3, 4, ... or 10 times (number of TD):

ID G W TD OBS

b1 H1 1 28/02/2008 13.5

b1 H1 1 01/04/2008 17.2

b1 H1 1 17/05/2008 16.8

b1 H1 1 05/08/2008 12.5

b1 H1 1 22/09/2008 10.0

b1 H1 3 27/03/2009 22.3

b1 H1 3 23/02/2009 20.1

b1 H1 3 11/05/2009 18.3

b1 H1 3 19/06/2009 18.9

b1 H1 3 29/07/2009 16.9

b2 H1 1 02/10/2009 20.0

b2 H1 1 28/11/2009 22.2

b2 H1 1 20/12/2009 24.6

b2 H1 1 31/01/2010 23.0

b2 H1 1 27/02/2010 20.5

....

(this is just an example, you can't test the code (it requires more data)
What I am trying to do is to form classes of [Group * Year * Month of the test date] (GYM ).
In the new data I am looking for I only want to keep observations that respect the following conditions::
- Each GYM class must contain at least 6 observations.
- Each group G kept must have at least 4 different IDs.
- Each selected ID must be present with at least its first 3 observations for a given Work.

Please note that we may lose whole works for a given ID, or some IDs for a given group and all their controls, or even entire groups with all the IDs they contain.

data new; set have;
Y = year (TD);
M = month (TD);
GYM = compress (G || Y || M);
proc sort; by ID W;
run;

proc sql; 
create table new2 as 
select *, count (distinct(TD)) as N_TD 
from new 
group by GYM; 
quit;

/ ****** Here, I noticed that the number of observations by GYM classes varies from 1 to hundreds. (I want to keep only the groups with at least 6 observations, by respecting the conditions above) ***/

Can someone help me formulate this code or tell me if there is another way?
Thank you for your expertise and insight.

Reeza · Posted 08-31-2018 10:16 AM

I don't think SQL will work for this condition:

- Each selected ID must be present with at least its first 3 observations for a given Work.

Your best bet is to summarize the data, so calculate each of your statistics (# of Gym, # of distinct ID, first three obs) and then you can filter that data sets. The # of Gym and # of distinct ID can be calculated inline with SQL and then you can check the first three obs with a data step. I'm also assuming that the 'first three' is when sorted by date?

This won't work, because I don't know what a GYM is, or how you identify it in your data but it should give you the idea of how to set up your data.

proc sql;
create table summary_stats as
select *, count(G) as N_G, count(distinct ID) as unique_ID
from have
group by GYM;
quit;

data want;
set summary_stats;
by gym;
if first.gym then do; count=0; nmissing=0; end;
count+1;
if not missing(obs) then nmissing+1;
if count=3 and nmissing=0 then output ID;
run;

soumri · Posted 08-31-2018 11:15 AM

Yes, I understood what you did, but it did not work by adjusting it to my actual data. find attached a part of my file in csv format that can still help (dates are given in SAS format).

Reeza · Posted 08-31-2018 11:19 AM

What is a 'GYM' and which variable represents a GYM?

Saying 'it did not work' doesn't really explain what didn't work. Please show your code and explain how your results don't match what you want - which I may be interpreting incorrectly because that was never specified either.

I'm sorry, I don't have time to download a file, write an import and then answer your question. If you post it as an example with full code I'll take another look, otherwise I'm relatively sure someone else will help out shortly.

soumri · Posted 08-31-2018 11:30 AM

a GYM is a new variable created by putting (Group || Test Year || Test Month), Test year (Y) and Test Month (M) were taken from TD variable by Y = year (TD) and M = month (TD)

Create classes under some conditions