DATA Step, Macro, Functions and more

Identify serial smaples in the dataset

Reply
Frequent Contributor
Frequent Contributor
Posts: 101

Identify serial smaples in the dataset

DOSE=1 at TIME=0 represents the first dosing event.  Subsequent dosing event is represented by DOSE=1.  I want to create a variable “SERIAL” which  identify serial samples from non-serial samples. Non serial samples will be coded as SAMPLE=1 and Serial samples: Number groups of serial samples sequentially with consecutive integers starting from 2 (i.e., SERIAL= 2, 3, ..).  Serial samples: a group of >=3 consecutive  samples (rows) associated with the same dose (pre-dose and all post-dose samples). Samples  (or rows) with the same “VISIT” value belong to the same serial sample group if they are associated with the same dose (including the pre-dose sample) or were taken within <=192 h (TIME) of the first dose in the serial sampling group.  SERIAL= .  if DOSE=1. In this calculations rows where C1  !=  .  & C2 = 1 & C3 equal to 1 or 3 should be ignored. I am struggling to implement this in SAS code. Any help is greatly appreciated. Thanks

 

IDC1C2C3TIMEDOSEVISITSERIAL
2.11011.
2.022011
2.042011
2.119611.
2.02100012
2.04100012
2.02232012
2.04232012
2.02258022
2.04258022
2.02267022
2.04267022
2412604031
2414604031
2412963041
2414963041
24121637051
24141637051
14120012
14140012
1.11011.
1.021012
14141012
1.024012
1.044012
1.028012
1.048012
1.0212012
1.0412012
1.0224012
1.0424012
1.0248012
1.0448012
1.1172112
1.0272012
1.0472012
1.02167012
1.04167012
1.1116911.
1.1124011.
1.02335021
1.04335021
1.1133712.
1.1143412.
1.02503031
1.04503031
1.1150313.
1412669041
1.04669041
1.1177314.
1.1188814.
1.1194014.
1.021006051
1.041006051
1.11103715.
1.11110815.
1.11120615.
1.11128215.
1.021342061
1.041342061
1.11137316.
1.11144116.
1.11151316.
14121679071
14141679071
1.11168017.
1.11175217.
1.11185017.
1.11192117.
14122014081
1.042014081
1.11201818.
1.11208918.
1.11218518.
1.11247718.
1.11254718.
1.11264318.
1.022686081
1.042686081
1.11271518.
1.11281118.
1.11290618.
1.11295118.
1.11304718.
1.11311818.
1.11321518.
1.11328618.
14123360081
1.043360081
1.11338218.
1.11557418.
1.11563918.
1.11573418.
1.11580818.
1.11590218.
1.11597718.
1.11607018.
1.11614318.
1.11623918.
1.11631118.
1.11641318.
1.11648518.
Trusted Advisor
Posts: 1,372

Re: Identify serial smaples in the dataset

In your post you mentioned:

1)  I want to create a variable “SERIAL”

2)  SERIAL= .  if DOSE=1.

I couldn't understand your other rules to calculate SERIAL.

As your posted data has already a colomn named SERIAL,

I tried to build a code that create a new variable named calc_sn,

that as much as I understand should be equal to the posted variable SERIAL.

 

If you check next code log, you will see that first 18 rows has calc_sn = serial.

Maybe you can complete the program by yoursef, if not -

what are the rules to add 1 to the calc_sn or to reset it to 1.

Does ID have any part in those rules ?

data have;
  input 
ID  C1  C2  C3  TIME    DOSE    VISIT   SERIAL;
datalines;
2   .   1   1   0   1   1   .
2   .   0   2   2   0   1   1
2   .   0   4   2   0   1   1
2   .   1   1   96  1   1   .
2   .   0   2   100 0   1   2
2   .   0   4   100 0   1   2
2   .   0   2   232 0   1   2
2   .   0   4   232 0   1   2
2   .   0   2   258 0   2   2
2   .   0   4   258 0   2   2
2   .   0   2   267 0   2   2
2   .   0   4   267 0   2   2
2   4   1   2   604 0   3   1
2   4   1   4   604 0   3   1
2   4   1   2   963 0   4   1
2   4   1   4   963 0   4   1
2   4   1   2   1637    0   5   1
2   4   1   4   1637    0   5   1
1   4   1   2   0   0   1   2
1   4   1   4   0   0   1   2
1   .   1   1   0   1   1   .
1   .   0   2   1   0   1   2
1   4   1   4   1   0   1   2
1   .   0   2   4   0   1   2
1   .   0   4   4   0   1   2
1   .   0   2   8   0   1   2
1   .   0   4   8   0   1   2
1   .   0   2   12  0   1   2
1   .   0   4   12  0   1   2
1   .   0   2   24  0   1   2
1   .   0   4   24  0   1   2
1   .   0   2   48  0   1   2
1   .   0   4   48  0   1   2
1   .   1   1   72  1   1   2
1   .   0   2   72  0   1   2
1   .   0   4   72  0   1   2
1   .   0   2   167 0   1   2
1   .   0   4   167 0   1   2
1   .   1   1   169 1   1   .
1   .   1   1   240 1   1   .
1   .   0   2   335 0   2   1
1   .   0   4   335 0   2   1
1   .   1   1   337 1   2   .
1   .   1   1   434 1   2   .
1   .   0   2   503 0   3   1
1   .   0   4   503 0   3   1
1   .   1   1   503 1   3   .
1   4   1   2   669 0   4   1
1   .   0   4   669 0   4   1
1   .   1   1   773 1   4   .
1   .   1   1   888 1   4   .
1   .   1   1   940 1   4   .
1   .   0   2   1006    0   5   1
1   .   0   4   1006    0   5   1
1   .   1   1   1037    1   5   .
1   .   1   1   1108    1   5   .
1   .   1   1   1206    1   5   .
1   .   1   1   1282    1   5   .
1   .   0   2   1342    0   6   1
1   .   0   4   1342    0   6   1
1   .   1   1   1373    1   6   .
1   .   1   1   1441    1   6   .
1   .   1   1   1513    1   6   .
1   4   1   2   1679    0   7   1
1   4   1   4   1679    0   7   1
1   .   1   1   1680    1   7   .
1   .   1   1   1752    1   7   .
1   .   1   1   1850    1   7   .
1   .   1   1   1921    1   7   .
1   4   1   2   2014    0   8   1
1   .   0   4   2014    0   8   1
1   .   1   1   2018    1   8   .
1   .   1   1   2089    1   8   .
1   .   1   1   2185    1   8   .
1   .   1   1   2477    1   8   .
1   .   1   1   2547    1   8   .
1   .   1   1   2643    1   8   .
1   .   0   2   2686    0   8   1
1   .   0   4   2686    0   8   1
1   .   1   1   2715    1   8   .
1   .   1   1   2811    1   8   .
1   .   1   1   2906    1   8   .
1   .   1   1   2951    1   8   .
1   .   1   1   3047    1   8   .
1   .   1   1   3118    1   8   .
1   .   1   1   3215    1   8   .
1   .   1   1   3286    1   8   .
1   4   1   2   3360    0   8   1
1   .   0   4   3360    0   8   1
1   .   1   1   3382    1   8   .
1   .   1   1   5574    1   8   .
1   .   1   1   5639    1   8   .
1   .   1   1   5734    1   8   .
1   .   1   1   5808    1   8   .
1   .   1   1   5902    1   8   .
1   .   1   1   5977    1   8   .
1   .   1   1   6070    1   8   .
1   .   1   1   6143    1   8   .
1   .   1   1   6239    1   8   .
1   .   1   1   6311    1   8   .
1   .   1   1   6413    1   8   .
1   .   1   1   6485    1   8   .
; run;

data want;
 set have;
  by notsorted id;
     retain sn 0
            calc_sn;  /* calculated serial number */
     if first.id then sn=0; 
     
     if dose=1 then do;
        calc_sn=.; output;
        sn+1;
        return;
     end;
     else calc_sn = sn;
     
     if c1 ne . and c2=1 and c3 not in (1, 3) then do;
        calc_sn=1; sn=1; 
     end; 
     output;
     
     /* check */
     if calc_sn ne serial then put _N_= serial= calc_sn=;
     drop sn;
run;

     
   
   
   
     

 

Frequent Contributor
Frequent Contributor
Posts: 101

Re: Identify serial smaples in the dataset

@Shmuel: Thank you. 

TIME, DOSE are important here. Several doses (>1) are given per subject and each dose is given in time. For instance, ID=2, received doses at time=0 and time=96. dose=1 and serial=. represent those dosing events. Serial samples are  >= 3 consecutive samples (DOSE=0) associated with same dose (including predose samples).  The samples at TIME=2 were set to SERIAL=1 as they are non-consecutive samples (<=3 consecutive samples), because at TIME=96 next dose is given.  Following dose at TIME=96, there are serial samples (>=3) or <= 192 hr from the first dose (from TIME=100 to TIME=267), and hence these serial samples are coded as 2. Samples from TIME=604 to TIME=1637 are > 192 hr from first dose and hence were set to SERIAL=1. 

 

For ID=1, First dose is given at TIME=0, second dose  at TIME=72 & third dose at TIME=169. Samples between TIME=1 and TIME=169 are >= 3 consecutive samples or within <=192 hr from the first dose  and hence are coded SERIAL=2.  DOSES given at later TIME points such as 169, 240, 337 etc.. do not have >=3 samples subsequent to them and hence are coded as SERIAL=1. In case if >=3  consecutive samples are present after TIME=169, those samples will be coded as SERIAL=3 and so on.

 

Hope this information helps. Please let me know if you have any questions.

 

Thank you.

 

 

Trusted Advisor
Posts: 1,372

Re: Identify serial smaples in the dataset

@ari, I have been too busy yesterday and today, so I want just to clarify some more points:

 

- Is ID the subject ID ?

- Do you mean by "sample" - a row/observation in the dataset ?

- If I want to sort the dataset by subject, what other variables I need

  beyond ID (as subject ) and time - as there are "dupliacate" lines, regarding ID Time only.

- AmI right, to count consecutive samples I need count rows with DOSE=0 between 

  instances of DOSE=1.

- Would you say: if there are less than 3 consecutive samples then its Serial=1

  otherwise Serial count from 2 on by 1 (2, 3, 4...) per subject (ID) ?

 

I hope I'll have time to continue tomorrow.

 

 

Frequent Contributor
Frequent Contributor
Posts: 101

Re: Identify serial smaples in the dataset

@Shmuel. Thanks for the response.

   Yes, Id is the subject ID and sample refers to row/observation.

Sort order is:  ID TIME  DOSE  C3 can be used to sort the records

Yes, consecutive samples  can be identified by   counting rows with DOSE=0 between  instances of DOSE=1

if there are less than 3 consecutive samples then its Serial=1 otherwise Serial count from 2 on by 1 (2, 3, 4...) per subject (ID) ? 

Yes, for instance, if there are 3 groups of 4 consecutive samples (or rows) are found per subject. 1st group wil be SERIAL=2, 2nd group will be SERIAL=3 and so on.

Thank you. 

 

Trusted Advisor
Posts: 1,372

Re: Identify serial smaples in the dataset

Please run next code. No changes to step 1 - creating HAVE dataset.

There are still few observations with serial NE cal_sn (calculated serial number);

Check dataset CHECK.

proc sort data=have out=haves; by id time dose c3; run;

data count_cs(rename=(k1=id k2=time k3=dose k4=c3));
 set haves end=eof;
  by id time dose c3;
     retain phase 0    /* 0 1st time only */
            k1 k2 k3 k4
            count_cs;  /* count consecutive samples */
     if phase=1 and dose=1 
     or last.id
     or eof then output;
     
     if dose=1 then do;
        phase=1;
        k1=id; k2=time; k3=dose; k4=c3; 
        count_cs=0;
     end;
     if dose=0 then count_cs+1;
     keep k1 k2 k3 k4 count_cs;
run;     
     

data want;
 merge haves 
       count_cs(rename=(count_cs=count));
  by id time dose c3;
     retain count_cs sn dose_time
            calc_sn;  /* sn = serial number */
     lag_time = lag(time);
     if first.id then sn=0;
     if dose=1 then do;
        dose_time=time;
        count_cs=count;
        calc_sn=.; output;
        if count_cs < 3
        then calc_sn=1;
        else do;
             if sn=0 then sn=2; else sn+1;
             calc_sn=sn;
        end;
     end;
     else do;
          if time - dose_time > 192 then calc_sn=1;
          output; 
     end;
     drop sn count;
run;

data check;  /* should contain 0 obs */
 set want;
     if serial ne calc_sn;
run;     
   

 

Ask a Question
Discussion stats
  • 5 replies
  • 151 views
  • 0 likes
  • 2 in conversation