Re: Identify serial smaples in the dataset

ari · Posted 02-04-2017 09:32 AM

DOSE=1 at TIME=0 represents the first dosing event. Subsequent dosing event is represented by DOSE=1. I want to create a variable “SERIAL” which identify serial samples from non-serial samples. Non serial samples will be coded as SAMPLE=1 and Serial samples: Number groups of serial samples sequentially with consecutive integers starting from 2 (i.e., SERIAL= 2, 3, ..). Serial samples: a group of >=3 consecutive samples (rows) associated with the same dose (pre-dose and all post-dose samples). Samples (or rows) with the same “VISIT” value belong to the same serial sample group if they are associated with the same dose (including the pre-dose sample) or were taken within <=192 h (TIME) of the first dose in the serial sampling group. SERIAL= . if DOSE=1. In this calculations rows where C1 != . & C2 = 1 & C3 equal to 1 or 3 should be ignored. I am struggling to implement this in SAS code. Any help is greatly appreciated. Thanks

ID	C1	C2	C3	TIME	DOSE	VISIT	SERIAL
2	.	1	1	0	1	1	.
2	.	0	2	2	0	1	1
2	.	0	4	2	0	1	1
2	.	1	1	96	1	1	.
2	.	0	2	100	0	1	2
2	.	0	4	100	0	1	2
2	.	0	2	232	0	1	2
2	.	0	4	232	0	1	2
2	.	0	2	258	0	2	2
2	.	0	4	258	0	2	2
2	.	0	2	267	0	2	2
2	.	0	4	267	0	2	2
2	4	1	2	604	0	3	1
2	4	1	4	604	0	3	1
2	4	1	2	963	0	4	1
2	4	1	4	963	0	4	1
2	4	1	2	1637	0	5	1
2	4	1	4	1637	0	5	1
1	4	1	2	0	0	1	2
1	4	1	4	0	0	1	2
1	.	1	1	0	1	1	.
1	.	0	2	1	0	1	2
1	4	1	4	1	0	1	2
1	.	0	2	4	0	1	2
1	.	0	4	4	0	1	2
1	.	0	2	8	0	1	2
1	.	0	4	8	0	1	2
1	.	0	2	12	0	1	2
1	.	0	4	12	0	1	2
1	.	0	2	24	0	1	2
1	.	0	4	24	0	1	2
1	.	0	2	48	0	1	2
1	.	0	4	48	0	1	2
1	.	1	1	72	1	1	2
1	.	0	2	72	0	1	2
1	.	0	4	72	0	1	2
1	.	0	2	167	0	1	2
1	.	0	4	167	0	1	2
1	.	1	1	169	1	1	.
1	.	1	1	240	1	1	.
1	.	0	2	335	0	2	1
1	.	0	4	335	0	2	1
1	.	1	1	337	1	2	.
1	.	1	1	434	1	2	.
1	.	0	2	503	0	3	1
1	.	0	4	503	0	3	1
1	.	1	1	503	1	3	.
1	4	1	2	669	0	4	1
1	.	0	4	669	0	4	1
1	.	1	1	773	1	4	.
1	.	1	1	888	1	4	.
1	.	1	1	940	1	4	.
1	.	0	2	1006	0	5	1
1	.	0	4	1006	0	5	1
1	.	1	1	1037	1	5	.
1	.	1	1	1108	1	5	.
1	.	1	1	1206	1	5	.
1	.	1	1	1282	1	5	.
1	.	0	2	1342	0	6	1
1	.	0	4	1342	0	6	1
1	.	1	1	1373	1	6	.
1	.	1	1	1441	1	6	.
1	.	1	1	1513	1	6	.
1	4	1	2	1679	0	7	1
1	4	1	4	1679	0	7	1
1	.	1	1	1680	1	7	.
1	.	1	1	1752	1	7	.
1	.	1	1	1850	1	7	.
1	.	1	1	1921	1	7	.
1	4	1	2	2014	0	8	1
1	.	0	4	2014	0	8	1
1	.	1	1	2018	1	8	.
1	.	1	1	2089	1	8	.
1	.	1	1	2185	1	8	.
1	.	1	1	2477	1	8	.
1	.	1	1	2547	1	8	.
1	.	1	1	2643	1	8	.
1	.	0	2	2686	0	8	1
1	.	0	4	2686	0	8	1
1	.	1	1	2715	1	8	.
1	.	1	1	2811	1	8	.
1	.	1	1	2906	1	8	.
1	.	1	1	2951	1	8	.
1	.	1	1	3047	1	8	.
1	.	1	1	3118	1	8	.
1	.	1	1	3215	1	8	.
1	.	1	1	3286	1	8	.
1	4	1	2	3360	0	8	1
1	.	0	4	3360	0	8	1
1	.	1	1	3382	1	8	.
1	.	1	1	5574	1	8	.
1	.	1	1	5639	1	8	.
1	.	1	1	5734	1	8	.
1	.	1	1	5808	1	8	.
1	.	1	1	5902	1	8	.
1	.	1	1	5977	1	8	.
1	.	1	1	6070	1	8	.
1	.	1	1	6143	1	8	.
1	.	1	1	6239	1	8	.
1	.	1	1	6311	1	8	.
1	.	1	1	6413	1	8	.
1	.	1	1	6485	1	8	.

Shmuel · Posted 02-04-2017 04:44 PM

In your post you mentioned:

1) I want to create a variable “SERIAL”

2) SERIAL= . if DOSE=1.

I couldn't understand your other rules to calculate SERIAL.

As your posted data has already a colomn named SERIAL,

I tried to build a code that create a new variable named calc_sn,

that as much as I understand should be equal to the posted variable SERIAL.

If you check next code log, you will see that first 18 rows has calc_sn = serial.

Maybe you can complete the program by yoursef, if not -

what are the rules to add 1 to the calc_sn or to reset it to 1.

Does ID have any part in those rules ?

data have;
  input 
ID  C1  C2  C3  TIME    DOSE    VISIT   SERIAL;
datalines;
2   .   1   1   0   1   1   .
2   .   0   2   2   0   1   1
2   .   0   4   2   0   1   1
2   .   1   1   96  1   1   .
2   .   0   2   100 0   1   2
2   .   0   4   100 0   1   2
2   .   0   2   232 0   1   2
2   .   0   4   232 0   1   2
2   .   0   2   258 0   2   2
2   .   0   4   258 0   2   2
2   .   0   2   267 0   2   2
2   .   0   4   267 0   2   2
2   4   1   2   604 0   3   1
2   4   1   4   604 0   3   1
2   4   1   2   963 0   4   1
2   4   1   4   963 0   4   1
2   4   1   2   1637    0   5   1
2   4   1   4   1637    0   5   1
1   4   1   2   0   0   1   2
1   4   1   4   0   0   1   2
1   .   1   1   0   1   1   .
1   .   0   2   1   0   1   2
1   4   1   4   1   0   1   2
1   .   0   2   4   0   1   2
1   .   0   4   4   0   1   2
1   .   0   2   8   0   1   2
1   .   0   4   8   0   1   2
1   .   0   2   12  0   1   2
1   .   0   4   12  0   1   2
1   .   0   2   24  0   1   2
1   .   0   4   24  0   1   2
1   .   0   2   48  0   1   2
1   .   0   4   48  0   1   2
1   .   1   1   72  1   1   2
1   .   0   2   72  0   1   2
1   .   0   4   72  0   1   2
1   .   0   2   167 0   1   2
1   .   0   4   167 0   1   2
1   .   1   1   169 1   1   .
1   .   1   1   240 1   1   .
1   .   0   2   335 0   2   1
1   .   0   4   335 0   2   1
1   .   1   1   337 1   2   .
1   .   1   1   434 1   2   .
1   .   0   2   503 0   3   1
1   .   0   4   503 0   3   1
1   .   1   1   503 1   3   .
1   4   1   2   669 0   4   1
1   .   0   4   669 0   4   1
1   .   1   1   773 1   4   .
1   .   1   1   888 1   4   .
1   .   1   1   940 1   4   .
1   .   0   2   1006    0   5   1
1   .   0   4   1006    0   5   1
1   .   1   1   1037    1   5   .
1   .   1   1   1108    1   5   .
1   .   1   1   1206    1   5   .
1   .   1   1   1282    1   5   .
1   .   0   2   1342    0   6   1
1   .   0   4   1342    0   6   1
1   .   1   1   1373    1   6   .
1   .   1   1   1441    1   6   .
1   .   1   1   1513    1   6   .
1   4   1   2   1679    0   7   1
1   4   1   4   1679    0   7   1
1   .   1   1   1680    1   7   .
1   .   1   1   1752    1   7   .
1   .   1   1   1850    1   7   .
1   .   1   1   1921    1   7   .
1   4   1   2   2014    0   8   1
1   .   0   4   2014    0   8   1
1   .   1   1   2018    1   8   .
1   .   1   1   2089    1   8   .
1   .   1   1   2185    1   8   .
1   .   1   1   2477    1   8   .
1   .   1   1   2547    1   8   .
1   .   1   1   2643    1   8   .
1   .   0   2   2686    0   8   1
1   .   0   4   2686    0   8   1
1   .   1   1   2715    1   8   .
1   .   1   1   2811    1   8   .
1   .   1   1   2906    1   8   .
1   .   1   1   2951    1   8   .
1   .   1   1   3047    1   8   .
1   .   1   1   3118    1   8   .
1   .   1   1   3215    1   8   .
1   .   1   1   3286    1   8   .
1   4   1   2   3360    0   8   1
1   .   0   4   3360    0   8   1
1   .   1   1   3382    1   8   .
1   .   1   1   5574    1   8   .
1   .   1   1   5639    1   8   .
1   .   1   1   5734    1   8   .
1   .   1   1   5808    1   8   .
1   .   1   1   5902    1   8   .
1   .   1   1   5977    1   8   .
1   .   1   1   6070    1   8   .
1   .   1   1   6143    1   8   .
1   .   1   1   6239    1   8   .
1   .   1   1   6311    1   8   .
1   .   1   1   6413    1   8   .
1   .   1   1   6485    1   8   .
; run;

data want;
 set have;
  by notsorted id;
     retain sn 0
            calc_sn;  /* calculated serial number */
     if first.id then sn=0; 
     
     if dose=1 then do;
        calc_sn=.; output;
        sn+1;
        return;
     end;
     else calc_sn = sn;
     
     if c1 ne . and c2=1 and c3 not in (1, 3) then do;
        calc_sn=1; sn=1; 
     end; 
     output;
     
     /* check */
     if calc_sn ne serial then put _N_= serial= calc_sn=;
     drop sn;
run;

ari · Posted 02-05-2017 01:04 AM

@Shmuel: Thank you.

TIME, DOSE are important here. Several doses (>1) are given per subject and each dose is given in time. For instance, ID=2, received doses at time=0 and time=96. dose=1 and serial=. represent those dosing events. Serial samples are >= 3 consecutive samples (DOSE=0) associated with same dose (including predose samples). The samples at TIME=2 were set to SERIAL=1 as they are non-consecutive samples (<=3 consecutive samples), because at TIME=96 next dose is given. Following dose at TIME=96, there are serial samples (>=3) or <= 192 hr from the first dose (from TIME=100 to TIME=267), and hence these serial samples are coded as 2. Samples from TIME=604 to TIME=1637 are > 192 hr from first dose and hence were set to SERIAL=1.

For ID=1, First dose is given at TIME=0, second dose at TIME=72 & third dose at TIME=169. Samples between TIME=1 and TIME=169 are >= 3 consecutive samples or within <=192 hr from the first dose and hence are coded SERIAL=2. DOSES given at later TIME points such as 169, 240, 337 etc.. do not have >=3 samples subsequent to them and hence are coded as SERIAL=1. In case if >=3 consecutive samples are present after TIME=169, those samples will be coded as SERIAL=3 and so on.

Hope this information helps. Please let me know if you have any questions.

Thank you.

Shmuel · Posted 02-05-2017 07:25 AM

@ari, I have been too busy yesterday and today, so I want just to clarify some more points:

- Is ID the subject ID ?

- Do you mean by "sample" - a row/observation in the dataset ?

- If I want to sort the dataset by subject, what other variables I need

beyond ID (as subject ) and time - as there are "dupliacate" lines, regarding ID Time only.

- AmI right, to count consecutive samples I need count rows with DOSE=0 between

instances of DOSE=1.

- Would you say: if there are less than 3 consecutive samples then its Serial=1

otherwise Serial count from 2 on by 1 (2, 3, 4...) per subject (ID) ?

I hope I'll have time to continue tomorrow.

ari · Posted 02-05-2017 09:29 AM

@Shmuel. Thanks for the response.

Yes, Id is the subject ID and sample refers to row/observation.

Sort order is: ID TIME DOSE C3 can be used to sort the records

Yes, consecutive samples can be identified by counting rows with DOSE=0 between instances of DOSE=1

if there are less than 3 consecutive samples then its Serial=1 otherwise Serial count from 2 on by 1 (2, 3, 4...) per subject (ID) ?

Yes, for instance, if there are 3 groups of 4 consecutive samples (or rows) are found per subject. 1st group wil be SERIAL=2, 2nd group will be SERIAL=3 and so on.

Thank you.

Shmuel · Posted 02-05-2017 09:13 PM

Please run next code. No changes to step 1 - creating HAVE dataset.

There are still few observations with serial NE cal_sn (calculated serial number);

Check dataset CHECK.

proc sort data=have out=haves; by id time dose c3; run;

data count_cs(rename=(k1=id k2=time k3=dose k4=c3));
 set haves end=eof;
  by id time dose c3;
     retain phase 0    /* 0 1st time only */
            k1 k2 k3 k4
            count_cs;  /* count consecutive samples */
     if phase=1 and dose=1 
     or last.id
     or eof then output;
     
     if dose=1 then do;
        phase=1;
        k1=id; k2=time; k3=dose; k4=c3; 
        count_cs=0;
     end;
     if dose=0 then count_cs+1;
     keep k1 k2 k3 k4 count_cs;
run;     
     

data want;
 merge haves 
       count_cs(rename=(count_cs=count));
  by id time dose c3;
     retain count_cs sn dose_time
            calc_sn;  /* sn = serial number */
     lag_time = lag(time);
     if first.id then sn=0;
     if dose=1 then do;
        dose_time=time;
        count_cs=count;
        calc_sn=.; output;
        if count_cs < 3
        then calc_sn=1;
        else do;
             if sn=0 then sn=2; else sn+1;
             calc_sn=sn;
        end;
     end;
     else do;
          if time - dose_time > 192 then calc_sn=1;
          output; 
     end;
     drop sn count;
run;

data check;  /* should contain 0 obs */
 set want;
     if serial ne calc_sn;
run;

Identify serial smaples in the dataset