Solved: Re: Resampling one data per subject where multiple observation are ava...

SBuc · Posted 10-12-2018 06:34 AM

Dear SAS community, I want to submit to your thoughts one problem. I have 2 datasets one from a cross-sectional study and another from a prospective cohort. I am more used with data analysis than internal sampling of my data…

The 1st study is data obtained from calves between 1 to 21 days old. Each calf has only 1 data line (cross-sectional study 1 visit)

The dataset I have is on the form:

DATASET1

calfID	FarmID	Date_birth	Date_visit	Age	Gender	X1	X2	X3
111	FARM1				Male or female

Where CalfD is the eartag number of the calf (unique for a specific calf), FarmID is farm identification code, Date_birth the date of calf_birth, Date_visit the day we measured X1,X2 and X2 which are continuous numeric data. Age is the difference between the 2 dates (which represents calf’s age). We’ve also collected gender information.

The 2nd dataset is coming from different farms/ animal. The same data are collected but the same calf can be repeated 2 up to 3 times (extra column visit which indicate the visit number ofr a specific calf) as below (the interval between the visit is the same: 1 week). Calves have the same age range in the 2 datasets (from 1 day to 21 days).

DATASET2 (calves are replicated during 2 to 3 visits one week apart)

calfID	FarmID	Visit	Date_birth	Date_visit	Age	Gender	X1	X2	X3
222	FARM2	1
222	FARM2	2
222	FARM2	3
223	FARM2	1
223	FARM2	2
333	FARM3	1

My objectives are to perform a logistic regression for predicting calves probability of being younger than X days (different age cut-off would be used) based on covariates Gender, X1, X2, X3 using farm as a random effect. I want to use information from both database.

I want to sample the DATASET 2 to have only 1 sample per calf but also being able to select calves from the database based on the age distribution I want to have. For example if I have 115 calves from dataset 1 and 200 calves from dataset 2. I want to select calves (1 visit calf only) conditional on age characteristics (ex: having a median distribution of the age of calf sampled that I can specified).

I therefore want to know if you have any clues on how sampling the DATASET2 to achieve my goals. I hope that this problem is clearly defined and can be solved with your expertise.

If possible, in a second step I would be interested to make internal validation of my models using bootstrap samples of my 2 datasets (respecting 1 sample per calf). But I want to start by a more simple approach.

SBuc · Posted 10-14-2018 07:03 AM

Thank you very much, it works perfectly well.

View solution in original post

PGStats · Posted 10-12-2018 06:16 PM

There may not be an algorithm to find an optimal solution to this subsampling problem. But here is a way to find a pretty good solution.

/* Mock data */
data have;
call streaminit(7897);
do id = 1 to 200;
    /* Age at first visit */
    age = rand("integer", 1, 14);
    do visit = 1 by 1 while(age <= 21);
        output;
        /* visits are one week apart */
        age = age + 7;
        end;
    end;
keep id visit age;
run; 

/* Required age median */
%let targetAgeMed = 16;

/* Pick the most desirable visits */ 
data want;
do until(last.id);
    set have; by id;
    ageMin = min(ageMin, age);
    ageMax = max(ageMax, age);
    /* Largest age inferior to the median */
    if age <= &targetAgeMed then infMedMax = max(infMedMax, age);
    /* Smallest age superior to the median */
    if age >= &targetAgeMed then infMedMin = min(infMedMin, age);
    end;

if infMed <= supMed 
    then do;
        if missing(infMedMax) then pickAge = ageMin;
        else pickAge = infMedMax;
        end;
    else do;
        if missing(infMedMin) then pickAge = ageMax;
        else pickAge = infMedMin;
        end;

if pickAge > &targetAgeMed then supMed + 1;
if pickAge < &targetAgeMed then infMed + 1;

do until(last.id);
    set have; by id;
    if age = pickAge then output;
    end;
drop ageMin ageMax infMedMax infMedMin infMed supMed pickAge;
run;

proc sql;
select * from
(select "Available Age median" as Statistic, median(age) as medianAge from have)
union
(select "Selected Age median", median(age) from want);
quit;

                          Statistic             medianAge
                          -------------------------------
                          Available Age median         13
                          Selected Age median          16

Note: you can get a different solution by changing the order of the animals in the data. It might be a good idea to randomize the order before subsetting.

PG

SBuc · Posted 10-13-2018 07:29 AM

Thanks for this answer. I don't have access to the data base this week end so It's why I directly tried the code to generate data.

Unfortunately my 9.4 version send me an error message for the "integer" within the rand function.

when looking for this argument in sas book I see that rand is generally followed by a distribution type?

PGStats · Posted 10-13-2018 02:44 PM

You can replace

age = rand("integer", 1, 14);

/* by */

age = ceil(14*rand("uniform"));

PG

SBuc · Posted 10-14-2018 07:03 AM

Thank you very much, it works perfectly well.

Resampling one data per subject where multiple observation are available

Re: Resampling one data per subject where multiple observation are available

Re: Resampling one data per subject where multiple observation are available

Re: Resampling one data per subject where multiple observation are available

Re: Resampling one data per subject where multiple observation are available

Re: Resampling one data per subject where multiple observation are available