BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SBuc
Obsidian | Level 7

Dear SAS community, I want to submit to your thoughts one problem. I have 2 datasets one from a cross-sectional study and another from a prospective cohort. I am more used with data analysis than internal sampling of my data…

The 1st study is data obtained from calves between 1 to 21 days old. Each calf has only 1 data line (cross-sectional study 1 visit)

The dataset I have is on the form:

DATASET1

calfID

FarmID

Date_birth

Date_visit

Age

Gender

X1

X2

X3

111

FARM1

 

 

 

Male or female

 

 

 

 Where CalfD is the eartag number of the calf (unique for a specific calf), FarmID is farm identification code, Date_birth the date of calf_birth, Date_visit the day we measured X1,X2 and X2 which are continuous numeric data. Age is the difference between the 2 dates (which represents calf’s age). We’ve also collected gender information.

The 2nd dataset is coming from different farms/ animal. The same data are collected but the same calf can be repeated 2 up to 3 times (extra column visit which indicate the visit number ofr a specific calf) as below (the interval between the visit is the same: 1 week). Calves have the same age range in the 2 datasets (from 1 day to 21 days).

DATASET2 (calves are replicated during 2 to 3 visits one week apart)

calfID

FarmID

Visit

Date_birth

Date_visit

Age

Gender

X1

X2

X3

222

FARM2

1

 

 

 

 

 

 

 

222

FARM2

2

 

 

 

 

 

 

 

222

FARM2

3

 

 

 

 

 

 

 

223

FARM2

1

 

 

 

 

 

 

 

223

FARM2

2

 

 

 

 

 

 

 

333

FARM3

1

 

 

 

 

 

 

 

 

My objectives are to perform a logistic regression for predicting calves probability of being younger than X days (different age cut-off would be used) based on covariates Gender, X1, X2, X3 using farm as a random effect. I want to use information from both database.

I want to sample the DATASET 2 to have only 1 sample per calf but also being able to select calves from the database based on the age distribution I want to have. For example if I have 115 calves from dataset 1 and 200 calves from dataset 2. I want to select calves (1 visit calf only) conditional on age characteristics (ex: having a median distribution of the age of calf sampled that I can specified).

I therefore want to know if you have any clues on how sampling the DATASET2 to achieve my goals. I hope that this problem is clearly defined and can be solved with your expertise.

If possible, in a second step I would be interested to make internal validation of my models using bootstrap samples of my 2 datasets (respecting 1 sample per calf). But I want to start by a more simple approach.

 

1 ACCEPTED SOLUTION

Accepted Solutions
SBuc
Obsidian | Level 7

Thank you very much, it works perfectly well.

View solution in original post

4 REPLIES 4
PGStats
Opal | Level 21

There may not be an algorithm to find an optimal solution to this subsampling problem. But here is a way to find a pretty good solution.

 

 

/* Mock data */
data have;
call streaminit(7897);
do id = 1 to 200;
    /* Age at first visit */
    age = rand("integer", 1, 14);
    do visit = 1 by 1 while(age <= 21);
        output;
        /* visits are one week apart */
        age = age + 7;
        end;
    end;
keep id visit age;
run; 

/* Required age median */
%let targetAgeMed = 16;

/* Pick the most desirable visits */ 
data want;
do until(last.id);
    set have; by id;
    ageMin = min(ageMin, age);
    ageMax = max(ageMax, age);
    /* Largest age inferior to the median */
    if age <= &targetAgeMed then infMedMax = max(infMedMax, age);
    /* Smallest age superior to the median */
    if age >= &targetAgeMed then infMedMin = min(infMedMin, age);
    end;

if infMed <= supMed 
    then do;
        if missing(infMedMax) then pickAge = ageMin;
        else pickAge = infMedMax;
        end;
    else do;
        if missing(infMedMin) then pickAge = ageMax;
        else pickAge = infMedMin;
        end;

if pickAge > &targetAgeMed then supMed + 1;
if pickAge < &targetAgeMed then infMed + 1;

do until(last.id);
    set have; by id;
    if age = pickAge then output;
    end;
drop ageMin ageMax infMedMax infMedMin infMed supMed pickAge;
run;

proc sql;
select * from
(select "Available Age median" as Statistic, median(age) as medianAge from have)
union
(select "Selected Age median", median(age) from want);
quit;
                          Statistic             medianAge
                          -------------------------------
                          Available Age median         13
                          Selected Age median          16

 

Note: you can get a different solution by changing the order of the animals in the data. It might be a good idea to randomize the order before subsetting.  

PG
SBuc
Obsidian | Level 7

Thanks for this answer. I don't have access to the data base this week end so It's why I directly tried the code to generate data.

Unfortunately my 9.4 version send me an error message for the "integer" within the rand function.

when looking for this argument in sas book I see that rand is generally followed by a distribution type?

 

PGStats
Opal | Level 21

You can replace

 

age = rand("integer", 1, 14);

/* by */

age = ceil(14*rand("uniform"));
PG
SBuc
Obsidian | Level 7

Thank you very much, it works perfectly well.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 730 views
  • 0 likes
  • 2 in conversation