Dear SAS community, I want to submit to your thoughts one problem. I have 2 datasets one from a cross-sectional study and another from a prospective cohort. I am more used with data analysis than internal sampling of my data…
The 1st study is data obtained from calves between 1 to 21 days old. Each calf has only 1 data line (cross-sectional study 1 visit)
The dataset I have is on the form:
DATASET1
calfID | FarmID | Date_birth | Date_visit | Age | Gender | X1 | X2 | X3 |
111 | FARM1 |
|
|
| Male or female |
|
|
|
Where CalfD is the eartag number of the calf (unique for a specific calf), FarmID is farm identification code, Date_birth the date of calf_birth, Date_visit the day we measured X1,X2 and X2 which are continuous numeric data. Age is the difference between the 2 dates (which represents calf’s age). We’ve also collected gender information.
The 2nd dataset is coming from different farms/ animal. The same data are collected but the same calf can be repeated 2 up to 3 times (extra column visit which indicate the visit number ofr a specific calf) as below (the interval between the visit is the same: 1 week). Calves have the same age range in the 2 datasets (from 1 day to 21 days).
DATASET2 (calves are replicated during 2 to 3 visits one week apart)
calfID | FarmID | Visit | Date_birth | Date_visit | Age | Gender | X1 | X2 | X3 |
222 | FARM2 | 1 |
|
|
|
|
|
|
|
222 | FARM2 | 2 |
|
|
|
|
|
|
|
222 | FARM2 | 3 |
|
|
|
|
|
|
|
223 | FARM2 | 1 |
|
|
|
|
|
|
|
223 | FARM2 | 2 |
|
|
|
|
|
|
|
333 | FARM3 | 1 |
|
|
|
|
|
|
|
My objectives are to perform a logistic regression for predicting calves probability of being younger than X days (different age cut-off would be used) based on covariates Gender, X1, X2, X3 using farm as a random effect. I want to use information from both database.
I want to sample the DATASET 2 to have only 1 sample per calf but also being able to select calves from the database based on the age distribution I want to have. For example if I have 115 calves from dataset 1 and 200 calves from dataset 2. I want to select calves (1 visit calf only) conditional on age characteristics (ex: having a median distribution of the age of calf sampled that I can specified).
I therefore want to know if you have any clues on how sampling the DATASET2 to achieve my goals. I hope that this problem is clearly defined and can be solved with your expertise.
If possible, in a second step I would be interested to make internal validation of my models using bootstrap samples of my 2 datasets (respecting 1 sample per calf). But I want to start by a more simple approach.
Thank you very much, it works perfectly well.
There may not be an algorithm to find an optimal solution to this subsampling problem. But here is a way to find a pretty good solution.
/* Mock data */
data have;
call streaminit(7897);
do id = 1 to 200;
/* Age at first visit */
age = rand("integer", 1, 14);
do visit = 1 by 1 while(age <= 21);
output;
/* visits are one week apart */
age = age + 7;
end;
end;
keep id visit age;
run;
/* Required age median */
%let targetAgeMed = 16;
/* Pick the most desirable visits */
data want;
do until(last.id);
set have; by id;
ageMin = min(ageMin, age);
ageMax = max(ageMax, age);
/* Largest age inferior to the median */
if age <= &targetAgeMed then infMedMax = max(infMedMax, age);
/* Smallest age superior to the median */
if age >= &targetAgeMed then infMedMin = min(infMedMin, age);
end;
if infMed <= supMed
then do;
if missing(infMedMax) then pickAge = ageMin;
else pickAge = infMedMax;
end;
else do;
if missing(infMedMin) then pickAge = ageMax;
else pickAge = infMedMin;
end;
if pickAge > &targetAgeMed then supMed + 1;
if pickAge < &targetAgeMed then infMed + 1;
do until(last.id);
set have; by id;
if age = pickAge then output;
end;
drop ageMin ageMax infMedMax infMedMin infMed supMed pickAge;
run;
proc sql;
select * from
(select "Available Age median" as Statistic, median(age) as medianAge from have)
union
(select "Selected Age median", median(age) from want);
quit;
Statistic medianAge ------------------------------- Available Age median 13 Selected Age median 16
Note: you can get a different solution by changing the order of the animals in the data. It might be a good idea to randomize the order before subsetting.
Thanks for this answer. I don't have access to the data base this week end so It's why I directly tried the code to generate data.
Unfortunately my 9.4 version send me an error message for the "integer" within the rand function.
when looking for this argument in sas book I see that rand is generally followed by a distribution type?
You can replace
age = rand("integer", 1, 14);
/* by */
age = ceil(14*rand("uniform"));
Thank you very much, it works perfectly well.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.