I'm bootstrapping samples (with replacement) for a matched case control data using proc survey select. The sampling unit is the id for each matched pair and am outputting all the replicates using the outhits option. So the output data set is the same size as the input data set.
My goal is to assign different id values for each duplicated matched pair to prevent such case-control pairs from being dropped off during analysis. Here is an sample input and output data set, and the modification am trying to achieve.
I'm thinking by group processing but when the number of hits is more than 2, then I can't seem to wrap my mind around it.
Any ideas are appreciated.
Hello @fones and welcome to the SAS Support Communities!
Sorry to see that you haven't received a reply yet. Maybe it's because many users here are hesitant or not allowed to open MS office attachments (security concerns). I don't even have Word or Excel installed on my SAS workstation, but I managed to reconstruct your data from the preview of the .docx file:
data have;
input Replicate id Case NumberHits;
cards;
1 10 0 1
1 10 1 1
1 11 0 3
1 11 1 3
1 11 0 3
1 11 1 3
1 11 0 3
1 11 1 3
1 14 0 2
1 14 1 2
1 14 0 2
1 14 1 2
2 8 0 1
2 8 1 1
2 10 0 2
2 10 1 2
2 10 0 2
2 10 1 2
;
(A DATA step as above is the preferred way to provide test data. Instructions on how to create such a DATA step from an existing SAS dataset can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...)
Yes, BY-group processing can be used to create the new IDs:
data want(drop=seqno /* id rename=(newid=id) */);
set have;
by replicate id;
if first.id then seqno=0;
else seqno+1;
newid=ifn(seqno>1, id*1e4+int(seqno/2), id);
run;
Result:
Number Replicate id Case Hits newid 1 10 0 1 10 1 10 1 1 10 1 11 0 3 11 1 11 1 3 11 1 11 0 3 110001 1 11 1 3 110001 1 11 0 3 110002 1 11 1 3 110002 1 14 0 2 14 1 14 1 2 14 1 14 0 2 140001 1 14 1 2 140001 2 8 0 1 8 2 8 1 1 8 2 10 0 2 10 2 10 1 2 10 2 10 0 2 100001 2 10 1 2 100001
Depending on your max(NumberHits) and max(id) you may want to adapt the multiplier (1e4=10000) so that it is large enough to avoid overlaps between the new IDs of different IDs and small enough to avoid numeric precision issues (if NEWID gets larger than about 9.0e15). Of course, you can simplify the definition of NEWID if those IDs which are currently copied to NEWID may be multiplied, too:
newid=id*1e4+int(seqno/2);
If you really want to replace the original ID values, just activate the code that I've commented out.
Hello @fones and welcome to the SAS Support Communities!
Sorry to see that you haven't received a reply yet. Maybe it's because many users here are hesitant or not allowed to open MS office attachments (security concerns). I don't even have Word or Excel installed on my SAS workstation, but I managed to reconstruct your data from the preview of the .docx file:
data have;
input Replicate id Case NumberHits;
cards;
1 10 0 1
1 10 1 1
1 11 0 3
1 11 1 3
1 11 0 3
1 11 1 3
1 11 0 3
1 11 1 3
1 14 0 2
1 14 1 2
1 14 0 2
1 14 1 2
2 8 0 1
2 8 1 1
2 10 0 2
2 10 1 2
2 10 0 2
2 10 1 2
;
(A DATA step as above is the preferred way to provide test data. Instructions on how to create such a DATA step from an existing SAS dataset can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...)
Yes, BY-group processing can be used to create the new IDs:
data want(drop=seqno /* id rename=(newid=id) */);
set have;
by replicate id;
if first.id then seqno=0;
else seqno+1;
newid=ifn(seqno>1, id*1e4+int(seqno/2), id);
run;
Result:
Number Replicate id Case Hits newid 1 10 0 1 10 1 10 1 1 10 1 11 0 3 11 1 11 1 3 11 1 11 0 3 110001 1 11 1 3 110001 1 11 0 3 110002 1 11 1 3 110002 1 14 0 2 14 1 14 1 2 14 1 14 0 2 140001 1 14 1 2 140001 2 8 0 1 8 2 8 1 1 8 2 10 0 2 10 2 10 1 2 10 2 10 0 2 100001 2 10 1 2 100001
Depending on your max(NumberHits) and max(id) you may want to adapt the multiplier (1e4=10000) so that it is large enough to avoid overlaps between the new IDs of different IDs and small enough to avoid numeric precision issues (if NEWID gets larger than about 9.0e15). Of course, you can simplify the definition of NEWID if those IDs which are currently copied to NEWID may be multiplied, too:
newid=id*1e4+int(seqno/2);
If you really want to replace the original ID values, just activate the code that I've commented out.
Just what I wanted to avoid overlaps! Thanks a lot!
PS: Will use a SAS data step next time I have a data example!
And just to add on for other users out there: The data have to be sorted by the BY variables before the BY- group processing!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.