Solved: Renaming duplicated matched paired ids for case control data in bootst...

fones · Posted 05-01-2019 01:19 PM

I'm bootstrapping samples (with replacement) for a matched case control data using proc survey select. The sampling unit is the id for each matched pair and am outputting all the replicates using the outhits option. So the output data set is the same size as the input data set.

My goal is to assign different id values for each duplicated matched pair to prevent such case-control pairs from being dropped off during analysis. Here is an sample input and output data set, and the modification am trying to achieve.

I'm thinking by group processing but when the number of hits is more than 2, then I can't seem to wrap my mind around it.

Any ideas are appreciated.

FreelanceReinh · Posted 05-02-2019 01:11 PM

Hello @fones and welcome to the SAS Support Communities!

Sorry to see that you haven't received a reply yet. Maybe it's because many users here are hesitant or not allowed to open MS office attachments (security concerns). I don't even have Word or Excel installed on my SAS workstation, but I managed to reconstruct your data from the preview of the .docx file:

data have;
input Replicate id Case NumberHits;
cards;
1 10 0 1 
1 10 1 1 
1 11 0 3 
1 11 1 3 
1 11 0 3
1 11 1 3 
1 11 0 3 
1 11 1 3 
1 14 0 2 
1 14 1 2 
1 14 0 2 
1 14 1 2 
2  8 0 1 
2  8 1 1 
2 10 0 2 
2 10 1 2 
2 10 0 2 
2 10 1 2
;

(A DATA step as above is the preferred way to provide test data. Instructions on how to create such a DATA step from an existing SAS dataset can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...)

Yes, BY-group processing can be used to create the new IDs:

data want(drop=seqno /* id rename=(newid=id) */);
set have;
by replicate id;
if first.id then seqno=0;
else seqno+1;
newid=ifn(seqno>1, id*1e4+int(seqno/2), id);
run;

Result:

                           Number
Replicate    id    Case     Hits      newid

    1        10      0        1          10
    1        10      1        1          10
    1        11      0        3          11
    1        11      1        3          11
    1        11      0        3      110001
    1        11      1        3      110001
    1        11      0        3      110002
    1        11      1        3      110002
    1        14      0        2          14
    1        14      1        2          14
    1        14      0        2      140001
    1        14      1        2      140001
    2         8      0        1           8
    2         8      1        1           8
    2        10      0        2          10
    2        10      1        2          10
    2        10      0        2      100001
    2        10      1        2      100001

Depending on your max(NumberHits) and max(id) you may want to adapt the multiplier (1e4=10000) so that it is large enough to avoid overlaps between the new IDs of different IDs and small enough to avoid numeric precision issues (if NEWID gets larger than about 9.0e15). Of course, you can simplify the definition of NEWID if those IDs which are currently copied to NEWID may be multiplied, too:

newid=id*1e4+int(seqno/2);

If you really want to replace the original ID values, just activate the code that I've commented out.

View solution in original post

FreelanceReinh · Posted 05-02-2019 01:11 PM

Hello @fones and welcome to the SAS Support Communities!

Sorry to see that you haven't received a reply yet. Maybe it's because many users here are hesitant or not allowed to open MS office attachments (security concerns). I don't even have Word or Excel installed on my SAS workstation, but I managed to reconstruct your data from the preview of the .docx file:

data have;
input Replicate id Case NumberHits;
cards;
1 10 0 1 
1 10 1 1 
1 11 0 3 
1 11 1 3 
1 11 0 3
1 11 1 3 
1 11 0 3 
1 11 1 3 
1 14 0 2 
1 14 1 2 
1 14 0 2 
1 14 1 2 
2  8 0 1 
2  8 1 1 
2 10 0 2 
2 10 1 2 
2 10 0 2 
2 10 1 2
;

(A DATA step as above is the preferred way to provide test data. Instructions on how to create such a DATA step from an existing SAS dataset can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...)

Yes, BY-group processing can be used to create the new IDs:

data want(drop=seqno /* id rename=(newid=id) */);
set have;
by replicate id;
if first.id then seqno=0;
else seqno+1;
newid=ifn(seqno>1, id*1e4+int(seqno/2), id);
run;

Result:

                           Number
Replicate    id    Case     Hits      newid

    1        10      0        1          10
    1        10      1        1          10
    1        11      0        3          11
    1        11      1        3          11
    1        11      0        3      110001
    1        11      1        3      110001
    1        11      0        3      110002
    1        11      1        3      110002
    1        14      0        2          14
    1        14      1        2          14
    1        14      0        2      140001
    1        14      1        2      140001
    2         8      0        1           8
    2         8      1        1           8
    2        10      0        2          10
    2        10      1        2          10
    2        10      0        2      100001
    2        10      1        2      100001

Depending on your max(NumberHits) and max(id) you may want to adapt the multiplier (1e4=10000) so that it is large enough to avoid overlaps between the new IDs of different IDs and small enough to avoid numeric precision issues (if NEWID gets larger than about 9.0e15). Of course, you can simplify the definition of NEWID if those IDs which are currently copied to NEWID may be multiplied, too:

newid=id*1e4+int(seqno/2);

If you really want to replace the original ID values, just activate the code that I've commented out.

fones · Posted 05-02-2019 02:22 PM

Just what I wanted to avoid overlaps! Thanks a lot!

PS: Will use a SAS data step next time I have a data example!

fones · Posted 05-02-2019 02:28 PM

And just to add on for other users out there: The data have to be sorted by the BY variables before the BY- group processing!

Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Re: Renaming duplicated matched paired ids for case control data in bootstrapp samples

Registration is open