Re: Matching observations based on ID number

Mdormond · Posted 02-01-2015 01:46 PM

I have a data set that contains cases and matched controls (most cases are matched 1:4 with controls). Each case-control group is identified by having the same ID number following with "X" for case and "A" "B" "C" "D" for controls. Example (GI001X GI001A GI001B GI001C). I would like to make a new dataset that only contains giardia cases and the matching controls. Is there a way to subset my data based on ID number? Basically I want to keep observations where "Giardia"=1 and "Case"=1 (this combination denotes giardia cases) as well as the matching controls in which "Case"=0. The tricky part is, not all the ID numbers are the same length- they range from 2-4 numbers followed the X or A,B,C,D.

LinusH · Posted 02-01-2015 04:24 PM

Seems you just have to sub-string out the id part, and then inner join on that.

Data never sleeps

Mdormond · Posted 02-01-2015 05:01 PM

How would I go about doing that?

Reeza · Posted 02-01-2015 05:10 PM

Using substring and length functions.

I'm assuming Giardia and Case are different variables?

If you provide better sample data you get better answers, but not guarantees

Mdormond · Posted 02-02-2015 10:16 AM

Thanks for your response, yes "Giardia" and "Case" are different variables. There are four main exposures of interest in my dataset (Giardia, Amebias, Cryptosporidium, and Cyclospora) in addition to the matched (1:4) controls for each. I would like to write a code to match all the case variables to corresponding controls and then make a dataset that only contains the Giardia cases and the matching controls. Is this more specific?

LinusH · Posted 02-02-2015 10:38 AM

I think I got it from the start. But the syntax for this is straightforward, and you should be able to find a solution by search the on-line documentation, papers, other forum threads, you name it...

Data never sleeps

RW9 · Posted 02-02-2015 10:38 AM

Post some test data and required output, maybe something like:

data have;
group="GI001X"; output;
group="GI001A"; output;
group="GI001B"; output;
group="AI001X"; output;
group="AI001A"; output;
run;

data giarida amebias;
set have;
select(substr(group,1,3));
    when ("GI0") output giarida;
    when ("AI0") output amebias;
    otherwise;
end;
run;

Mdormond · Posted 02-02-2015 01:14 PM

Great, I will do this tonight. Will the substring command correctly select even though some ID's are 3 characters long and some are 4?

RW9 · Posted 02-03-2015 04:16 AM

You need to think of a string as an array of characters, e.g:

{1} {2} {3} {4} {5} {6}

G I 0 0 1 X

G I 0 0 1 A

So if we use substring(array 1 to array 3), both of these strings will be GI0 as 4-6 is dropped, hence they will match.

LinusH · Posted 02-03-2015 05:37 AM

You can substring from the "end", see on-line doc for syntax.

Data never sleeps

Mdormond · Posted 02-03-2015 08:32 PM

I've used the substring option as follows:

match_id= substr (ID, 3,5)

which gives me just the numeric portion of a variable as follows:

Match_ID:

100

101

102

Now I need to run a descriptive analysis to get the characteristics of the cases and the matched controls. How do I run an analysis on controls matched to Giardia cases (the Giardia value for these observations will be 0). Can I make a new data set selecting only Giardia cases and the matched controls using the newly generated Match_ID variable? Something along the lines of:

Data New;

Set Old;

Where Giardia=1 and Match_ID=_____ ;(somehow denoting matched cases)

run;

Mdormond · Posted 02-04-2015 05:28 PM

II think I actually want to reword the data:

1. Creat a variable "match id" which is the identification number without the trailing letter. I can't use the substring option with this because all the identification numbers are different lengths. I basically want to "drop" the last character in each ID.

2. create a dataset with only giardia cases abd the matched controls.

Reeza · Posted 02-04-2015 06:17 PM

1. Substr(variable, 1, length(variable)-1);

Mdormond · Posted 02-04-2015 06:47 PM

Great! How would I go about selecting cases and matched controls based on the two variables?

Registration is open

SAS Training: Just a Click Away