I have a data set that contains cases and matched controls (most cases are matched 1:4 with controls). Each case-control group is identified by having the same ID number following with "X" for case and "A" "B" "C" "D" for controls. Example (GI001X GI001A GI001B GI001C). I would like to make a new dataset that only contains giardia cases and the matching controls. Is there a way to subset my data based on ID number? Basically I want to keep observations where "Giardia"=1 and "Case"=1 (this combination denotes giardia cases) as well as the matching controls in which "Case"=0. The tricky part is, not all the ID numbers are the same length- they range from 2-4 numbers followed the X or A,B,C,D.
Seems you just have to sub-string out the id part, and then inner join on that.
How would I go about doing that?
Using substring and length functions.
I'm assuming Giardia and Case are different variables?
If you provide better sample data you get better answers, but not guarantees
Thanks for your response, yes "Giardia" and "Case" are different variables. There are four main exposures of interest in my dataset (Giardia, Amebias, Cryptosporidium, and Cyclospora) in addition to the matched (1:4) controls for each. I would like to write a code to match all the case variables to corresponding controls and then make a dataset that only contains the Giardia cases and the matching controls. Is this more specific?
I think I got it from the start. But the syntax for this is straightforward, and you should be able to find a solution by search the on-line documentation, papers, other forum threads, you name it...
Post some test data and required output, maybe something like:
data have;
group="GI001X"; output;
group="GI001A"; output;
group="GI001B"; output;
group="AI001X"; output;
group="AI001A"; output;
run;
data giarida amebias;
set have;
select(substr(group,1,3));
when ("GI0") output giarida;
when ("AI0") output amebias;
otherwise;
end;
run;
Great, I will do this tonight. Will the substring command correctly select even though some ID's are 3 characters long and some are 4?
You need to think of a string as an array of characters, e.g:
{1} {2} {3} {4} {5} {6}
G I 0 0 1 X
G I 0 0 1 A
So if we use substring(array 1 to array 3), both of these strings will be GI0 as 4-6 is dropped, hence they will match.
You can substring from the "end", see on-line doc for syntax.
I've used the substring option as follows:
match_id= substr (ID, 3,5)
which gives me just the numeric portion of a variable as follows:
Match_ID:
100
100
100
101
101
101
102
102
102
Now I need to run a descriptive analysis to get the characteristics of the cases and the matched controls. How do I run an analysis on controls matched to Giardia cases (the Giardia value for these observations will be 0). Can I make a new data set selecting only Giardia cases and the matched controls using the newly generated Match_ID variable? Something along the lines of:
Data New;
Set Old;
Where Giardia=1 and Match_ID=_____ ;(somehow denoting matched cases)
run;
II think I actually want to reword the data:
1. Creat a variable "match id" which is the identification number without the trailing letter. I can't use the substring option with this because all the identification numbers are different lengths. I basically want to "drop" the last character in each ID.
2. create a dataset with only giardia cases abd the matched controls.
1. Substr(variable, 1, length(variable)-1);
Great! How would I go about selecting cases and matched controls based on the two variables?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.