Hello, I have a dataset with 20,000 unique identifiers and multiple observations per identifier. I need to select all the observations for 65 of those identifiers and output all the observations for those 65. The list of unique identifiers is in another dataset. Here is a sample of what I'm trying to accomplish:
data have:
dataset 1:
visit_id study_id var var var
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 4
9 5
10... 5...
dataset 2:
study_id
1
3
5
17
92
81
180
1299...
data want:
visit_id study_id var var var
1 1
2 1
3 1
6 3
7 3
9 5
10 5
810 17
811.. 17..
I know how to do this in PROC SQL with a WHERE study_id IN statement. But I don't want to have to list out the 65 id numbers. Any ideas? Thank you!
proc sort data=dataset1;
by study_id;
run;
proc sort data=dataset2;
by study_id;
run;
data want;
merge dataset1 dataset2(in=in2);
by study_id;
if in2;
run;
It appears what you need is a simple merge?
sort the two datasets by study_id and merge by study_id
Use the IN= Dataset option to mark the dataset that contributes to observation in the PDV and apply the marker in a subsetting if statement to only keep the contributing obs from dataset2.
Can you give me sample code?
The two datasets have all the same variables. Dataset 2 is a subset of dataset one. When I tried to merge, it didn't work.
proc sort data=dataset1;
by study_id;
run;
proc sort data=dataset2;
by study_id;
run;
data want;
merge dataset1 dataset2(in=in2);
by study_id;
if in2;
run;
What does the in2 mean?
When you use a dataset option IN=, SAS basically creates a temp variable as a binary marker to keep a flag on the observations that dataset contributed to the observation construction at the time of execution. In2 is just a name that I gave and you are free to name anything you please. The temp variable is never written to the output dataset.
Okay, simple enought, thanks so much!
Here's one way:
proc sql; create table want as select b.* from dataset2 as a left join dataset1 as b on a.study_id=b.study_id ; quit;
the a and b are aliases so you can type a.study_id instead dataset1.study_id. If you haven't used. SQL you have to reference the table as well as the variable when it exists in two or more of the tables involved so SAS uses the right one.
Left Join says take every value in the the set on the left (dataset2) and match the values with the other set based on the criteria following join. Note that if Study_id is duplicated in dataset2 then each value matches ALL of the same study_id in the dataset1 and you'll have lots more records than you expect.
Thanks for the reply!
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.