Hi, someone on this help board recently taught me about arrays, and I'm wondering if I can use them to search across multiple columns in "dataset_2" for values in "dataset_1." (Or if not, if there is another way to do this.)
To explain, I have a "dataset_1" that looks something like this:
test other columns
A1111542 .......
A1234567 .......
A1342999 .......
And a "dataset_2" that looks something like this:
patient test_1 test_2 test_3 date_1 date_2 date_3 other columns
Bob A1111542 01/01/1999 .......
Mary A7394237 09/01/1973 .......
Joe A5394238 A2392230 A1234567 05/01/1962 07/01/1962 04/01/1964 .......
Tom A4394203 A1342999 03/01/1986 03/01/1988 .......
I'd like to create a "dataset_3" that adds to the columns of "dataset_2" the following 3 new columns:
patient n_want test_want date_want
Bob 1 A1111542 01/01/1999
Mary
Joe 3 A1234567 04/01/1964
Tom 2 A1342999 03/01/1988
Can I do this with some variation of the following code? Or is there some other way I can do this?
data dataset_3; set dataset_2;
array c test_1 - test_3;
array d date_1 - date_3;
n_want = whichn(Can I put something in this spot to search in dataset_2 for all values of "test" in dataset_1?);
test_want = c[n_want];
date_want = d[n_want]; format date_want MMDDYY10.;
run;
Thank you in advance for your help.
While hash tables would work, a format would work just as well here. The starting point would be to create a format that identifies what tests are in data set 1:
data locate;
set dataset1;
start=test;
label='Found';
fmtname='$test';
run
proc format cntlin=locate;
run;
This creates a format equivalent to:
proc format;
value $test 'A1111542'='Found'
'A1234567'='Found'
...;
Then use that format when processing dataset2 to see which test is found:
data want;
set dataset2;
array c {3} test_1-test_3;
array d {3} date_1-date_3;
do i=1 to 3;
if put(c{i}, $test.)='Found' then do;
n_want=i;
test_want=c{i};
date_want = d{i};
end;
end;
drop i;
run;
You may need to consider what the output should be if more than one test is wanted for the same person.
The code is untested, but should be fine. Good luck.
Thanks, @LinusH. Sorry to be a dunce, but could you please teach me more about the hash table option? (I am also looking up hash tables online in the mean time. Couldn't find them in the index of my Little SAS Book.)
This would be a very simple exercise if the dataset_2 was:
Patient Test Date (and possibly test number if you do multiple tests per day)
with one record per test.
It would look something like:
Proc sql;
create table want as
select a.*, b.patient, b.date(, other variables from dataset_2 )
from dataset_1 as a left join dataset_2 on a.test=b.test;
quit;
Data in the form you have is going to require lots of work continually going back and forth with having to search across multiple variables when the data is as you show.
It may be worth transposing the data to the normalized form of one record per test as and intermediate step.
While hash tables would work, a format would work just as well here. The starting point would be to create a format that identifies what tests are in data set 1:
data locate;
set dataset1;
start=test;
label='Found';
fmtname='$test';
run
proc format cntlin=locate;
run;
This creates a format equivalent to:
proc format;
value $test 'A1111542'='Found'
'A1234567'='Found'
...;
Then use that format when processing dataset2 to see which test is found:
data want;
set dataset2;
array c {3} test_1-test_3;
array d {3} date_1-date_3;
do i=1 to 3;
if put(c{i}, $test.)='Found' then do;
n_want=i;
test_want=c{i};
date_want = d{i};
end;
end;
drop i;
run;
You may need to consider what the output should be if more than one test is wanted for the same person.
The code is untested, but should be fine. Good luck.
Hi, @Astounding. I'd like to try your solution, but am getting an empty dataset (new columns/variables "label" and "fmtname" created, but no entries in dataset). Is this what I should be seeing?
It's probably something simple. Check (1) missing the SET statement, (2) missing the semicolon at the end of the DATA statement. If it's not that, post the log from the DATA step ... should be just a few lines there.
Hi @Astounding. Yeah, I was missing a semicolon -- that was silly of me. Ok, so now I'm ready to try the 2nd half of your solution. I don't think I understand this part. For the if statement, we're referring to the new "locate" dataset, right? How does SAS know that's where we want to look, if we don't have anything about the "locate" dataset in this code? Thanks for your patient instruction.
The LOCATE data set is used to create a format named $test. Then the DATA step uses that format.
Hi @Astounding. Hm, I'm getting errors, including the following note in my log: "Format $test was not found or could not be loaded".
I think I might have figured out the problem: I had 2 entries with the same value for "test." I'll work with this a little more & see how it goes.
That means you still have to run the PROC FORMAT to create $test.
P.S. I just modified my original post to clarify: I want to search in dataset_2 "test_n" columns for all (not just one or a handful of) values of "test" from dataset_1.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.