BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
beginner
Calcite | Level 5

Hi, someone on this help board recently taught me about arrays, and I'm wondering if I can use them to search across multiple columns in "dataset_2" for values in "dataset_1." (Or if not, if there is another way to do this.)

 

To explain, I have a "dataset_1" that looks something like this:

test          other columns

A1111542       .......

A1234567       .......

A1342999       .......

 

And a "dataset_2" that looks something like this:

patient    test_1         test_2         test_3         date_1          date_2          date_3          other columns

Bob        A1111542                                         01/01/1999                                              .......

Mary      A7394237                                         09/01/1973                                              .......

Joe        A5394238   A2392230   A1234567   05/01/1962   07/01/1962   04/01/1964    .......

Tom       A4394203   A1342999                      03/01/1986   03/01/1988                         .......

 

I'd like to create a "dataset_3" that adds to the columns of "dataset_2" the following 3 new columns:

patient   n_want    test_want     date_want

Bob        1             A1111542    01/01/1999

Mary

Joe         3             A1234567    04/01/1964

Tom        2             A1342999    03/01/1988

 

Can I do this with some variation of the following code?  Or is there some other way I can do this? 

data dataset_3; set dataset_2;
   array c test_1 - test_3;
   array d date_1 - date_3;
n_want = whichn(Can I put something in this spot to search in dataset_2 for all values of "test" in dataset_1?);
   test_want = c[n_want];
   date_want = d[n_want]; format date_want MMDDYY10.;

run;

 

Thank you in advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

While hash tables would work, a format would work just as well here.  The starting point would be to create a format that identifies what tests are in data set 1:

 

data locate;

set dataset1;

start=test;

label='Found';

fmtname='$test';

run

proc format cntlin=locate;

run;

 

This creates a format equivalent to:

 

proc format;

value $test 'A1111542'='Found'

   'A1234567'='Found'

   ...;

 

 

Then use that format when processing dataset2 to see which test is found:

 

data want;

set dataset2;

array c {3} test_1-test_3;

array d {3} date_1-date_3;

do i=1 to 3;

   if put(c{i}, $test.)='Found' then do;

      n_want=i;

      test_want=c{i};

      date_want = d{i};

   end;

end;

drop i;

run;

 

You may need to consider what the output should be if more than one test is wanted for the same person.

 

The code is untested, but should be fine.  Good luck.

View solution in original post

13 REPLIES 13
LinusH
Tourmaline | Level 20
Not seeing that arrays would be the natural choice here. Rather read dataset1 into a hash table.
You could also do a SQL join, but that would probably require joining on all three test columns, combined with case statements.
Data never sleeps
beginner
Calcite | Level 5

Thanks, @LinusH.  Sorry to be a dunce, but could you please teach me more about the hash table option?  (I am also looking up hash tables online in the mean time.  Couldn't find them in the index of my Little SAS Book.)

ballardw
Super User

This would be a very simple exercise if the dataset_2 was:

Patient Test Date  (and possibly test number if you do multiple tests per day)

with one record per test.

 

It would look something like:

Proc sql;

   create table want as

   select a.*, b.patient, b.date(, other variables from dataset_2 )

   from dataset_1 as a left join dataset_2 on a.test=b.test;

quit;

 

Data in the form you have is going to require lots of work continually going back and forth with having to search across multiple variables when the data is as you show.

 

It may be worth transposing the data to the normalized form of one record per test as and intermediate step.

Astounding
PROC Star

While hash tables would work, a format would work just as well here.  The starting point would be to create a format that identifies what tests are in data set 1:

 

data locate;

set dataset1;

start=test;

label='Found';

fmtname='$test';

run

proc format cntlin=locate;

run;

 

This creates a format equivalent to:

 

proc format;

value $test 'A1111542'='Found'

   'A1234567'='Found'

   ...;

 

 

Then use that format when processing dataset2 to see which test is found:

 

data want;

set dataset2;

array c {3} test_1-test_3;

array d {3} date_1-date_3;

do i=1 to 3;

   if put(c{i}, $test.)='Found' then do;

      n_want=i;

      test_want=c{i};

      date_want = d{i};

   end;

end;

drop i;

run;

 

You may need to consider what the output should be if more than one test is wanted for the same person.

 

The code is untested, but should be fine.  Good luck.

beginner
Calcite | Level 5

Hi, @Astounding.  I'd like to try your solution, but am getting an empty dataset (new columns/variables "label" and "fmtname" created, but no entries in dataset).  Is this what I should be seeing?

Astounding
PROC Star

It's probably something simple.  Check (1) missing the SET statement, (2) missing the semicolon at the end of the DATA statement.  If it's not that, post the log from the DATA step ... should be just a few lines there.

beginner
Calcite | Level 5

Hi @Astounding.  Yeah, I was missing a semicolon -- that was silly of me.  Ok, so now I'm ready to try the 2nd half of your solution.  I don't think I understand this part.  For the if statement, we're referring to the new "locate" dataset, right?  How does SAS know that's where we want to look, if we don't have anything about the "locate" dataset in this code?  Thanks for your patient instruction.

Astounding
PROC Star

The LOCATE data set is used to create a format named $test.  Then the DATA step uses that format.

beginner
Calcite | Level 5

Hi @Astounding.  Hm, I'm getting errors, including the following note in my log: "Format $test was not found or could not be loaded". 

beginner
Calcite | Level 5

I think I might have figured out the problem: I had 2 entries with the same value for "test."  I'll work with this a little more & see how it goes.

Astounding
PROC Star

That means you still have to run the PROC FORMAT to create $test.

LinusH
Tourmaline | Level 20
Hash table programming is not as simple as SQL so I barely use it. So if you are a SAS beginner, I would first try to solve this with SQL as I outlined, and exemplified in other replies.
Data never sleeps
beginner
Calcite | Level 5

P.S.  I just modified my original post to clarify: I want to search in dataset_2 "test_n" columns for all (not just one or a handful of) values of "test" from dataset_1.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2585 views
  • 0 likes
  • 4 in conversation