DATA Step, Macro, Functions and more

multiple variables to match a reference datset

Reply
Super Contributor
Posts: 336

multiple variables to match a reference datset

I have a dataset, with 130000 obs.

I need to check whethr those match any of my reference dataset, if any one of these five variables, the value matchs one of the values of the reference dataset (around 5000 obs), which has three categories linked to those values, "S', "M', or "l', if not matched, it is "normal"

I can use merge one by one, so I need to do five merge

any other option to do that?

thanks.

Super User
Posts: 17,818

Re: multiple variables to match a reference datset

Your question is unclear. 

Super Contributor
Posts: 336

Re: multiple variables to match a reference datset

there are five diagnosis variables, that I need to check those values of these five var, whether the value matches the reference dataset, which includes 5000 obs, and 3 categories

 

I can do five matches, and see whether any of them matches, if match, they will be considered as there 3 categories

if none of these 5 matches, they will be considered as another category

Regular Contributor
Posts: 194

Re: multiple variables to match a reference datset

Hello,

 

Does the following answers your problem (assuming the reference dataset contains a column "category")

 

data want;
merge mydataset (in=a) reference (in=b);
by var1 var2 var3 var4 var5;
if a and not b then category="other";
if a;
run;
Regular Contributor
Posts: 194

Re: multiple variables to match a reference datset

I did not read correctly your description. I thought there had to be a match for all 5 variables. So this won't do what you want.
Super User
Posts: 5,081

Re: multiple variables to match a reference datset

There are two common methods, that would first process the 5,000 records.

 

1. Turn the 5,000 records into a format.  

 

2. Turn the 5,000 records into a hash table.

 

Either way, you would then go through the 5 variables in an array, to see if they match the new form of the 5,000 records.

 

Are you comfortable with either technique (creating a format from a SAS data set, or creating a hash table)?

Super User
Posts: 17,818

Re: multiple variables to match a reference datset

Create a format using your 5000 dataset, make sure to include an Other category. 

 

Then in apply the format and check for the Other category. 

 

A merge wont work as is because you want to check each value against all other values. 

 

Another method is is to make your wide dataset long and then use something like the following:

 

proc sql;

create table matches as

select id, diag

from long where diag not in (select diag from lookup);

quit. 

 

You can merge the matches dataset with your original wide dataset if you wanted. 

Super User
Posts: 17,818

Re: multiple variables to match a reference datset

Another option is temporary arrays. 

 

https://gist.github.com/statgeek/f052b5223fecca066b1f

 

Edit: The long method proposed wouldn't work. I missed part about 3 categories. You can merge the long data with the lookup data to find the category. Do you have the possibility of multiple matches? Ie can a record belong to multiple categories? I'd recommend a format though. It's a common method in analyzing clinical data and its come up as a solution to several of your recent questions so probably worth learning. 

Super User
Posts: 17,818

Re: multiple variables to match a reference datset

This paper may be helpful (not for this question though)

 

http://www.lexjansen.com/wuss/2014/71_Final_Paper_PDF.pdf

 

The link at the end includes the code. 

Ask a Question
Discussion stats
  • 8 replies
  • 358 views
  • 0 likes
  • 4 in conversation