DATA Step, Macro, Functions and more

Using Proc SQL to create a pair of two samples from different datasets based on variables.

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Using Proc SQL to create a pair of two samples from different datasets based on variables.

Hello everyone, 

 

I am working on two different datasets, and would like to borrow your wisdom: 

 

Dataset 1: A group of treatment samples with Variables A and B

Dataset 2: A group of control samples with Variables A and B 

 

There are 47 samples in Dataset 1 (Treatment) and +100 samples in Dataset 2 (Control). 

I want to create 47 pairs - 47 from Dataset1 and 47 from Dataset2- with the closest value of Variables A and B. 

 

First of all, I want to minimize the absolute difference in variable A between treatment sample i (from Dataset1) and control sample i (from Dataset2).

 

Second, likewise in the previous step, I want to minimize the absolute difference in variable B between treatment sample i (from Dataset1) and control sample i (from Dataset2). 

 

Based on the minimum difference in variables A and B, I would like to create a pair of treatment and control samples such that they are very close in terms of the value of variables A and B. 

 

I tried the following code, and the resulting dataset gave 47 pairs, matched in terms of minimum difference in the variables between treatment and control samples. However, there are a lot of duplicates- 

 

For example, 

ID: A from Dataset 1 was paired with ID: 1 from Dataset 2 based on the matching criterion. However, ID: B from Dataset 1 was also paired with ID: 1 from Dataset 2 because the matching variable of ID: B is very close to that of ID: R. 

 

I would like to eliminate these duplicates (overlaps). I would like only one control sample (from Dataset 2) to be matched with one treatment sample (from Dataset 1), but codes that I came up with did not work as I want it to be. 

 

proc sql;

create table dataset as

select a.id, a.variableA as variableA_treatment, a.variableB as variableB_treatment, b.id, b.variableA as variableA_control, b.variableB as variableB_control

from dataset1 as a left join dataset2 as b

on a.variableA > b.variableA or a.variableA < b.variableA or a.variableB > b.variableB or a.variableB < b.variableB

group by a.id

having abs(a.variableA-b.VariableA)=min(abs(a.variableA-b.variableA))
;

quit; 

Any advice and suggestion will be appreciated. 

 

Thank you very much for your time. 

 

Kishio 


Accepted Solutions
Solution
3 weeks ago
Super User
Posts: 19,822

Re: Using Proc SQL to create a pair of two samples from different datasets based on variables.

The Mayo Clinic has a bunch of algorithms on their website that illustrates case-control matching algorithms, greedy algorithms and variable matching. And the full SAS code is included. 

 

http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biom...

 

PS. Data management is typically for DataFlux/DI Studio questions so I'll move this post to the Base SAS forum

View solution in original post


All Replies
Solution
3 weeks ago
Super User
Posts: 19,822

Re: Using Proc SQL to create a pair of two samples from different datasets based on variables.

The Mayo Clinic has a bunch of algorithms on their website that illustrates case-control matching algorithms, greedy algorithms and variable matching. And the full SAS code is included. 

 

http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biom...

 

PS. Data management is typically for DataFlux/DI Studio questions so I'll move this post to the Base SAS forum

New Contributor
Posts: 4

Re: Using Proc SQL to create a pair of two samples from different datasets based on variables.

Thank you very much! I found a Matching MACRO to help me solve the problem
I had.
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 2 replies
  • 137 views
  • 1 like
  • 2 in conversation