Solved: Re: SAS Merge using proc sql

GeetVish · Posted 01-04-2019 10:15 AM

I have two datasets A and B with a common column severity. I need to identify which row of B has the severity closest to the severity of the each row of A. It is called the nearest neighbor problem.

I am reading in both A and B and then for each row of A search for the nearest neighbor in B using a do loop. However, I am not sure how to create index for B so that a do loop can be done.

Please NOTE : There are no common variables and the subjects are in different studies. The purpose of the analysis is to figure out who in B is the closest subject to each one in A. .

In terms of disease severity.

The final output should be below:-- please HELP AND URGENT :-

Severity_A	subject_A	Severity_B	subject_B
1.2	1	1.5	6
1.2	1	2.6	7
1.2	1	4.5	8
1.2	1	5.1	9
1.2	1	5.7	10
2.3	2	1.5	6
2.3	2	2.6	7
2.3	2	4.5	8
2.3	2	5.1	9
2.3	2	5.7	10
…	…

Any help will be greatly appreciated. Thank you very much!

DanielLangley · Posted 01-04-2019 11:32 AM

Without knowing your expected output we can only have our best guess. Try this.

To get to the final results uncomment the commented code.

Data A;
input Severity subject;
cards;
1.2 1
2.3 2
3.4 3
5.3 4
6.7 5
;
run;
   
Data B;
input Severity subject;
cards;
1.5 6
2.6 7
4.5 8
5.1 9
5.7 10
;
run;

proc sql;
create table c as
	select 	a.Severity as A_Severity
	,		a.Subject as A_Subject
	,		b.Severity as B_Severity
	,		b.Subject as B_Subject
	,		abs(A.Severity - B.Severity) as Distance
	from 	work.A as A
	,		work.B as B

/*	group by a.Subject*/
/*	having calculated Distance = min(Calculated Distance)*/
	order by A_Subject, Calculated Distance
;
quit;

View solution in original post

HB · Posted 01-04-2019 10:26 AM

Actual data to go with your desired results would be helpful.

Do you need unique nearest neighbors? Meaning that if in Set A we have severity values of 1.3 and 1.5, and in Set B we have a 1.4 and a 2.7. Can the 1.3 and the 1.5 both be matched to the 1.4 or do we need to use unique values?

GeetVish · Posted 01-04-2019 10:31 AM

They can both be matched since we want to know whom in B are similar to those in A.

GeetVish · Posted 01-04-2019 11:11 AM

can you help me ?

DanielLangley · Posted 01-04-2019 11:32 AM

Without knowing your expected output we can only have our best guess. Try this.

To get to the final results uncomment the commented code.

Data A;
input Severity subject;
cards;
1.2 1
2.3 2
3.4 3
5.3 4
6.7 5
;
run;
   
Data B;
input Severity subject;
cards;
1.5 6
2.6 7
4.5 8
5.1 9
5.7 10
;
run;

proc sql;
create table c as
	select 	a.Severity as A_Severity
	,		a.Subject as A_Subject
	,		b.Severity as B_Severity
	,		b.Subject as B_Subject
	,		abs(A.Severity - B.Severity) as Distance
	from 	work.A as A
	,		work.B as B

/*	group by a.Subject*/
/*	having calculated Distance = min(Calculated Distance)*/
	order by A_Subject, Calculated Distance
;
quit;

GeetVish · Posted 01-04-2019 11:39 AM

Thanks a lot , I will let you know if this is matching with requirement. But really appreciate your response and help

GeetVish · Posted 01-04-2019 11:46 AM

the answer is perfect, Thanks

ballardw · Posted 01-04-2019 10:27 AM

Please clarify the apparent inconsistency in your text:

@GeetVish wrote:

I have two datasets A and B with a common column severity. I need to identify which row of B has the severity closest to the severity of the each row of A. It is called the nearest neighbor problem.

I am reading in both A and B and then for each row of A search for the nearest neighbor in B using a do loop. However, I am not sure how to create index for B so that a do loop can be done.

Please NOTE : There are no common variables and the subjects are in different studies. The purpose of the analysis is to figure out who in B is the closest subject to each one in A. .

In terms of disease severity.

Please provide examples of appropriate variables from BOTH input data sets, best as data step code to ensure we have the correct names and types. Then provide example output expected from that example input. Your example output as shown also does not show a "nearest neighbor" result but a "five nearest neighbors" result. If the desire is 5 nearest then please say so as that is a significantly different problem.

Since Proc Sql does not support do loops I am having a hard time imagining what you may be contemplating.

GeetVish · Posted 01-04-2019 10:32 AM

Dataset A:
Severity subject
1.2 1
2.3 2
3.4 3
5.3 4
6.7 5

Dataset B:
Severity subject
1.5 6
2.6 7
4.5 8
5.1 9
5.7 10

GeetVish · Posted 01-04-2019 11:11 AM

can you help me ?

Classroom Training Available!