Solved: Re: Shortest distance match.

AshPatel · Posted 02-28-2021 10:12 AM

Hello there, Can someone help me with the following. I have the following two datasets. Ned to join Dataset_1 with Dataset_2 on ZIP and City. Need to match only once based on the shortest Distance_mile. If match, then create a dataset (e.g. matched), If do not match, then create a dataset (e.g. Notmatched). Again, if Clientno match twice, I need to pick only the one with the shortest distance. Sorry, the Dataset_2 is huge, so only sample fake data here

Dataset_1			Dataset_2
ZIP	ClientNo	City	ZIP	SEQ	Distance_Miles	City
14215	A52498	Buffalo	14215	1	1.2	Sloan
14213	A09533	Buffalo	14215	2	2.0	Amherst
14207	A63397	Buffalo	14215	3	2.5	Buffalo
14216	A77384	Buffalo	14215	4	6.0	Seneca
14220	A73552	Buffalo	14215	5	10.0	Kenmore
14214	A31165	Buffalo	14213	1	0.4	Buffalo
14211	A06391	Bristol	14213	2	1.5	Kenmore
14206	A71649	Bristol	14213	3	8.0	Blasdell
14222	A66949	Bristol	14213	4	12.0	Buffalo
14210	A43865	Bristol	14213	5	15.0	Blasdell
14201	A95572	Cedar Grove	14201	1	1.1	Cedar Grove
14208	A06957	Cedar Grove	14201	2	5.3	Sloan
14204	A91374	Cedar Grove	14201	3	9.3	Amherst
14209	A58062	Cedar Grove	14201	4	15.5	Buffalo
14212	A54498	Cedar Grove	14201	5	30.1	Mo Town
			14216	1	2.0	Cheektowaga
			14216	2	7.2	Great Hill
			14216	3	8.8	Buffalo
			14216	4	14.2	Fort Erie
			14216	5	21.6	Ajax
			14220	1	3.0	Sloan
			14220	2	4.6	Amherst
			14220	3	8.0	Cheektowaga
			14220	4	14.9	Great Hill
			14220	5	26.2	Milton

Shmuel · Posted 03-01-2021 09:48 AM

If you accept the solution, please assign it as that.

View solution in original post

PeterClemmensen · Posted 02-28-2021 01:29 PM

What do you mean by "if Clientno match twice" ?

ClientNo is not in both data sets ?

Can you show us your desired result given this sample data?

The DATA to DATA Step Macro
Blog: SASnrd

Shmuel · Posted 02-28-2021 01:31 PM

You should post test data as data steps using infile method with datalines - as in my proposed code. Most contributors are short in time and will not trouble to type the data step or copy line by line as I did. Even I did not copy all 2nd data set lines.

Anyway you can try and check next code:

data have1;
  infile datalines dlm='09'x truncover;
  input zip ClientNo $ city $char30.;
datalines;
14215	A52498	Buffalo
14213	A09533	Buffalo
14207	A63397	Buffalo
14216	A77384	Buffalo
14220	A73552	Buffalo
14214	A31165	Buffalo
14211	A06391	Bristol
14206	A71649	Bristol
14222	A66949	Bristol
14210	A43865	Bristol
14201	A95572	Cedar Grove
14208	A06957	Cedar Grove
14204	A91374	Cedar Grove
14209	A58062	Cedar Grove
14212	A54498	Cedar Grove
;run;
data have2;
  infile datalines dlm='09'x truncover;
  input zip SEQ Distance_Miles city $char30.;
datalines;
14215	1	1.2	Sloan
14215	2	2.0	Amherst
14215	3	2.5	Buffalo
14215	4	6.0	Seneca
14215	5	10.0	Kenmore
14213	1	0.4	Buffalo
14213	2	1.5	Kenmore
14213	3	8.0	Blasdell
14213	4	12.0	Buffalo
14213	5	15.0	Blasdell
14201	1	1.1	Cedar Grove
14201	2	5.3	Sloan
14201	3	9.3	Amherst
14201	4	15.5	Buffalo
14201	5	30.1	Mo Town
; run;
proc sort data=have1; by zip city; run;
proc sort data=have2; by zip city; run;
data temp want2;
 merge have1 (in=in1)
       have2 (in=in2);
  by zip city; 
     if in1 then do;
        if in2 then output temp; 
        else output want2;
     end;
run;
proc sort data=temp; by zip city Distance_Miles; run;
data want1;
 set temp;
  by zip city;
     if first.city;
run;

AshPatel · Posted 02-28-2021 03:10 PM

sorry, I should have said match based on ZIP and City. I would like to have the match based on the shortest distance (e.g. in case of zip 14213, the output would be SEQ=1 and not SEQ=3. If there is no match for a record in Dataset_1, then need to have a seprate output for it (e.g. ZIP 14208)

Tom · Posted 02-28-2021 03:24 PM

What is the expected output for that data?

Also please just post the data a text, much easier to handle.

data dataset_1;
  infile cards dsd dlm='|' truncover ;
  input ZIP :$5. ClientNo :$20. City :$80. ;
cards;
14215|A52498|Buffalo
14213|A09533|Buffalo
14207|A63397|Buffalo
14216|A77384|Buffalo
14220|A73552|Buffalo
14214|A31165|Buffalo
14211|A06391|Bristol
14206|A71649|Bristol
14222|A66949|Bristol
14210|A43865|Bristol
14201|A95572|Cedar Grove
14208|A06957|Cedar Grove
14204|A91374|Cedar Grove
14209|A58062|Cedar Grove
14212|A54498|Cedar Grove
;

data dataset_2;
  infile cards dsd dlm='|' truncover;
  input ZIP :$5. SEQ Distance_Miles City :$80. ;
cards;
14215|1|1.2|Sloan
14215|2|2|Amherst
14215|3|2.5|Buffalo
14215|4|6|Seneca
14215|5|10|Kenmore
14213|1|0.4|Buffalo
14213|2|1.5|Kenmore
14213|3|8|Blasdell
14213|4|12|Buffalo
14213|5|15|Blasdell
14201|1|1.1|Cedar Grove
14201|2|5.3|Sloan
14201|3|9.3|Amherst
14201|4|15.5|Buffalo
14201|5|30.1|Mo Town
14216|1|2|Cheektowaga
14216|2|7.2|Great Hill
14216|3|8.8|Buffalo
14216|4|14.2|Fort Erie
14216|5|21.6|Ajax
14220|1|3|Sloan
14220|2|4.6|Amherst
14220|3|8|Cheektowaga
14220|4|14.9|Great Hill
14220|5|26.2|Milton
;

Shmuel · Posted 02-28-2021 03:50 PM

@AshPatel wrote:

sorry, I should have said match based on ZIP and City. I would like to have the match based on the shortest distance (e.g. in case of zip 14213, the output would be SEQ=1 and not SEQ=3. If there is no match for a record in Dataset_1, then need to have a separate output for it (e.g. ZIP 14208)

That is exactly what my code do. Have you run the code on your data? Have you checked the results?

I named the test data have1 (instead dataset_1) and have2 (instead dataset_2).

The output named want1 are the matched data, want2 the not matched.

AshPatel · Posted 03-01-2021 08:43 AM

Yes, the solution worked perfactly. Thank you ever so much.

Shmuel · Posted 03-01-2021 09:48 AM

If you accept the solution, please assign it as that.

AshPatel · Posted 03-04-2021 03:10 PM

The solution that was provided earlier this week worked fine, however; I have been asked to modify the criteria as below. I tried modifying the data step but unable to get the expected output. I can take dividual Miles_Rank one at a time in a separate data step but in our actual data file, we have 10 unique Miles_Ranks (1 to10) for each ZIP. I believe, there must be a way to have the Miles Ranks logic build in one or two data steps. I am attaching revised raw data (have1 and have2) and also included expected output sheet. Once again thank you so much!

1. Match on zip, city and state (have1 and have2)
2. Want1 all records from have1 with ClientNo where there is a match based on the shortest distance (Miles_Rank)
3. Also, Want2 a list of records from have1 where there is no match

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!