Hello there, Can someone help me with the following. I have the following two datasets. Ned to join Dataset_1 with Dataset_2 on ZIP and City. Need to match only once based on the shortest Distance_mile. If match, then create a dataset (e.g. matched), If do not match, then create a dataset (e.g. Notmatched). Again, if Clientno match twice, I need to pick only the one with the shortest distance. Sorry, the Dataset_2 is huge, so only sample fake data here
Dataset_1 | Dataset_2 | ||||||
ZIP | ClientNo | City | ZIP | SEQ | Distance_Miles | City | |
14215 | A52498 | Buffalo | 14215 | 1 | 1.2 | Sloan | |
14213 | A09533 | Buffalo | 14215 | 2 | 2.0 | Amherst | |
14207 | A63397 | Buffalo | 14215 | 3 | 2.5 | Buffalo | |
14216 | A77384 | Buffalo | 14215 | 4 | 6.0 | Seneca | |
14220 | A73552 | Buffalo | 14215 | 5 | 10.0 | Kenmore | |
14214 | A31165 | Buffalo | 14213 | 1 | 0.4 | Buffalo | |
14211 | A06391 | Bristol | 14213 | 2 | 1.5 | Kenmore | |
14206 | A71649 | Bristol | 14213 | 3 | 8.0 | Blasdell | |
14222 | A66949 | Bristol | 14213 | 4 | 12.0 | Buffalo | |
14210 | A43865 | Bristol | 14213 | 5 | 15.0 | Blasdell | |
14201 | A95572 | Cedar Grove | 14201 | 1 | 1.1 | Cedar Grove | |
14208 | A06957 | Cedar Grove | 14201 | 2 | 5.3 | Sloan | |
14204 | A91374 | Cedar Grove | 14201 | 3 | 9.3 | Amherst | |
14209 | A58062 | Cedar Grove | 14201 | 4 | 15.5 | Buffalo | |
14212 | A54498 | Cedar Grove | 14201 | 5 | 30.1 | Mo Town | |
14216 | 1 | 2.0 | Cheektowaga | ||||
14216 | 2 | 7.2 | Great Hill | ||||
14216 | 3 | 8.8 | Buffalo | ||||
14216 | 4 | 14.2 | Fort Erie | ||||
14216 | 5 | 21.6 | Ajax | ||||
14220 | 1 | 3.0 | Sloan | ||||
14220 | 2 | 4.6 | Amherst | ||||
14220 | 3 | 8.0 | Cheektowaga | ||||
14220 | 4 | 14.9 | Great Hill | ||||
14220 | 5 | 26.2 | Milton |
If you accept the solution, please assign it as that.
What do you mean by "if Clientno match twice" ?
ClientNo is not in both data sets ?
Can you show us your desired result given this sample data?
You should post test data as data steps using infile method with datalines - as in my proposed code. Most contributors are short in time and will not trouble to type the data step or copy line by line as I did. Even I did not copy all 2nd data set lines.
Anyway you can try and check next code:
data have1;
infile datalines dlm='09'x truncover;
input zip ClientNo $ city $char30.;
datalines;
14215 A52498 Buffalo
14213 A09533 Buffalo
14207 A63397 Buffalo
14216 A77384 Buffalo
14220 A73552 Buffalo
14214 A31165 Buffalo
14211 A06391 Bristol
14206 A71649 Bristol
14222 A66949 Bristol
14210 A43865 Bristol
14201 A95572 Cedar Grove
14208 A06957 Cedar Grove
14204 A91374 Cedar Grove
14209 A58062 Cedar Grove
14212 A54498 Cedar Grove
;run;
data have2;
infile datalines dlm='09'x truncover;
input zip SEQ Distance_Miles city $char30.;
datalines;
14215 1 1.2 Sloan
14215 2 2.0 Amherst
14215 3 2.5 Buffalo
14215 4 6.0 Seneca
14215 5 10.0 Kenmore
14213 1 0.4 Buffalo
14213 2 1.5 Kenmore
14213 3 8.0 Blasdell
14213 4 12.0 Buffalo
14213 5 15.0 Blasdell
14201 1 1.1 Cedar Grove
14201 2 5.3 Sloan
14201 3 9.3 Amherst
14201 4 15.5 Buffalo
14201 5 30.1 Mo Town
; run;
proc sort data=have1; by zip city; run;
proc sort data=have2; by zip city; run;
data temp want2;
merge have1 (in=in1)
have2 (in=in2);
by zip city;
if in1 then do;
if in2 then output temp;
else output want2;
end;
run;
proc sort data=temp; by zip city Distance_Miles; run;
data want1;
set temp;
by zip city;
if first.city;
run;
sorry, I should have said match based on ZIP and City. I would like to have the match based on the shortest distance (e.g. in case of zip 14213, the output would be SEQ=1 and not SEQ=3. If there is no match for a record in Dataset_1, then need to have a seprate output for it (e.g. ZIP 14208)
What is the expected output for that data?
Also please just post the data a text, much easier to handle.
data dataset_1;
infile cards dsd dlm='|' truncover ;
input ZIP :$5. ClientNo :$20. City :$80. ;
cards;
14215|A52498|Buffalo
14213|A09533|Buffalo
14207|A63397|Buffalo
14216|A77384|Buffalo
14220|A73552|Buffalo
14214|A31165|Buffalo
14211|A06391|Bristol
14206|A71649|Bristol
14222|A66949|Bristol
14210|A43865|Bristol
14201|A95572|Cedar Grove
14208|A06957|Cedar Grove
14204|A91374|Cedar Grove
14209|A58062|Cedar Grove
14212|A54498|Cedar Grove
;
data dataset_2;
infile cards dsd dlm='|' truncover;
input ZIP :$5. SEQ Distance_Miles City :$80. ;
cards;
14215|1|1.2|Sloan
14215|2|2|Amherst
14215|3|2.5|Buffalo
14215|4|6|Seneca
14215|5|10|Kenmore
14213|1|0.4|Buffalo
14213|2|1.5|Kenmore
14213|3|8|Blasdell
14213|4|12|Buffalo
14213|5|15|Blasdell
14201|1|1.1|Cedar Grove
14201|2|5.3|Sloan
14201|3|9.3|Amherst
14201|4|15.5|Buffalo
14201|5|30.1|Mo Town
14216|1|2|Cheektowaga
14216|2|7.2|Great Hill
14216|3|8.8|Buffalo
14216|4|14.2|Fort Erie
14216|5|21.6|Ajax
14220|1|3|Sloan
14220|2|4.6|Amherst
14220|3|8|Cheektowaga
14220|4|14.9|Great Hill
14220|5|26.2|Milton
;
@AshPatel wrote:
sorry, I should have said match based on ZIP and City. I would like to have the match based on the shortest distance (e.g. in case of zip 14213, the output would be SEQ=1 and not SEQ=3. If there is no match for a record in Dataset_1, then need to have a separate output for it (e.g. ZIP 14208)
That is exactly what my code do. Have you run the code on your data? Have you checked the results?
I named the test data have1 (instead dataset_1) and have2 (instead dataset_2).
The output named want1 are the matched data, want2 the not matched.
Yes, the solution worked perfactly. Thank you ever so much.
If you accept the solution, please assign it as that.
The solution that was provided earlier this week worked fine, however; I have been asked to modify the criteria as below. I tried modifying the data step but unable to get the expected output. I can take dividual Miles_Rank one at a time in a separate data step but in our actual data file, we have 10 unique Miles_Ranks (1 to10) for each ZIP. I believe, there must be a way to have the Miles Ranks logic build in one or two data steps. I am attaching revised raw data (have1 and have2) and also included expected output sheet. Once again thank you so much!
1. Match on zip, city and state (have1 and have2)
2. Want1 all records from have1 with ClientNo where there is a match based on the shortest distance (Miles_Rank)
3. Also, Want2 a list of records from have1 where there is no match
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.