I am working to combine two SAS datasets. One has zipcodes, coordinated_care_organizations, and the PCPCH-clinic associated with that zipcode. The second is a provider data set with provider, specialty, address, zipcode and several other fields. I have connected the PCPCH-clinic and providers by zipcode - this was to ensure that the clinic and providers lined up for each CCO.
The next goal (and what I am writing about) is to match the providers that work in the PCPCH-clinic by address of the clinic. I ended up interleaving the two datasets and this following schematic is what the
output dataset looks like after the interleave operation. The numbers are illustrative.
zipcode county CCO address provider address
96007 Lincoln Healthcare_CCO 234 SW Green St . . ....
R Smith 12345 Brown Rd
E Collins 234 SW Green St
R Batra 4355 Plaza Dr
J Frank 235 SW Green St
B Jones 234 SW Green St
R Budden 6775 Evergreen Cir
P Smith 234 SW Green St
. . .
While the PCPCH-clinic is identified by zipcode there are several clinics or practices within that zipcode
that are not a PCPCH-clinic. So what I would like to do is match only the providers at the same address
as the PCPCH-clinic. So the desired result would look like:
zipcode county CCO address provider address
96007 Lincoln Healthcare_CCO 234 SW Green St . . ...
E Collins 234 SW Green St
B Jones 234 SW Green St
P Smith 234 SW Green St
An so on for each PCPCH-clinic identified by zipcode for the state of OR.
My question is what approach should I take to link the PCPCH-clinic to the providers that practice in the
clinic?
Thank you in advance for your help.
Walt Lierman
I put in datalines code as I did not have data and had to generate it. In your case you have to modify the code below
Data CCO_Area.CCO_PCPCH_clinic_interleaved_1;
Set CCO_Area.CCO_PCPCH_clinic_interleaved;
Address = lower(address);
If Address ne ' ' the Add_fac = Address;
Retain Add_fac;
distance = compged(Add_fac,Address);
If distance < 50;
run;
Try adding these lines?
if ADDRESS_CLINIC ne ' ' then A = ADDRESS_CLINIC;
retain A;
if A = ADDRESS_PROVIDER;
Hello,
Thank you for responding. I added those lines but the data set is empty
NOTE: There were 51749 observations read from the data set CCO_AREA.CCO_PCPCH_PROV_INTERLEAVED.
NOTE: The data set CCO_AREA.CCO_PCPCH_PROV_INTERLEAVED_1 has 0 observations and 42 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.04 seconds
This is how I used the code:
Data CCO_Area.CCO_PCPCH_clinic_interleaved_1;
Set CCO_Area.CCO_PCPCH_clinic_interleaved;
If Address ne ' ' the Add_fac = Address;
Retain Add_fac;
If Add_fac = Address_1;
run;
Address is the field name in the clinic dataset and Address_1 the field name in the provider dataset.
One thing that I didn't mention in my original post is that the address fields like many text string data are
prone to slight anomalies like one set may have 345 SW Pine St Ste 2 and the other may have 345 Sw Pine Stste . These types of coding problems occur in very few records actually.
But this result (0 obs) occurs over the entire dataset.
wlierman
1. Look at the usual suspects
2. If there are slight variations in your string use edit distance such as compged, complev and spedis. set a threshold to filter.
informat zipcode 5. cco $32. county $32. address $64.;
input zipcode county $ cco $ address $;
infile cards dlm='|';
cards;
96007|Lincoln|Healthcare_CCO|234 SW Green St
96008|Lincoln|Healthcare_CCO|6775 Evergreen Cir ste 200
;;;
run;
proc sort; by zipcode address; run;
data cc2;
informat zipcode 5. provider $32. address $64.;
input zipcode provider $ address $;
infile cards dlm='|';
cards;
96007|R Smith|12345 Brown Rd
96007|E Collins|234 SW Green St
96007|R Batra|4355 Plaza Dr
96007|J Frank|235 SW Green St
96007|B Jones|234 SW Green St ste2
96008|R Budden|6775 Evergreen Cir
96007|P Smith|234 SW Green St
;;;
run;
proc sort; by zipcode address; run;
data cc3;
length prv_address $32.;
set cc1(in=cc1) cc2(in=cc2);
by zipcode;
retain prv_address ' ';
if cc1 then prv_address = address;
distance=spedis(strip(tranwrd(address,' ','')),strip(tranwrd(prv_address,' ','')));
put _all_;
if distance < 50;
run;
Thank you both for your help.
I like the coding provided and the approach. But in order to implement the coding approach I need to ask another question.
The PCPCH_clinic data set has 649 obs whereas the provider dataset has over 51K obs.
How do I implement the delimiter step using the pipe | over two data sets like this?
Thank you.
wlierman
I put in datalines code as I did not have data and had to generate it. In your case you have to modify the code below
Data CCO_Area.CCO_PCPCH_clinic_interleaved_1;
Set CCO_Area.CCO_PCPCH_clinic_interleaved;
Address = lower(address);
If Address ne ' ' the Add_fac = Address;
Retain Add_fac;
distance = compged(Add_fac,Address);
If distance < 50;
run;
smantha,
Thank you for your help and insight.
I really appreciate it as this overcomes a hurdle to completion.
Take care.
wlierman
You are welcome! We should thank @ChrisNZ as he started the original solution and I built on top of it.
Also did not mean to overlook the work from @ChrisNZ
Thank you for your help. I appreciate it.
wlierman
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.