Hello again,
I am back for another round with interleaving. First both @ChrisNZ and smantha provided great insight to my problem earlier - thank you again. In fact, I thought the code (shown below) gave me what I needed. Unfortunately, I spoke too soon.
Briefly to summarize the datasets I am working with, DATASET 1 The variables are"
zipcode,
county,
CCO_1 to CCO_15 (I can't have just one CCO field since more than one CCO overlap in several parts of the state across several zipcodes)
PCPCH_clinic_ID,
clinic_address,
recognition tier of the clinic,
no. of physicians in clinic
no. of other care providers in clinic
DATASET 2
Provider_name,
Specialty,
zipcode,
address of practice,
. . . (other variables)
I interleaved the two sets using zipcode. So the CCO, PCPCH_clinic address, and provider location address are all aligned. So a schematic of the interleaved dataset is (with some illustrative values)
these variables are from DATASET 1 and the important variables from DS2
zipcode county CCO_1 .... CCO_15 PCPCH_clinic_address Provider provider_location_add
97056 Multnomah HealthShare 3008 Sw Salmon . . ...
Dr. Smith 128 S Burns St
Dr. Brown 15595 Sw Teton
Dr. Batra 3008 Sw Salmon
Dr. Glenn 2223 Sw Salmon
Dr. Williams 3008 Sw Salmon
and so on
I want to connect clinic by address with only the physicians at the same address. This is the problem that the code below addresses.
Data CCO_Area.CCO_PCPCH_clinic_interleaved_1;
Set CCO_Area.CCO_PCPCH_clinic_interleaved;
Address = lower(address);
If Address ne ' ' the Add_fac = Address;
Retain Add_fac;
distance = compged(Add_fac,Address);
If distance < 50;
run;
However, I am not getting the result. The result would look like
zipcode county CCO_1 .... CCO_15 PCPCH_clinic_address Provider provider_location_add
97056 Multnomah HealthShare 3008 Sw Salmon . . ...
Dr. Batra 3008 Sw Salmon
and Williams et al.
Oregon has 36 counties and, after using zipcodes to align county, CCO, and clinic there are 647 obs in DATASET 1 and over 57K (roughly) in the provider file or DATASET 2.
So the final dataset will have 647 separate county/CCO/clinic_addresses matched against a varying number of providers by addresses depending on clinic size, county (urban, rural, frontier) and so on.
I tried using Soundex
Data CCO_Area.CCO_PCPCH_Prov_Interleaved_C;
Set CCO_Area.CCO_PCPCH_Prov_Interleaved_;
Where Address_1 =* '15950 Sw Millikan Way';
run;
That did identify every provider at the address specified which is also the PCPCH-clinic address for the zipcode. But it also pulled variants of the desired address (like 15950 Sw Millikan Way Ste 5, etc).
I have one interleaved dataset with the two address fields that I want to match together but can't seem to
get the result needed.
Again your help and expertise is much appreciated.
wlierman
Can you please add a sample of the data that you have as input and the one that you want as output. The input and output do not match. I used a zipcode as the variable on by statement as I did not have additional fields that could be on the by statement. If you can use cco1 - cco5 as additional fileds then you might get a closer match. Please provide the inputs of each of your input tables separately.
Hello,
Ok. The best I could do was put together three csv files
1- CCO_with_PCPCH is the clinical dataset (important variable/s address)
2 - Prov_in_state_with_NPI_Address_1 is the provider dataset (important variable/s address_1)
those are the input datasets
3- CCO_PCPCH_Prov_Interleaved
is the interleaved SAS Output dataset
Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.
Thank you for your continued assistance. Appreciate it.
wlierman
Are you saying you haven't sasified the CSV files you are trying to interleave?
But you must have, since you have shown tried a program to read the data originating in one or more of those csv file. Instead of each of us writing a sas program to reproduce the conversion of the csv's to sas datasets, why not provide the code you use (and the in-line csv data) to make the initial sas data sets?
Help us help y
> Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.
Show an example of 2 addresses that don't match when they should
@wlierman wrote:
Hello,
Ok. The best I could do was put together three csv files
1- CCO_with_PCPCH is the clinical dataset (important variable/s address)
2 - Prov_in_state_with_NPI_Address_1 is the provider dataset (important variable/s address_1)
those are the input datasets
3- CCO_PCPCH_Prov_Interleaved
is the interleaved SAS Output dataset
Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.
Thank you for your continued assistance. Appreciate it.
wlierman
Thanks for providing data, now please post the data-steps you have written to read the files, so that we have the same datasets.
Hello,
I have attached the coding below.
Thank you.
PS There should be enough data to work on the '15950 Sw Millikan Way' address/address_1.
If that one works the others will likely also.
wlierman
@wlierman wrote:
Hello,
I have attached the coding below.
Thank you.
PS There should be enough data to work on the '15950 Sw Millikan Way' address/address_1.
If that one works the others will likely also.
wlierman
Sorry, but i can't open any office-file, the button "Insert SAS Code" exists to post code.
I made the changes you suggested.
Now the addresses match up. But the Name field for the provider is null.
What I am hoping for is this kind of structure
clinic_name address provider_ address
clinic a 123 Home St
Smith 123 Home St
Bowen 123 Home St
. . . . . .
clinic k 322 A Ave
Green 322 A Ave
Collier 322 A Ave
Tanner 322 A Ave
and so on for each PCPCH_clinic
The changes now match addresses. But no provider names appear.
wlierman
I have a 85-90% percent solution. Using the following code
Data CCO_AREA.CCO_PCPCH_Prov_COMP;
Set CCO_Area.CCO_with_PCPCH_skinny;
tmp1 = soundex(Address);
Do i = 1 to nobs;
Set CCO_Area.Prov_in_State_w_NPI_Add_skinny point = i nobs=nobs;
tmp2 = soundex(Address_1);
dif = compged(tmp1, tmp2);
If dif = 0 then do; /* using a dif < 2 gives the same number of matches, so I set dif = 0 */
possible_matches = 'Yes';
drop i tmp1 tmp2;
output;
end;
end;
run;
The fuzzy matching algorithm runs into trouble (or so I notice) in an area of dense addresses (like an urban core or city center where there are many medical offices, clinics, or even hospitals. If the addresses are on different streets then the algorithm makes the correct match almost 100%. It doesn't do as well when there are several medical offices (or this could be any type of office) one right after the other on the same street. Then everything is sort of thrown into the solution. But the fuzzy algorithm has provided a big step forward for my analysis.
Thank you for all your help.
wlierman
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.