BookmarkSubscribeRSS Feed
wlierman
Lapis Lazuli | Level 10

Hello again,

I am back for another round with interleaving. First both @ChrisNZ and smantha provided great insight to my problem earlier - thank you again. In fact, I thought the code (shown below) gave me what I needed. Unfortunately, I spoke too soon. 

Briefly to summarize the datasets I am working with, DATASET 1  The variables are"

zipcode,

county,

CCO_1 to CCO_15 (I can't have just one CCO field since more than one CCO overlap in several parts of the state across several zipcodes)

PCPCH_clinic_ID,

clinic_address,

recognition tier of the clinic,

no.  of physicians in clinic

no. of other care providers in clinic

 

DATASET 2

 

Provider_name,

Specialty,

zipcode,

address of practice,

. . .  (other variables)

I interleaved the two sets using zipcode. So the CCO, PCPCH_clinic address, and provider location address are all aligned. So a schematic of the interleaved dataset is  (with some illustrative values)

these variables are from DATASET 1                                                 and the important variables from DS2

zipcode    county      CCO_1 .... CCO_15    PCPCH_clinic_address         Provider    provider_location_add

97056   Multnomah   HealthShare              3008 Sw Salmon                        .                         .                  ...

                                                                                                                   Dr. Smith       128 S Burns St

                                                                                                                   Dr. Brown       15595 Sw Teton

                                                                                                                   Dr. Batra           3008 Sw Salmon

                                                                                                                   Dr. Glenn          2223 Sw Salmon

                                                                                                                   Dr. Williams       3008 Sw Salmon

                                                                                                                            and so on

I want to connect clinic by address with only the physicians at the same address.  This is the problem that the code below addresses.

Data CCO_Area.CCO_PCPCH_clinic_interleaved_1;

   Set CCO_Area.CCO_PCPCH_clinic_interleaved;
   Address = lower(address);
   If Address ne '  ' the Add_fac = Address;

     Retain Add_fac;
   distance = compged(Add_fac,Address);
  If distance < 50;

run;

 

However, I am not getting the result.  The result would look like

 

zipcode    county      CCO_1 .... CCO_15    PCPCH_clinic_address    Provider    provider_location_add

97056   Multnomah   HealthShare              3008 Sw Salmon                        .                         .                  ...

                                                                                                            Dr. Batra           3008 Sw Salmon

                                                                                                           and Williams et al.

 

Oregon has 36 counties and, after using zipcodes to align county, CCO, and clinic there are 647 obs in DATASET 1 and over 57K (roughly) in the provider file or DATASET 2.

 

So the final dataset will have 647 separate county/CCO/clinic_addresses matched against a varying number of providers  by addresses depending on clinic size, county (urban, rural, frontier) and so on.

 

I tried using Soundex

 

Data CCO_Area.CCO_PCPCH_Prov_Interleaved_C;
Set CCO_Area.CCO_PCPCH_Prov_Interleaved_;
Where Address_1 =* '15950 Sw Millikan Way';
run;

 

That did identify every provider at the address specified which is also the PCPCH-clinic address for the zipcode. But it also pulled variants of the desired address (like 15950 Sw Millikan Way Ste 5, etc).

 

 I have one interleaved dataset with the two address fields that I want to match together but can't seem to

get the result needed.

 

Again your help and expertise is much appreciated.

 

wlierman                                 

11 REPLIES 11
smantha
Lapis Lazuli | Level 10

Can you please add a sample of the data that you have as input and the one that you want as output. The input and output do not match. I used a zipcode as the variable on by statement as I did not have additional fields that could be on the by statement. If you can use cco1 - cco5 as additional fileds then you might get a closer match. Please provide the inputs of each of your input tables separately.

wlierman
Lapis Lazuli | Level 10

Hello,

 

Ok.  The best I could do was put together three csv files

1- CCO_with_PCPCH   is the clinical dataset  (important variable/s address)

2 - Prov_in_state_with_NPI_Address_1   is the provider dataset (important variable/s address_1)

 

those are the input datasets

3- CCO_PCPCH_Prov_Interleaved

 

is the interleaved SAS Output dataset

 

Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.

 

Thank you for your continued assistance.  Appreciate it.

 

wlierman

 

mkeintz
PROC Star

Are you saying you haven't sasified the CSV files you are trying to interleave?

 

But you must have, since you have shown tried a program to read the data originating in one or more of those csv file.  Instead of each of us writing a sas program to reproduce the conversion of the csv's to sas datasets, why not provide the code you use (and the in-line csv data) to make the initial sas data sets?

 

Help us help y

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ChrisNZ
Tourmaline | Level 20

> Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.

Show an example of 2 addresses that don't match when they should

andreas_lds
Jade | Level 19

@wlierman wrote:

Hello,

 

Ok.  The best I could do was put together three csv files

1- CCO_with_PCPCH   is the clinical dataset  (important variable/s address)

2 - Prov_in_state_with_NPI_Address_1   is the provider dataset (important variable/s address_1)

 

those are the input datasets

3- CCO_PCPCH_Prov_Interleaved

 

is the interleaved SAS Output dataset

 

Ideally, what I would like would be in the interleaved output dataset to have each clinic address matched with only the providers at that address.

 

Thank you for your continued assistance.  Appreciate it.

 

wlierman

 


Thanks for providing data, now please post the data-steps you have written to read the files, so that we have the same datasets.

wlierman
Lapis Lazuli | Level 10

Hello,

 

I have attached the coding below.

 

Thank you.

 

PS  There should be enough data to work on the '15950 Sw Millikan Way'   address/address_1.

If that one works the others will likely also.

 

wlierman

andreas_lds
Jade | Level 19

@wlierman wrote:

Hello,

 

I have attached the coding below.

 

Thank you.

 

PS  There should be enough data to work on the '15950 Sw Millikan Way'   address/address_1.

If that one works the others will likely also.

 

wlierman


Sorry, but i can't open any office-file, the button "Insert SAS Code" exists to post code.

smantha
Lapis Lazuli | Level 10
In the first data step you have if address_lin = address1; remove that line. Change the condition <50 to less than 2
wlierman
Lapis Lazuli | Level 10

I made the changes you suggested.

Now the addresses match up.  But the Name field for the provider is null.

 

What I am hoping for is this kind of structure

                                                                          clinic_name     address                 provider_         address

                                                                            clinic a        123 Home St            

                                                                                                                                  Smith              123 Home St

                                                                                                                                  Bowen            123 Home St

                                                                                       . . .                                              . . .

                                                                           clinic k          322 A Ave      

                                                                                                                                   Green              322 A Ave

                                                                                                                                    Collier             322 A Ave

                                                                                                                                    Tanner            322 A Ave

                                                         and so on for each PCPCH_clinic

 

The changes now match addresses.  But no provider names appear.

 

wlierman

                                                          

wlierman
Lapis Lazuli | Level 10
After looking more closely, the code only replicates the text string for Address. The text string for
Address_1 needs to be matched. Something like a loop through the Address_1 for each Address as long as Address_1 = Address. I saw something in a Python tutorial that made me think of that.
I have to dig up that tutorial.
wlierman
Lapis Lazuli | Level 10

I have a 85-90% percent solution. Using the following code

 

Data CCO_AREA.CCO_PCPCH_Prov_COMP;
Set CCO_Area.CCO_with_PCPCH_skinny;
tmp1 = soundex(Address);
Do i = 1 to nobs;
    Set CCO_Area.Prov_in_State_w_NPI_Add_skinny point = i nobs=nobs;
    tmp2 = soundex(Address_1);
   dif = compged(tmp1, tmp2);
   If dif = 0 then do;  /* using a dif < 2 gives the same number of matches, so I set dif = 0 */
      possible_matches = 'Yes';
     drop i tmp1 tmp2;
    output;
  end;
end;
run;

 

The fuzzy matching algorithm runs into trouble (or so I notice) in an area of dense addresses (like an urban core or city center where there are many medical offices, clinics, or even hospitals. If the addresses are on different streets then the algorithm makes the correct match almost 100%.  It doesn't do as well when there are several medical offices (or this could be any type of office) one right after the other on the same street.  Then everything is sort of thrown into the solution.  But the fuzzy algorithm has provided a big step forward for my analysis.

 

Thank you for all your help.

 

wlierman

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 3365 views
  • 1 like
  • 5 in conversation