Re: Merging two datasets

gabriellep38 · Posted 03-14-2022 07:58 AM

I am trying to merge the following two datasets by the NPI number on SAS v9.4 (a unique number identifying individual physicians):

https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medi...

(This datasets lists procedures completed by a physician)

and

https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers/medicare-part-d....

(This dataset lists drugs prescribed by a physician)

I sorted each dataset by the NPI number (after renaming the NPI variables so they are the same) and then merged by the NPI number. This is the problem I am running into:

Say we have Dr. Smith, and Dr. Smith prescribed 5 different drugs during 2019 and completed 10 different procedures in 2019. When I go to merge the dataset, Dr. Smith's procedures 1-5 are merged but procedures 5-10 are subsequently deleted.

How can I solve this issue?

Tom · Posted 03-14-2022 09:25 AM

How to solve it depends on what you are trying to do. How do you want to combine 5 drugs with 10 procedures? What type of output do you need?

gabriellep38 · Posted 03-14-2022 10:59 AM

This is the way the input is currently structured:

NPI
1	Drug 1
1	Drug 2
1	Drug 3
1	Drug 4
1	Drug 5

and

NPI
1	Procedure 1
1	Procedure 2
1	Procedure 3
1	Procedure 4
1	Procedure 5
1	Procedure 6
1	Procedure 7
1	Procedure 8
1	Procedure 9
1	Procedure 10

I would like the following output:

NPI

1

Drug 1

Drug 2

Drug 3

Drug 4

Drug 5

Procedure 1

Procedure 2

Procedure 3

Procedure 4

Procedure 5

Procedure 6

Procedure 7

Procedure 8

Procedure 9

Procedure 10

Procedure 11

Kurt_Bremser · Posted 03-14-2022 11:21 AM

My first step would be to append both datasets; from there you can then transpose to the wide format if such is needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

gabriellep38 · Posted 03-20-2022 01:14 PM

How would I transpose to the wide format?

Kurt_Bremser · Posted 03-21-2022 04:13 AM

@gabriellep38 wrote:

How would I transpose to the wide format?

See this, using your example data:

data drugs;
infile datalines dlm="|";
input NPI $ drugtype :$20.;
datalines;
1|Drug 1
1|Drug 2
1|Drug 3
1|Drug 4
1|Drug 5
;

data procedures;
infile datalines dlm="|";
input NPI $ proctype :$20.;
datalines;
1|Procedure 1
1|Procedure 2
1|Procedure 3
1|Procedure 4
1|Procedure 5
1|Procedure 6
1|Procedure 7
1|Procedure 8
1|Procedure 9
1|Procedure 10
;

data all;
set
  drugs (rename=(drugtype=value))
  procedures (rename=(proctype=value))
;
run;

proc transpose
  data=all
  out=want (drop=_name_)
;
by npi;
var value;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

benjamin_2018 · Posted 03-14-2022 10:30 AM

It may be that you need to compare the lists by more than a Doctor's last name. For example, in the services dataset, there is a Nirali Patel (Rndrng_NPI = '1265846356'), an Anesthesiologist Assistant in Washington, DC, while in the Drugs dataset there is a different Nirali Patel (Prscrbr_NPI = '1104322072'), a student in an organized health care program in Washington, DC.

gabriellep38 · Posted 03-20-2022 01:15 PM

Thanks this is a good point.

ballardw · Posted 03-14-2022 10:51 AM

If you used a data step merge you likely have problems. Data step merge only handles one data set with multiples of the By variable. If both sets have multiples then one of them gets truncated to match the smaller and likely not where you would like.

Without some sample data, made up is fine as long as it behaves like your real data, and show what you expect for a result using the sample data. Best is to provide the example and final data in the form of data step code so we can run tests with it.

Links like that require us to do a lot of work, much less anything like creating accounts just to access it.

From your limited description I doubt that merging ALL drugs with ALL procedures for a given doctor is appropriate anyway.

Sajid01 · Posted 03-14-2022 11:02 AM

The issue you are facing is typical of many to many merge using SAS data step.
Better to use Proc SQL.
In case it is the intention to use data step, then have a look at this paper.
https://www.pharmasug.org/proceedings/2011/TU/PharmaSUG-2011-TU05.pdf

gabriellep38 · Posted 03-15-2022 08:18 AM

Can Proc SQL be used to merge multiple datasets. I am hoping to merge both the services and drug dataset from all 6 years, for a total of 12 datasets combined into 1.

Sajid01 · Posted 03-15-2022 08:29 AM

Yes

gabriellep38 · Posted 03-20-2022 01:16 PM

I tried to use proc sql but it was requiring 100s of GBs in order to do the data processing - so the program would not run. Is there another way to do this that does not require so much storage space?

Tom · Posted 03-20-2022 03:47 PM

To just go from tall to wide the most efficient way is to use a data step.

You will need to set an upper limit on the number of variables you will create so you can define the arrays. Define the arrays to match the definition of the variables in the original datasets. So if PROCEDURE is defined as length $10 then add the $10 to the ARRAY for P so that the new variables are also defined as $10.

data want;
   set procedures(in=in1) drugs(in=in2);
   by npi;
   array p procedure1-procedure20;
   array d drug1-drug20;
   if first.npi then call missing(of np p[*] nd d[*]);
   if in1 then do;
     np+1;
     p[np] = procedure ;
   end;
   if in2 then do;
     nd+1;
     d[nd] = drug ;
   end;
   if last.npi;
   drop procedure drug;
run;

Sajid01 · Posted 03-20-2022 06:22 PM

Hello @gabriellep38
The provider and service dataset has approximately 1 million rows and the the part D prescribers has approximately 2.5 million rows and both together may have a size of approximately 7 GB (as csv after unzipping). (I have considered the data for 2019 year only.)
I strongly recommended that the data must be loaded on to an RDMS database. If there are difficulties , consider using either opensource (MySQL /PostgreSQL) or may be Oracle Express (The 18C express has 12 GB database size restriction) depending upon your preference.

Your process will be simplified. The Oracle listagg function Or the MySQL GROUP_CONCAT functions will greatly simplify the "tall to wide " transpose as you have shown in your second post. (NPI, drugs and Procedures in a row).
Once you get the final dataset in a form you need, you can do further processing in SAS.

Merging two datasets