I am trying to merge the following two datasets by the NPI number on SAS v9.4 (a unique number identifying individual physicians):
(This datasets lists procedures completed by a physician)
and
(This dataset lists drugs prescribed by a physician)
I sorted each dataset by the NPI number (after renaming the NPI variables so they are the same) and then merged by the NPI number. This is the problem I am running into:
Say we have Dr. Smith, and Dr. Smith prescribed 5 different drugs during 2019 and completed 10 different procedures in 2019. When I go to merge the dataset, Dr. Smith's procedures 1-5 are merged but procedures 5-10 are subsequently deleted.
How can I solve this issue?
How to solve it depends on what you are trying to do. How do you want to combine 5 drugs with 10 procedures? What type of output do you need?
This is the way the input is currently structured:
NPI | |
1 | Drug 1 |
1 | Drug 2 |
1 | Drug 3 |
1 | Drug 4 |
1 | Drug 5 |
and
NPI | |
1 | Procedure 1 |
1 | Procedure 2 |
1 | Procedure 3 |
1 | Procedure 4 |
1 | Procedure 5 |
1 | Procedure 6 |
1 | Procedure 7 |
1 | Procedure 8 |
1 | Procedure 9 |
1 | Procedure 10 |
I would like the following output:
NPI | ||||||||||||||||
1 | Drug 1 | Drug 2 | Drug 3 | Drug 4 | Drug 5 | Procedure 1 | Procedure 2 | Procedure 3 | Procedure 4 | Procedure 5 | Procedure 6 | Procedure 7 | Procedure 8 | Procedure 9 | Procedure 10 | Procedure 11 |
My first step would be to append both datasets; from there you can then transpose to the wide format if such is needed.
How would I transpose to the wide format?
@gabriellep38 wrote:
How would I transpose to the wide format?
See this, using your example data:
data drugs;
infile datalines dlm="|";
input NPI $ drugtype :$20.;
datalines;
1|Drug 1
1|Drug 2
1|Drug 3
1|Drug 4
1|Drug 5
;
data procedures;
infile datalines dlm="|";
input NPI $ proctype :$20.;
datalines;
1|Procedure 1
1|Procedure 2
1|Procedure 3
1|Procedure 4
1|Procedure 5
1|Procedure 6
1|Procedure 7
1|Procedure 8
1|Procedure 9
1|Procedure 10
;
data all;
set
drugs (rename=(drugtype=value))
procedures (rename=(proctype=value))
;
run;
proc transpose
data=all
out=want (drop=_name_)
;
by npi;
var value;
run;
It may be that you need to compare the lists by more than a Doctor's last name. For example, in the services dataset, there is a Nirali Patel (Rndrng_NPI = '1265846356'), an Anesthesiologist Assistant in Washington, DC, while in the Drugs dataset there is a different Nirali Patel (Prscrbr_NPI = '1104322072'), a student in an organized health care program in Washington, DC.
Thanks this is a good point.
If you used a data step merge you likely have problems. Data step merge only handles one data set with multiples of the By variable. If both sets have multiples then one of them gets truncated to match the smaller and likely not where you would like.
Without some sample data, made up is fine as long as it behaves like your real data, and show what you expect for a result using the sample data. Best is to provide the example and final data in the form of data step code so we can run tests with it.
Links like that require us to do a lot of work, much less anything like creating accounts just to access it.
From your limited description I doubt that merging ALL drugs with ALL procedures for a given doctor is appropriate anyway.
The issue you are facing is typical of many to many merge using SAS data step.
Better to use Proc SQL.
In case it is the intention to use data step, then have a look at this paper.
https://www.pharmasug.org/proceedings/2011/TU/PharmaSUG-2011-TU05.pdf
Can Proc SQL be used to merge multiple datasets. I am hoping to merge both the services and drug dataset from all 6 years, for a total of 12 datasets combined into 1.
Yes
I tried to use proc sql but it was requiring 100s of GBs in order to do the data processing - so the program would not run. Is there another way to do this that does not require so much storage space?
To just go from tall to wide the most efficient way is to use a data step.
You will need to set an upper limit on the number of variables you will create so you can define the arrays. Define the arrays to match the definition of the variables in the original datasets. So if PROCEDURE is defined as length $10 then add the $10 to the ARRAY for P so that the new variables are also defined as $10.
data want;
set procedures(in=in1) drugs(in=in2);
by npi;
array p procedure1-procedure20;
array d drug1-drug20;
if first.npi then call missing(of np p[*] nd d[*]);
if in1 then do;
np+1;
p[np] = procedure ;
end;
if in2 then do;
nd+1;
d[nd] = drug ;
end;
if last.npi;
drop procedure drug;
run;
Hello @gabriellep38
The provider and service dataset has approximately 1 million rows and the the part D prescribers has approximately 2.5 million rows and both together may have a size of approximately 7 GB (as csv after unzipping). (I have considered the data for 2019 year only.)
I strongly recommended that the data must be loaded on to an RDMS database. If there are difficulties , consider using either opensource (MySQL /PostgreSQL) or may be Oracle Express (The 18C express has 12 GB database size restriction) depending upon your preference.
Your process will be simplified. The Oracle listagg function Or the MySQL GROUP_CONCAT functions will greatly simplify the "tall to wide " transpose as you have shown in your second post. (NPI, drugs and Procedures in a row).
Once you get the final dataset in a form you need, you can do further processing in SAS.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.