- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is there a more efficient way to write the following code?
proc sql;
create table ABC as
select t1.age,
t1.gender,
t1.diagnosis1,
t1.diagnosis2,
t1.diagnosis3,
t1.diagnosis4
from work.testfile t1
where t1.diagnosis1 in ("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359")
or t1.diagnosis2 in ("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359")
or t1.diagnosis3 in ("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359")
or t1.diagnosis4 in ("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359");
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Perhaps put the codes in a format and then evaluate?
Or a data step with an array?
data want;
set have;
keep age diagnosis1-diagnosis4;
array diag(4) diagnosis1-diagnosis4;
keep=0;
do i=1 to dim(diag);
if diag(i) in
("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359") then keep=1;
end;
if keep=1;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Reeza, a small improvement:
do i=1 to dim(diag) until (keep=1);
That way, if DIAGNOSIS1 is a match, there's no need to check the other 3 variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
<g> One can always find something to optimize. I LOVE this job.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Without changing the syntax you should think of using a macro variable anytime you type something over again like this:
%let list = ("4330","43301","43310","43311","43320","43321","43330","43331","43380","43381","43390","43391","4340","43401",
"43410","43411","43490","43491","4350","4351","4352","4353","4358","4359");
proc sql;
create table ABC as
select t1.age,
t1.gender,
t1.diagnosis1,
t1.diagnosis2,
t1.diagnosis3,
t1.diagnosis4
from work.testfile t1
where t1.diagnosis1 in &list
or t1.diagnosis2 in &list
or t1.diagnosis3 in &list
or t1.diagnosis_oth4_cd in &list;
I would probably use a proc tranpose but I'm sure there are a lot of ways.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Define your need for efficiency: code brevity, program run time, memory/disk utilization, something else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Code brevity or a better way of writing the code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that one way to simplify your code would be to restructure (normalize) your data. Having a wide dataset with variables like DIGNOSIS1-DIAGNOSIS4 often makes code more difficult to write and maintain than if data are normalized into a narrow format with just a single DIAGNOSIS variable and repeating rows. Key point being that if you ever get a patient with a fifth diagnosis, you will not need to update all your code.
I spent a lot of time in my early years manipulating data with lots of arrays, until I was lucky to have a boss point out that simplified data structures lead to simplified code.
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, the data is as normalized as we can get it. The nature of our database (all payers claims data) will not allow the structure you are suggesting. I have a total of 12 diagnosis fields and 7 procedure code fields with a total of 219 columns.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Even if the database that stores the data is not normalized (unfortunately), as a SAS programmer it's often helpful to normalize data for yourself before you do any processing. Often for a project, or for a data source I will use across projects, I have a %reformatdata() macro which reads some source data in a format I don't like and reshapes it into a format I do like. If the data are big, can do this as a nightly ETL job, or do it for just the subset I'm interested it.
Given a table with variables: patient age gender claimdate diagnosis1-diagnosis12 procedure1-procedure7, I might reshape it into three tables. Patient table with patient age gender. Diagnosis table with Patient, ClaimDate, Diagnosis. Procedure Table with Patient, ClaimDate, Procedure.
Once I have that done, answering most of the anticipated questions is easier. Which patients have one of these diagnosis codes? Which patients have both diagnosis A and diagnosis B? Which patients had diagnosis X within one month following treatment with procedure Y. What is the average number of diagnoses per patient? etc.
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Quentin and others. A normalised structure is better in most ways. Easier to maintain and query.
If that's not an option consider storing the data in SPDE in which the where-planner can optimise OR conditions (given that the columns are indexed).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just to add after seeing your post, I am not suggesting chaning your database, just your working copy which you are processing. Having 219 columns will make your code more complicated, unless your using arrays - (i.e. SQL will just be long winded) - per Reeza's example.
I would agree with Quentin here. Normalize your data to get:
AGE GENDER DIAG_NO DIAG
xx xxxx 1 4330
xx xxxx 2 2345
...
Then have your list of data to find in another dataset - I personally don't like the use of macro variables to hold lots of data:
DIAG
4330
43301
...
Then its a simple lookup:
proc sql;
create table WANT as
select AGE
,GENDER
,DIAG_NO,
,DIAG
where DIAG in (select distinct DIAG from LOOKUP);
quit;