Greetings,
So I have two tables, one of them with around 100,000,000 healthcare claims (claims), and one with a list of NDCs (ndc_table). I'm trying to use a key-indexing approach to subset to include only those claims associated with NDCs in the NDC table.
data final;
array ndc_list (*) _temporary_;
do until (eof1);
set ndc_table end=eof1;
ndc_list(ndc) = ndc;
end;
do until (eof2);
set claims end=eof2;
array_search = ndc_list(ndc);
if array_search > . then output;
end;
run;
I'm getting 3 errors that I can't seem to figure out:
ERROR: The non-variable based array ndc_list has been defined with zero elements.
ERROR: Too many array subscripts specified for array ndc_list.
ERROR: Too many array subscripts specified for array ndc_list.
I used (*) because the number of NDCs in the ndc_table is going to change over time. Right now there are around 3500 values.
@PegaZeus wrote:
Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.
So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.
data TEST(drop=rc:); if 0 then set ndc_table; if _N_ = 1 then do; declare hash h1(dataset:"ndc_table"); h1.defineKey('NDC'); h1.defineData(all:'Y'); h1.defineDone(); end; set claims; rc1=h1.find(); if rc1 ne 0 then call missing(of _all_); run;
If it's a "left join on the ndc key", then you need an "if test" to avoid outputing claims that don't match the ndc table. So instead of
rc1=h1.find();
if rc1 ne 0 then call missing(of _all_);
you're probably better off with
if h1.find()=0 then output;
If the NDC table is a miniscule fraction of the claims dataset, may be there are a few cases that aren't all blank … but since your program produces a dataset with the same number of records as claims, maybe you didn't see them.
And of course you can eliminate the "drop=rc1" option in the data statement.
Hello @PegaZeus,
Arrays in SAS need to be defined with a dimension. In your case you may want to choose a "large enough" value, e.g., 999999 if this is larger than the maximum NDC value you would expect. Or determine the maximum in a preliminary step and use it in the array definition. SAS hash objects do not have this requirement. They grow as needed while they are loaded with data. So this would be an alternative approach.
@PegaZeus what is the maximum value of ndc?
The max value of NDC would be 99999999999. This is not the number of ndcs in the ndc table though. In the ndc table there are about 3500 NDCs.
So NDC can range from 1 to 99999999999? Or does 99999999999 have some special meaning?
That's correct, except instead of just 1, it would be 00000000001. It has to have 11 digits.
Ok. Definitely go with the Hash approach suggested by @FreelanceReinh then 🙂
@PegaZeus wrote:
The max value of NDC would be 99999999999. This is not the number of ndcs in the ndc table though. In the ndc table there are about 3500 NDCs.
This 11-digit number would likely be too large as an array dimension, so the key-indexing approach would need a modification or perhaps you just resort to the hash object approach shown below:
data final;
dcl hash h(dataset:'ndc_table');
h.definekey('ndc');
h.definedone();
do until(eof);
set claims end=eof;
if h.check()=0 then output;
end;
stop;
run;
@FreelanceReinh out of interest, what modification do you have in mind to approach this problem with key-indexing? Bitmapping?
Oh, nothing concrete yet. I only remembered those Dorfman papers about key-indexing and was thinking that the NDC numbers perhaps could be mapped to a much smaller range of integers (without collisions). But this would, of course, require deeper knowledge about the structure of those NDC numbers.
Ok. Thank you. Triggered my curiosity 🙂 The Dorfman papers entered my mind as well.
Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.
So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.
data TEST(drop=rc:);
if 0 then set ndc_table;
if _N_ = 1 then do;
declare hash h1(dataset:"ndc_table");
h1.defineKey('NDC');
h1.defineData(all:'Y');
h1.defineDone();
end;
set claims;
rc1=h1.find();
if rc1 ne 0 then call missing(of _all_);
run;
Interesting, you're right. What actually happened was I was using too small of a claims sample for testing, which didn't include any of the NDCs in the ndc_table. I updated the call missing to only the fields in that table now.
@PegaZeus wrote:
Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.
So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.
data TEST(drop=rc:); if 0 then set ndc_table; if _N_ = 1 then do; declare hash h1(dataset:"ndc_table"); h1.defineKey('NDC'); h1.defineData(all:'Y'); h1.defineDone(); end; set claims; rc1=h1.find(); if rc1 ne 0 then call missing(of _all_); run;
If it's a "left join on the ndc key", then you need an "if test" to avoid outputing claims that don't match the ndc table. So instead of
rc1=h1.find();
if rc1 ne 0 then call missing(of _all_);
you're probably better off with
if h1.find()=0 then output;
If the NDC table is a miniscule fraction of the claims dataset, may be there are a few cases that aren't all blank … but since your program produces a dataset with the same number of records as claims, maybe you didn't see them.
And of course you can eliminate the "drop=rc1" option in the data statement.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.