Solved: Re: Errors in Key-Indexing Subset Using Array

PegaZeus · Posted 04-02-2020 03:05 PM

Greetings,

So I have two tables, one of them with around 100,000,000 healthcare claims (claims), and one with a list of NDCs (ndc_table). I'm trying to use a key-indexing approach to subset to include only those claims associated with NDCs in the NDC table.

data final; 
  array ndc_list (*) _temporary_; 
  do until (eof1); 
    set ndc_table end=eof1; 
        ndc_list(ndc) = ndc; 
end;
do until (eof2); 
  set claims end=eof2; 
    array_search = ndc_list(ndc); 
    if array_search > . then output; 
end;
run;

I'm getting 3 errors that I can't seem to figure out:

ERROR: The non-variable based array ndc_list has been defined with zero elements.

ERROR: Too many array subscripts specified for array ndc_list.

I used (*) because the number of NDCs in the ndc_table is going to change over time. Right now there are around 3500 values.

mkeintz · Posted 04-02-2020 08:03 PM

@PegaZeus wrote:

Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.

So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.
data TEST(drop=rc:);
  if 0 then set ndc_table;
  if _N_ = 1 then do;
      declare hash h1(dataset:"ndc_table");
      h1.defineKey('NDC');
      h1.defineData(all:'Y');
      h1.defineDone();
  end;
  set claims;
  rc1=h1.find();
  if rc1 ne 0 then call missing(of _all_);
run;

If it's a "left join on the ndc key", then you need an "if test" to avoid outputing claims that don't match the ndc table. So instead of

rc1=h1.find();
if rc1 ne 0 then call missing(of _all_);

you're probably better off with

if h1.find()=0  then output;

If the NDC table is a miniscule fraction of the claims dataset, may be there are a few cases that aren't all blank … but since your program produces a dataset with the same number of records as claims, maybe you didn't see them.

And of course you can eliminate the "drop=rc1" option in the data statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

FreelanceReinh · Posted 04-02-2020 03:38 PM

Hello @PegaZeus,

Arrays in SAS need to be defined with a dimension. In your case you may want to choose a "large enough" value, e.g., 999999 if this is larger than the maximum NDC value you would expect. Or determine the maximum in a preliminary step and use it in the array definition. SAS hash objects do not have this requirement. They grow as needed while they are loaded with data. So this would be an alternative approach.

PeterClemmensen · Posted 04-02-2020 03:57 PM

@PegaZeus what is the maximum value of ndc?

The DATA to DATA Step Macro
Blog: SASnrd

PegaZeus · Posted 04-02-2020 04:13 PM

The max value of NDC would be 99999999999. This is not the number of ndcs in the ndc table though. In the ndc table there are about 3500 NDCs.

PeterClemmensen · Posted 04-02-2020 04:26 PM

So NDC can range from 1 to 99999999999? Or does 99999999999 have some special meaning?

The DATA to DATA Step Macro
Blog: SASnrd

PegaZeus · Posted 04-02-2020 04:40 PM

That's correct, except instead of just 1, it would be 00000000001. It has to have 11 digits.

PeterClemmensen · Posted 04-02-2020 04:44 PM

Ok. Definitely go with the Hash approach suggested by @FreelanceReinh then 🙂

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 04-02-2020 04:39 PM

@PegaZeus wrote:

The max value of NDC would be 99999999999. This is not the number of ndcs in the ndc table though. In the ndc table there are about 3500 NDCs.

This 11-digit number would likely be too large as an array dimension, so the key-indexing approach would need a modification or perhaps you just resort to the hash object approach shown below:

data final;
dcl hash h(dataset:'ndc_table');
h.definekey('ndc');
h.definedone();
do until(eof);
  set claims end=eof;
  if h.check()=0 then output;
end;
stop;
run;

PeterClemmensen · Posted 04-02-2020 04:47 PM

@FreelanceReinh out of interest, what modification do you have in mind to approach this problem with key-indexing? Bitmapping?

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 04-02-2020 04:55 PM

Oh, nothing concrete yet. I only remembered those Dorfman papers about key-indexing and was thinking that the NDC numbers perhaps could be mapped to a much smaller range of integers (without collisions). But this would, of course, require deeper knowledge about the structure of those NDC numbers.

PeterClemmensen · Posted 04-02-2020 05:04 PM

Ok. Thank you. Triggered my curiosity 🙂 The Dorfman papers entered my mind as well.

The DATA to DATA Step Macro
Blog: SASnrd

PegaZeus · Posted 04-02-2020 04:54 PM

Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.

So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.

data TEST(drop=rc:);
  if 0 then set ndc_table;
  if _N_ = 1 then do;
      declare hash h1(dataset:"ndc_table");
      h1.defineKey('NDC');
      h1.defineData(all:'Y');
      h1.defineDone();
  end;
  set claims;
  rc1=h1.find();
  if rc1 ne 0 then call missing(of _all_);
run;

Tom · Posted 04-02-2020 05:04 PM

When FIND() fails you a setting ALL of the variables to missing, even those the came from the CLAIMS dataset.
If everything ends up missing then you got no hits. Are you sure NDC variable is in both datasets? Is the content in the same format? Perhaps one is using NDC-10 and the other NDC-11.

PegaZeus · Posted 04-02-2020 05:32 PM

Interesting, you're right. What actually happened was I was using too small of a claims sample for testing, which didn't include any of the NDCs in the ndc_table. I updated the call missing to only the fields in that table now.

mkeintz · Posted 04-02-2020 08:03 PM

@PegaZeus wrote:

Okay I found out that what we actually want is not to subset the claims but simply to left join on the ndc key and bring in the other fields from the ndc table.

So I wrote a hash left lookup like this, however my output gives me blank values in the entire dataset. Nothing is populated.
data TEST(drop=rc:);
  if 0 then set ndc_table;
  if _N_ = 1 then do;
      declare hash h1(dataset:"ndc_table");
      h1.defineKey('NDC');
      h1.defineData(all:'Y');
      h1.defineDone();
  end;
  set claims;
  rc1=h1.find();
  if rc1 ne 0 then call missing(of _all_);
run;

If it's a "left join on the ndc key", then you need an "if test" to avoid outputing claims that don't match the ndc table. So instead of

rc1=h1.find();
if rc1 ne 0 then call missing(of _all_);

you're probably better off with

if h1.find()=0  then output;

If the NDC table is a miniscule fraction of the claims dataset, may be there are a few cases that aren't all blank … but since your program produces a dataset with the same number of records as claims, maybe you didn't see them.

And of course you can eliminate the "drop=rc1" option in the data statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Call for Content

Classroom Training Available!