Help using Base SAS procedures

Arryas, Loops and Datasets...efficiency question

Reply
New Contributor
Posts: 3

Arryas, Loops and Datasets...efficiency question

Hello! I'm new here and need some help.

I have a huge data set which I have to search for codes. The codes can appear in any of multiple columns. The codes are alphanumberic.

Normally I use code like this:

data file1 (drop=i);

set file2;

array  field{25} field1-field25;

do i = 1 to 25;

     if (field{i} = "20" or field{i} = "25") then do;

     delete;

      leave;

     end;

end;

run;

This seems to work fine. However now instead of just 1 or 2 codes to search by I am given lists of 500 to 3800 codes. So my question is... is how is the best way to accomplish this. I have the codes in a sorted one column text file.

Thank you!

Super User
Posts: 5,091

Re: Arryas, Loops and Datasets...efficiency question

Rewena,

The easiest way would be to turn the one-column file into a format.  It would be the equivalent of:

proc format;

value $fields "20" = "delete me"  "25" = "delete me";

run;

Here's one way to do that:

data delete_list;

   infile one_col;

   input start $;

   retain fmtname "$fields" label "delete me";

run;

proc format cntlin=delete_list;

run;

That would create the format.  Then to use it, change the interior of the DO loop:

if put(field{i}, $fields.) = "delete me" then delete;

Once you delete a record, you don't have to worry about a LEAVE statement.  You're already out of the DO group.

Good luck.

PROC Star
Posts: 7,364

Re: Arryas, Loops and Datasets...efficiency question

Here is one approach:

data Amaster;

  informat code $2.;

  input code;

  cards;

25

14

BB

;

data _null_;

if 0 then set Amaster nobs=nobs;

CALL SYMPUT('NUMREC',nobs);

stop;

run;

data have;

  informat field1-field5 $2.;

  input id field1-field5;

  cards;

1 AA FF 11 12 CC

2 DD 25 14 XX YY

3 DD EE FF 25 CC

4 AA BB DD DD EE

;

data want (drop=codes: x i);

  array codes(&numrec.) $2.;

  i=0;

  do until (eof1);

    set Amaster end=eof1;

    i=i+1;

    codes(i)=code;

  end;

  do until (eof2);

    array fields(*) $2. field1-field5;

    set have end=eof2;

    do i=1 to dim(fields);

      if fields(i) in codes then do;

     x=1;

        leave;

      end;

    end;

    if missing(x) then output;

  end;

run;

Super User
Posts: 9,687

Re: Arryas, Loops and Datasets...efficiency question

I prefer to Hash Table .

New Contributor
Posts: 3

Re: Arryas, Loops and Datasets...efficiency question

I forgot to mention I am using EG 4.3 so I don't think Hash Tables are an option.

Astounding thank you, that sounds like a great method.  I am doing something wrong though. The code runs fine but it doesn't delete any observations on my test set (where it should delete 80). I really haven't used formats much. What am I doing wrong?

data delete_list;

     infile "myTextFile"

     input start $;

     retain fmtname "$form label "delete me";

     run;

* I get a 3 column dataset with the same number of observations and my codes in the first column, $form in the second, and delete me in the third);

proc format cntlin=delete_list;

data IGIV2_filtered (drop=i);

set IGIV1_all;

array Diag{12} diag1-diag12;

do i = 1 to 12;

     if put(diagCode{i}, $form.) = "delete me" then delete;

end;

run;

What am I doing wrong?

Thanks again!

Super User
Posts: 5,091

Re: Arryas, Loops and Datasets...efficiency question

Rewena,

Mostly, it looks like everything is fine and should work.  Let's explore a few things though.

The first DATA step looks like it contains a few syntax errors (missing semicolon, unbalanced quotes).  But you describe the output properly, so I assume that you ran the right program, and that the code you posted is slightly different than what you actually ran.

The last DATA step uses Diag as the array name, but later refers to the array as diagCode.  If you actually ran it this way, you would have gotten an error message.  So again I believe that the program you ran was slightly different than the program you posted.

My best guess as to what is happening is that the diagnostic codes in the data do not actually match the codes in the delete list.  Do the codes in IGIV1_all contain leading blanks?  Could they have been entered through a data entry system that appends a carriage return at the end?  You will have to do a little digging on this.  Find one of the 80 that should have been deleted, and explore what is actually in the data.  If one particular observation has diag5 that should be deleted, try this.  If diag5 has a length of 8:

length newvar $ 16;

newvar = put(diag5, $hex16.);

put diag5= newvar=;

Let's find out what's actually in the data and that should point to a solution.  It could be that the problem is with the delete list codes, rather than the data itself.

Good luck.

New Contributor
Posts: 3

Re: Arryas, Loops and Datasets...efficiency question

Output:

Diag5     newvar         

27903     3237393033

Thanks!

Super User
Posts: 5,091

Re: Arryas, Loops and Datasets...efficiency question

Rewena,

So far so good.  Can I assume that DIAG5 is defined as having a length of 5 when you run a PROC CONTENTS?  If not, adjust NEWVAR to be twice the length of DIAG5, and retest.

Just to rule out a few possibilities, let's try these steps as well.

Check the spelling.  Make sure "delete me" is spelled the same way, with the same capitalization throughout.

Check when START is "27903" in the DELETE_LIST data set.  Run the same test, but use the longer length ($ 16) for NEWVAR.  The statement INPUT START $ ; will assign START a length of 8.

Finally, I don't think this will be the fix but try changing the format in the PUT function:

put(diagCode{i}, $form9.)

That will make sure that the PUT function returns 9 characters, enough to spell out all the characters in "delete me".  (Do NOT change the value of FMTNAME in the prior step.)

Contributor
Posts: 31

Re: Arryas, Loops and Datasets...efficiency question

Include for testing purposes two elements:

1. in the creation of the format include an "other" value, e.g.:

HLO = 'O'; * to specify that it is "other";

Label = '!!$$##'; * or anything else, as long as you can distinguish it from the "real" label;

2. Include a PUTLOG in the DO-loop where you are testing whether to delete:

PUTLOG diagCode{i}= $form.; * now you can see whether it falls into "other" or one of the delete-values;

That would give you a clue whether there is something wrong with the comparison with "delete me" or with the value in diagCode{i}.

Respected Advisor
Posts: 3,899

Re: Arryas, Loops and Datasets...efficiency question

Hash tables are production since SAS 9.1.3 (I believe) so this should work for you.

data have;
  informat field1-field5 $2.;
  input id field1-field5;
  cards;
1 AA FF 11 12 CC
2 DD 25 14 XX YY
3 DD EE FF 25 CC
4 AA BB DD DD EE
;
run;

data exclusion_list;
  input _excl_code :$2.;
  datalines;
AA
BB
;
run;


data want (drop=_Smiley Happy;
  set have;
  array  field{*} field1-field5;
  if _n_=1 then
    do;
      if 0 then set exclusion_list(keep=_excl_code);
      declare hash h (dataset:'exclusion_list',hashexp:5);
      _rc=h.defineKey('_excl_code');
      _rc=h.defineDone();
    end;

  do _i = 1 to dim(field);
    if h.check(key:field[_i]) = 0 then
      do;
        delete;
        leave;
      end;
  end;
run;

Respected Advisor
Posts: 3,899

Re: Arryas, Loops and Datasets...efficiency question

And here an approach using a format.

I've also removed the "leave" statement in the loop as it is not needed. A "delete" has an implicit "return" so once the "delete" gets executed SAS starts immediately with the next iteration of the data step (an though not continuing any further looping).

data have;
  informat field1-field5 $2.;
  input id field1-field5;
  cards;
1 AA FF 11 12 CC
2 DD 25 14 XX YY
3 DD EE FF 25 CC
4 AA BB DD DD EE
;
run;

data exclusion_list;
  input _excl_code :$2.;
  datalines;
AA
BB
;
run;

/*create a format for exclusions */
data fmt_source;
  set exclusion_list (rename=(_excl_code=start)) end=last;
  retain fmtname '$Exclude' type 'c' label '1';
  output;
  if last then
    do;
      hlo='o';
      label='0';
      output;
    end;
run;

proc format cntlin=fmt_source;
run;

data want (drop=_Smiley Happy;
  set have;
  array  field{*} field1-field5;

  do _i = 1 to dim(field);

    if put(field[_i],$Exclude.) = '1' then delete;
  end;
run;

Ask a Question
Discussion stats
  • 10 replies
  • 316 views
  • 3 likes
  • 6 in conversation