SAS Programming

sobrio40 · Posted 07-13-2018 10:45 PM

Hi SAS users,

I have a baseline SAS dataset ENC with a primary key ID. I have a second dataset DIAG with ID as a foreign key. For values of ID that do not link, I need to set ID to NULL in the DIAG dataset. Assume there is no way to fix the source data.

I tried about 20 variations of the proc sql step below but nothing works yet.

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,case when ID in (select ID from ENC) then ID
            else ID is NULL end as ID
   from DIAG;
quit;

Tom · Posted 07-14-2018 03:23 PM

[Editors Note]

I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information

[End Editors Note]

To represent a missing numeric value use a period. For a missing character value then just use a string literal that only contains blanks.

proc sql ;
  select name
        ,age,case when name='Alice' then . else age end as age2
        ,sex,case when name='Alfred' then ' ' else sex end as sex2
  from sashelp.class
  where name in ('Alice','Alfred')
  ;
quit;

It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL. Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.

data DIAG_NEW
  merge diag(in=in1) ENC(keep=id in=in2) ;
  by id;
  if in1;
  if not in2 then call missing(id);
run;

View solution in original post

SASKiwi · Posted 07-13-2018 11:21 PM

Don't have any data to test this with but something like this should work. When there is no matching row in ENC ID_ENC will be blank.

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,E.ID as ID_ENC
   from DIAG as D
   left join ENC as E
   on D.ID = E.ID;
quit;

PGStats · Posted 07-13-2018 11:28 PM

Try this syntax:

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,case when ID in (select ID from ENC) then ID
            else . end as ID
   from DIAG;
quit;

PG

roberthi · Posted 10-11-2020 02:17 AM

nice solution

Tom · Posted 07-14-2018 03:23 PM

[Editors Note]

I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information

[End Editors Note]

To represent a missing numeric value use a period. For a missing character value then just use a string literal that only contains blanks.

proc sql ;
  select name
        ,age,case when name='Alice' then . else age end as age2
        ,sex,case when name='Alfred' then ' ' else sex end as sex2
  from sashelp.class
  where name in ('Alice','Alfred')
  ;
quit;

It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL. Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.

data DIAG_NEW
  merge diag(in=in1) ENC(keep=id in=in2) ;
  by id;
  if in1;
  if not in2 then call missing(id);
run;

ChrisNZ · Posted 07-16-2018 12:15 AM

It's a good habit to avoid embedded select clauses as they are inefficient (though not a performance killer when just building a list as you are here).

data DIAG ENC;
  do ID=1 to 2e7;
    output; 
  end; 
run;

proc sql;  * 40 seconds;
   create table DIAG_NEW as
   select case when ID in (select ID from ENC) then ID
          else . end as ID
   from DIAG;
quit;

proc sql; * 16 seconds;
   create table DIAG_NEW as
   select E.ID as ID_ENC
   from DIAG as D
   left join ENC as E
   on D.ID = E.ID;
quit;

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 10-11-2020 07:04 AM

Out of curiosity, I ran a test with a hash object and some "payload data" on my UE:

data
  myfold.DIAG
  myfold.ENC (keep=id)
;
length payload $200;
payload = repeat("X",199);
do ID=1 to 6e6;
  output myfold.diag;
  if mod(id,5) = 0 then output myfold.enc;
end; 
run; /* 7.77 seconds */

proc sql;
   create table myfold.DIAG_NEW1 as
   select
     id,
     payload,
     case
       when ID in (select ID from myfold.ENC)
       then ID
       else .
     end as enc_ID
   from myfold.DIAG;
quit; /* 21.61 seconds */

proc sql;
   create table myfold.DIAG_NEW2 as
   select
     d.id,
     d.payload,
     E.ID as ID_ENC
   from myfold.DIAG as D
   left join myfold.ENC as E
   on D.ID = E.ID;
quit; /* 1:07.10 (!) */

data myfold.diag_new3;
set myfold.diag;
if _n_ = 1
then do;
  declare hash h (dataset:"myfold.enc",hashexp:7);
  h.definekey("id");
  h.definedone();
end;
if h.check() = 0 then enc_id = id;
run; /* 14.74 seconds */

data myfold.diag_new4;
merge
  myfold.diag (in=d)
  myfold.enc (in=e)
;
by id;
if d;
if e then enc_id = id;
run; /* 13.41 seconds */

As you can see, the addition of payload data caused the join to turn into a performance nightmare.

Unsurprisingly, the data step methods performed best.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 10-11-2020 10:17 PM

My test results, which are in line with my expectations:

data
  WORK.DIAG 
  WORK.ENC (keep=ID)
  ;
  length PAYLOAD $200;
  PAYLOAD = repeat("X",199);
  do ID=1 to 1e7;
    output WORK.DIAG;
    if mod(ID,5) = 0 then output WORK.ENC;
  end; 
run; /* 5.87 seconds */

proc sql;
   create table WORK.DIAG_NEW1 as
   select
     ID,
     PAYLOAD,
     case
       when ID in (select ID from WORK.ENC)
       then ID
       else .
     end as ENC_ID
   from WORK.DIAG;
quit;                                 /* in (select ...) 27.59 seconds */

proc sql _method;
   create table WORK.DIAG_NEW2 as
   select
     d.ID,
     d.PAYLOAD,
     E.ID as ID_ENC
   from WORK.DIAG     as d
   left join WORK.ENC as e
   on D.ID = E.ID;
quit;                                  /* left join 24:05 */                 

data WORK.DIAG_NEW3;
set WORK.DIAG;
  if _n_ = 1 then do;
    declare hash h (dataset:"WORK.ENC",hashexp:7);
    h.definekey("ID");
    h.definedone();
  end;
  if h.check() = 0 then ENC_ID = ID;
run;                                   /* hash 15.08 seconds */

data WORK.DIAG_NEW4;
merge
  WORK.DIAG (in=D)
  WORK.ENC  (in=E)
  ;
  by ID;
  if D;
  if E then ENC_ID = ID;
run;                                   /* merge by 10.41 seconds */


data
  WORK.DIAG(sortedby=ID) 
  WORK.ENC (sortedby=ID keep=ID)
  ;
  length PAYLOAD $200;
  PAYLOAD = repeat("X",199);
  do ID=1 to 1e7;
    output WORK.DIAG;
    if mod(ID,5) = 0 then output WORK.ENC;
  end; 
run; /* 5.24 seconds */

proc sql _method;
   create table WORK.DIAG_NEW2 as
   select
     d.ID,
     d.PAYLOAD,
     e.ID as ID_ENC
   from WORK.DIAG     as d
   left join WORK.ENC as e
   on D.ID = E.ID;
quit;                              /* left join sorted 12.24 seconds */

The worst time is: in (select...), which is very inefficient as expected (it's basically a Cartesian product).

Then a SQL join with bad metadata (proc sql doesn't know the data is already sorted so sorts both tables again).

Then hash.

Then the SQL join without sort

Then the data step merge, which leverages the sorted data the best.

Note: WORK library uses the BASE engine.

When using SPDE, the data step takes longer than proc sql.

High-Performance SAS Coding - Third Edition

SAS Programming

SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Re: SAS SQL: Setting values to NULL

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...