BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sobrio40
Fluorite | Level 6

Hi SAS users,

I have a baseline SAS dataset ENC with a primary key ID. I have a second dataset DIAG with ID as a foreign key. For values of ID that do not link, I need to set ID to NULL in the DIAG dataset. Assume there is no way to fix the source data.

 

I tried about 20 variations of the proc sql step below but nothing works yet.

 

 

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,case when ID in (select ID from ENC) then ID
            else ID is NULL end as ID
   from DIAG;
quit;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

[Editors Note]

I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information

[End Editors Note]

To represent a missing numeric value use a period.  For a missing character value then just use a string literal that only contains blanks.

proc sql ;
  select name
        ,age,case when name='Alice' then . else age end as age2
        ,sex,case when name='Alfred' then ' ' else sex end as sex2
  from sashelp.class
  where name in ('Alice','Alfred')
  ;
quit;

It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL.  Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.

data DIAG_NEW
  merge diag(in=in1) ENC(keep=id in=in2) ;
  by id;
  if in1;
  if not in2 then call missing(id);
run;

 

View solution in original post

7 REPLIES 7
SASKiwi
PROC Star

Don't have any data to test this with but something like this should work. When there is no matching row in ENC ID_ENC will be blank.

 

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,E.ID as ID_ENC
   from DIAG as D
   left join ENC as E
   on D.ID = E.ID;
quit;

 

 

PGStats
Opal | Level 21

Try this syntax:

 

proc sql;
   create table DIAG_NEW as
   select field1
      ,field2
      ,case when ID in (select ID from ENC) then ID
            else . end as ID
   from DIAG;
quit;
PG
roberthi
Calcite | Level 5

nice solution

Tom
Super User Tom
Super User

[Editors Note]

I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information

[End Editors Note]

To represent a missing numeric value use a period.  For a missing character value then just use a string literal that only contains blanks.

proc sql ;
  select name
        ,age,case when name='Alice' then . else age end as age2
        ,sex,case when name='Alfred' then ' ' else sex end as sex2
  from sashelp.class
  where name in ('Alice','Alfred')
  ;
quit;

It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL.  Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.

data DIAG_NEW
  merge diag(in=in1) ENC(keep=id in=in2) ;
  by id;
  if in1;
  if not in2 then call missing(id);
run;

 

ChrisNZ
Tourmaline | Level 20

It's a good habit to avoid embedded select clauses as they are inefficient (though not a performance killer when just building a list as you are here).

 

data DIAG ENC;
  do ID=1 to 2e7;
    output; 
  end; 
run;

proc sql;  * 40 seconds;
   create table DIAG_NEW as
   select case when ID in (select ID from ENC) then ID
          else . end as ID
   from DIAG;
quit;

proc sql; * 16 seconds;
   create table DIAG_NEW as
   select E.ID as ID_ENC
   from DIAG as D
   left join ENC as E
   on D.ID = E.ID;
quit;

 

Kurt_Bremser
Super User

Out of curiosity, I ran a test with a hash object and some "payload data" on my UE:

data
  myfold.DIAG
  myfold.ENC (keep=id)
;
length payload $200;
payload = repeat("X",199);
do ID=1 to 6e6;
  output myfold.diag;
  if mod(id,5) = 0 then output myfold.enc;
end; 
run; /* 7.77 seconds */

proc sql;
   create table myfold.DIAG_NEW1 as
   select
     id,
     payload,
     case
       when ID in (select ID from myfold.ENC)
       then ID
       else .
     end as enc_ID
   from myfold.DIAG;
quit; /* 21.61 seconds */

proc sql;
   create table myfold.DIAG_NEW2 as
   select
     d.id,
     d.payload,
     E.ID as ID_ENC
   from myfold.DIAG as D
   left join myfold.ENC as E
   on D.ID = E.ID;
quit; /* 1:07.10 (!) */

data myfold.diag_new3;
set myfold.diag;
if _n_ = 1
then do;
  declare hash h (dataset:"myfold.enc",hashexp:7);
  h.definekey("id");
  h.definedone();
end;
if h.check() = 0 then enc_id = id;
run; /* 14.74 seconds */

data myfold.diag_new4;
merge
  myfold.diag (in=d)
  myfold.enc (in=e)
;
by id;
if d;
if e then enc_id = id;
run; /* 13.41 seconds */

As you can see, the addition of payload data caused the join to turn into a performance nightmare.

Unsurprisingly, the data step methods performed best.

ChrisNZ
Tourmaline | Level 20

My test results, which are in line with my expectations:

data
  WORK.DIAG 
  WORK.ENC (keep=ID)
  ;
  length PAYLOAD $200;
  PAYLOAD = repeat("X",199);
  do ID=1 to 1e7;
    output WORK.DIAG;
    if mod(ID,5) = 0 then output WORK.ENC;
  end; 
run; /* 5.87 seconds */

proc sql;
   create table WORK.DIAG_NEW1 as
   select
     ID,
     PAYLOAD,
     case
       when ID in (select ID from WORK.ENC)
       then ID
       else .
     end as ENC_ID
   from WORK.DIAG;
quit;                                 /* in (select ...) 27.59 seconds */

proc sql _method;
   create table WORK.DIAG_NEW2 as
   select
     d.ID,
     d.PAYLOAD,
     E.ID as ID_ENC
   from WORK.DIAG     as d
   left join WORK.ENC as e
   on D.ID = E.ID;
quit;                                  /* left join 24:05 */                 

data WORK.DIAG_NEW3;
set WORK.DIAG;
  if _n_ = 1 then do;
    declare hash h (dataset:"WORK.ENC",hashexp:7);
    h.definekey("ID");
    h.definedone();
  end;
  if h.check() = 0 then ENC_ID = ID;
run;                                   /* hash 15.08 seconds */

data WORK.DIAG_NEW4;
merge
  WORK.DIAG (in=D)
  WORK.ENC  (in=E)
  ;
  by ID;
  if D;
  if E then ENC_ID = ID;
run;                                   /* merge by 10.41 seconds */


data
  WORK.DIAG(sortedby=ID) 
  WORK.ENC (sortedby=ID keep=ID)
  ;
  length PAYLOAD $200;
  PAYLOAD = repeat("X",199);
  do ID=1 to 1e7;
    output WORK.DIAG;
    if mod(ID,5) = 0 then output WORK.ENC;
  end; 
run; /* 5.24 seconds */

proc sql _method;
   create table WORK.DIAG_NEW2 as
   select
     d.ID,
     d.PAYLOAD,
     e.ID as ID_ENC
   from WORK.DIAG     as d
   left join WORK.ENC as e
   on D.ID = E.ID;
quit;                              /* left join sorted 12.24 seconds */

The worst time is:  in (select...), which is very inefficient as expected (it's basically a Cartesian product).

Then a SQL join with bad metadata (proc sql doesn't know the data is already sorted so sorts both tables again).

Then hash.

Then the SQL join without sort 

Then the data step merge, which leverages the sorted data the best.

 

Note: WORK library uses the BASE engine.

When using SPDE, the data step takes longer than proc sql.

 

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 34693 views
  • 0 likes
  • 7 in conversation