- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi SAS users,
I have a baseline SAS dataset ENC with a primary key ID. I have a second dataset DIAG with ID as a foreign key. For values of ID that do not link, I need to set ID to NULL in the DIAG dataset. Assume there is no way to fix the source data.
I tried about 20 variations of the proc sql step below but nothing works yet.
proc sql; create table DIAG_NEW as select field1 ,field2 ,case when ID in (select ID from ENC) then ID else ID is NULL end as ID from DIAG; quit;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
[Editors Note]
I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information
[End Editors Note]
To represent a missing numeric value use a period. For a missing character value then just use a string literal that only contains blanks.
proc sql ;
select name
,age,case when name='Alice' then . else age end as age2
,sex,case when name='Alfred' then ' ' else sex end as sex2
from sashelp.class
where name in ('Alice','Alfred')
;
quit;
It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL. Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.
data DIAG_NEW
merge diag(in=in1) ENC(keep=id in=in2) ;
by id;
if in1;
if not in2 then call missing(id);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Don't have any data to test this with but something like this should work. When there is no matching row in ENC ID_ENC will be blank.
proc sql;
create table DIAG_NEW as
select field1
,field2
,E.ID as ID_ENC
from DIAG as D
left join ENC as E
on D.ID = E.ID;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try this syntax:
proc sql;
create table DIAG_NEW as
select field1
,field2
,case when ID in (select ID from ENC) then ID
else . end as ID
from DIAG;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
nice solution
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
[Editors Note]
I've marked @Tom response as the accepted solution as it is concise and offers both SQL and Data Step solutions. I also recommend you read the replies from @ChrisNZ and @Kurt_Bremser who add valuable additional information
[End Editors Note]
To represent a missing numeric value use a period. For a missing character value then just use a string literal that only contains blanks.
proc sql ;
select name
,age,case when name='Alice' then . else age end as age2
,sex,case when name='Alfred' then ' ' else sex end as sex2
from sashelp.class
where name in ('Alice','Alfred')
;
quit;
It is much easier if you just use regular SAS syntax instead of trying to use PROC SQL. Then you can use the CALL MISSING() function to make the value missing without having to know whether the field is numeric or character.
data DIAG_NEW
merge diag(in=in1) ENC(keep=id in=in2) ;
by id;
if in1;
if not in2 then call missing(id);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It's a good habit to avoid embedded select clauses as they are inefficient (though not a performance killer when just building a list as you are here).
data DIAG ENC;
do ID=1 to 2e7;
output;
end;
run;
proc sql; * 40 seconds;
create table DIAG_NEW as
select case when ID in (select ID from ENC) then ID
else . end as ID
from DIAG;
quit;
proc sql; * 16 seconds;
create table DIAG_NEW as
select E.ID as ID_ENC
from DIAG as D
left join ENC as E
on D.ID = E.ID;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Out of curiosity, I ran a test with a hash object and some "payload data" on my UE:
data
myfold.DIAG
myfold.ENC (keep=id)
;
length payload $200;
payload = repeat("X",199);
do ID=1 to 6e6;
output myfold.diag;
if mod(id,5) = 0 then output myfold.enc;
end;
run; /* 7.77 seconds */
proc sql;
create table myfold.DIAG_NEW1 as
select
id,
payload,
case
when ID in (select ID from myfold.ENC)
then ID
else .
end as enc_ID
from myfold.DIAG;
quit; /* 21.61 seconds */
proc sql;
create table myfold.DIAG_NEW2 as
select
d.id,
d.payload,
E.ID as ID_ENC
from myfold.DIAG as D
left join myfold.ENC as E
on D.ID = E.ID;
quit; /* 1:07.10 (!) */
data myfold.diag_new3;
set myfold.diag;
if _n_ = 1
then do;
declare hash h (dataset:"myfold.enc",hashexp:7);
h.definekey("id");
h.definedone();
end;
if h.check() = 0 then enc_id = id;
run; /* 14.74 seconds */
data myfold.diag_new4;
merge
myfold.diag (in=d)
myfold.enc (in=e)
;
by id;
if d;
if e then enc_id = id;
run; /* 13.41 seconds */
As you can see, the addition of payload data caused the join to turn into a performance nightmare.
Unsurprisingly, the data step methods performed best.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My test results, which are in line with my expectations:
data
WORK.DIAG
WORK.ENC (keep=ID)
;
length PAYLOAD $200;
PAYLOAD = repeat("X",199);
do ID=1 to 1e7;
output WORK.DIAG;
if mod(ID,5) = 0 then output WORK.ENC;
end;
run; /* 5.87 seconds */
proc sql;
create table WORK.DIAG_NEW1 as
select
ID,
PAYLOAD,
case
when ID in (select ID from WORK.ENC)
then ID
else .
end as ENC_ID
from WORK.DIAG;
quit; /* in (select ...) 27.59 seconds */
proc sql _method;
create table WORK.DIAG_NEW2 as
select
d.ID,
d.PAYLOAD,
E.ID as ID_ENC
from WORK.DIAG as d
left join WORK.ENC as e
on D.ID = E.ID;
quit; /* left join 24:05 */
data WORK.DIAG_NEW3;
set WORK.DIAG;
if _n_ = 1 then do;
declare hash h (dataset:"WORK.ENC",hashexp:7);
h.definekey("ID");
h.definedone();
end;
if h.check() = 0 then ENC_ID = ID;
run; /* hash 15.08 seconds */
data WORK.DIAG_NEW4;
merge
WORK.DIAG (in=D)
WORK.ENC (in=E)
;
by ID;
if D;
if E then ENC_ID = ID;
run; /* merge by 10.41 seconds */
data
WORK.DIAG(sortedby=ID)
WORK.ENC (sortedby=ID keep=ID)
;
length PAYLOAD $200;
PAYLOAD = repeat("X",199);
do ID=1 to 1e7;
output WORK.DIAG;
if mod(ID,5) = 0 then output WORK.ENC;
end;
run; /* 5.24 seconds */
proc sql _method;
create table WORK.DIAG_NEW2 as
select
d.ID,
d.PAYLOAD,
e.ID as ID_ENC
from WORK.DIAG as d
left join WORK.ENC as e
on D.ID = E.ID;
quit; /* left join sorted 12.24 seconds */
The worst time is: in (select...), which is very inefficient as expected (it's basically a Cartesian product).
Then a SQL join with bad metadata (proc sql doesn't know the data is already sorted so sorts both tables again).
Then hash.
Then the SQL join without sort
Then the data step merge, which leverages the sorted data the best.
Note: WORK library uses the BASE engine.
When using SPDE, the data step takes longer than proc sql.