Dear all,
I found this code online (ht to the OP) to find nonmatching observations between 2 datasets to prepare for a sql join. It works great for the example below, as the result is the nonmatching obs by id and group but for any medium sized dataset of say 1-2 gb on a fast m.2 pcie NVME ssd of approx 600MB/sec speed it just gets stuck reading the datasets over and over. After I stop it after a minute, the log throws the following error:
NOTE: There were 112 observations read from the data set WORK.xxxx1.
NOTE: There were 1 observations read from the data set WORK.xxxxxx2.
WHERE (LPERMNO=22779) and (date_m_crsp=12722);
WARNING: The data set WORK.TEMP_DEBUG1 may be incomplete. When this step was stopped there were 0 observations and 1899 variables.
WARNING: Data set WORK.TEMP_DEBUG1 was not replaced because this step was stopped.
Apparently the code is malformed, but I don't know where the mistakes are.
data dataset1;
input id $ group $ gender $ age;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
;
run;
data dataset2;
input id $ group $ gender $ age;
cards;
111 A Male 11
999 C Male 15
;
run;
proc sql;
create table tempo_names
as
select *
from dataset1 ds1
where not exists
(select ds2.id, ds2.group
from dataset2 as ds2
where ds1.id=ds2.id and ds1.group=ds2.group);
quit;
SAS 9.4 ts1m3 64 bits on EG 7.1 64 bits on a win 7 x64 with 24gb ram at 4.2 ghz 4cores.
Thank you,
astrae
If you only want to use 2 variables for deduplication, hash tables would be the fastest.
data DATASET1
DATASET2;
input ID $ GROUP $ GENDER $ AGE;
do I=1 to 1e4;
output DATASET1;
if _N_ in (1,8) then output DATASET2;
end;
drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;
data TEMPO_NAMES1; ************ 0.1 seconds **********;
set DATASET1;
if _N_=1 then do;
dcl hash LOOKUP(dataset:'DATASET2');
LOOKUP.definekey('ID','GROUP');
LOOKUP.definedone();
end;
if LOOKUP.check();
run;
proc sql; ************ 22 seconds **********;
create table TEMPO_NAMES2 as
select ds1.*
from DATASET1 ds1
left join
DATASET2 ds2
on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;
Are you looking for exact values or just via ID and Group variables for identifying 'duplicates'?
I'd probably append the two datasets and sort by ID and use PROC SORT to identify unique and not unique records.
I don't think there is anything wrong with your syntax, but it might be trying to accomplish too much IO.
Doesn't something like the following do the same thing?
proc sql; create table tempo_names2 as select ds1.* from dataset1 ds1 left outer join dataset2 ds2 on ds1.id=ds2.id and ds1.group=ds2.group where ds1.id<>ds2.id and ds1.group<>ds2.group ; quit;
Art, CEO, AnalystFinder.com
Exist clauses can be seriously inefficient.
Try:
proc sql;
create table TEMPO_NAMES as
select *
from DATASET1
except
select *
from DATASET2;
quit;
Thank you guys! I haven't had time to go over all of the solutions yet, but this all looks very promising. I had a feeling that one of the datasets was being cycled for every observation in the other dataset based on the I/O stats, maybe that's due to the exist clause?
If you only want to use 2 variables for deduplication, hash tables would be the fastest.
data DATASET1
DATASET2;
input ID $ GROUP $ GENDER $ AGE;
do I=1 to 1e4;
output DATASET1;
if _N_ in (1,8) then output DATASET2;
end;
drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;
data TEMPO_NAMES1; ************ 0.1 seconds **********;
set DATASET1;
if _N_=1 then do;
dcl hash LOOKUP(dataset:'DATASET2');
LOOKUP.definekey('ID','GROUP');
LOOKUP.definedone();
end;
if LOOKUP.check();
run;
proc sql; ************ 22 seconds **********;
create table TEMPO_NAMES2 as
select ds1.*
from DATASET1 ds1
left join
DATASET2 ds2
on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;
An interesting approach! Is there a way to make keys have different names in the tables? Say GROUP in ds1 and GROUP_1 in ds2?
I tried to follow this approach: https://communities.sas.com/t5/Base-SAS-Programming/SAS-hash-tables-joining-on-keys-with-different-n... but the result was different from the same key one.
Thank you
Yes, you can have different variable names and then lookup values in the hash using syntax like:
lookup.check(key:ID, key:GROUP)
You still would need to define variable GROUP_1 in ds1.
What I would do instead is to rename the variables when you load them into the hash:
dcl hash LOOKUP(dataset:'DATASET2(rename=(group_1=group))');
I wrote a small macro using the SQL approach using the code from Chris's reply. Decided to post it here. I find the sql performance hit is not that bad and it also allows to calculate some basic counts in the same proc ,e.g., how many groups have non-matching observations. Is there a difference between left join and left outer join in this case?
/* Macro to find nonmatching obsevations between
two datasets on two keys/vars */
%MACRO nonm_2var_sql(dsn_in1=,dsn_in2=,dsn_an=,group_var=,var1in1=, var1in2=, var2in1=, var2in2=,var3in1=, var3in2=);
%macro _; %mend _;
proc sql;
create table &dsn_an. as
select *, count(distinct(&group_var.)) as group_cnt from (select a.*, count(a.&group_var.) as obs_per_group_cnt
from &dsn_in1. as a
left join
&dsn_in2. as b
on a.&var1in1.=b.&var1in2. and a.&var2in1.=b.&var2in2.
where a.&var1in1. ne b.&var1in2. and a.&var2in1. ne b.&var2in2.
group by a.&group_var.);
quit;
%MEND nonm_2var_sql;
Thank you all who replied to my question!
The keyword OUTER is optional. They are the same.
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.