Solved: Re: Nonmatching observations with PROC SQL: works for small DS but not...

astrae_research · Posted 07-31-2017 06:46 PM

Dear all,

I found this code online (ht to the OP) to find nonmatching observations between 2 datasets to prepare for a sql join. It works great for the example below, as the result is the nonmatching obs by id and group but for any medium sized dataset of say 1-2 gb on a fast m.2 pcie NVME ssd of approx 600MB/sec speed it just gets stuck reading the datasets over and over. After I stop it after a minute, the log throws the following error:

NOTE: There were 112 observations read from the data set WORK.xxxx1.
NOTE: There were 1 observations read from the data set WORK.xxxxxx2.
      WHERE (LPERMNO=22779) and (date_m_crsp=12722);
WARNING: The data set WORK.TEMP_DEBUG1 may be incomplete.  When this step was stopped there were 0 observations and 1899 variables.
WARNING: Data set WORK.TEMP_DEBUG1 was not replaced because this step was stopped.

Apparently the code is malformed, but I don't know where the mistakes are.

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;

data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;
proc sql;
 create table tempo_names
 as
 select *
 from dataset1 ds1
 where not exists
 (select ds2.id, ds2.group
 from dataset2 as ds2
 where ds1.id=ds2.id and ds1.group=ds2.group);
quit;

SAS 9.4 ts1m3 64 bits on EG 7.1 64 bits on a win 7 x64 with 24gb ram at 4.2 ghz 4cores.

Thank you,

astrae

ChrisNZ · Posted 07-31-2017 10:04 PM

If you only want to use 2 variables for deduplication, hash tables would be the fastest.

data DATASET1
     DATASET2;
  input ID $ GROUP $ GENDER $ AGE;
  do I=1 to 1e4;
    output DATASET1; 
    if _N_ in (1,8) then output DATASET2; 
  end;
  drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;

       
data TEMPO_NAMES1; ************ 0.1 seconds **********;
  set DATASET1;
  if _N_=1 then do;
    dcl hash LOOKUP(dataset:'DATASET2');
    LOOKUP.definekey('ID','GROUP');
    LOOKUP.definedone();
  end;
  if LOOKUP.check();
run;

proc sql;        ************ 22 seconds **********;
 create table TEMPO_NAMES2 as
 select ds1.*
 from DATASET1 ds1
       left join
      DATASET2 ds2
       on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
  where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;

High-Performance SAS Coding - Third Edition

View solution in original post

Reeza · Posted 07-31-2017 09:40 PM

Are you looking for exact values or just via ID and Group variables for identifying 'duplicates'?

Reeza · Posted 07-31-2017 09:41 PM

I'd probably append the two datasets and sort by ID and use PROC SORT to identify unique and not unique records.

art297 · Posted 07-31-2017 09:42 PM

I don't think there is anything wrong with your syntax, but it might be trying to accomplish too much IO.

Doesn't something like the following do the same thing?

proc sql;
 create table tempo_names2 as
   select ds1.*
     from dataset1 ds1
       left outer join
         dataset2 ds2
           on ds1.id=ds2.id and ds1.group=ds2.group
             where ds1.id<>ds2.id and ds1.group<>ds2.group
           
 ;
quit;

Art, CEO, AnalystFinder.com

ChrisNZ · Posted 07-31-2017 09:47 PM

Exist clauses can be seriously inefficient.

Try:


proc sql;
 create table TEMPO_NAMES as
 select *
 from DATASET1 
 except
 select *
 from DATASET2;
quit;

High-Performance SAS Coding - Third Edition

astrae_research · Posted 08-02-2017 06:13 PM

Thank you guys! I haven't had time to go over all of the solutions yet, but this all looks very promising. I had a feeling that one of the datasets was being cycled for every observation in the other dataset based on the I/O stats, maybe that's due to the exist clause?

ChrisNZ · Posted 07-31-2017 10:04 PM

If you only want to use 2 variables for deduplication, hash tables would be the fastest.

data DATASET1
     DATASET2;
  input ID $ GROUP $ GENDER $ AGE;
  do I=1 to 1e4;
    output DATASET1; 
    if _N_ in (1,8) then output DATASET2; 
  end;
  drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;

       
data TEMPO_NAMES1; ************ 0.1 seconds **********;
  set DATASET1;
  if _N_=1 then do;
    dcl hash LOOKUP(dataset:'DATASET2');
    LOOKUP.definekey('ID','GROUP');
    LOOKUP.definedone();
  end;
  if LOOKUP.check();
run;

proc sql;        ************ 22 seconds **********;
 create table TEMPO_NAMES2 as
 select ds1.*
 from DATASET1 ds1
       left join
      DATASET2 ds2
       on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
  where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;

High-Performance SAS Coding - Third Edition

astrae_research · Posted 08-04-2017 10:12 PM

An interesting approach! Is there a way to make keys have different names in the tables? Say GROUP in ds1 and GROUP_1 in ds2?

I tried to follow this approach: https://communities.sas.com/t5/Base-SAS-Programming/SAS-hash-tables-joining-on-keys-with-different-n... but the result was different from the same key one.

Thank you

Patrick · Posted 08-04-2017 10:37 PM

@astrae_research

Yes, you can have different variable names and then lookup values in the hash using syntax like:

lookup.check(key:ID, key:GROUP)

You still would need to define variable GROUP_1 in ds1.

What I would do instead is to rename the variables when you load them into the hash:

dcl hash LOOKUP(dataset:'DATASET2(rename=(group_1=group))');

astrae_research · Posted 08-07-2017 05:50 PM

I wrote a small macro using the SQL approach using the code from Chris's reply. Decided to post it here. I find the sql performance hit is not that bad and it also allows to calculate some basic counts in the same proc ,e.g., how many groups have non-matching observations. Is there a difference between left join and left outer join in this case?

/* Macro to find nonmatching obsevations between
two datasets on two keys/vars */

%MACRO nonm_2var_sql(dsn_in1=,dsn_in2=,dsn_an=,group_var=,var1in1=, var1in2=, var2in1=, var2in2=,var3in1=, var3in2=);
%macro _; %mend _;
proc sql;       
 create table &dsn_an. as
 select *, count(distinct(&group_var.)) as group_cnt from (select a.*, count(a.&group_var.) as obs_per_group_cnt
 from &dsn_in1. as a
       left join
      &dsn_in2. as b
       on a.&var1in1.=b.&var1in2. and a.&var2in1.=b.&var2in2. 
  where a.&var1in1. ne b.&var1in2. and a.&var2in1. ne b.&var2in2. 
  group by a.&group_var.);
quit;
%MEND nonm_2var_sql;

Thank you all who replied to my question!

ChrisNZ · Posted 08-07-2017 06:58 PM

The keyword OUTER is optional. They are the same.

High-Performance SAS Coding - Third Edition

Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Re: Nonmatching observations with PROC SQL: works for small DS but not for medium sized DS - issues

Classroom Training Available!