BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
astrae_research
Obsidian | Level 7

Dear all,

 

I found this code online (ht to the OP) to find nonmatching observations between 2 datasets to prepare for a sql join. It works great for the example below, as the result is the nonmatching obs by id and group but for any medium sized dataset of say 1-2 gb on a fast m.2 pcie NVME ssd of approx 600MB/sec speed it just gets stuck reading the datasets over and over. After I stop it after a minute, the log throws the following error:

NOTE: There were 112 observations read from the data set WORK.xxxx1.
NOTE: There were 1 observations read from the data set WORK.xxxxxx2.
      WHERE (LPERMNO=22779) and (date_m_crsp=12722);
WARNING: The data set WORK.TEMP_DEBUG1 may be incomplete.  When this step was stopped there were 0 observations and 1899 variables.
WARNING: Data set WORK.TEMP_DEBUG1 was not replaced because this step was stopped.

Apparently the code is malformed, but I don't know where the mistakes are.

 

data dataset1;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      111 B Male 11
      222 D Male 12
      333 E Female 13
      666 G Female 14
      999 A Male 15
      999 B Male 15
      999 C Male 15
      ;
run;

data dataset2;
      input id $ group $ gender $ age;
      cards;
      111 A Male 11
      999 C Male 15
      ;
run;
proc sql;
 create table tempo_names
 as
 select *
 from dataset1 ds1
 where not exists
 (select ds2.id, ds2.group
 from dataset2 as ds2
 where ds1.id=ds2.id and ds1.group=ds2.group);
quit;

SAS 9.4 ts1m3 64 bits on EG 7.1 64 bits on a win 7 x64 with 24gb ram at 4.2 ghz 4cores.

 

Thank you,

astrae

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

If you only want to use 2 variables for deduplication, hash tables would be the fastest.

 

data DATASET1
     DATASET2;
  input ID $ GROUP $ GENDER $ AGE;
  do I=1 to 1e4;
    output DATASET1; 
    if _N_ in (1,8) then output DATASET2; 
  end;
  drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;

       
data TEMPO_NAMES1; ************ 0.1 seconds **********;
  set DATASET1;
  if _N_=1 then do;
    dcl hash LOOKUP(dataset:'DATASET2');
    LOOKUP.definekey('ID','GROUP');
    LOOKUP.definedone();
  end;
  if LOOKUP.check();
run;

proc sql;        ************ 22 seconds **********;
 create table TEMPO_NAMES2 as
 select ds1.*
 from DATASET1 ds1
       left join
      DATASET2 ds2
       on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
  where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;

 

 

View solution in original post

10 REPLIES 10
Reeza
Super User

Are you looking for exact values or just via ID and Group variables for identifying 'duplicates'?

Reeza
Super User

I'd probably append the two datasets and sort by ID and use PROC SORT to identify unique and not unique records. 

art297
Opal | Level 21

I don't think there is anything wrong with your syntax, but it might be trying to accomplish too much IO.

Doesn't something like the following do the same thing?

proc sql;
 create table tempo_names2 as
   select ds1.*
     from dataset1 ds1
       left outer join
         dataset2 ds2
           on ds1.id=ds2.id and ds1.group=ds2.group
             where ds1.id<>ds2.id and ds1.group<>ds2.group
           
 ;
quit;

Art, CEO, AnalystFinder.com

 

ChrisNZ
Tourmaline | Level 20

Exist clauses can be seriously inefficient.

Try:


proc sql;
 create table TEMPO_NAMES as
 select *
 from DATASET1 
 except
 select *
 from DATASET2;
quit;

 

 

astrae_research
Obsidian | Level 7

Thank you guys! I haven't had time to go over all of the solutions yet, but this all looks very promising. I had a feeling that one of the datasets was being cycled for every observation in the other dataset based on the I/O stats, maybe that's due to the exist clause?

ChrisNZ
Tourmaline | Level 20

If you only want to use 2 variables for deduplication, hash tables would be the fastest.

 

data DATASET1
     DATASET2;
  input ID $ GROUP $ GENDER $ AGE;
  do I=1 to 1e4;
    output DATASET1; 
    if _N_ in (1,8) then output DATASET2; 
  end;
  drop I;
cards;
111 A Male 11
111 B Male 11
222 D Male 12
333 E Female 13
666 G Female 14
999 A Male 15
999 B Male 15
999 C Male 15
run;

       
data TEMPO_NAMES1; ************ 0.1 seconds **********;
  set DATASET1;
  if _N_=1 then do;
    dcl hash LOOKUP(dataset:'DATASET2');
    LOOKUP.definekey('ID','GROUP');
    LOOKUP.definedone();
  end;
  if LOOKUP.check();
run;

proc sql;        ************ 22 seconds **********;
 create table TEMPO_NAMES2 as
 select ds1.*
 from DATASET1 ds1
       left join
      DATASET2 ds2
       on ds1.ID=ds2.ID and ds1.GROUP=ds2.GROUP
  where ds1.ID ne ds2.ID and ds1.GROUP ne ds2.GROUP ;
quit;

 

 

astrae_research
Obsidian | Level 7

An interesting approach! Is there a way to make keys have different names in the tables? Say GROUP in ds1 and GROUP_1 in ds2?

I tried to follow this approach: https://communities.sas.com/t5/Base-SAS-Programming/SAS-hash-tables-joining-on-keys-with-different-n... but the result was different from the same key one.

 

Thank you

Patrick
Opal | Level 21

@astrae_research

Yes, you can have different variable names and then lookup values in the hash using syntax like:

lookup.check(key:ID, key:GROUP)

 

You still would need to define variable GROUP_1 in ds1. 

 

What I would do instead is to rename the variables when you load them into the hash:

dcl hash LOOKUP(dataset:'DATASET2(rename=(group_1=group))');

 

astrae_research
Obsidian | Level 7

I wrote a small macro using the SQL approach using the code from Chris's reply. Decided to post it here. I find the sql performance hit is not that bad and it also allows to calculate some basic counts in the same proc ,e.g., how many groups have non-matching observations. Is there a difference between left join and left outer join in this case?

 

 

/* Macro to find nonmatching obsevations between
two datasets on two keys/vars */

%MACRO nonm_2var_sql(dsn_in1=,dsn_in2=,dsn_an=,group_var=,var1in1=, var1in2=, var2in1=, var2in2=,var3in1=, var3in2=);
%macro _; %mend _;
proc sql;       
 create table &dsn_an. as
 select *, count(distinct(&group_var.)) as group_cnt from (select a.*, count(a.&group_var.) as obs_per_group_cnt
 from &dsn_in1. as a
       left join
      &dsn_in2. as b
       on a.&var1in1.=b.&var1in2. and a.&var2in1.=b.&var2in2. 
  where a.&var1in1. ne b.&var1in2. and a.&var2in1. ne b.&var2in2. 
  group by a.&group_var.);
quit;
%MEND nonm_2var_sql;

Thank you all who replied to my question!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2820 views
  • 0 likes
  • 5 in conversation