DATA Step, Macro, Functions and more

compare each line of a data table 2 by 2

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 124
Accepted Solution

compare each line of a data table 2 by 2

Hello,

I would like to compare each line of a data table 2 by 2.

(the first obs with all the following, then the second with all the following, etc.)

I found a way to do it with 2 do loops in a macro but it's too slow

I had in mind to make something like this but it doesn't work since I have a limited understanding of how data steps work

I'm not even sure it's the best way to achieve what I need, so any other idea is welcome

  %let WorkingTable = MyTable

  %let nobs=20; /* simplified, number of observations in &WorkingTable = ~220 000 */

  %let k=Gender; /* simplified, should be more than one variable in the end : =Gender Postcode */

  data _null_;

    set &WorkingTable;

    %let i=_N_;

    call symputx('m', &k);

    data _null_;

      set &WorkingTable (firstobs=&i);

      call symputx('n', &k);

      if &m=&n then putlog 'Ok!';

      run;

    run;

Feel free to ask if I'm not clear enough.

Thanks for your help


Accepted Solutions
Solution
‎05-22-2012 10:37 PM
Super User
Posts: 9,681

Re: compare each line of a data table 2 by 2

I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

If you want faster, Hash Table is a good choice.

data class1;
 set sashelp.class(keep=name);
 k+1;
run;
data class2;
 set sashelp.class(keep=name rename=(name=_name));
 _k+1;
run;


%let dsid=%sysfunc(open(class1));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data want;
if 0 then set class1;
 declare hash ha1(dataset:'class1',ordered:'Y');
  ha1.definekey('k');
  ha1.definedata('name');
  ha1.definedone();
if 0 then set class2;
 declare hash ha2(dataset:'class2',ordered:'Y');
  ha2.definekey('_k');
  ha2.definedata('_name');
  ha2.definedone();

do k=1 to &nobs; 
 ha1.find();
 do _k=k to &nobs;
  ha2.find();output;
 end;
end;
stop;
run;


Ksharp

View solution in original post


All Replies
Super User
Posts: 10,500

Re: compare each line of a data table 2 by 2

What kind of comparison are you looking for? And do you actually mean compare the first row with all of 220,000 rows?

If so, I'm afraid that what you're desribing is going to create on the order of 220000! (factorial) comparisons and isn't likely to finish in any reasonable time.

Frequent Contributor
Posts: 124

Re: compare each line of a data table 2 by 2

Hello,

Thanks for taking the time to help me

> What kind of comparison are you looking for?

I want to search for duplicated rows, the only way is to compare each row with each other (proc sort nodupkey does not allow nearby dates to be considered as equal)


> And do you actually mean compare the first row with all of 220,000 rows?

Yes, (The first with the 219 999 others rows; then the second with the 219 998 others etc.)

> If so, I'm afraid that what you're desribing is going to create on the order of 220000! (factorial) comparisons and isn't likely to finish in any reasonable time.

Complexity should be O(n²) not O(n!), so it should be feasable in less than a day I think.

Either way, not all the tables I need to run are so big, this is an exception.


>data _null_;

> set sashelp.class(keep=name);

> do i=_n_ to nobs;

>  set sashelp.class(keep=name rename=(name=_name)) nobs=nobs point=i;

>  put _all_;

> end;

>run;

That's it I think,

Crashed with nobs = 5000 but that's another issue

Thank you !


Solution
‎05-22-2012 10:37 PM
Super User
Posts: 9,681

Re: compare each line of a data table 2 by 2

I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

If you want faster, Hash Table is a good choice.

data class1;
 set sashelp.class(keep=name);
 k+1;
run;
data class2;
 set sashelp.class(keep=name rename=(name=_name));
 _k+1;
run;


%let dsid=%sysfunc(open(class1));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data want;
if 0 then set class1;
 declare hash ha1(dataset:'class1',ordered:'Y');
  ha1.definekey('k');
  ha1.definedata('name');
  ha1.definedone();
if 0 then set class2;
 declare hash ha2(dataset:'class2',ordered:'Y');
  ha2.definekey('_k');
  ha2.definedata('_name');
  ha2.definedone();

do k=1 to &nobs; 
 ha1.find();
 do _k=k to &nobs;
  ha2.find();output;
 end;
end;
stop;
run;


Ksharp

Frequent Contributor
Posts: 124

Re: compare each line of a data table 2 by 2

> I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

I always try the algorithm first on little tables (nobs=5) if it works, I increase it step by step to have a better feeling on running times

nobs=50, nobs=500 took less than a minute but nobs 5000 crashed (in fact i found out lateer that the crash occurs between nobs = 1400 and 1500)

> If you want faster, Hash Table is a good choice.

much much faster as I can see.

Crash occurs between 15 000 and 16 000 this time (lack of RAM I think)

I have a few questions

-Is it necessary to create two same tables in the beginning?

-WANT is growing too big, is there a way to still have the same computation without storing it in a table and thus in RAM memory ?

Super User
Posts: 9,681

Re: compare each line of a data table 2 by 2

"Crash occurs between 15 000 and 16 000 this time"

Did you mean SAS session stop by accident ? But for a dataset only containing 22,000 obs ,that is not a problem I think.

"-Is it necessary to create two same tables in the beginning?"

It isn't necessary. You can do the same thing in the code when building the Hash Table.

I do this to try to enhance the code's readability . and it is very convenient if the dataset is not too big.

"-WANT is growing too big, is there a way to still have the same computation without storing it in a table and thus in RAM memory ?"

Yes. It is up to what you want, you can output obs you need and discard others .

Ksharp

Frequent Contributor
Posts: 124

Re: compare each line of a data table 2 by 2

> "Did you mean SAS session stop by accident ?"

No, simply nothing happens.

When you look the Physical Memory Usage, you see that nobs=16k needs more than 6Gb of RAM (cf. screenshot)2012-05-23_093032.png


> "Yes. It is up to what you want, you can output obs you need and discard others ."

Thanks, I'll try to play with that and come back if I can't find it out by myself

Super User
Posts: 5,083

Re: compare each line of a data table 2 by 2

mathias,

I'm going to build upon KSharp's first solution.  If these ideas work for you, maybe he can convert it to a hash version.

First, you have to incorporate the idea that it is not necessary to output every combination.  The same DATA step that reads in all the pairs can also perform some (or all) of the comparison work, and only output the matching pairs.  Second, you have to account for the fact that both observations contain exactly the same variable names.  For comparison purposes, you'll have to rename one set of variables.

Here is a program that incorporates some of those ideas:

data match;

   set sashelp.class nobs=_nobs_;

   if _n_ < _nobs_ then do _i_ = _n_ + 1 to _nobs_;

     set sashelp.class (keep=name rename=(name=name2)) point=_i_;

     if name=name2 then do;

       original_observation = _n_;

       matching_observation = _i_;

       output;

       set sashelp.class point=_i_;

       output;
   end;

   drop name2;

run;

The idea is to output pairs that match, as two separate observations, using the original variable names.  Each observation of the pair contains the new variables ORIGINAL_OBSERVATION and MATCHING_OBSERVATION.  In the sample program, I'm comparing on NAME only, but you can use more than one variable to narrow down potential matches.  The bottleneck will be speed, but in theory this will chug along without other resource constraints.  As you noted, the number of comparisons is proportional to the square of the number of observations.

Super User
Posts: 9,681

Re: compare each line of a data table 2 by 2

Oh. That doesn't matter, as far as SAS session doesen't stop by accident. That is OK. You can run it as long as you can.

BTW, I recode it to make my code better ,faster, more succinct ........

data class;
 set sashelp.class(keep=name);
 k+1;
run;



data want;
if 0 then set class;
 declare hash ha1(dataset:'class',ordered:'Y');
 declare hiter hi1('ha1');
  ha1.definekey('k');
  ha1.definedata('k','name');
  ha1.definedone();
if 0 then set class(rename=(name=_name));
 declare hash ha2(dataset:'class(rename=(name=_name))',ordered:'Y');
 declare hiter hi2('ha2');
  ha2.definekey('k');
  ha2.definedata('_name');
  ha2.definedone();

do while(hi1.next()=0) ;
 rc=hi2.setcur(); 
 do while(rc=0);
  output;
  rc=hi2.next(); 
 end;
end; 
stop;
run;


Ksharp

Frequent Contributor
Posts: 124

Re: compare each line of a data table 2 by 2

I almost got what I want today, thanks all for your help

I just need to automate a bit more and to find the best output.

Without output or computations, I calculated empiracally that it should take about 55 min to go through 200k observations.

Reasonable I think, but going through the whole table at once isn't necessary so I still have optimization options available.

Here is my code :

%let WorkingTable = DUPLICATES_TABLE;

%let IntervalDays = 70;

%let vars1 = Id Germ Gender PostCode DateOfDiagnosis DateOfBirth;

%let vars2 = 'Id','Germ','Gender','PostCode','DateOfDiagnosis','DateOfBirth';

%let vars3 = Id=_Id Germ=_Germ Gender=_Gender PostCode=_PostCode DateOfDiagnosis=_DateOfDiagnosis DateOfBirth=_DateOfBirth;

%let vars4 = '_Id','_Germ','_Gender','_PostCode','_DateOfDiagnosis','_DateOfBirth';

  /* define class for hash table */

  data class;

  set &WorkingTable (keep=&vars1);

  k+1;

  run;

  /* compute nobs */

  %let dsid=%sysfunc(open(class));

  %let nobs=%sysfunc(attrn(&dsid,nobs));

  %let dsid=%sysfunc(close(&dsid));

  /* run comparison loops */

  data comparison (drop=rc k);

    if 0 then set class;

      declare hash ha1(dataset:'class',ordered:'Y');   *Définir hash table;

      declare hiter hi1('ha1');             *Définir hash iterator;

      ha1.definekey('k');               *Définir la variable identifiante;

      ha1.definedata('k',&vars2);           *Définir les autres variables;

      ha1.definedone();                 *Ecriture de la structure de la table;

    if 0 then set class(rename=(&vars3));

      declare hash ha2(dataset:"class(rename=(&vars3))",ordered:'Y');

      declare hiter hi2('ha2');

      ha2.definekey('k');

      ha2.definedata(&vars4);

      ha2.definedone();

    do while(hi1.next()=0);

      rc=hi2.setcur(key : k+1); *demarre l'iteration de la seconde loop a i+1;

      do while(rc=0);

        *if Gender= _Gender then put k= Gender= _Gender=;

        *if Gender= _Gender then output;

        *put k= Gender= _Gender=;

        *output;

        if (DateOfBirth=_DateOfBirth) AND (Gender=_Gender) AND (PostCode=_PostCode) AND (&GermVar= _&GermVar) AND (abs(DateOfDiagnosis - _DateOfDiagnosis) < &IntervalDays) then output;

        rc=hi2.next();

        end;

      end;

    stop;

    run; * no ouput, nobs=40k -> 2:06 min;

Super User
Posts: 9,681

Re: compare each line of a data table 2 by 2

It looks like you are looking for:

data _null_;
 set sashelp.class(keep=name);
 do i=_n_ to nobs;
  set sashelp.class(keep=name rename=(name=_name)) nobs=nobs point=i;
  put _all_;
 end;
run;

Ksharp

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 475 views
  • 6 likes
  • 4 in conversation