BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mathias
Quartz | Level 8

Hello,

I would like to compare each line of a data table 2 by 2.

(the first obs with all the following, then the second with all the following, etc.)

I found a way to do it with 2 do loops in a macro but it's too slow

I had in mind to make something like this but it doesn't work since I have a limited understanding of how data steps work

I'm not even sure it's the best way to achieve what I need, so any other idea is welcome

  %let WorkingTable = MyTable

  %let nobs=20; /* simplified, number of observations in &WorkingTable = ~220 000 */

  %let k=Gender; /* simplified, should be more than one variable in the end : =Gender Postcode */

  data _null_;

    set &WorkingTable;

    %let i=_N_;

    call symputx('m', &k);

    data _null_;

      set &WorkingTable (firstobs=&i);

      call symputx('n', &k);

      if &m=&n then putlog 'Ok!';

      run;

    run;

Feel free to ask if I'm not clear enough.

Thanks for your help

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

If you want faster, Hash Table is a good choice.

data class1;
 set sashelp.class(keep=name);
 k+1;
run;
data class2;
 set sashelp.class(keep=name rename=(name=_name));
 _k+1;
run;


%let dsid=%sysfunc(open(class1));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data want;
if 0 then set class1;
 declare hash ha1(dataset:'class1',ordered:'Y');
  ha1.definekey('k');
  ha1.definedata('name');
  ha1.definedone();
if 0 then set class2;
 declare hash ha2(dataset:'class2',ordered:'Y');
  ha2.definekey('_k');
  ha2.definedata('_name');
  ha2.definedone();

do k=1 to &nobs; 
 ha1.find();
 do _k=k to &nobs;
  ha2.find();output;
 end;
end;
stop;
run;


Ksharp

View solution in original post

10 REPLIES 10
ballardw
Super User

What kind of comparison are you looking for? And do you actually mean compare the first row with all of 220,000 rows?

If so, I'm afraid that what you're desribing is going to create on the order of 220000! (factorial) comparisons and isn't likely to finish in any reasonable time.

mathias
Quartz | Level 8

Hello,

Thanks for taking the time to help me

> What kind of comparison are you looking for?

I want to search for duplicated rows, the only way is to compare each row with each other (proc sort nodupkey does not allow nearby dates to be considered as equal)


> And do you actually mean compare the first row with all of 220,000 rows?

Yes, (The first with the 219 999 others rows; then the second with the 219 998 others etc.)

> If so, I'm afraid that what you're desribing is going to create on the order of 220000! (factorial) comparisons and isn't likely to finish in any reasonable time.

Complexity should be O(n²) not O(n!), so it should be feasable in less than a day I think.

Either way, not all the tables I need to run are so big, this is an exception.


>data _null_;

> set sashelp.class(keep=name);

> do i=_n_ to nobs;

>  set sashelp.class(keep=name rename=(name=_name)) nobs=nobs point=i;

>  put _all_;

> end;

>run;

That's it I think,

Crashed with nobs = 5000 but that's another issue

Thank you !


Ksharp
Super User

I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

If you want faster, Hash Table is a good choice.

data class1;
 set sashelp.class(keep=name);
 k+1;
run;
data class2;
 set sashelp.class(keep=name rename=(name=_name));
 _k+1;
run;


%let dsid=%sysfunc(open(class1));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
data want;
if 0 then set class1;
 declare hash ha1(dataset:'class1',ordered:'Y');
  ha1.definekey('k');
  ha1.definedata('name');
  ha1.definedone();
if 0 then set class2;
 declare hash ha2(dataset:'class2',ordered:'Y');
  ha2.definekey('_k');
  ha2.definedata('_name');
  ha2.definedone();

do k=1 to &nobs; 
 ha1.find();
 do _k=k to &nobs;
  ha2.find();output;
 end;
end;
stop;
run;


Ksharp

mathias
Quartz | Level 8

> I don't understand what you mean when you refer to Crashed with nobs = 5000 ?

I always try the algorithm first on little tables (nobs=5) if it works, I increase it step by step to have a better feeling on running times

nobs=50, nobs=500 took less than a minute but nobs 5000 crashed (in fact i found out lateer that the crash occurs between nobs = 1400 and 1500)

> If you want faster, Hash Table is a good choice.

much much faster as I can see.

Crash occurs between 15 000 and 16 000 this time (lack of RAM I think)

I have a few questions

-Is it necessary to create two same tables in the beginning?

-WANT is growing too big, is there a way to still have the same computation without storing it in a table and thus in RAM memory ?

Ksharp
Super User

"Crash occurs between 15 000 and 16 000 this time"

Did you mean SAS session stop by accident ? But for a dataset only containing 22,000 obs ,that is not a problem I think.

"-Is it necessary to create two same tables in the beginning?"

It isn't necessary. You can do the same thing in the code when building the Hash Table.

I do this to try to enhance the code's readability . and it is very convenient if the dataset is not too big.

"-WANT is growing too big, is there a way to still have the same computation without storing it in a table and thus in RAM memory ?"

Yes. It is up to what you want, you can output obs you need and discard others .

Ksharp

mathias
Quartz | Level 8

> "Did you mean SAS session stop by accident ?"

No, simply nothing happens.

When you look the Physical Memory Usage, you see that nobs=16k needs more than 6Gb of RAM (cf. screenshot)2012-05-23_093032.png


> "Yes. It is up to what you want, you can output obs you need and discard others ."

Thanks, I'll try to play with that and come back if I can't find it out by myself

Astounding
PROC Star

mathias,

I'm going to build upon KSharp's first solution.  If these ideas work for you, maybe he can convert it to a hash version.

First, you have to incorporate the idea that it is not necessary to output every combination.  The same DATA step that reads in all the pairs can also perform some (or all) of the comparison work, and only output the matching pairs.  Second, you have to account for the fact that both observations contain exactly the same variable names.  For comparison purposes, you'll have to rename one set of variables.

Here is a program that incorporates some of those ideas:

data match;

   set sashelp.class nobs=_nobs_;

   if _n_ < _nobs_ then do _i_ = _n_ + 1 to _nobs_;

     set sashelp.class (keep=name rename=(name=name2)) point=_i_;

     if name=name2 then do;

       original_observation = _n_;

       matching_observation = _i_;

       output;

       set sashelp.class point=_i_;

       output;
   end;

   drop name2;

run;

The idea is to output pairs that match, as two separate observations, using the original variable names.  Each observation of the pair contains the new variables ORIGINAL_OBSERVATION and MATCHING_OBSERVATION.  In the sample program, I'm comparing on NAME only, but you can use more than one variable to narrow down potential matches.  The bottleneck will be speed, but in theory this will chug along without other resource constraints.  As you noted, the number of comparisons is proportional to the square of the number of observations.

Ksharp
Super User

Oh. That doesn't matter, as far as SAS session doesen't stop by accident. That is OK. You can run it as long as you can.

BTW, I recode it to make my code better ,faster, more succinct ........

data class;
 set sashelp.class(keep=name);
 k+1;
run;



data want;
if 0 then set class;
 declare hash ha1(dataset:'class',ordered:'Y');
 declare hiter hi1('ha1');
  ha1.definekey('k');
  ha1.definedata('k','name');
  ha1.definedone();
if 0 then set class(rename=(name=_name));
 declare hash ha2(dataset:'class(rename=(name=_name))',ordered:'Y');
 declare hiter hi2('ha2');
  ha2.definekey('k');
  ha2.definedata('_name');
  ha2.definedone();

do while(hi1.next()=0) ;
 rc=hi2.setcur(); 
 do while(rc=0);
  output;
  rc=hi2.next(); 
 end;
end; 
stop;
run;


Ksharp

mathias
Quartz | Level 8

I almost got what I want today, thanks all for your help

I just need to automate a bit more and to find the best output.

Without output or computations, I calculated empiracally that it should take about 55 min to go through 200k observations.

Reasonable I think, but going through the whole table at once isn't necessary so I still have optimization options available.

Here is my code :

%let WorkingTable = DUPLICATES_TABLE;

%let IntervalDays = 70;

%let vars1 = Id Germ Gender PostCode DateOfDiagnosis DateOfBirth;

%let vars2 = 'Id','Germ','Gender','PostCode','DateOfDiagnosis','DateOfBirth';

%let vars3 = Id=_Id Germ=_Germ Gender=_Gender PostCode=_PostCode DateOfDiagnosis=_DateOfDiagnosis DateOfBirth=_DateOfBirth;

%let vars4 = '_Id','_Germ','_Gender','_PostCode','_DateOfDiagnosis','_DateOfBirth';

  /* define class for hash table */

  data class;

  set &WorkingTable (keep=&vars1);

  k+1;

  run;

  /* compute nobs */

  %let dsid=%sysfunc(open(class));

  %let nobs=%sysfunc(attrn(&dsid,nobs));

  %let dsid=%sysfunc(close(&dsid));

  /* run comparison loops */

  data comparison (drop=rc k);

    if 0 then set class;

      declare hash ha1(dataset:'class',ordered:'Y');   *Définir hash table;

      declare hiter hi1('ha1');             *Définir hash iterator;

      ha1.definekey('k');               *Définir la variable identifiante;

      ha1.definedata('k',&vars2);           *Définir les autres variables;

      ha1.definedone();                 *Ecriture de la structure de la table;

    if 0 then set class(rename=(&vars3));

      declare hash ha2(dataset:"class(rename=(&vars3))",ordered:'Y');

      declare hiter hi2('ha2');

      ha2.definekey('k');

      ha2.definedata(&vars4);

      ha2.definedone();

    do while(hi1.next()=0);

      rc=hi2.setcur(key : k+1); *demarre l'iteration de la seconde loop a i+1;

      do while(rc=0);

        *if Gender= _Gender then put k= Gender= _Gender=;

        *if Gender= _Gender then output;

        *put k= Gender= _Gender=;

        *output;

        if (DateOfBirth=_DateOfBirth) AND (Gender=_Gender) AND (PostCode=_PostCode) AND (&GermVar= _&GermVar) AND (abs(DateOfDiagnosis - _DateOfDiagnosis) < &IntervalDays) then output;

        rc=hi2.next();

        end;

      end;

    stop;

    run; * no ouput, nobs=40k -> 2:06 min;

Ksharp
Super User

It looks like you are looking for:

data _null_;
 set sashelp.class(keep=name);
 do i=_n_ to nobs;
  set sashelp.class(keep=name rename=(name=_name)) nobs=nobs point=i;
  put _all_;
 end;
run;

Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1286 views
  • 6 likes
  • 4 in conversation