BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NKormanik
Barite | Level 11

I'm about to search for answers.  Who knows if I'll find a solution.

 

But, just putting this up here, in case one of you has some thoughts and wants to share.

 

So, 100 datasets in a library.  It's possible that some of them are identical to each other, with merely different names.  I've used standard Windows tools to test this out, and the results came back negative.  Curious if SAS concurs?

 

Google is indicating that Proc Compare has possibilities.  However, comparing two datasets at a time, given 100 datasets, doesn't seem to be the way to go.

 

Looking forward, as always, to your responses.

 

Nicholas Kormanik

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@NKormanik 

Below fully working code (you can run it in your environment) should give you many pointers how you could go about this.

I was too lazy to add much comment so you will need to get your head around how this is working. 

/** create sample data **/
options dlcreatedir;
libname src_lib "%sysfunc(pathname(work))/data_source";

proc datasets lib=src_lib nolist mt=data;
  copy
    in=sashelp
    out=src_lib;
    ;
  run;
quit;

data src_lib.class2 src_lib.class3;
  set sashelp.class;
run;

data src_lib.class4;
  set sashelp.class;
  age=10;
run;

data src_lib.class5;
  set sashelp.class;
  if _n_=1 then output;
  output;
run;

data src_lib.class6;
  set sashelp.class;
  new_var='a';
run;

/** and here all the logic **/
proc sql;
  create table work.tbl_and_cols as
  select libname, memname, upcase(name) as name, type
  from dictionary.columns
  where libname='SRC_LIB'
  order by memname, name
  ;
quit;


data work.table_meta;
  set work.tbl_and_cols;
  by memname name;
  length col_name_digest col_type_digest $32;
  retain col_name_digest col_type_digest;
  col_name_digest=put(md5(catx('|',col_name_digest,name)),$hex32.);
  col_type_digest=put(md5(catx('|',col_type_digest,type)),$hex32.);
  if last.memname then 
    do;
      output;
      call missing(col_name_digest, col_type_digest);
    end;
run;

proc sql;
  create table work.comp_candidates as
  select 
    l.libname as libname,
    l.memname as l_memname,
    r.memname as r_memname
  from work.table_meta l inner join work.table_meta r
  on 
    l.col_name_digest=r.col_name_digest and
    l.col_type_digest=r.col_type_digest and
    l.memname > r.memname
  ;
quit;

%macro comp(libref,base,comp);
  proc compare
    base=&libref..&base
    comp=&libref..&comp
    noprint
    out=work.__comp_result
    OUTNOEQUAL
    ;
  run;
  data work.__ds_no_diff;
    if nobs=0 then
      do;
        if 0 then set work.comp_candidates;
        libname="&libref";
        l_memname="&base";
        r_memname="&comp";
        keep libname l_memname r_memname;
        output;
      end;
    stop;  
    set work.__comp_result nobs=nobs;
  run;
  proc append base=work.ds_no_diff data=work.__ds_no_diff;
  run;quit;

%mend;

proc datasets lib=work nolist nowarn;
  delete __comp_result __ds_no_diff ds_no_diff;
  run;
quit;

data _null_;
  set work.comp_candidates;
  length cmd $100;
  cmd=cats('%comp(',libname,',',l_memname,',',r_memname,')');
  call execute(cmd);
run;

proc datasets lib=work nolist nowarn;
  delete __comp_result __ds_no_diff;
  run;
quit;

/** print result: Pairs of identical tables **/
proc print data=work.ds_no_diff;
run;

 

Patrick_0-1646647272121.png

 

View solution in original post

17 REPLIES 17
pink_poodle
Barite | Level 11

That is possible. You just hold one dataset constant, and iterate over the remaining ones. The datasets can be put into a proc sql macro variable list. Then use two macro programs - one driver that is calling the iterations on another one that is comparing. Proc compare must have an easy output parameter to check if completely identical (“ods trace on;” will show what tables are output). Aggregate these 99  values of parameter in an output dataset together with names of the second dataset and check which one is the same.

Astounding
PROC Star
As a strategy, you might capture the number of observations in each data set, and compare only those with the same number of observations. That would likely reduce the number of comparisons from many thousands to a dozen or so.
NKormanik
Barite | Level 11

Bonus Question:

 

If you had to do a 2-dataset comparison, given 100 datasets, how many comparisons would you have to do to check each and every one of 'em?

 

Explain your answer.

 

 

Astounding
PROC Star
Straightforward combinations math:

100 * 99 / 2 = 4950

100 choices for first element in the pair, times 99 remaining choices for second element in the pair. Then cut the total in half to avoid double counting both A B plus B A.
NKormanik
Barite | Level 11

4950!!

 

Even with multi-threaded SAS OnDemand, seems a whopper.  Hate to get permanently kicked off.

 

On a stand-alone desktop, yikes.

 

Plus, how would one code it anyway?

 

(Time for more macro learning....  Humphh....)

 

 

ballardw
Super User

First check the dictionary tables.

Start with dictionary.tables (or sashelp.vtables).

Look at numbers of observations and variable, Nobs and Nvar. The ones with the same values for these two variables are the only possible candidates for "duplicate".

 

Use that information and dictionary.columns to find out if the candidates have the same variables by name, length and type (might include other information of Label and format as well).

Only the ones that  match on all those points could be duplicates.

 

Hint: do two requests: select nobs, nvar

         and                     select distinct nobs, nvar

 

The number observations different is the number of duplicate candidates.

Similar with the name, type, length from dictionary.columns.

 

Ksharp
Super User

You could check INTERSECT or EXCEPT of SQL .

As this :

 

proc sql;
create table obs_in_A_B_C as
select * from A
intersect
select * from B
intersect
select * from C
;
quit;

Table 'obs_in_A_B_C ' contains  obs in common for A and B and C .

pink_poodle
Barite | Level 11

If you want to know which datasets are the same, you can set them into one long dataset with a variable containing dataset names then check for duplicates.

AllanBowe
Barite | Level 11

You could simply create a hash of each dataset, and compare the hashes in one go.  Here's a macro that would help:  https://core.sasjs.io/mp__md5_8sas.html

 

To clarify - this macro hashes the VALUES, not the metadata.  So things like creation date, labels etc are ignored.  

EDIT:  I shared the wrong macro.  The above WILL hash values, but this macro will be much easier to use as it will accumulate the hash across every row:  https://core.sasjs.io/mp__hashdataset_8sas.html

/Allan
MacroCore library for app developers
Data Workflows, Data Contracts, Data Lineage, Drag & drop excel EUCs to SAS 9 & Viya - Data Controller
DevOps and AppDev on SAS 9 / Viya / Base SAS - SASjs
NKormanik
Barite | Level 11

I suspect there should be at least one set of duplicates.  Just a hunch.

 

(By the way, what I'm doing is running a number of Proc Logistics, but using different options (as usual with SAS, options are abundant).  I think the results for some of the output tables/datasets for these options combinations should be identical.  Just wanted to check that out.)

 

Regarding using hash, etc., that's what I thought various Windows compare programs do.  Was surprised when they came up empty, as far as finding duplicates.

 

One such program is AllDup .

 

 

SASKiwi
PROC Star

Please note SAS datasets contain metadata storing the date and time they are created. Third-party tools will identify datasets as different merely because they were created at different times, even though they might contain identical rows and columns of data. Only SAS tools will provide the correct answer. 

NKormanik
Barite | Level 11

@SASKiwi  That's what I've been figurin' as well.  All it would take is a bit of different metadata to throw off the usual Windows compare programs.

 

(I think such programs are able to 'overlook' metadata differences in .mp3 files, by the way.  Title, Album, Artist, etc., tucked in right before the actual song begins.)

 

One can easily get a sense of SAS metadata by creating the smallest SAS dataset and look at it's size -- 99% of such a dataset appears to be made up of metadata.

 

Why hasn't some genius coded the needed macro yet??!!

 

Compare all datasets in given library.....

 

Patrick
Opal | Level 21

@NKormanik It's some work but shouldn't be that difficult to create such a macro. There is a lot of information in the dictionary tables that should allow to drastically reduce the tables for which it's still necessary to execute a proc compare.

You would first need to define what "same" and "different" means on a metadata and structural level like:

- Is a different order of variables "same" or "different"? I'd go for "same" as variable order shouldn't be of relevance.

- Is different casing of variable names "same" or "different"? Here I'd go again for "same"

- Are differences in variable attributes "same" or "different" - like length, format, informat, label?

- One table has an index or constraint, the other doesn't - same or different?

- ....more such stuff like table compressed or not, encoding, .....

NKormanik
Barite | Level 11

@Patrick , TOURMALINE TITAN, you are definitely on track.  I agree more than 100%.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 3321 views
  • 20 likes
  • 8 in conversation