Normalize a SAS Dataset

1 Like

I wrote this to do the following:

1. Find all fields in a dataset with a length over 8 bytes.

2. For each of them, count the distinct values in the dataset.

3. If there are at least 10 records per value, create a surrogate key table for the field and replace the field with the key in a copy of the dataset.

There's also a 3rd copy of the database that has the field and the key in it.

If you're comfortable with the results, you can then delete the original dataset as well as the 3rd version. I didn't delete them because I wanted to show the difference in size.

For example, one dataset started out at 3,491,692,544 bytes with 6 fields converted to keys resulting in a normalzed dataset at 2,379,612,160 with 6 key tables (23,461,888) for a saving of 31%.

%macro normalize(lib,dsn);

    PROC SQL;
        SELECT name into :name1 - 
            FROM dictionary.columns 
        WHERE compress(libname||'.'||memname) = "&lib..&dsn" AND length > 8;
    quit;

    %if &SQLOBS = 0 %then %return;

    %let reccnt = &sqlobs;

    PROC SQL;
        CREATE TABLE &lib..&dsn.2 AS SELECT * FROM &lib..&dsn;
        QUIT;

    %do i = 1 %to &reccnt;

        %put &&name&i;

        proc sql;
            select count(*),COUNT(DISTINCT(&&name&i)),int(count(*)/COUNT(DISTINCT(&&name&i)))
            into :count,:countdist,:pct
            from &lib..&dsn; 
        quit;

        %put &=count &=countdist &=pct;

        %if &pct > 10 %then %do;

            proc sql;
                CREATE TABLE work.distinct AS 
                SELECT DISTINCT &&name&i
                FROM &lib..&dsn;
            quit;

            data &lib..sk_&&name&i; set work.distinct; sk_&&name&i =_n_; run;
 
            PROC SQL;
                CREATE TABLE &lib..&dsn.3 AS 
                    SELECT t2.*, t1.sk_&&name&i

                    FROM &lib..sk_&&name&i t1
                        INNER JOIN &lib..&dsn.2 t2 ON (t1.&&name&i = t2.&&name&i);
            QUIT;

            data &lib..&dsn.2; set &lib..&dsn.3; drop &&name&i; run;
 
        %end;

     %end;

 %mend normalize;

Reeza · ‎08-23-2018

Does this build a star type schema automatically then? That would be cool and useful 🙂

What happens if you create a view on top of that and query the view instead of the table, is the performance better?

tomrvincent · ‎08-23-2018

Yes, that's essentially what it does. A view would be very fast. I use them regularly. Plus, you can add calculated fields in a view.

Here's an example:

CREATE VIEW WORK.vw_FD01_CLAIM_DETAILS2 AS
SELECT
t1.CLCL_RECD_DT,
t1.CLCL_PAID_DT,
t1.Date_claim_was_processed,
t1.Line_FROM_DT,
t1.Line_TO_DT,
t1.RCRC_ID,
t1.CDML_CHG_AMT,
t2.MEDICAID_ID,
t3.BILL_PROV_NPI,
t4.IPCD_ID,
t5.SERV_PROV_NPI,
t6.PRPR_ID,
t7.CLCL_ID_ADJ_FROM,
t8.CLCL_ID
FROM DB.FD01_CLAIM_DETAILS2 t1,

DB.SK_MEDICAID_ID t2,

DB.SK_BILL_PROV_NPI t3,

DB.SK_IPCD_ID t4,
DB.SK_SERV_PROV_NPI t5,

DB.SK_PRPR_ID t6,

DB.SK_CLCL_ID_ADJ_FROM t7,

DB.SK_CLCL_ID t8
WHERE (t1.sk_MEDICAID_ID = t2.sk_MEDICAID_ID

AND t1.sk_BILL_PROV_NPI = t3.sk_BILL_PROV_NPI

AND t1.sk_IPCD_ID = t4.sk_IPCD_ID

AND t1.sk_SERV_PROV_NPI = t5.sk_SERV_PROV_NPI

AND t1.sk_PRPR_ID = t6.sk_PRPR_ID

AND t1.sk_CLCL_ID_ADJ_FROM = t7.sk_CLCL_ID_ADJ_FROM

AND t1.sk_CLCL_ID = t8.sk_CLCL_ID);
QUIT;

Reeza · ‎08-23-2018

I have a project coming up where this is going to be awesome to use! Thanks!!!!

FreelanceReinh · ‎08-23-2018

I haven't thoroughly checked the code, but

%if &pct > 10 %then %do;

should read

%if %sysevalf(&pct > 10) %then %do;

Otherwise, conditions such as 6.5>10 would be evaluated as TRUE.

Reeza · ‎08-23-2018

@FreelanceReinh would you ever have a decimal for count of a variable?

FreelanceReinh · ‎08-24-2018

@Reeza: Macro variable PCT is defined as a quotient

count(*)/COUNT(DISTINCT(&&name&i))

which can take non-integer values.

Example:

data test(keep=c);
length c $9;
do i=1 to 13;
  c=ifc(i<7,'A','B');
  output;
end;
run;

%normalize(WORK, TEST)

tomrvincent · ‎08-24-2018

@FreelanceReinh thanks! I added the int function to address that.

Reeza · ‎08-24-2018

So it's not ten records, it's 10% of records then?

3. If there are at least 10 records per value, create a surrogate key table for the field and replace the field with the key in a copy of the dataset.

tomrvincent · ‎08-24-2018

@Reeza no, at least 10 records. YMMV. I just decided to pick 10...no special reason.

Reeza · ‎08-24-2018

@tomrvincent Thanks! haven't run it yet, but the only IF condition I see is related to the PCT not count, why I asked if it's 10 percent or 10 count distinct. I'll try it on the weekend and let you know how it goes. Do you have it on GitHub as well, if I make any changes I can note them there?

tomrvincent · ‎08-24-2018

@Reeza it isn't a percent...I was actually thinking 'per count' when I wrote it but you can certainly change pct to whatever you want. Maybe 'rpd' for 'records per distinct' might work...whatever floats your boat. 🙂 Not, it's just here so copy it and mod it as you see fit.

ChrisNZ · ‎08-29-2018

This article made me ponder.

1. Creating the SK tables

The macro reads the source table over and over. I modified it to only do one pass.

The run time drops from 582 seconds to 146 seconds for my example data.

( I also replaced the query to the dictionary: I avoid doing this as it can stall the session if RDBMS libraries are defined).

2. Using the SK tables

> A view would be very fast. I use them regularly.

I have always found this to be the exception rather than the norm. I'd be curious to know more.

In my test,

- Querying the original table table takes 2:16 (create table READ_ORIG below)

- Querying the "star schema" directly takes 3:15 (create table READ_STAR below)

- Querying the "star schema" from a view takes 4:01 (create table READ_STARV below)

I tried indexing the SK tables but the index is not used by default when joining, so no valid outcome for this test.

I can't attach a file to the library comments, so I apologise for the lengthy block below.

I made the font size small to make it less obnoxious - though the line spacing seems unchanged 😞 -.

data X Y Z;
  do I=1 to 1e6;
    A=put(I,z10.);
    B=put(I,z10.);
    C=put(I,z10.);
    D=put(I,z10.);
    E=put(I,z10.);
    F=put(I,z10.);
    do J=1 to 20;
      output;
    end;
  end;
run;              
                             
%macro normalise(lib,dsn,do_index=0);

  %let time=%sysfunc(time());

  %* Find variables having length > 8;
  proc contents data=&lib..&dsn. noprint out=____NORMALISE_CONTENTS; run;
  proc sql; select NAME into :name1 -  from ____NORMALISE_CONTENTS where LENGTH > 8; quit;
  %let nbvar = &sqlobs;
  %if &nbvar. = 0 %then %return;

  %* Find variables having at least 10 records par value on average;
  proc sql noprint;
    create table ____NORMALISE_COUNTS as 
    select count(*)                            as NB_OBS
      %do i = 1 %to &nbvar.;  
      ,"&&name&i"                              as NAME_&i.
      ,count(distinct(&&name&i))               as CARDINALITY_&i.
      ,int(count(*)/count(distinct(&&name&i))) as NB_RECPERVALUE_&i.
      %end;
    from &lib..&dsn; 
  quit;

  %* Save as macro variables NAME&i the variables qualifying for being detached ;
  data _null_; 
    set ____NORMALISE_COUNTS;
    %do i = 1 %to &nbvar.;  
    if NB_RECPERVALUE_&i. > 10 then do;
      N+1;
      call symput("name&i",NAME_&i.);
    end;
    %end;
    call symputx("nbvar",N);
  run;

  %* Save as macro variable KEEP_NAMES the variables not detached;
  proc sql noprint;

    select catt('t0.',NAME) into :keep_names separated by  ',' 
    from ____NORMALISE_CONTENTS 
    where NAME not in (' ' %do i=1 %to &nbvar; ,"&&name&i" %end; );

    drop table ____NORMALISE_CONTENTS;
    drop table ____NORMALISE_COUNTS;

  quit;
  %if &nbvar. = 0 %then %return;

  %* Create a foreign key table for each variable;
  %do i = 1 %to &nbvar.;

    %put ==> &i &&name&i;

    %* Create SK table;
    proc sql;
      create table &lib..&dsn._SK&i. %if &do_index. %then (index=(SK_&&name&i)); as
      select &&name&i., monotonic() as SK_&&name&i. from ( 
        select distinct &&name&i. from &lib..&dsn.
      );
    quit;

   %end;      

  %* Create new base table with values replaced with foreign keys;
   proc sql;
     create table &lib..&dsn._SK as 
     select &keep_names. %do i = 1 %to &nbvar.; , t&i..sk_&&name&i. %end;
     from &lib..&dsn. t0
          %do i = 1 %to &nbvar.;
            inner join 
          &lib..&dsn._SK&i. t&i.
            on t0.&&name&i. = t&i..&&name&i.
          %end;
      ;
   quit;

   %* Create view on star schema;
   proc sql;
     create view &lib..&dsn._STARV as 
     select &keep_names. %do i = 1 %to &nbvar.; , t&i..&&name&i. %end;
     from &lib..&dsn._SK  t0
          %do i = 1 %to &nbvar.;
            inner join 
          &lib..&dsn._SK&i. t&i.
            on t0.sk_&&name&i. = t&i..sk_&&name&i.
          %end;
      ;
   quit;

   %put =========> runtime normalise %sysfunc(int(%sysfunc(time())-&time)) seconds.;

%mend normalise;

%macro normalize(lib,dsn);

  %let time=%sysfunc(time());

  proc contents data=&lib..&dsn. noprint out=____CONT; run;

  proc sql; select NAME into :name1 -  from ____CONT where LENGTH > 8; quit;

  %if &sqlobs. = 0 %then %return;

    %let reccnt = &sqlobs;

    PROC SQL;
        CREATE TABLE &lib..&dsn.2 AS SELECT * FROM &lib..&dsn;
        QUIT;

    %do i = 1 %to &reccnt;

        %put ==> &i &&name&i;

        proc sql;
            select count(*),COUNT(DISTINCT(&&name&i)),int(count(*)/COUNT(DISTINCT(&&name&i)))
            into :count,:countdist,:pct
            from &lib..&dsn; 
        quit;

        %put &=count &=countdist &=pct;

        %if &pct > 10 %then %do;

            proc sql;
                CREATE TABLE work.distinct AS 
                SELECT DISTINCT &&name&i
                FROM &lib..&dsn;
            quit;

            data &lib..sk_&&name&i; set work.distinct; sk_&&name&i =_n_; run;
 
            PROC SQL;
                CREATE TABLE &lib..&dsn.3 AS 
                    SELECT t2.*, t1.sk_&&name&i

                    FROM &lib..sk_&&name&i t1
                        INNER JOIN &lib..&dsn.2 t2 ON (t1.&&name&i = t2.&&name&i);
            QUIT;

            data &lib..&dsn.2; set &lib..&dsn.3; drop &&name&i; run;
 
        %end;           

     %end;

     %put =========> runtime normalize %sysfunc(int(%sysfunc(time())-&time)) seconds.;

 %mend normalize;

option mprint msglevel=i;
%normalise (work,X);
%normalise (work,Y,do_index=1);
%normalize (work,Z);
                                                                            
proc sql; 
  create table READ_ORIG as select * from X;
quit;
  
proc sql; 
  create table READ_STAR as
  select t0.I, t0.J , t1.A , t2.B , t3.C , t4.D , t5.E , t6.F 
  from work.X_SK t0 
  inner join work.X_SK1 t1 on t0.sk_A = t1.sk_A 
  inner join work.X_SK2 t2 on t0.sk_B = t2.sk_B 
  inner join work.X_SK3 t3 on t0.sk_C = t3.sk_C 
  inner join work.X_SK4 t4 on t0.sk_D = t4.sk_D 
  inner join work.X_SK5 t5 on t0.sk_E = t5.sk_E 
  inner join work.X_SK6 t6 on t0.sk_F = t6.sk_F ; 
quit;
  
proc sql; 
  create table READ_STARI as
  select t0.I, t0.J , t1.A , t2.B , t3.C , t4.D , t5.E , t6.F 
  from work.Y_SK t0 
  inner join work.Y_SK1 t1 on t0.sk_A = t1.sk_A 
  inner join work.Y_SK2 t2 on t0.sk_B = t2.sk_B 
  inner join work.Y_SK3 t3 on t0.sk_C = t3.sk_C 
  inner join work.Y_SK4 t4 on t0.sk_D = t4.sk_D 
  inner join work.Y_SK5 t5 on t0.sk_E = t5.sk_E 
  inner join work.Y_SK6 t6 on t0.sk_F = t6.sk_F ; 
quit;

proc sql; 
  create table READ_STARV as select * from work.A_STARV;
quit;
     
proc sql; 
  create table READ_STARVI as select * from work.B_STARV;
quit;

Normalize a SAS Dataset

Registration is open

SAS AI and Machine Learning Courses