DATA Step, Macro, Functions and more

Enhancement Request for CATX, plus some questions

Accepted Solution Solved
Reply
Super Contributor
Posts: 376
Accepted Solution

Enhancement Request for CATX, plus some questions

Here is an excerpt of a case I opened with SAS TS:

I’ve got an ETL process where I’m doing change record detection, and modifying data (close old record, add new record) if a change has occurred.  It’s somewhat analogous to SCD2 processing.

I’ve ran into a problem where data was copied into the wrong column.  This error was corrected, but due to the way CATX works, a change is not detected.

An example will help to illustrate the point:

data test;

  length var1 var2 var3 $1 var4 var5 8 buffer1 buffer2 $32767 digest1 digest2 $32;

  input var1-var5;

  buffer1=catx("|", of var1-var5);

  digest1=put(md5(buffer1),hex32.);

  buffer2=strip(var1) || "|" || strip(var2) || "|" || strip(var3) || "|" || strip(var4) || "|" || strip(var5);

  digest2=put(md5(buffer2),hex32.);

   datalines;

A B . 1 1

A . B 1 1

;

run;

I'd prefer to use the CATX function - I can use a variable list, I don't have to think about data type conversion, and don’t have to jump through hoops in my code generation.  My actual code has about 40 variables I need to check for changes.

The solution, of course, is a flag for the CATX function that would mimic the concatenation operator.  IOW, print the delimiter regardless of whether the data value is missing.

I think (???) the CATX function could be changed where it would not break existing code.  Example invocations:

buffer1=catx("|", of var1-var5, “Y”);

buffer1=catx("|", var1, var2, var3, var4, var5, “Y”);

The last parameter does not have to be “Y”, it just needs to be some character string, that could not be confused with a variable, that would trigger CATX to print the delimiter regardless of whether the variable is missing or not.  There would be a delimiter for every variable – 1.

Questions:

Can you think of an ingenious way to address this problem, other than the concatenation operator as above?  Some ideas:

* Perhaps build the buffer one variable at a time, translating missing to “00”x, to “trick” CATX into printing the delimiter.  But spinning over the variable list would be problematic since it is a mix of char and num.

* Or, perhaps I can export the dataset as a delimited file, read that file back in without delimiters, and use _INFILE_ as the buffer for the MD5 function.

* Or, could I use PROC FCMP to roll my own CATX function that would give me the desired results?  (BTW, the desired results is a buffer with one delimiter per variable - 1)

Thanks...


Accepted Solutions
Solution
‎03-10-2014 09:57 PM
Respected Advisor
Posts: 4,641

Re: Enhancement Request for CATX, plus some questions

Try using the CATQ function:

data test;

  length var1 var2 var3 $1 var4 var5 8 buffer1 buffer2 buffer3 $32767 digest1 digest2 digest3 $32;

  input var1-var5;

  buffer1=catx("|", of var1-var5);

  digest1=put(md5(buffer1),hex32.);

  buffer2=strip(var1) || "|" || strip(var2) || "|" || strip(var3) || "|" || strip(var4) || "|" || strip(var5);

  digest2=put(md5(buffer2),hex32.);

  buffer3 = catq("MTD", "|", of var1-var5);

  digest3=put(md5(buffer3),hex32.);

datalines;

A B . 1 1

A . B 1 1

;

proc print data=test noobs; run;

PG

PG

View solution in original post


All Replies
Super Contributor
Posts: 376

Re: Enhancement Request for CATX, plus some questions

Hi All,

I think I've solved my own problem.  I remembered a technique I first saw proposed by (a really smart guy, you all should follow him Smiley Happy).

I get one extra delimiter at the beginning of the buffer string, but for my purposes of creating a message digest, it meets my needs for change detection.

If you can think of a better approach, please let me know.

Thanks...

* test data, should be a different digest for each observation ;

data test;

  length var1 var2 var3 $1 var4 var5 8;

  input var1-var5;

  datalines;

A B . 1 .

A . B 1 .

A B . . 1

A . B . 1

;

run;

* define the variables of interest.  i've purposely dropped off var3 ;

%let vars=var1 var2 var4 var5;

* first test with default option for missing numerics ;

options missing=".";

data test2;

  * use a "naked format" to define the PDV order ;

  * the _dummy_ variable in the PDV is an end-of-processing marker ;

  format &vars _dummy_;

  * prevent unitialized variable notes ;

  retain _dummy_ "";

  set test;

  length _name $32 _value buffer1 $32767 digest1 $32 buffer2 $32767 digest2 $32;

  buffer1=catx("|", of var1-var5);

digest1=put(md5(buffer1),hex32.);

  do while (1);

    * get next variable in the PDV ;

    call vnext(_name);

    * bail if variable name = _dummy_ ;

    if (_name="_dummy_") then leave;

    * get the variable value.  no notes on numeric to character conversion, which is nice ;

_value=vvaluex(_name);

buffer2=cats(buffer2,"|",_value);

  end;

digest2=put(md5(buffer2),hex32.);

run;

* then do the same test as above with blank for missing numerics ;

options missing=" ";

data test3;

  format &vars _dummy_;

  retain _dummy_ "";

  set test;

  length _name $32 _value buffer1 $32767 digest1 $32 buffer2 $32767 digest2 $32;

buffer1=catx("|", of var1-var5);

digest1=put(md5(buffer1),hex32.);

  do while (1);

    call vnext(_name);

    if (_name="_dummy_") then leave;

_value=vvaluex(_name);

buffer2=cats(buffer2,"|",_value);

  end;

digest2=put(md5(buffer2),hex32.);

run;

* and if the PDV order is really critical to you, grab Richard Devenezia's (another really smart guy) %seplist macro ... (Google for his website) ;

data test4;

  set test;

  retain _dummy_ "";

  length _name $32 _value buffer1 $32767 digest1 $32 buffer2 $32767 digest2 $32;

buffer1=catx("|", of var1-var5);

digest1=put(md5(buffer1),hex32.);

  do while (1);

    call vnext(_name);

    if (_name="_dummy_") then leave;

    if (_name in (%seplist(&vars,nest=QQ))) then do;

_value=vvaluex(_name);

buffer2=cats(buffer2,"|",_value);

    end;

  end;

digest2=put(md5(buffer2),hex32.);

run;

Solution
‎03-10-2014 09:57 PM
Respected Advisor
Posts: 4,641

Re: Enhancement Request for CATX, plus some questions

Try using the CATQ function:

data test;

  length var1 var2 var3 $1 var4 var5 8 buffer1 buffer2 buffer3 $32767 digest1 digest2 digest3 $32;

  input var1-var5;

  buffer1=catx("|", of var1-var5);

  digest1=put(md5(buffer1),hex32.);

  buffer2=strip(var1) || "|" || strip(var2) || "|" || strip(var3) || "|" || strip(var4) || "|" || strip(var5);

  digest2=put(md5(buffer2),hex32.);

  buffer3 = catq("MTD", "|", of var1-var5);

  digest3=put(md5(buffer3),hex32.);

datalines;

A B . 1 1

A . B 1 1

;

proc print data=test noobs; run;

PG

PG
Super Contributor
Posts: 376

Re: Enhancement Request for CATX, plus some questions

Thanks PG.  I thought I knew all the CAT* functions, apparently CATQ escaped my reading of the doc :smileyblush:.  Besides being less code, I'm sure it's more performant than my approach.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 273 views
  • 2 likes
  • 2 in conversation