Re: Loop through tables at different periods and columns name

Rickie2 · Posted 08-26-2024 09:53 AM

Hi everybody,

I would like to figure out the best way to solve this :

create a macro with a loop that substract (or if character indicate if difference) each variable from datasets of different periods (P0 - P1 until PX-1 - PX), and put the result in a new dataset. For example, apple_P0 - apple_P1, Orange_P0 - Orange_P1,then same for Pineapple and TotalAB. For dataset P0 to dataset PX, increment by 1.

The dataset contains about 1000 colunms.

%let %n = 0;

%let %X = X; (max period)

data HAVE_P0;

input ID Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0 $;

cards;

15427 10 100 1000 Machine

35894 20 200 2000 Hand

57842 40 400 4000 Hand

79432 75 750 7500 Machine

;run;

data HAVE_P1;

input ID Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1 $;

cards;

15427 50 500 5000 Machine

35894 10 100 1000 Machine

57842 40 400 4000 Wind

;run;

data HAVE_P2;

input ID Apple_P2 Orange_P2 Pineapple_P2 TotalAB_P2

cards;

...

data HAVE_PX;

etc.

(

case if Var_P1 do not exist then .

if Var_P0 is character and P0 = P1 then 1

if Var_P0 is character and P0 <> P1 then 0

else Var_P0 - Var_P1

end

)

data WANT_P0_minus_P1;

15427 40 400 4000 1

35894 -10 -100 -1000 0

57842 0 0 0 0

79432 . . . .

;

data WANT_P1_minus_P2;

…

data WANT_PX-1_minus_PX;

Thanks in advance,

Tom · Posted 08-26-2024 10:29 AM

What is the overall goal here? What is the meaning of the data and what is the analysis you are trying to do?

What does this P0, P1, etc suffix on the variable names mean?

Why is the data in multiple datasets? Why not just store all of the data in one dataset?

Why do the variable names change between the datasets?

Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for?

Do you always have the same 100 variables?

Do you have SAS/IML license? Could you load the two datasets into matrices and just subtract them?

Why do you need to make some many different difference datasets?

ballardw · Posted 08-26-2024 10:36 AM

From your pseudo-code, why are variables that you say you want to 'subtract' apparently sometimes character???

There are elements of the question that seem to indicate a poor data structure or management design at the base that is terribly complicating your data for what should be relatively simple actions.

Rickie2 · Posted 08-27-2024 04:47 AM

Thanks for your answers Tom,

here are the answers for your questions:

- What is the overall goal here? What is the meaning of the data and what is the analysis you are trying to do?

Those data will be used for a ML model. A left outer join brings all of tables together. The new table is going to be the dataset to modelized. Each table represent measures from different periods P: P0 current period, P1 the previous period, P2 the period before P1, ... until PX which is the ultimate period. That's why it required a loop. 1° P0-P1 2° P1-P2 3° P2-P3 .... PX-1 - PX.

- What does this P0, P1, etc suffix on the variable names mean? Periods of time, example a year, measurement for a given year.

- Why is the data in multiple datasets? Why not just store all of the data in one dataset? No problem, this step can be done before make a comparison step. In this case, the variable names have to keep P0, P1, P2, etc. at the end of the variable name.

- Why do the variable names change between the datasets? The value is computed period by period. Table Have_PO are results for the current year, Have_P1 results for the previous year, Have_P2 etc.

Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for? It's a possibility.

Do you always have the same 100 variables? Yes, periods P0, P1, P2,...,PX will always have the same variables

Do you have SAS/IML license? Could you load the two datasets into matrices and just subtract them? NO

Why do you need to make some many different difference datasets? For a ML model, to follow the evolution between periods. Then tables are united with a left outer join.

if you have other questions, don't hesitate,

your help is greatly appreciated.

Tom · Posted 08-27-2024 12:49 PM

Sounds like you are trying to do feature generation to pass into your ML software?

Not sure if ML software is smart enough to do the sort of temporal change calculations for you.

With your described source data it would be trivial to combine the datasets using normal SAS code. MUCH MUCH easier than trying to do it with SQL code.

data want;
  merge have_p0 - have_p100;
  by id;
run;

To then generate your difference flags you could use ARRAYs in the data step.

Say you have a series of 101 numeric variables named APPLE_P0 to APPLE_p100 and another series of character variables (let's call it NAME). You could then generate a series of 100 difference variables comparing the value to the previous value (or perhaps to the period 0 value) using arrays and DO lops like this:

data want;
  merge have_p0 - have_p100;
  by id;
  array _apple Apple_p0 - Apple_p100;
  array d_apple diff_apple_p1-diff_apple_p100;
  array _name Name_p0 - Name_p100;
  array d_name diff_name_p1-diff_name_p100;
  do index=2 to dim(_apple);
     d_apple[index-1] = _apple[index] - _apple[index-1] ;
     d_name[index-1] = _name[index]  ne _name[index-1] ;
  end;
run;

If you need to extend that to a number of other variables then you could use code generation. Either using macro language. Or just using normal SAS code to write the generated SAS code to a file which can be included into the program using the %INCLUDE statement.

Tom · Posted 08-26-2024 10:38 AM

Why not just use PROC COMPARE?

data HAVE_P0;
    input ID Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0 $;
cards;
15427 10    100       1000         Machine
35894 20    200       2000         Hand
57842 40    400       4000         Hand
79432 75    750       7500         Machine
;
 
 
data HAVE_P1;
    input ID Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1 $;
cards;
15427     50 500 5000 Machine
35894     10 100 1000 Machine
57842     40 400 4000 Wind
;

proc compare data=have_p0 compare=have_p1 outall out=want noprint;
  id id;
  var Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0;
  with Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1;
run;

proc print data=want;
run;

If you need help getting the list of variable names perhaps you can let SAS calculate them for you.

proc sql noprint;
  select a.name 
       , b.name
    into :var0 separated by ' '
       , :var1 separated by ' '
  from
      (select name,type from dictionary.columns
        where libname='WORK' and memname='HAVE_P0'
          and upcase(name) like '%_P0'
       ) a
  inner join 
      (select name,type from dictionary.columns
        where libname='WORK' and memname='HAVE_P1'
          and upcase(name) like '%_P1'
       ) b
  on (upcase(substrn(a.name,1,length(a.name)-3))
   = upcase(substrn(b.name,1,length(b.name)-3)))
   and a.type=b.type
  order by 1
;
%let nvars=&sqlobs;
quit;

%put &=nvars;
%put &=var0;
%put &=var1;

Loop through tables at different periods and columns name