BookmarkSubscribeRSS Feed
Rickie2
Calcite | Level 5
Hi everybody,
 
I would like to figure out the best way to solve this :
 
create a macro with a loop that substract (or if character indicate if difference) each variable from datasets of different periods (P0 - P1 until PX-1 - PX), and put the result in a new dataset. For example, apple_P0 - apple_P1, Orange_P0 - Orange_P1,then same for  Pineapple and TotalAB. For dataset P0 to dataset PX, increment by 1.
 
The dataset contains about 1000 colunms.
 
%let %n = 0;
%let %X = X; (max period)
 
 
data HAVE_P0;
    input ID Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0 $;
    cards;
  15427 10    100       1000         Machine
  35894 20    200       2000         Hand
  57842 40    400       4000         Hand
  79432 75    750       7500         Machine
;run;
 
 
data HAVE_P1;
    input ID Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1 $;
    cards;
  15427     50 500 5000 Machine
    35894     10 100 1000 Machine
    57842     40 400 4000 Wind
;run;
 
data HAVE_P2;
    input ID Apple_P2 Orange_P2 Pineapple_P2 TotalAB_P2
    cards;
...
data HAVE_PX;
etc.
 
(
case if Var_P1 do not exist then .
if Var_P0 is character and P0 = P1 then 1
if Var_P0 is character and P0 <> P1 then 0
else Var_P0 - Var_P1
end
)
 
 
data WANT_P0_minus_P1;
15427      40 400 4000 1
35894   -10 -100 -1000 0
57842      0 0 0 0
79432      .     .       .       .
;
 
data WANT_P1_minus_P2;
 
data WANT_PX-1_minus_PX;
 
 
Thanks in advance,

 

5 REPLIES 5
Tom
Super User Tom
Super User

What is the overall goal here?  What is the meaning of the data and what is the analysis you are trying to do?

 

What does this P0, P1, etc suffix on the variable names mean?

Why is the data in multiple datasets?  Why not just store all of the data in one dataset?

Why do the variable names change between the datasets?

Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for?

Do you always have the same 100 variables?

Do you have SAS/IML license?  Could you load the two datasets into matrices and just subtract them?

Why do you need to make some many different difference datasets?

ballardw
Super User

From your pseudo-code, why are variables that you say you want to 'subtract' apparently sometimes character???

There are elements of the question that seem to indicate a poor data structure or management design at the base that is terribly complicating your data for what should be relatively simple actions.

Rickie2
Calcite | Level 5

Thanks for your answers Tom,

here are the answers for your questions:

 

- What is the overall goal here?  What is the meaning of the data and what is the analysis you are trying to do?

Those data will be used for a ML model. A left outer join brings all of tables together. The new table is going to be the dataset to modelized. Each table represent measures from different periods P: P0 current period, P1 the previous period, P2 the period before P1, ... until PX which is the ultimate period. That's why it required a loop. 1° P0-P1      2° P1-P2     3° P2-P3    ....    PX-1 - PX.

 

- What does this P0, P1, etc suffix on the variable names mean?  Periods of time, example a year, measurement for a given year.

- Why is the data in multiple datasets?  Why not just store all of the data in one dataset?   No problem, this step can be done before make a comparison step.  In this case, the variable names have to keep P0, P1, P2, etc. at the end of the variable name.

 

- Why do the variable names change between the datasets?  The value is computed period by period. Table Have_PO are results for the current year, Have_P1 results for the previous year, Have_P2 etc.

 

Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for?    It's a possibility.

 

Do you always have the same 100 variables?    Yes, periods P0, P1, P2,...,PX will always have the same variables

 

Do you have SAS/IML license?  Could you load the two datasets into matrices and just subtract them?  NO

 

Why do you need to make some many different difference datasets?  For a ML model, to follow the evolution between periods.  Then tables are united with a left outer join.

 

if you have other questions, don't hesitate,

your help is greatly appreciated.

Tom
Super User Tom
Super User

Sounds like you are trying to do feature generation to pass into your ML software?

Not sure if ML software is smart enough to do the sort of temporal change calculations for you.

 

With your described source data it would be trivial to combine the datasets using normal SAS code. MUCH MUCH easier than trying to do it with SQL code.

data want;
  merge have_p0 - have_p100;
  by id;
run;

To then generate your difference flags you could use ARRAYs in the data step.

Say you have a series of 101 numeric variables named APPLE_P0 to APPLE_p100 and another series of character variables (let's call it NAME).  You could then generate a series of 100 difference variables comparing the value to the previous value (or perhaps to the period 0 value) using arrays and DO lops like this:

data want;
  merge have_p0 - have_p100;
  by id;
  array _apple Apple_p0 - Apple_p100;
  array d_apple diff_apple_p1-diff_apple_p100;
  array _name Name_p0 - Name_p100;
  array d_name diff_name_p1-diff_name_p100;
  do index=2 to dim(_apple);
     d_apple[index-1] = _apple[index] - _apple[index-1] ;
     d_name[index-1] = _name[index]  ne _name[index-1] ;
  end;
run;

If you need to extend that to a number of other variables then you could use code generation.  Either using macro language.  Or just using normal SAS code to write the generated SAS code to a file which can be included into the program using the %INCLUDE statement.

Tom
Super User Tom
Super User

Why not just use PROC COMPARE?

data HAVE_P0;
    input ID Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0 $;
cards;
15427 10    100       1000         Machine
35894 20    200       2000         Hand
57842 40    400       4000         Hand
79432 75    750       7500         Machine
;
 
 
data HAVE_P1;
    input ID Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1 $;
cards;
15427     50 500 5000 Machine
35894     10 100 1000 Machine
57842     40 400 4000 Wind
;

proc compare data=have_p0 compare=have_p1 outall out=want noprint;
  id id;
  var Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0;
  with Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1;
run;

proc print data=want;
run;

If you need help getting the list of variable names perhaps you can let SAS calculate them for you.

proc sql noprint;
  select a.name 
       , b.name
    into :var0 separated by ' '
       , :var1 separated by ' '
  from
      (select name,type from dictionary.columns
        where libname='WORK' and memname='HAVE_P0'
          and upcase(name) like '%_P0'
       ) a
  inner join 
      (select name,type from dictionary.columns
        where libname='WORK' and memname='HAVE_P1'
          and upcase(name) like '%_P1'
       ) b
  on (upcase(substrn(a.name,1,length(a.name)-3))
   = upcase(substrn(b.name,1,length(b.name)-3)))
   and a.type=b.type
  order by 1
;
%let nvars=&sqlobs;
quit;

%put &=nvars;
%put &=var0;
%put &=var1;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 502 views
  • 2 likes
  • 3 in conversation