What is the overall goal here? What is the meaning of the data and what is the analysis you are trying to do?
What does this P0, P1, etc suffix on the variable names mean?
Why is the data in multiple datasets? Why not just store all of the data in one dataset?
Why do the variable names change between the datasets?
Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for?
Do you always have the same 100 variables?
Do you have SAS/IML license? Could you load the two datasets into matrices and just subtract them?
Why do you need to make some many different difference datasets?
From your pseudo-code, why are variables that you say you want to 'subtract' apparently sometimes character???
There are elements of the question that seem to indicate a poor data structure or management design at the base that is terribly complicating your data for what should be relatively simple actions.
Thanks for your answers Tom,
here are the answers for your questions:
- What is the overall goal here? What is the meaning of the data and what is the analysis you are trying to do?
Those data will be used for a ML model. A left outer join brings all of tables together. The new table is going to be the dataset to modelized. Each table represent measures from different periods P: P0 current period, P1 the previous period, P2 the period before P1, ... until PX which is the ultimate period. That's why it required a loop. 1° P0-P1 2° P1-P2 3° P2-P3 .... PX-1 - PX.
- What does this P0, P1, etc suffix on the variable names mean? Periods of time, example a year, measurement for a given year.
- Why is the data in multiple datasets? Why not just store all of the data in one dataset? No problem, this step can be done before make a comparison step. In this case, the variable names have to keep P0, P1, P2, etc. at the end of the variable name.
- Why do the variable names change between the datasets? The value is computed period by period. Table Have_PO are results for the current year, Have_P1 results for the previous year, Have_P2 etc.
Why not just have a separate variable with values like P0 or P1 (or perhaps numeric variable with 0 and 1) to indicate which P value this observation is for? It's a possibility.
Do you always have the same 100 variables? Yes, periods P0, P1, P2,...,PX will always have the same variables
Do you have SAS/IML license? Could you load the two datasets into matrices and just subtract them? NO
Why do you need to make some many different difference datasets? For a ML model, to follow the evolution between periods. Then tables are united with a left outer join.
if you have other questions, don't hesitate,
your help is greatly appreciated.
Sounds like you are trying to do feature generation to pass into your ML software?
Not sure if ML software is smart enough to do the sort of temporal change calculations for you.
With your described source data it would be trivial to combine the datasets using normal SAS code. MUCH MUCH easier than trying to do it with SQL code.
data want;
merge have_p0 - have_p100;
by id;
run;
To then generate your difference flags you could use ARRAYs in the data step.
Say you have a series of 101 numeric variables named APPLE_P0 to APPLE_p100 and another series of character variables (let's call it NAME). You could then generate a series of 100 difference variables comparing the value to the previous value (or perhaps to the period 0 value) using arrays and DO lops like this:
data want;
merge have_p0 - have_p100;
by id;
array _apple Apple_p0 - Apple_p100;
array d_apple diff_apple_p1-diff_apple_p100;
array _name Name_p0 - Name_p100;
array d_name diff_name_p1-diff_name_p100;
do index=2 to dim(_apple);
d_apple[index-1] = _apple[index] - _apple[index-1] ;
d_name[index-1] = _name[index] ne _name[index-1] ;
end;
run;
If you need to extend that to a number of other variables then you could use code generation. Either using macro language. Or just using normal SAS code to write the generated SAS code to a file which can be included into the program using the %INCLUDE statement.
Why not just use PROC COMPARE?
data HAVE_P0;
input ID Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0 $;
cards;
15427 10 100 1000 Machine
35894 20 200 2000 Hand
57842 40 400 4000 Hand
79432 75 750 7500 Machine
;
data HAVE_P1;
input ID Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1 $;
cards;
15427 50 500 5000 Machine
35894 10 100 1000 Machine
57842 40 400 4000 Wind
;
proc compare data=have_p0 compare=have_p1 outall out=want noprint;
id id;
var Apple_P0 Orange_P0 Pineapple_P0 TotalAB_P0;
with Apple_P1 Orange_P1 Pineapple_P1 TotalAB_P1;
run;
proc print data=want;
run;
If you need help getting the list of variable names perhaps you can let SAS calculate them for you.
proc sql noprint;
select a.name
, b.name
into :var0 separated by ' '
, :var1 separated by ' '
from
(select name,type from dictionary.columns
where libname='WORK' and memname='HAVE_P0'
and upcase(name) like '%_P0'
) a
inner join
(select name,type from dictionary.columns
where libname='WORK' and memname='HAVE_P1'
and upcase(name) like '%_P1'
) b
on (upcase(substrn(a.name,1,length(a.name)-3))
= upcase(substrn(b.name,1,length(b.name)-3)))
and a.type=b.type
order by 1
;
%let nvars=&sqlobs;
quit;
%put &=nvars;
%put &=var0;
%put &=var1;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.