Dear SAS Community,
I am a new user in SAS and I have encountered an issue while working on large data set. Let me introduce the data-table I am working with.
My table has millions of rows and it is unique at ID+Month level. The Month variable runs from 1 to 100. Then I have series of column named like var1, var2, var3.....var100. And these columns only contain two values; a blank or "Y". I leave a snapshot of the table below.
ID | Month | Var1 | Var2 | …. | …. | Var100 |
ABC1 | 1 | Y | Y | |||
ABC2 | 1 | |||||
ABC3 | 1 | Y | ||||
ABC4 | 1 | Y | ||||
ABC5 | 1 | Y | Y | |||
ABC1 | 2 | Y | Y | Y | ||
ABC2 | 2 | Y | ||||
ABC3 | 2 | |||||
ABC4 | 2 | Y | ||||
ABC5 | 2 | |||||
ABC1 | … | Y | Y | |||
ABC2 | … | Y | Y | |||
ABC3 | … | |||||
ABC4 | … | Y | ||||
ABC5 | … | Y | ||||
ABC1 | 100 | Y | Y | |||
ABC2 | 100 | Y | ||||
ABC3 | 100 | |||||
ABC4 | 100 | Y | Y | |||
ABC5 | 100 | Y | Y | Y |
Using the table I want to create an output which has only unique IDs and columns as Month1, Month2, Month3, ..., Month100. In the columns Month1, Month2, ...., Month100, I want to have only two values as a blank or "Y" based on combination of a specific month and varx. For example, for Month = 1 and Var1, only ID ABC3 has Var1 = "Y" hence in my output table, Month1 value against ID ABC3 has the same value "Y".
ID | Month1 | Month2 | …. | …. | Month100 |
ABC1 | Y | Y | |||
ABC2 | Y | Y | |||
ABC3 | Y | ||||
ABC4 | |||||
ABC5 | Y |
Hope the long explanation helps.
Given that you're new to SAS you might not realize yet that working with a narrow table structure is very often advantageous.
Consider to structure your data as below. Whether you want to create rows where the value for a var<n> variable is missing depends a bit on how you intend to use the data. But it's of course also never a big problem to add such missing rows later for some specific use cases. And some SAS Proc's also allow to just create a second "template" table with all the cross sections you want to see in a report independent if this data exists in the base table or not.
ID | Month | index | value |
ABC1 | 1 | 2 | Y |
ABC3 | 1 | 1 | Y |
ABC5 | 1 | 2 | Y |
ABC1 | 2 | 1 | Y |
ABC1 | 2 | 2 | Y |
ABC2 | 2 | 2 | Y |
ABC4 | 2 | 1 | Y |
Here a sample how to create such a structure and report on it
data have;
infile cards truncover expandtabs;
input (ID Month Var1 Var2) ($);
cards;
ABC1 1 . Y
ABC2 1 . .
ABC3 1 Y .
ABC4 1 . .
ABC5 1 . Y
ABC1 2 Y Y
ABC2 2 . Y
ABC3 2 . .
ABC4 2 Y .
ABC5 2 . .
;
data want;
set have;
by month id;
array vars {*} var1-var2;
do index=1 to dim(vars);
value=vars[index];
output;
end;
drop var1-var2;
run;
proc tabulate data=want missing;
class Month index value;
keylabel n=' ';
table month, index*value;
run;
Honestly, I don't grasp what you are trying to do here, somehow var1 - var100 turns into month1 - month100 but I can't follow your explanation.
Furthermore, I always object to trying to take data like this and put an calendar type information into variable names like month1 - month100. Although I don't understand your data or what you are trying to do, I do understand SAS and it seems to me that you can work with the data much more easily in the original form rather than in the form you want to output.
Hi, thank you for your response. I agree that my explanation is not apt. Here is a glimpse of how I would do it using a case when statement. The problem is I have to write this case when 100 times as I have 100 distinct months. I am trying to find an easier way to write this code using some kind of loop which can based on the two conditions Month and Varx = "Y", create individual Monthx columns.
Proc Sql;
Create table data_want as
Select distinct ID,
case when var1 = "Y" and Month = 1 then "Y" else "" end as Month1,
case when var2 = "Y" and Month = 2 then "Y" else "" end as Month2,
.....
.....
case when var100 = "Y" and Month = 100 then "Y" else "" end as Month100
from data_have
group by 1;
Quit;
So when month=1 you only look at var1 and when month=2 you only look at var2 and so on?
Before I go ahead and try to write a program, I feel compelled to ask ... why is this rearrangement of the data needed? Normally leaving data in the long arrangement makes subsequent programming easier, while turning data into a wide data set usually makes subsequent programming harder. So what is the reason you are doing this? What will be done with the data after you re-arrange it?
@Ariin wrote:
So the IDs based on Month+Varx combination lead to certain specific payout in the future and the client wants to understand two things: how many IDs/contract are there in each month (Monthx) and how much will be the payout for each month and second, how different are these contracts when MonthX changes from Month1 to Month10 or Month4 to Month14 and so on... I hope it makes sense.
But this can be done without re-arranging to a wide data set.
could be, but the output table is much easier to filer for a or more specific months and will be useful in any downstream activities whereas the input table is not based on months.
Assuming I understood what you mean.
data have;
infile cards truncover expandtabs;
input (ID Month Var1 Var2) ($);
cards;
ABC1 1 . Y
ABC2 1 . .
ABC3 1 Y .
ABC4 1 . .
ABC5 1 . Y
ABC1 2 Y Y
ABC2 2 . Y
ABC3 2 . .
ABC4 2 Y .
ABC5 2 . .
;
data temp;
set have;
if vvaluex(cats('var',month))='Y' then do;name=cats('month',month);value='Y';output; end;
keep ID name value month;
run;
proc sort data=temp;by ID ;run;
proc transpose data=temp out=want(drop=_:);
by ID;
id name;
var value;
run;
proc sql noprint;
select distinct cats('month',month) into: month separated by ' ' from have;
quit;
data want;
retain ID &month.;
set want;
run;
I'd say: nice usecase for two dimensional array:
%let size = 100;
data have;
call streaminit(123);
do id = 'ABC1','ABC2','ABC3','ABC4','ABC5';
do Month =1 to &size.;
array Var[1:&size.] $ 1;
do _N_=1 to &size.;
if rand('uniform') > 0.8 then Var[_N_]="Y";
end;
output;
call missing(of Var[*]);
end;
end;
run;
data want;
array a[&size.,&size.] $ 1 _temporary_;
r=0;
do until (last.id);
set have;
by ID;
r+1;
array Var[1:&size.] $ 1;
do c =1 to &size.;
a[r,c]=Var[c];
end;
end;
array M[&size.] $ 1 Month1-Month100;
do r=1 to &size.;
M[r] = a[r,r];
end;
output;
drop r c Month Var: ;
run;
Bart
Given that you're new to SAS you might not realize yet that working with a narrow table structure is very often advantageous.
Consider to structure your data as below. Whether you want to create rows where the value for a var<n> variable is missing depends a bit on how you intend to use the data. But it's of course also never a big problem to add such missing rows later for some specific use cases. And some SAS Proc's also allow to just create a second "template" table with all the cross sections you want to see in a report independent if this data exists in the base table or not.
ID | Month | index | value |
ABC1 | 1 | 2 | Y |
ABC3 | 1 | 1 | Y |
ABC5 | 1 | 2 | Y |
ABC1 | 2 | 1 | Y |
ABC1 | 2 | 2 | Y |
ABC2 | 2 | 2 | Y |
ABC4 | 2 | 1 | Y |
Here a sample how to create such a structure and report on it
data have;
infile cards truncover expandtabs;
input (ID Month Var1 Var2) ($);
cards;
ABC1 1 . Y
ABC2 1 . .
ABC3 1 Y .
ABC4 1 . .
ABC5 1 . Y
ABC1 2 Y Y
ABC2 2 . Y
ABC3 2 . .
ABC4 2 Y .
ABC5 2 . .
;
data want;
set have;
by month id;
array vars {*} var1-var2;
do index=1 to dim(vars);
value=vars[index];
output;
end;
drop var1-var2;
run;
proc tabulate data=want missing;
class Month index value;
keylabel n=' ';
table month, index*value;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.