Hi,
I am trying to count how many times an observation passes a certain caloric threshold in the most efficient manner. I have thousands of variables.
I want something similar to this:
data outdata;
set ppp.MFP;
Days_Calorie = N(of Q3 , Q19 , Q33 , Q47 , Q61 , Q75 , Q89 , Q103 , Q117 , Q131 , Q145 , Q159 , Q173 , Q187 , Q201 , Q215 , Q229 , Q243 , Q257 , Q271 , Q285 , Q299 , Q313 , Q327 , Q341 , Q355 , Q369 , Q383 , Q397 , Q411 , Q425 , Q439 , Q453 , Q467 , Q481 , Q495 , Q509 , Q523 , Q537 , Q551 , Q565 , Q579 , Q593 , Q607 , Q621 , Q635 , Q649 , Q663 , Q677 , Q691 , Q705 , Q719 , Q733 , Q747 , Q761 , Q775 , Q789 , Q803 , Q817 , Q831 , Q845 , Q859 , Q873 , Q887 , Q901 , Q915 , Q929 , Q943 , Q957 , Q971 , Q985 , Q999 , Q1013 , Q1027 , Q1041 , Q1055 , Q1069 , Q1083 , Q1097 , Q1111 , Q1125 , Q1139 , Q1153 , Q1167 , Q1181 , Q1195 , Q1209 , Q1223 , Q1237 , Q1251 , Q1265 , Q1279 , Q1293 , Q1307 , Q1321 , Q1335 , Q1349 , Q1363
);
However, I only want to count the occurrence if it passes the >100 threshold.
Is there an efficient way to do this? I toyed with the idea similar to this:
proc sql;
select count( Q3 , Q19 , Q33 , Q47 , Q61 , Q75 , Q89 , Q103 , Q117 , Q131 , Q145 , Q159 , Q173 , Q187 , Q201 , Q215 , Q229 , Q243 , Q257 , Q271 , Q285 , Q299 , Q313 , Q327 , Q341 , Q355 , Q369 , Q383 , Q397 , Q411 , Q425 , Q439 , Q453 , Q467 , Q481 , Q495 , Q509 , Q523 , Q537 , Q551 , Q565 , Q579 , Q593 , Q607 , Q621 , Q635 , Q649 , Q663 , Q677 , Q691 , Q705 , Q719 , Q733 , Q747 , Q761 , Q775 , Q789 , Q803 , Q817 , Q831 , Q845 , Q859 , Q873 , Q887 , Q901 , Q915 , Q929 , Q943 , Q957 , Q971 , Q985 , Q999 , Q1013 , Q1027 , Q1041 , Q1055 , Q1069 , Q1083 , Q1097 , Q1111 , Q1125 , Q1139 , Q1153 , Q1167 , Q1181 , Q1195 , Q1209 , Q1223 , Q1237 , Q1251 , Q1265 , Q1279 , Q1293 , Q1307 , Q1321 , Q1335 , Q1349 , Q1363
) as N_obs
from ppp.MFP
where ( Q3 , Q19 , Q33 , Q47 , Q61 , Q75 , Q89 , Q103 , Q117 , Q131 , Q145 , Q159 , Q173 , Q187 , Q201 , Q215 , Q229 , Q243 , Q257 , Q271 , Q285 , Q299 , Q313 , Q327 , Q341 , Q355 , Q369 , Q383 , Q397 , Q411 , Q425 , Q439 , Q453 , Q467 , Q481 , Q495 , Q509 , Q523 , Q537 , Q551 , Q565 , Q579 , Q593 , Q607 , Q621 , Q635 , Q649 , Q663 , Q677 , Q691 , Q705 , Q719 , Q733 , Q747 , Q761 , Q775 , Q789 , Q803 , Q817 , Q831 , Q845 , Q859 , Q873 , Q887 , Q901 , Q915 , Q929 , Q943 , Q957 , Q971 , Q985 , Q999 , Q1013 , Q1027 , Q1041 , Q1055 , Q1069 , Q1083 , Q1097 , Q1111 , Q1125 , Q1139 , Q1153 , Q1167 , Q1181 , Q1195 , Q1209 , Q1223 , Q1237 , Q1251 , Q1265 , Q1279 , Q1293 , Q1307 , Q1321 , Q1335 , Q1349 , Q1363
) >800;
quit;
I am missing a key concept here, but I think it is close.
Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:
%let treshold=100;
data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;
proc sql;
create table want as
select
id,
count(*) as count
from mfp_long
where val > &treshold.
group by id
;
quit;
See Maxim 19.
I hope I'm not missing something, but this seems like a case that practically cries out for an Array. Something along the lines of (untested) the below:
data outdata;
set ppp.MFP;
ARRAY Days [*] Q1 - Q1363;
Days_Calorie = 0;
DO i = 1 TO DIM(Days);
IF Days[i] > 100 THEN
Days_Calorie = Days_Calorie + 1;
END;
RUN;
Assuming each Q variable represents an observation for a separate day, the above should give you the number of days above 100 calories.
Jim
I tend to agree with @jimbarbour that an array is the most obvious way to do it. Bur the way he defined the array is not very efficient. Using "Q1-Q1363" as the array definition will mean that all the (nonexistent) variables like Q1, Q2, Q4 up to Q1262 will be defined, making the code run a lot slower.
If your dataset has the Qx variables consecutive and in the order you refer to them, you can use "Q3--Q1363". If your Qx variables are the only ones having a name starting with Q, you can refer to them as "Q:".
Here is one solution using the first definition:
data want;
set ppp.MFP;
Days_Calorie=0;
array Qs(*) 8 Q3--Q1363;
do _N_=1 to dim(Qs);
Days_Calorie+Qs(_N_)>100;
end;
run;
If your Qx variables are not nicely ordered, AND you have other variables starting with Q, you can also define the array as all the variables that start with Q and a non-zero digit:
array Qs(*) 8 Q1: Q2: Q3: Q4: Q5: Q6: Q7: Q8: Q9:;
If you want to do it with SQL, you can also use SQL to create the expression you want, e.g.:
proc sql noprint;
select cats(name,'>100') into :test_expr separated by ','
from dictionary.columns
where libname='PPP' and memname='MFP'
and name like 'Q%';
create table want as select
*,
sum(&test_expr) as Days_Calorie
from ppp.MFP;
quit;
The TEST_EXPR macro expression can also be used in a data step in exactly the same way, and it may perform a little faster than the array method.
@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.
If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?
If you know the interval and it is ALWAYS exactly 14 then the do loop could be iterated such as
do i = 1 to dim(arrayname) by 14;
<stuff using the iterated "i" values>
end;
so leave all the variables in the array assuming they are all numeric.
@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.
If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?
I would suggest creating a macro to solve that. Something like
%macro criteria;
%local i;
Q3>100%do i=19 %to 1363 %by 14;,Q&i>100%end;
%mend;
- it just generates a comma-separated list of all the criteria.
Which you can use in a datastep
data want;
set ppp.MFP;
Days_Calorie=sum(%criteria);
run;
or SQL:
proc SQL;
create table want as select
*,
sum(%criteria) as Days_Calorie
from ppp.MFP;
quit;
The macro looks a bit messy because I put all the generated code on one line in order to avoid spurious blanks in the output.
If you want to do the array stuff instead (I actually think it performs slower, so I would go with the SUM function solution), you can define the array with a macro as well:
%macro array_vars;
%local i;
Q3%do i=19 %to 1363 %by 14; Q&i%end;
%mend;
data want;
set ppp.MFP;
array Qx(*) 8 %array_vars;
and then the rest of the datastep code as before. But IMO, it just adds another level of complication.
Good catch, @s_lassen. I wasn't paying attention to the specific variable names. Yes, it would not be efficient to define a lot of non-existent variables. And, yes, assuming that the variables are contiguous within the PDV, one could change the code as follows which would be far more efficient and avoids having to increment by 14.
data outdata;
set ppp.MFP;
ARRAY Days [*] Q1 -- Q1363;
Days_Calorie = 0;
DO i = 1 TO DIM(Days);
IF Days[i] > 100 THEN
Days_Calorie = Days_Calorie + 1;
END;
RUN;
It would be interesting to see if the SQL method that @s_lassen mentions is consistently faster than using an array (and whether it makes a material difference).
Jim
Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:
%let treshold=100;
data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;
proc sql;
create table want as
select
id,
count(*) as count
from mfp_long
where val > &treshold.
group by id
;
quit;
See Maxim 19.
And if you want to make sure that id's with no value above treshold are also shown, do this:
proc sql;
create table want as
select
id,
sum(case
when val > &treshold.
then 1
else 0
end) as count
from mfp_long
group by id
;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.