BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
joebacon
Pyrite | Level 9

Hi,

I am trying to count how many times an observation passes a certain caloric threshold in the most efficient manner. I have thousands of variables.

 

I want something similar to this:

data outdata;
set ppp.MFP;
Days_Calorie = N(of Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
);

However, I only want to count the occurrence if it passes the >100 threshold. 

 

Is there an efficient way to do this? I toyed with the idea similar to this:

proc sql;
select count( Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
) as N_obs
from ppp.MFP
where ( Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
) >800;
quit;

I am missing a key concept here, but I think it is close.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:

%let treshold=100;

data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;

proc sql;
create table want as
  select
    id,
    count(*) as count
  from mfp_long
  where val > &treshold.
  group by id
;
quit;

See Maxim 19.

View solution in original post

8 REPLIES 8
jimbarbour
Meteorite | Level 14

@joebacon,

 

I hope I'm not missing something, but this seems like a case that practically cries out for an Array.  Something along the lines of (untested) the below:

data outdata;
    set ppp.MFP;
    ARRAY Days [*] Q1 - Q1363;
    Days_Calorie = 0;
    DO i = 1 TO DIM(Days);
        IF  Days[i] > 100  THEN
            Days_Calorie = Days_Calorie + 1;
    END;
RUN;

Assuming each Q variable represents an observation for a separate day, the above should give you the number of days above 100 calories.

 

Jim

s_lassen
Meteorite | Level 14

I tend to agree with @jimbarbour that an array is the most obvious way to do it. Bur the way he defined the array is not very efficient. Using "Q1-Q1363" as the array definition will mean that all the (nonexistent) variables like Q1, Q2, Q4 up to Q1262 will be defined, making the code run a lot slower.

 

If your dataset has the Qx variables consecutive and in the order you refer to them, you can use "Q3--Q1363". If your Qx variables are the only ones having a name starting with Q, you can refer to them as "Q:". 

 

Here is one solution using the first definition:

data want;
  set ppp.MFP;
  Days_Calorie=0;
  array Qs(*) 8 Q3--Q1363;
  do _N_=1 to dim(Qs);
    Days_Calorie+Qs(_N_)>100;
    end;
run; 

If your Qx variables are not nicely ordered, AND you have other variables starting with Q, you can also define the array as all the variables that start with Q and a non-zero digit:

array Qs(*) 8 Q1: Q2: Q3: Q4: Q5: Q6: Q7: Q8: Q9:;

 

If you want to do it with SQL, you can also use SQL to create the expression you want, e.g.:

proc sql noprint;
  select cats(name,'>100') into :test_expr separated by ',' 
  from dictionary.columns 
  where libname='PPP' and memname='MFP' 
     and name like 'Q%';
  create table want as select
    *,
    sum(&test_expr) as Days_Calorie
   from ppp.MFP;
quit;

The TEST_EXPR macro expression can also be used in a data step in exactly the same way, and it may perform a little faster than the array method.

 

 

joebacon
Pyrite | Level 9
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?
ballardw
Super User

@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?

If you know the interval and it is ALWAYS exactly 14 then the do loop could be iterated such as

do i = 1 to dim(arrayname) by 14;

  <stuff using the iterated "i" values>

end;

so leave all the variables in the array assuming they are all numeric.

s_lassen
Meteorite | Level 14

@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?

I would suggest creating a macro to solve that. Something like

 

%macro criteria;
  %local i;
  Q3>100%do i=19 %to 1363 %by 14;,Q&i>100%end;
%mend;

- it just generates a comma-separated list of all the criteria.

 

Which you can use in a datastep

 

 

data want;
  set ppp.MFP;
  Days_Calorie=sum(%criteria);
run;

or SQL:

 

 

proc SQL;
  create table want as select
  *,
  sum(%criteria) as Days_Calorie
  from ppp.MFP;
quit;

The macro looks a bit messy because I put all the generated code on one line in order to avoid spurious blanks in the output.

 

 

If you want to do the array stuff instead (I actually think it performs slower, so I would go with the SUM function solution), you can define the array with a macro as well:

%macro array_vars;
  %local i;
  Q3%do i=19 %to 1363 %by 14; Q&i%end;
%mend;

data want;
  set ppp.MFP;
  array Qx(*) 8 %array_vars;

and then the rest of the datastep code as before. But IMO, it just adds another level of complication.

jimbarbour
Meteorite | Level 14

Good catch, @s_lassen.  I wasn't paying attention to the specific variable names.  Yes, it would not be efficient to define a lot of non-existent variables.  And, yes, assuming that the variables are contiguous within the PDV, one could change the code as follows which would be far more efficient and avoids having to increment by 14.

data outdata;
    set ppp.MFP;
    ARRAY Days [*] Q1 -- Q1363;
    Days_Calorie = 0;
    DO i = 1 TO DIM(Days);
        IF  Days[i] > 100  THEN
            Days_Calorie = Days_Calorie + 1;
    END;
RUN;

It would be interesting to see if the SQL method that @s_lassen mentions is consistently faster than using an array (and whether it makes a material difference).

 

Jim

Kurt_Bremser
Super User

Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:

%let treshold=100;

data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;

proc sql;
create table want as
  select
    id,
    count(*) as count
  from mfp_long
  where val > &treshold.
  group by id
;
quit;

See Maxim 19.

Kurt_Bremser
Super User

And if you want to make sure that id's with no value above treshold are also shown, do this:

proc sql;
create table want as
  select
    id,
    sum(case
      when val > &treshold.
      then 1
      else 0
    end) as count
  from mfp_long
  group by id
;
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1326 views
  • 7 likes
  • 5 in conversation