Solved: Re: Count how many times an observation passes a threshold across vari...

joebacon · Posted 09-30-2020 01:15 AM

Hi,

I am trying to count how many times an observation passes a certain caloric threshold in the most efficient manner. I have thousands of variables.

I want something similar to this:

data outdata;
set ppp.MFP;
Days_Calorie = N(of Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
);

However, I only want to count the occurrence if it passes the >100 threshold.

Is there an efficient way to do this? I toyed with the idea similar to this:

proc sql;
select count( Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
) as N_obs
from ppp.MFP
where ( Q3	, Q19	, Q33	, Q47	, Q61	, Q75	, Q89	, Q103	, Q117	, Q131	, Q145	, Q159	, Q173	, Q187	, Q201	, Q215	, Q229	, Q243	, Q257	, Q271	, Q285	, Q299	, Q313	, Q327	, Q341	, Q355	, Q369	, Q383	, Q397	, Q411	, Q425	, Q439	, Q453	, Q467	, Q481	, Q495	, Q509	, Q523	, Q537	, Q551	, Q565	, Q579	, Q593	, Q607	, Q621	, Q635	, Q649	, Q663	, Q677	, Q691	, Q705	, Q719	, Q733	, Q747	, Q761	, Q775	, Q789	, Q803	, Q817	, Q831	, Q845	, Q859	, Q873	, Q887	, Q901	, Q915	, Q929	, Q943	, Q957	, Q971	, Q985	, Q999	, Q1013	, Q1027	, Q1041	, Q1055	, Q1069	, Q1083	, Q1097	, Q1111	, Q1125	, Q1139	, Q1153	, Q1167	, Q1181	, Q1195	, Q1209	, Q1223	, Q1237	, Q1251	, Q1265	, Q1279	, Q1293	, Q1307	, Q1321	, Q1335	, Q1349	, Q1363
) >800;
quit;

I am missing a key concept here, but I think it is close.

Kurt_Bremser · Posted 10-01-2020 03:45 AM

Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:

%let treshold=100;

data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;

proc sql;
create table want as
  select
    id,
    count(*) as count
  from mfp_long
  where val > &treshold.
  group by id
;
quit;

See Maxim 19.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

jimbarbour · Posted 09-30-2020 01:41 AM

@joebacon,

I hope I'm not missing something, but this seems like a case that practically cries out for an Array. Something along the lines of (untested) the below:

data outdata;
    set ppp.MFP;
    ARRAY Days [*] Q1 - Q1363;
    Days_Calorie = 0;
    DO i = 1 TO DIM(Days);
        IF  Days[i] > 100  THEN
            Days_Calorie = Days_Calorie + 1;
    END;
RUN;

Assuming each Q variable represents an observation for a separate day, the above should give you the number of days above 100 calories.

Jim

s_lassen · Posted 09-30-2020 08:26 AM

I tend to agree with @jimbarbour that an array is the most obvious way to do it. Bur the way he defined the array is not very efficient. Using "Q1-Q1363" as the array definition will mean that all the (nonexistent) variables like Q1, Q2, Q4 up to Q1262 will be defined, making the code run a lot slower.

If your dataset has the Qx variables consecutive and in the order you refer to them, you can use "Q3--Q1363". If your Qx variables are the only ones having a name starting with Q, you can refer to them as "Q:".

Here is one solution using the first definition:

data want;
  set ppp.MFP;
  Days_Calorie=0;
  array Qs(*) 8 Q3--Q1363;
  do _N_=1 to dim(Qs);
    Days_Calorie+Qs(_N_)>100;
    end;
run;

If your Qx variables are not nicely ordered, AND you have other variables starting with Q, you can also define the array as all the variables that start with Q and a non-zero digit:

array Qs(*) 8 Q1: Q2: Q3: Q4: Q5: Q6: Q7: Q8: Q9:;

If you want to do it with SQL, you can also use SQL to create the expression you want, e.g.:

proc sql noprint;
  select cats(name,'>100') into :test_expr separated by ',' 
  from dictionary.columns 
  where libname='PPP' and memname='MFP' 
     and name like 'Q%';
  create table want as select
    *,
    sum(&test_expr) as Days_Calorie
   from ppp.MFP;
quit;

The TEST_EXPR macro expression can also be used in a data step in exactly the same way, and it may perform a little faster than the array method.

joebacon · Posted 09-30-2020 09:58 AM

Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?

ballardw · Posted 09-30-2020 11:04 AM

@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?

If you know the interval and it is ALWAYS exactly 14 then the do loop could be iterated such as

do i = 1 to dim(arrayname) by 14;

end;

so leave all the variables in the array assuming they are all numeric.

s_lassen · Posted 09-30-2020 01:56 PM

@joebacon wrote:
Thank you for the thoughtful reply! I do agree than an array seems like a no-brainer- I wish I would have thought of it.

If I do have the entire slew of variables between Q1 - Q1363, what is the best way to pull out those individual variables? After Q3, the numbers are +14. For instance, Q19 +14 = Q33 + 14 = Q47. Is there any easy way to get all of those into an array, or should I split the dataset to only include these variables for this analysis?

I would suggest creating a macro to solve that. Something like

%macro criteria;
  %local i;
  Q3>100%do i=19 %to 1363 %by 14;,Q&i>100%end;
%mend;

- it just generates a comma-separated list of all the criteria.

Which you can use in a datastep

data want;
  set ppp.MFP;
  Days_Calorie=sum(%criteria);
run;

or SQL:

proc SQL;
  create table want as select
  *,
  sum(%criteria) as Days_Calorie
  from ppp.MFP;
quit;

The macro looks a bit messy because I put all the generated code on one line in order to avoid spurious blanks in the output.

If you want to do the array stuff instead (I actually think it performs slower, so I would go with the SUM function solution), you can define the array with a macro as well:

%macro array_vars;
  %local i;
  Q3%do i=19 %to 1363 %by 14; Q&i%end;
%mend;

data want;
  set ppp.MFP;
  array Qx(*) 8 %array_vars;

and then the rest of the datastep code as before. But IMO, it just adds another level of complication.

jimbarbour · Posted 09-30-2020 04:30 PM

Good catch, @s_lassen. I wasn't paying attention to the specific variable names. Yes, it would not be efficient to define a lot of non-existent variables. And, yes, assuming that the variables are contiguous within the PDV, one could change the code as follows which would be far more efficient and avoids having to increment by 14.

data outdata;
    set ppp.MFP;
    ARRAY Days [*] Q1 -- Q1363;
    Days_Calorie = 0;
    DO i = 1 TO DIM(Days);
        IF  Days[i] > 100  THEN
            Days_Calorie = Days_Calorie + 1;
    END;
RUN;

It would be interesting to see if the SQL method that @s_lassen mentions is consistently faster than using an array (and whether it makes a material difference).

Jim

Kurt_Bremser · Posted 10-01-2020 03:45 AM

Your data structure causes this to be an unnecessarily complicated process; a long layout makes this a breeze:

%let treshold=100;

data mfp_long;
input id $ q $ val;
datalines;
A Q1 120
A Q19 95
A Q33 125
;

proc sql;
create table want as
  select
    id,
    count(*) as count
  from mfp_long
  where val > &treshold.
  group by id
;
quit;

See Maxim 19.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 10-02-2020 03:37 AM

And if you want to make sure that id's with no value above treshold are also shown, do this:

proc sql;
create table want as
  select
    id,
    sum(case
      when val > &treshold.
      then 1
      else 0
    end) as count
  from mfp_long
  group by id
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Re: Count how many times an observation passes a threshold across variables

Classroom Training Available!