BookmarkSubscribeRSS Feed
ballardw
Super User

Suggestion: Stop with incremental data descriptions. Give us a full description of all the variables and examples of what you want counted.

 

There are things that work with missing values that do not work with values of zero.

There are things that work with 1/0 coded values that do not work with 0/1/2/3/4/5 etc.

Almost everything works different with character values than numeric.

So some changes of your values means that logic has to change completely.

 

Examples: the N function counts non-missing numeric values so a mix of . and 1 means that: N=0 all values missing; N=1 exactly 1 value not missing, N>1 more than one non-missing. Replace the . with 0 and you get 4 (or 5 or 25 if there are 25 variables used).

So another approach. Sum of 1/0 coded numeric variables is the number of 1 values.

Now throw in 2. Neither of the above work. Now each value needs to be tested to "I want 2 to follow the counting process as 1." Are your going to add in 3? then code that tests for 2 won't find it and has to be rewritten again. Is this "count" going to become dependent on more variables that have not been mentioned yet?

Are non-integer values going to be involved with yet another set of rules?

Here's one you aren't likely to be familiar with: what about special missing values? (There are actually 27 different missing values possible in SAS. None would normally be used for summing or such but they can be counted when needed.)

 

It is one thing to ask different questions about a single description of the data or even conditions in the data. Changing the data content is a completely different story.

CathyVI
Pyrite | Level 9

@ballardw Thanks for the clarity. As I have said this is a mistake-anxiety issue on my part. I am a step-by-step learner but i fail to include all information earlier. lesson learn. 

I am not addition another variable or row after this. My data is 0,1,2. Hope you or anyone can help. Thanks again.

Patrick
Opal | Level 21

@CathyVI Given that your data is "evolving" I assume you are creating this data yourself as part of some upstream logic. It could be possible that there is a more direct way to get to the report you're after but for us to know you would need to share a representative sample of your "source" data and then describe what information you would like to surface in your report.

FreelanceReinh
Jade | Level 19

@CathyVI wrote:

This is the overall goal am trying to get.

 

Event1

only

Event2 only

Case32 only

Event46 A

only

Event46 B

only

Event46 C

Only

Multiple

Total

30 days

 

 

 

 

 

 

 

 

31-60 days

 

 

 

 

 

 

 

 

Total

 

 

 

 

 

 

 

 


This looks like a two-way frequency table, so we need to create a second variable, let's call it Duration, in addition to event_type. Then we can use PROC FREQ to obtain such a table (which, in turn, could be used as a basis for a more sophisticated report).

 

Let me first create a more comprehensive sample dataset which includes all possible combinations of events and duration values (that have been mentioned so far).

 

/* Create a comprehensive sample dataset */

data have;
call streaminit(31415927);
do Event1=0 to 2;
  do Event2=0 to 2;
    do Case32=0 to 2;   
      do Event46=0 to 2;
        if Event46 then do grp='A', 'B', 'C';
          do _n_=1 to rand('integer',9);
            output;
          end;
        end;
        else do;
          grp=' ';
          do _n_=1 to rand('integer',9);
            output;
          end;
        end;
      end;
    end;
  end;
end;
run;

/* Define a format for variable Duration */

proc format;
value days
1=' <=30 days'
2='31-60 days';
run;

/* Create the analysis dataset */

data want(drop=_:);
set have;
array e[*] event1 event2 case32 event46;
length _s $4 _v event_type $40;
_s=cat(of e[*]);
_n0=countc(_s,'0');
if _n0=3 then do; /* Three zeros indicate that only one type of event has occurred. */
  _vn=verify(_s,'0'); /* This determines which type of event it was. */
  _v=vname(e[_vn]);
  event_type=catx(' ',_v,grp,'only');
  if _vn<3 then event_type=' '||event_type; /* Indentation improves sort order of report columns. */
  Duration=e[_vn];
end;
else if _n0<3 then do; /* Fewer than 3 zeros indicate that multiple types of events have occurred. */
  event_type='Multiple';
  Duration=1+(countc(_s,'2')>1); /* Duration=2 only in case of multiple occurrences of value 2. */
end;
format Duration days.;
run;

/* Generate a simple report */ 

proc freq data=want order=formatted;
tables duration*event_type / nopercent norow nocol;
run;

Result:

 

dur_by_evtype.png

The six missing values are from observations with Event1=Event2=Case32=Event46=0. The value of Duration for event_type='Multiple' is determined according to your specification

"if you have 2 in more than one group you will be multi_60."
CathyVI
Pyrite | Level 9

@FreelanceReinh  What is this step doing in the code? Is it counting the location of event.

if _vn<3 then event_type=' '||event_type; /* Indentation improves sort order of report columns. */
Duration=e[_vn];

and what is the difference between Duration=e[_vn];  and  _v=vname(e[_vn]);

 

Then is this code going to identify only multiple of 2

Duration=1+(countc(_s,'2')>1); /* Duration=2 only in case of multiple occurrences of value 2. */

FreelanceReinh
Jade | Level 19

@CathyVI wrote:

@FreelanceReinh  What is this step doing in the code? Is it counting the location of event.

if _vn<3 then event_type=' '||event_type; /* Indentation improves sort order of report columns. */
Duration=e[_vn];


The purpose of the IF-THEN statement is to indent the character values "Event1 only" and "Event2 only" by one character (resulting in " Event1 only" and " Event2 only") so that they come first in alphabetical order. The inserted leading blank is sorted before all letters. Without this trick the column order in the final PROC FREQ output would be

Case32 only, Event1 only, Event2 only, Event46 A only, Event46 B only, Event46 C only, Multiple

If this alphabetical column order is acceptable, remove that IF-THEN statement to simplify the code.

 

The statement

Duration=e[_vn];

assigns the non-zero value among the four values of Event1Event2Case32 and Event46 to variable Duration. There is exactly one non-zero value because of the IF condition _n0=3. Example: If Case32=2 and Event1=Event2=Event46=0, the concatenation _s equals "0020". The COUNTC function (here counting zeros in _s) then returns _n0=3 and the VERIFY function returns _vn=3 as the position of the first character in _s that is not equal to "0". Finally, e[_vn]=e[3] is the value of the third variable of the list event1 event2 case32 event46 in the ARRAY statement, i.e., the value of Case32 (which is 2 in the example).

 

what is the difference between Duration=e[_vn];  and  _v=vname(e[_vn]);


The VNAME function retrieves the name of the variable corresponding to the array reference e[_vn]. In the example above (where _vn=3) the result is _v="Case32".

 


Then is this code going to identify only multiple of 2

Duration=1+(countc(_s,'2')>1); /* Duration=2 only in case of multiple occurrences of value 2. */


In the case event_type='Multiple' the question arises whether this observation should be counted in category "<=30 days" or in "31-60 days" if both categories occur, e.g., if Event1=0, Event2=1, Case32=2, Event46=0. From your description

"if you have 2 in more than one group you will be multi_60"

I concluded that this particular example should be counted in "<=30 days", whereas, e.g., an observation with Event1=2, Event2=1, Case32=2, Event46=0 would be counted in "31-60 days" (because value 2 occurs more than once). This rule is implemented in the assignment statement

Duration=1+(countc(_s,'2')>1);

Step-by-step explanation:

Event1=0, Event2=1, Case32=2, Event46=0 → _s="0120" → countc(_s,'2')=1 → 1>1 is FALSE (0) → Duration=1+0=1
Event1=2, Event2=1, Case32=2, Event46=0 → _s="2120" → countc(_s,'2')=2 → 2>1 is TRUE (1)  → Duration=1+1=2

That assignment statement using a Boolean expression (the inequality countc(_s,'2')>1), which evaluates to TRUE (numeric value: 1) or FALSE (numeric value: 0), is an abbreviation for

if countc(_s,'2')>1 then Duration=2;
else Duration=1;

Note that this logic is applied only in the case _n0<3, not to observations with, e.g., Event1=0, Event2=0, Case32=2, Event46=0, where _n0=3 (and eventually Duration=2) as explained further above.

CathyVI
Pyrite | Level 9

@FreelanceReinh Thanks. So in the _s =2100, the logic is that the individual filled the form (event2) within 30days before they filled form(event1) after 30days. The forms are Mutually Exclusive. So if they are 2100, they should be counted at event2 only for the 0-30 day window.

Another example is (If Case32=2, Event1= 2, Event2=2 and Event46C=1), this means they should be counted in the Event46C only for (0-30 days) because they filled form (Event46C) within the first 30 days.

Other combinations where the form is mischaracterize and/or the time window are 2201, 2100,0012,0212,0221, 2221, 0102 etc (anything with one 1 and one or more 2s).

I would like these individuals to be  counted in the events or cases ONLY for the days window which they match (1=0-30days, 2=31-60days) because they filled atleast one form (events or case) in the first 30 days or 31-60 days. 

 

Those who will be in the multiple group will be for example when the events or cases have only the same value e.g., 0011, 0022, 2222,2200, 1110, 1111, 1010 etc.

FreelanceReinh
Jade | Level 19

With these new specifications the DO-END block after "else if _n0<3 ..." changes:

 

data want(drop=_:);
set have;
array e[*] event1 event2 case32 event46;
length _s $4 _v event_type $40;
_s=cat(of e[*]);
_n0=countc(_s,'0');
if _n0=3 then do; /* Three zeros indicate that only one type of form has been filled. */
  _vn=verify(_s,'0'); /* This determines which form it was. */
  _v=vname(e[_vn]);
  event_type=catx(' ',_v,grp,'only');
  if _vn<3 then event_type=' '||event_type; /* Indentation improves sort order of report columns. */
  Duration=e[_vn];
end;
else if _n0<3 then do; /* Fewer than 3 zeros indicate that multiple forms have been filled. */
  _n1=countc(_s,'1');
  if _n1=1 then do; /* If value 1 occurs only once, ... */
    _vn=whichn(1, of e[*]); /* ... this determines the corresponding form. */
    _v=vname(e[_vn]);
    if _vn=4 then event_type=catx(' ',_v,grp,'only');
             else event_type=catx(' ',_v,    'only');
    if _vn<3 then event_type=' '||event_type; /* Indentation improves sort order of report columns. */
    Duration=1; /* Value 1 overrides value 2. */
  end;
  else do;
    event_type='Multiple';
    Duration=1+(_n1=0); /* Duration=2 only if value 1 does not occur, otherwise Duration=1. */
  end;
end;
format Duration days.;
run;

First, we count the 1s in the four-digit string _s. If that number _n1=1, we determine where the single 1 occurred (WHICHN function) and conclude that event_type is the corresponding "event" only (variable _v). In the special case where that is "Event46" (i.e., _vn=4), the value of grp is inserted. As previously, we indent "Event1" and "Event2". Definitely, Duration=1, so that value 1 prevails.

 

If _n1 does not equal 1 (while _n0<3), we must have either two (or more) 1s or zero 1s and two (or more) 2s. Both cases qualify for event_type='Multiple' and the Duration is 1 or 2, respectively.

 

To obtain an overview of the event_type and Duration values resulting from the various four-digit combinations in _s (and the value of grp), you can omit the DROP= dataset option from the DATA statement (let's call the new output dataset CHK) and run the following PROC FREQ step:

proc freq data=chk;
tables event_type*duration*_s / missing list;
run;

 

CathyVI
Pyrite | Level 9

@FreelanceReinh  Thank you. This is great. 

I notice a slight issue. Let say i have the same individual with a record like this. for example row 219 and 220 is the same person.

 

CathyVI_0-1714586107942.png

 

How can I make this individual to be counted as  multiple instead of counting them in each Event46 B only and Event46 C only group. I noticed it is only one person with one date form (event46). That is, they filled the form one single day that is why event46 is both 1 for them. If i am counting them separate, that means I am double counting them because of their event type. However, It is one record actually assuming I transpose the records, they will be counted as multiple because they have 1-1.  I want this to only apply to the Event46  because it has grp A,B,C and this grp is making us count someone with the same record twice. 

Please @FreelanceReinh  if you can add an ID variable to the data and some will have 2 records, this will be nice. Row 219 & 220 will be a good example.

 

FreelanceReinh
Jade | Level 19

In an earlier post in this thread you wrote:

I am not addition another variable or row after this.


And now you are requesting just that:


@CathyVI wrote:

if you can add an ID variable to the data and some will have 2 records, this will be nice. 


You have also opened a second thread find the count of variable by group where, as far as I see, the same problem is being discussed.

 

In this situation I am not going to give another precise shot at a moving target, sorry, but suggest a general four-step approach:

  1. Aggregate the data to one observation per ID.
  2. Classify the aggregated observations into the categories of interest.
  3. Count the observations per category.
  4. Create a report from the counts.

Example solutions for steps 2 (before aggregation, though) through 4 have been presented in this thread. Whether a DATA step, PROC SQL, PROC SUMMARY or other techniques are most appropriate for step 1 depends on the data structure and on the aggregation rules, neither of which we have seen entirely. For instance, if two records provide different pieces of information about forms that have been filled, their (e.g. chronological) order may or may not be relevant.

ballardw
Super User

@CathyVI wrote:

This is not what I want. I want 2 to follow the counting process as 1. if you have a single 2 you will be counted differently may if i can name it event1_60, event2_60 or case32_60 and if you have 2 in more than one group you will be multi_60. Again, I know this is inefficient way of requesting help. Am sorry

This is the overall goal am trying to get.

 

Event1

only

Event2 only

Case32 only

Event46 A

only

Event46 B

only

Event46 C

Only

Multiple

Total

30 days

 

 

 

 

 

 

 

 

31-60 days

 

 

 

 

 

 

 

 

Total

 

 

 

 

 

 

 

 


And now you have introduced 'days' without any previous description of where "days" might come from.

ballardw
Super User

Can you describe the rule(s) involved? Code that doesn't do what you want without explanation of what is needed makes it hard to suggest changes.

 

I have to say that without a very clear understanding of what is done in the later analysis that creating a bunch of variables like event1only, event2only, etc. makes me cringe to some extent.

 

If the only purpose is that "count" (sum might be a better approach)

data example;
   set have;
   array e (*) event1-event4;
   array o (*) eventOnly1 -EventOnly4;
   if sum(of e(*))=1 then  o [ whichn(1,of e(*))] =1;
   else if sum(of e(*))>1 then multi_event=1;
run;

Proc tabulate data=example ;
  var EventOnly: multi_event;
  table  EventOnly: multi_event,
         sum
  ;

run;

Arrays are a one way to group variables for shorthand code. In this case using Sum of the array to tell if only one of the variables has a value of 1, or more, or all missing.

Note the different naming of variables in the Array o definition. If you place a number at the end you can use a list to generate sequentially named variables. Placing a suffix like "only" after that means you have to type them all the time.

The fun bit in the code is the function WHICHN which looks for the first value in a list of others. In this case it looks for a value of 1 in the 4 variables of the array E. The return is the position in the list of that variable. Since we have restricted the use to where only one variable in the list is defined that number can be used as the INDEX for the array o, which is why the whole function call is inside the [  ] brackets. Array can use either () or [] to enclose the index value of the array. I prefer the [ ] as it is easier to see which is the function index boundary when functions are involved.

 

I used Sum in the tabulate, N would work as well but would require VAR variables not Class because of all the missing values.

Patrick
Opal | Level 21

And just for fun here another approach to achieve the same

proc format;
  invalue $case
    '0000' = 0
    '1000' = 1
    '0100' = 2
    '0010' = 3
    '0001' = 4
    other  = 5
    ;
run;

data want(keep= Only1 - Only4 Multi Total);
  set have end=last;

  array src_vars{*} event1 event2 case32 event46;
  array trg_vars{*} Only1 - Only4 Multi (5*0);
  retain trg_vars;

  ind=input(cats(of src_vars[*]),$case.);

  if ind ne 0 then trg_vars[ind]=sum(trg_vars[ind],1);

  if last then 
    do;
      Total=sum(of trg_vars[*]);
      output;
    end;

run;

proc print data=want;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 27 replies
  • 737 views
  • 13 likes
  • 6 in conversation