Re: Count the number of occurrences of any items in a list

neyoushah · Posted 03-16-2023 02:14 PM

Hello,
I am trying to generate the count of times any of the strings in a list appears in each row. I am using a big data, with many columns, but the columns of interest start with D_I10_ .
This dataset has ICD codes for every visit a patient makes to the hospital. But I am interested to find out the number of times a patient ended up with an alcohol related ICD code in the hospital. The ICD codes of each visit is listed in columns starting with D_I10_ .
Here is what I have done so far:

data home.new2;
set home.new;
array d1 D_I10_;  /*this is trying to tell which columns to pick*/
array c{3} $ 4 _temporary_ ('X45' 'Y15' 'X65'); /* these are the strings I am interested in, there are way more than three, so I need an automatic way to do it*/
visits=0;  /* this is just stating the variable i want to make, which is # of visits based on those string appearance*/ 
do i= 1 to dim(d1); /* this is a attempt to do the counting*/
if whichc(d1[i],of c(*))>0 then do visit=visit+count(whichc(d1[i],of c(*))>0);leave;end;
end;
run;

Any help will be greatly appreciated 🙂
Neyousha

ballardw · Posted 03-16-2023 04:44 PM

I think that you want this for the first Array statement:

array d1 (*) D_I10_:   ;

The colon immediately after a string in the place where a list is accepted will use all variables whose names start with the string before the colon.

You may not want the Leave instruction in that loop as that means "stop counting as soon as the first match is found" as written. So if , for sake of argument D_I10_1 is "X45" and D_I10_9 is "Y15" it would not find the "Y15" because of the Leave.

neyoushah · Posted 03-16-2023 11:39 PM

Unfortunately I still end up with the empty frequency table 😞

ballardw · Posted 03-17-2023 06:00 AM

@neyoushah wrote:

Unfortunately I still end up with the empty frequency table 😞

Data

We can't test code without data.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Only include variables absolutely needed for the problem and replace any sensitive information that is needed, such as patient identifier with dummy values like A, B, C or 1, 2, 3

You may want to look for values that start with one or more spaces in your variables. They would not match any of the comparisons used. It is also a good idea to check CASE of character values. Someone may have told you to look for 'X45' but if the actual values are 'x45', or have other characters such as 'X45.1' (ICD codes tend to be in families with many decimals indicating subsets) then those won't match either.

You maybe should also run proc contents on your data set an share that.

Tom · Posted 03-16-2023 10:59 PM

If each variable is an ICD code for a patient interaction I don't see any value in counting how many of the different codes for alcohol related conditions appear on a single record. What would that even mean? I doubt that having two instead of just one of the codes present has any meaning. If it did you should probably make separate flags for each of the codes instead so you can try to evaluate which code is more important in your analysis.

You probably just want to flag each record as alcohol related or not and then summarize across visits and/or patients.

So taking your example with a set of just three code to search for you could use code like:

data want;
  set have;
  by patientid;
  if first.pattientid then call missing(all_visits,alcohol_visits);
  array dx D_I10_:;
  array c[3] $4 _temporary_ ('X45' 'Y15' 'X65'); 
  do index=1 to dim(c) while (not alcohol_related);
     alcohol_related = x[index] in dx ;
  end;
  all_visits + 1;
  alchohol_visits + alcohol_related;
/*
  if last.patientid;
  keep patientid all_visits alcohol_visits;
*/
run;

neyoushah · Posted 03-16-2023 11:48 PM

Hi Tom,
So the dataset has single rows per patient. Every patient who is admitted, end up with an ICD code for their diagnosis. The ICD code may or may not be alcohol related. There are over 200 code for alcohol related outcomes. Below is a snip of the data, as you see, some cells are empty, some have ICD codes, and some may have alcohol related ICD codes.

Thanks,
Neyousha

PaigeMiller · Posted 03-17-2023 06:10 AM

We can write code to work on data in a screen capture. The data must be in SAS data step code, @ballardw has given instructions how to do this. Do not provide data in any other format.

--
Paige Miller

ballardw · Posted 03-17-2023 11:32 AM

Might have been nice to include some of the actual codes you are looking for.

But we can't write code against pictures. Also from the range of values show, from my somewhat limited experience with ICD codes, it appears as if your particular system might be replacing "standard" codes with something modified.

You show a picture of F500 for example. I cannot find that in any of the lists of ICD-10 but there is an F50.0 (see that period) for Anorexia nervosa (with subcodes F50.00 F50.01 F50.01)

Quentin · Posted 03-18-2023 08:29 PM

Working with this sort of wide data is often easier if you transpose it into a vertical format.

For example, you could transpose your data into a dataset with three variables: Patient_ID, Visit_Number, and ICD_Code.

With data in that format, selecting patients with a certain code, or patients with a list of codes, is easy, because you only have one variable to check. You could probably do it with a PROC PRINT with a WHERE statement.

It's usually much easier to analyze a long narrow dataset rather than a short wide dataset. Yes, you can do this sort of stuff with arrays, but it gets complex/ugly quickly. If you put time into re-designing your data structures, often the coding becomes much simpler.

Lots of analysts think in terms of "one row per patient" because they're thinking about Excel. In one of my first jobs, I was lucky to have a boss who encouraged me to think about better data structures. As a programmer, even if the analyst/researcher I was supporting described a question / algorithm in terms of one row per-patient, my boss not only gave me the freedom to restructure the data into a better format, he expected me to do it, for the sake of both efficiency and quality.

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.

ErikLund_Jensen · Posted 03-19-2023 09:46 AM

Hi @neyoushah

To elaborate on the post from @Quentin

SAS is basically a statistical package with procedures that works on SAS Data sets, and the data step is originally intended as a tool to get data into SAS from external sources. And while it is true that almost anything can be done with a single complicated data step, is is not always the best approach.

This forum is about programming, so it is natural that most problems are presented as programming problems, i.e. how to get a result based on the idea that a solution starts with the token data and ends with the token run, and that most answers are given within that same context.

Unfortunately there is no forum to address overall program design, meaning how to write a whole program, not a data step, to solve a given problem with SAS software. A data step can easily be so complicated that nobody (including the programmer herself after a week) can figure out what's going on.

Start with getting an understanding of what you have. In this case it seems that there is one record per patient, and that each D_I10_ column represents a visit with one diagnosis code per visit, which may be alcohol related. If more than one column with the same code is present, they represent one visit each. If you are in doubt, or your interpretation in any way conflicts with actual data, get back to the source of data to verify it. Then define your final goal. What is it you (or the one requesting it) wants to know based on data. Which final data structure will be best to obtain that.

Next step is to get data into SAS. If the original source is a spreadsheet, then write a data step to read it into SAS. Just get it in, leave further data manipulation to later. And don't rely on Proc Import for this.

Then consider what additional information you need. In this case it is a list of alcohol-related diagnosis codes. Don't think of hardcoding them as constants in select-constructs or if-then sequences. Get them into a SAS data set too, so they are ready for further use, if necessary as cards in a data step.

Now consider your program flow. Don't be afraid of multiple steps, it is much better to break the problem down so it can be solved with a series of steps, where you for each step can chosse the best tool for that. Way back "real programmers" mocked SAS as meaning Sort-After-Sort, but it's the way it works, and what makes it much faster to get a result with SAS than with more code-oriented languages. So use it as intended.

First thing is to get rid of the Excel-structure, which has no place in SAS, so transpose your SAS input from wide to long. Use Proc Transpose because it is simpler than traversing arrays in a data sptep. The transposed data set is much easier to handle, now you can drop unwanted diagnisis codes with a simple Proc SQL join or a Data step merge, whichever you prefer, because you have another data set with the list of relevant codes.

Then consider the next step. Is it the number of alcohol-related visits per patient you want, the number of visits per diagnosis code, or the number of different diagnosis coder per patient. If it is more than one of these, then split up from here and treat different wishes as individual subtasks. It is much easier to code, if you can get the result with a simple count in Proc SQL og a Proc Means. This approach gives a program that is much easier to code and understand, easier to maintain and easier to add additional functionaly to. Just add other steps to fulfill further requests.

I have so many old programs running today where I deeply regret that I started out with coding instead of stepping back to make a proper design based on an understanding of data. The time spent on planning comes back twice in the coding phase. Besides, it is much easier to verify that the code works as intended, because there are intermediate result to check and small pieces of code to debug and correct. So don't be afraid of multiple steps. And remember that the core procedures are parts of the SAS language too, and that a good understanding of them is an integral part of being a good SAS programmer.

Tom · Posted 03-19-2023 11:38 PM

Sounds like you have collapsed the data to one observation per hospital stay. That might lead to a large number of ICD codes for a stay if the patient was in the hospital for a long time and had many complications. But most of the data in your picture does not have ANY codes and the ones that do only have one code. So I doubt whether there is any value in generating anything other than a YES/NO or 1/0 variable to indicate for a single observation whether or not this hospital stay included any ICD codes related to alcohol.

Patrick · Posted 03-18-2023 07:33 PM

If you would provide sample data in the form of a working SAS datastep creating such data then I could have posted tested code.

In below untested sample code the assumption is made that you can create a table ICD_Alcohol that only contains alcohol related codes.

The script then iterates over all your variables (do loop) and counts how many of these variables have a value that matches with one of the values stored in ICD_Alcohol (loaded into a hash table for lookup).

data want;
  if _n_=1 then
    do;
      if 0 then set ICD_Alcohol(keep=icd_cd);
      dcl hash(data:'ICD_Alcohol');
      h1.defineKey('icd_cd');
      h1.defineDone();
    end;
  call missing(icd_cd);
  set have;
  array _d_i10 {*} d_i10_:;
  do _i=1 to dim(_d_i10);
    alc_cnt=sum(alc_cnt, h1.check(key:strip(_d_i10[_i]))=0);
  end;
  drop _i;
run;

neyoushah · Posted 03-20-2023 10:16 AM

Hello,
Thank you all for the comments. I tried to recreate the sample but I ended up with many errors, instead I chose the first 30 samples, in ling format and not linked to other datasets. Instead I created a Github link:https://github.com/neyoushah/sashelp/blob/main/sashelp.sas7bdat

As you can see my unlinked data has several visits per patient. So I thought we could say if.first patient_id and if.last. So now, one must find the number of visits which included at least one alcohol related code for each patient_id. The codes are as follows:

data want;
   set dad ;
   array d D_I10_:;
   array c{25} $ 4 _temporary_ ('F103' 'F104' 'F105' 'F106' 'F107' 'F108' 'F109' 'F100' 'F101' 'F102' 'G621' 'G312' 'G721' 'I426' 'K292' 'K700' 'K704' 'K701' 'K702' 'K703' 'K709' 'K852' 'K860' 'Q860' 'P043');
   do i= 1 to dim(d);
      if whichc(d[i],of c(*))>0 then do alcohol1='Chronic— Wholly attributable to alcohol'; 
         leave;
      end;
   end;
run;data home.w;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away