DATA Step, Macro, Functions and more

Question about Concatenation of rows using last.variable

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Question about Concatenation of rows using last.variable

Hello, I am a novice sas user. I recently have found a solution to my problem, but figuring out why it is working has been diffiicult. It is crucial that I understand whats going on. It is my hope that someone can explain why this code is effective. 

 

Basically, I have three distinct Keys (incident_key, discipline_key, and offender_key) and want to concatenate by row to include all offense codes in one line. A mock version of the data, called "discipline" looks like: 

 

incident_iddiscipline_keyoffender_keyoffense_code
1119011FA
1122867FA
2345666FA
2345666WR
26788975YP
26788975RQ

 

 

I found this solution on this same website, and it works perfectly: 

 

data discipline_with_concat;

length all_offense $200.;

   do until (last.discipline_key);

      set discipline;

        by discipline_key notsorted;

      all_offense=catx(',',all_offense,number);

   end;

 

 

The code provides this output: 

incident_iddiscipline_keyoffender_keyoffense_codeall_offense
1119011FAFA
1122867FAFA
2345666FAFA,WR
26788975YPYP,RQ

 

 

Can someone please explain, if possible, in a simple and linear way, how this code is working? Often to find answers I run variations of the code and that made me wonder: Why is it that I need to concatenate offense_code with the newly created variable specified in the length statement? If i do not do that, it simply pulls the last observation for each discipline key. Any insight would be greatly appreciated, thanks! 


Accepted Solutions
Solution
‎04-04-2017 02:45 PM
Super User
Super User
Posts: 6,500

Re: Question about Concatenation of rows using last.variable

[ Edited ]

What you have is an example of DOW loop.  You should be able to search for papers on this. Look for ones by Ian Whitlock.

 

Normally when you process a dataset each iteration of the data step will correspond to one observation of the input data and one observation of the output data. By nesting the SET statment inside of the DO loop you have changed this dynamic.

 

First you need to understand when SAS writes the output record.  If you data step does not have any explicit OUTPUT statements then SAS will add an implied one at the end of the data step. That is what is happening in this data step.  But since the DO loop executes the SET statement potentionally multiple times in one iteration of the data step you could end up with fewer output dataset written than are read.

 

Second you need to understand how SAS initializes variables at the start of a data step. New variables (those not coming from an dataset read by statements like SET) are cleared when a new iteration starts.  

 

Third you need to understand how the CATX() function works.  It will combine all of the NON null arguements , putting the value of the first argument between them.  If there is only one non null value then the delimiter is not added.

 

Fourth you need to understand how the FIRST. and LAST. variables are created when you use a BY statement. Also what it means to use the NOTSORTED option on the BY statement.  Since you are just checking one variable it is easy.  The FIRST. variable is true for the first in a group and the LAST. is true for the last one. Since you told to treat the data as NOTSORTED it doesn't complain if the values are not in order.

 

Finally you need to understand what DO UNTIL() means. It means that the condtion is tested at the end of the DO loop. So it will always execute the loop at least once and then check the condition.

 

So in your data step you are creating a NEW variable named all_offense.  You then read one or more values from your input data set and concatenate the values of NUMBER onto the end of the new variable ALL_OFFENSE.  When you get to the end of the BY group the loop ends and the implled OUTPUT statement at the end writes the record.  Since this is the last observation in the BY group the values of the non-by group variables will come from that last observation. The new ALL_OFFENSE variable will have all of the non null values of NUMBER for all of the records in that group.

 

The the data step starts the next iteration it is on the next group of records. It will reset ALL_OFFENSE to null and do the same thing.  If there is no more records then the data step stops when it tries to read past the end of the input dataset.

View solution in original post


All Replies
Super User
Posts: 10,500

Re: Question about Concatenation of rows using last.variable

First is if you do not concatenate you do not achieve your objective which is to concatenate something.

Second and I think the heart of your questionis really "why do I need a new variable at all" is that it is very likely that your original variable has a length 2 to 5 characters (purely a guess based on displayed values). If you have 15 offenses to combine the length would be apparently a minimum of 15*2+14(the commas). Which would be 44 characters. When you try to stuff 44 characters into 5 spaces most of them go away. So the creation of the new variable with a length long enough to hold many combined values.

 

Note that if a new variable is created by assigning the value from an existing variable it acquires the length of the existing one.

 

data example;
   x= 'AB';
   Y= 'CD';
   z=x;   /*<= since X is 2 characters then Z is assigned 
            a length of 2 and that is max number of characters
           it will hold*/
   z=catx(',',z,y);
   q=catx(',',x,y);
   put z= q=;
run;
New Contributor
Posts: 4

Re: Question about Concatenation of rows using last.variable

Yes indeed, the issue all came about due to some truncation issues, where a client could not see the entire collection of offense_codes associated with each unique combination of discipline_key, and offender_key. It is important that we only concatenate by discipline_key, and if an offender only has one offense_code tied to a specific incicent, they will only have one discipline key and thus nothing to concatent, and thats ok. Thank you for your time and response, but I am really trying to get some explanation behind the logic of the code in the post. Specifically, why it works only if the variable all_offense is included in the catx portion. 

Super User
Posts: 10,500

Re: Question about Concatenation of rows using last.variable

The all_offense, or any variable in that position, can be though of as an "accumulator", you keep adding things to it. At the END of the loop then it holds all of the values. If you do not have all_offense as some portion of the CATX function the previous values you have identified are not retained.

 

Here is some slightly different code that shows what is going on with all_offense variable. The differenc here is that it output each record with the incremented value. Uncomment the last line and you should get a set similar to the one you are currently getting.

data example;
   set discipline;
   by discipline_key notsorted;
   length all_offense $200.;
   retain all_offense;
   all_offense=catx(',',all_offense,number);
   /* if last.discipline_key;*/
run; 

 

 

"Specifically, why it works only if the variable all_offense is included in the catx portion." As opposed to what? The way you phrase this it sounds to me as if there is an implied "instead of something else". What would that something else be?

 

May be the implied loop with the SET statement inside the do until is confusing.

New Contributor
Posts: 4

Re: Question about Concatenation of rows using last.variable

Thank you! Your description of the all_offense as an "accumulator" was helpful and illuminating. I had no alternative, or "opposed to what" in mind - I just knew that when I removed it the catx would simply select the last obersvation for each discipline key, and I wanted to know what role it was playing (i.e. including all_offense in catx). Yes, the arrangment of the code was confusing and out of my comfort level. I feel much better about this code, and the catx code itself. Unfortunatly, i was unable to get your altnerative code to work out correctly, but your help was much appreciated. 

Solution
‎04-04-2017 02:45 PM
Super User
Super User
Posts: 6,500

Re: Question about Concatenation of rows using last.variable

[ Edited ]

What you have is an example of DOW loop.  You should be able to search for papers on this. Look for ones by Ian Whitlock.

 

Normally when you process a dataset each iteration of the data step will correspond to one observation of the input data and one observation of the output data. By nesting the SET statment inside of the DO loop you have changed this dynamic.

 

First you need to understand when SAS writes the output record.  If you data step does not have any explicit OUTPUT statements then SAS will add an implied one at the end of the data step. That is what is happening in this data step.  But since the DO loop executes the SET statement potentionally multiple times in one iteration of the data step you could end up with fewer output dataset written than are read.

 

Second you need to understand how SAS initializes variables at the start of a data step. New variables (those not coming from an dataset read by statements like SET) are cleared when a new iteration starts.  

 

Third you need to understand how the CATX() function works.  It will combine all of the NON null arguements , putting the value of the first argument between them.  If there is only one non null value then the delimiter is not added.

 

Fourth you need to understand how the FIRST. and LAST. variables are created when you use a BY statement. Also what it means to use the NOTSORTED option on the BY statement.  Since you are just checking one variable it is easy.  The FIRST. variable is true for the first in a group and the LAST. is true for the last one. Since you told to treat the data as NOTSORTED it doesn't complain if the values are not in order.

 

Finally you need to understand what DO UNTIL() means. It means that the condtion is tested at the end of the DO loop. So it will always execute the loop at least once and then check the condition.

 

So in your data step you are creating a NEW variable named all_offense.  You then read one or more values from your input data set and concatenate the values of NUMBER onto the end of the new variable ALL_OFFENSE.  When you get to the end of the BY group the loop ends and the implled OUTPUT statement at the end writes the record.  Since this is the last observation in the BY group the values of the non-by group variables will come from that last observation. The new ALL_OFFENSE variable will have all of the non null values of NUMBER for all of the records in that group.

 

The the data step starts the next iteration it is on the next group of records. It will reset ALL_OFFENSE to null and do the same thing.  If there is no more records then the data step stops when it tries to read past the end of the input dataset.

New Contributor
Posts: 4

Re: Question about Concatenation of rows using last.variable

Wow - much appreciated for taking the time to explain. This is exactly the type of explanation I needed - and now I have some additional resources to further build a deeper understanding. Thank you. 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 141 views
  • 0 likes
  • 3 in conversation