BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
rchung
Fluorite | Level 6

Hello.

I have a survey data from 27 agencies (n=27). Each agency can enter information of up to 25 employees. For each employee, there are 17 variables, including: employment start date, employment end date, name, email, phone, position title, and FTE. I show you only 7 of them in this post.

rchung_3-1684952081260.png

 

I have transformed this agency-level data into agency-quarter-level data. In other words, each row of agency is duplicated for 40 quarters. For example, the row of agency1 is duplicated for 40 quarters from 1910-Q1 through 1919-Q4.

 

rchung_2-1684951973098.png

 

Now, I have to remove the values if the staff was not employed on the last day [YYYYQ_end] of the quarter. For example:

 

if start_dt_1 > YYYYQ_end or end_dt_1 < YYYYQ_end then do;

name_1="  ";

email_1=" ";

phone_1=" ';

role_1=" ";

fte=.;

end;

 

I have to do this for up to 25 staff members per agency, and for 15 variables per staff member. What would be the best way to get this job done?

 

Thank you so much!

Rakkoo

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

If Name_1 and all the variables ending in 1 relate to one employee, and Name_2 and the variables ending in 2 to a second employee then your data structure is poor. You would be better off with one observation per employee and variables with names like Name, start_dt, end_dt.

Really.

As it is you need to define 17 arrays, one for each "variable". The : list builder says 'use all the variable names that start with the text before the : " in a list of variables.

 

data want;
   set have;
   array n (*) Name_: ;
   array start (*) start_dt: ;
   array end  (*) end_dt: ;
   /* follow the obvious pattern for all the variables*/
   do i=1 to dim(n);
      if not ( start[i] le yyyyqend le end[i]) then call missing(n[i],email[i],phone[i]);
   end;
run;

The function CALL Missing works with multiple variables to set them missing. That is all it does.  After you have all of the arrays defined use all of them following the pattern to include them in the call missing.

 

Just because someone may give you data in a stupid structure with 25 people across doesn't mean that you have to keep it.

View solution in original post

3 REPLIES 3
ballardw
Super User

If Name_1 and all the variables ending in 1 relate to one employee, and Name_2 and the variables ending in 2 to a second employee then your data structure is poor. You would be better off with one observation per employee and variables with names like Name, start_dt, end_dt.

Really.

As it is you need to define 17 arrays, one for each "variable". The : list builder says 'use all the variable names that start with the text before the : " in a list of variables.

 

data want;
   set have;
   array n (*) Name_: ;
   array start (*) start_dt: ;
   array end  (*) end_dt: ;
   /* follow the obvious pattern for all the variables*/
   do i=1 to dim(n);
      if not ( start[i] le yyyyqend le end[i]) then call missing(n[i],email[i],phone[i]);
   end;
run;

The function CALL Missing works with multiple variables to set them missing. That is all it does.  After you have all of the arrays defined use all of them following the pattern to include them in the call missing.

 

Just because someone may give you data in a stupid structure with 25 people across doesn't mean that you have to keep it.

PaigeMiller
Diamond | Level 26

Agreeing with @ballardw, this is much easier to handle in a long format. Maxim 19:

 

Long beats wide.

(Don't keep data in structure)

In the world of spreadsheets, people tend to line up data side-by-side, and put data items (dates, categories, …) into column headers. This runs counter to all the methods available in SAS for group processing, and makes programming difficult, as one has variable column names and has to resort to creating dynamic code (with macros and/or call execute) where such is not necessary at all if categories were represented in their own column and data aligned vertically.
There are times where a wide format is needed, eg when preparing data for regression analysis. But for the processing and storing of data, long formats are always to be preferred.

Dynamic variable names force unnecessary dynamic code.

 

--
Paige Miller
rchung
Fluorite | Level 6
It is very helpful. Thank you so much! 🙂

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 459 views
  • 2 likes
  • 3 in conversation