Solved: Re: Creating new variables from long data

rfarmenta · Posted 10-08-2018 03:49 PM

I have a long dataset with 125 participants and 600,000 observations. I need to create several new variables summarizing the data. I can summarize the data but am having trouble figuring out how to create the new variables without having to do it for each participant separately. The two variables I am using to create the new variables are activity type with 4 categories (1-4) and hrzone with three categories (1-3).

What I need is the amount of time that participants spent in each activity in seconds, which is just a sum of all of their observations since each observation per participant is equal to 1 second in that activity. So I will need 4 new variables (ACT1T-ACT4T) for each participant.

Then, I need to create three other variables for amount of time participants spent in each HRzone (HRZ1T-HRZ3T).

What I need is a dataset that looks like this:

Study_id	HRZ1T	HRZ2T	HRZ3T	ACT1T	ACT2T	ACT3T	ACT4T
001	2300	1000	1500	2300	1000	5000	500
002	2200	800	700	2000	500	1000	200
003	2300	1000	1500	2300	1000	5000	500
004	2200	800	700	2000	500	1000	200

I've attached a very abbreviated mock dataset that shows my long data. In the original data I have 125 participants with between 1800 and 9005 observations each.

I can run the code below for each participant to get their hrzone in each activity (hrz1t-hrz3t), but I would have to do that for all participants.

proc sql;

select count(*) as N_obs

from mod_data

where activity=1 and hrzone=1;

quit;

I can also just run a proc freq on activity for each participant to get the amount of time (number of observations) they spent in each activity.

proc freq data=mod_data;

where study_id="001";

tables activity;

run;

What I need to be able to do create a loop or something similar to do it for all participants at the same time and to have it create a new dataset with 1 observation per participant like I have above so I can combine that data with the rest of the participants demographic data to do analyses.

I am not good with do loops and these types of things in general so any ideas to get this done would be apprecaited. Let me know if you need additional information.

Thank you in advance.

SuzanneDorinski · Posted 10-08-2018 04:46 PM

I prefer to use by group processing when dealing with data like this.

options MSGLEVEL=I;

/** Import an XLSX file.  **/

PROC IMPORT DATAFILE="/folders/myfolders/SAS Communities/mod_data.xlsx"
		    OUT=mod_data
		    DBMS=XLSX
		    REPLACE;
RUN;

data mod_data;
  set mod_data;
  indicator=1;
run;

* summarize hrzone data by participant ;

proc means data=mod_data sum noprint;
  var indicator;
  class study_id hrzone;
  types study_id*hrzone;
  output out=hrzone_summary sum=hrz;
run;

* reshape summarized hrzone data ;

proc transpose data=hrzone_summary(drop=_type_ _freq_)
               out=transposed_hrzone(drop=hrzone)
               name=hrzone prefix=HRZ
               suffix=T;
  by study_id;
  var hrz;
run;
 
* summarize activity data by participant ;

proc means data=mod_data sum noprint;
  var indicator;
  class study_id activity;
  types study_id*activity;
  output out=activity_summary sum=act;
run;

* reshape summarized activity data ;

proc transpose data=activity_summary(drop=_type_ _freq_)
               out=transposed_activity(drop=activity)
               name=activity prefix=ACT
               suffix=T;
  by study_id;
  var act;
run;


data results;
  merge transposed_hrzone
        transposed_activity;
  by study_id;
run;

proc print data=results noobs;
run;

The screen shot below shows the output from the example code and example data.

Heart rate zone and activity data summarized by participant

View solution in original post

SuzanneDorinski · Posted 10-08-2018 04:46 PM

I prefer to use by group processing when dealing with data like this.

options MSGLEVEL=I;

/** Import an XLSX file.  **/

PROC IMPORT DATAFILE="/folders/myfolders/SAS Communities/mod_data.xlsx"
		    OUT=mod_data
		    DBMS=XLSX
		    REPLACE;
RUN;

data mod_data;
  set mod_data;
  indicator=1;
run;

* summarize hrzone data by participant ;

proc means data=mod_data sum noprint;
  var indicator;
  class study_id hrzone;
  types study_id*hrzone;
  output out=hrzone_summary sum=hrz;
run;

* reshape summarized hrzone data ;

proc transpose data=hrzone_summary(drop=_type_ _freq_)
               out=transposed_hrzone(drop=hrzone)
               name=hrzone prefix=HRZ
               suffix=T;
  by study_id;
  var hrz;
run;
 
* summarize activity data by participant ;

proc means data=mod_data sum noprint;
  var indicator;
  class study_id activity;
  types study_id*activity;
  output out=activity_summary sum=act;
run;

* reshape summarized activity data ;

proc transpose data=activity_summary(drop=_type_ _freq_)
               out=transposed_activity(drop=activity)
               name=activity prefix=ACT
               suffix=T;
  by study_id;
  var act;
run;


data results;
  merge transposed_hrzone
        transposed_activity;
  by study_id;
run;

proc print data=results noobs;
run;

The screen shot below shows the output from the example code and example data.

Heart rate zone and activity data summarized by participant

rfarmenta · Posted 10-12-2018 02:56 PM

This worked well and gave what I am looking for overall. I am now trying to do something similar where I am trying to create new variables that summarizes the amount of time in hrzone by activity level for each participant. So essentially I need to create 18 new variables that would look like this: if activity=1 and hrzone=1 then acthr11=XX, if activity=2 and hrzone=2 then acthr12=XX...and so on.

I thought I could modify the code like this:

proc means data=mod_data sum noprint;

var indicator;

class study_id activity hrzone;

types study_id*HRZONE*activity;

output out=acthr_summary sum=acthr;

run;

to create those variables and then transpose the data using the same code:

proc transpose data=acthr_summary(drop=_type_ _freq_)

out=transposed_acthr(drop=activity)

name=activity prefix=ACTHR

suffix=T;

by study_id;

var acthr;

run;

That seems to create the right information where I get new variables with the amount of seconds for each person for hrzone*activity, however, the data does not seem to appear in the right variables for each person. IF you look at the output I have from the data you will see two participants, for participant 001, they participated in 3 activities and went into multiple HR zones. Those are shown under certain variable names. For participant 002, they participated in more activities and had multiple HR zones. But the way the data is transposed put different activity* hrzones under the same variable name as it did for for a different combo for the first participant. I am sure there is a way to reconcile that but I am not quite sure how. I hope this makes sense.

SuzanneDorinski · Posted 10-12-2018 05:41 PM

I wasn't able to download the PDFs that you provided.

The following bit of code may be useful.

proc means data=mod_data n completetypes;
  var indicator;
  class study_id activity hrzone;
  types study_id*activity*hrzone;
  output out=acthr_summary n=acthr;
  title 'Time in activity by hrzone level for each participant';
run;

* concatenate 'ACTHR', activity, hrzone, and 'T' to use when ;
* naming variable after transpose ;

data acthr_summary_with_combo;
  length combo $ 8;
  set acthr_summary(drop=_type_ _freq_);
  combo=cat('ACTHR',activity,hrzone,'T');
run;

proc transpose data=acthr_summary_with_combo(drop=activity hrzone)
               out=transposed_acthr(drop=_name_);
  by study_id;
  var acthr;
  id combo;
run;

proc print data=transposed_acthr;
  title 'Transposed data for time in activity by hrzone';
run;

The screen shot below shows part of the PROC MEANS output.

Partial PROC MEANS output for time in activity by hrzone for each study participant

The next screen shot shows the transposed data.

Transposed data for time in activity by hrzone

Tom · Posted 10-08-2018 04:57 PM

Since all of your categories are numeric why not just use PROC MEANS (summary) and PROC TRANSPOSE.

Let's get some of your sample data in a usable format.

data have ;
  input study_ID Time HR Activity Age maxhrper hrzone;
cards;
1 0 114 4 27 60.96256684 1
1 1 114 4 27 60.96256684 1
1 2 101 4 27 54.01069519 1
1 3 95 4 27 50.80213904 1
1 4 92 4 27 49.19786096 3
1 5 93 4 27 49.73262032 3
;

So first summarize by each of the different variables (ACTIVITY and HRZONE).

proc means data=have noprint ;
  by study_id;
  class activity hrzone ;
  ways 1 ;
  output out=summary N=N;
run;
proc print;
run;

Now let's pull out the variable name of the original class variables and use that for naming our new variable and then use PROC TRANSPOSE to convert .

data tall ;
  set summary ;
  array class activity hrzone ;
  length _name_ $32 value 8;
  value = coalesce(of class(*));
  _name_ = cats(vname(class(whichn(value,of class(*)))),value);
run;

proc print;
run;

proc transpose data=tall out=want ;
  by study_id;
  id _name_ ;
  var  N;
run;

proc print;
run;

mkeintz · Posted 10-09-2018 12:24 AM

Here's a single data step solution that doesn't use do loops. It makes use of the fact that your data appears to be sorted by study_id

data want (keep=study_id hrz: act: drop=activity);
  set have;
  by study_id;
  array h {*} hrz1t hrz2t hrz3t;
  array a {*} act1t act2t act3t act4t;
  retain hrz1t--act4t;
  if first.id then call missing(of h{*},of a{*});
  h{hrzone}+1;
  a{activity}+1;
  if last.study_id;
run;

It assumes you have no missing values for activity or hrzone, and all the hrzone values are the integers 1, 2, or 3. And activity is only the integers 1 through 4.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

s_lassen · Posted 10-09-2018 04:25 AM

I would probably do something like this:

data want;
  set have;
  by study_id;
  array HRZ(*) 8 HRZ1T HRZ2T HRZ3T;
  array ACT(*) 8 ACT1T ACT2T ACT3T ACT4T;
  if first.study_id then 
    call missing(of HRZ(*),of ACT(*));
  HRZ(hrzone)+1;
  ACT(activity)+1;
  if last.study_id;
run;

But you are making things difficult for yourself by using names with a suffixed T. It is much easier to refer to multiple variables if they have the number as a suffix, e.g.:

data want;
  set have;
  by study_id;
  array HRZ(*) 8 HRZ_time1-HRX_time3;
  array ACT(*) 8 ACT_time1-ACT_time4;
  if first.study_id then 
    call missing(of HRZ(*),of ACT(*));
  HRZ(hrzone)+1;
  ACT(activity)+1;
  if last.study_id;
run;

In this case it does not matter so much, but it will save you an awful lot of typing if you have a large number of categories.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away