Collapse Up to Seven Observations by ID

fwashburn · Posted 10-17-2017 02:35 PM

Howdy folks! I have a large dataset with IDs that have anywhere from one to seven observations, and I'm trying to collapse them into one per ID. It's from an Excel file, but I'll try to mimic some of the data here for the sake of the following code:

data visits;
input dovisit date9. person_id sex :$1. nvisit :$1. fvisit :$1. avisit :$1.;
datalines;
18dec2007 444 M T F F
18dec2007 444 M T F F
20dec2008 444 M F F T
23apr2009 444 M T T T
31mar2010 444 M F F F
10jan2007 365 M T F F
10jan2007 365 M T F F
11jan2008 265 M F F T
01feb2007 212 F T T T
01feb2007 212 F T F T
;
run;

/*create a data set of the duplicates using DUPOUT= option  */

proc sort data=visits dupout=visits_dup nodupkey;
  by person_id;
run;

/* Create a macro variable with the variable names that are to     */
/* be merged.  The variables considered BY variables are excluded  */
/* from going into the macro variable using the NOT IN operator.   */
/* The resulting macro variable is in the format varname=varname_2 */                                             
proc sql noprint;                                                                                                                       
  select trim(name) || '=' || trim(name) || '_2'                                                                                        
  into :varlist separated by ' '                                                                                                        
  from DICTIONARY.COLUMNS                                                                                                               
  WHERE LIBNAME EQ "WORK" and MEMNAME EQ "VISITS"
  and upcase(name) not in ('PERSON_ID' 'DOVISIT' 'SEX');                                                                                                                                                                                          
quit;

/*Merge the two data sets using the macro variable to rename the  */
/*common variables in the second (duplicates) data set.           */
data merged;
  merge visits visits_dup (rename=(&varlist));
  by person_id;
run;

proc print;
run;

Now what I want to modify this code to do is to have columns nvisit, fvisit, avisit, nvisit_2, fvisit_2, avisit_2, nvisit_3, fvisit_3, avisit_3, and so on, all the way to _7, but when I try to modify the code, specifically this line:

select trim(name) || '=' || trim(name) || '_2'

...nothing I do seems to stick. The only change I've been able to make so far without getting an error:

select trim(name) || '=' || trim(name) || '_2' || '_3'

turns my nvisit_2 into an nvisit_2_3 and so on, instead of actually creating the separate column nvisit_3. I'm sure I'm making a syntax error but I'm not sure how to fix it. Thank you so much for taking a look!

(PS - This code is almost identical to what I found at Collapse observations in BY-Group so values from duplicate observations have new names; all I did here was add some more datalines because the code was originally only written to collapse two observations.)

PaigeMiller · Posted 10-17-2017 02:38 PM

I didn't go through your code carefully, but it seems to me that PROC TRANSPOSE with a BY statement will collapse the data the way you are asking.

--
Paige Miller

Reeza · Posted 10-17-2017 02:59 PM

You don't show what you want as output, but I second @PaigeMiller solution. I suspect you want a PROC TRANSPOSE instead. Then the number (up to 7 or 20) doesn't matter, SAS will add as many columns as needed.

fwashburn · Posted 10-17-2017 03:37 PM

Hi Reeza! Sorry for being unclear. What I want as output is something like this:

Obs

dovisit

person_id

sex

nvisit

fvisit

avisit

nvisit_2

fvisit_2

avisit_2

nvisit_3

fvisit_3

avisit_3

nvisit_4

fvisit_4

avisit_4

nvisit_4

fvisit_5

avisit_5

nvisit_5

fvisit_6

avisit_6

nvisit_6

1

17198

212

F

T

F

T

2

17542

265

M

F

T

3

17176

365

M

T

F

T

F

4

17518

444

M

T

F

T

F

T

F

T

F

T

(I realized that I should've had the same dates for same IDs in my sample code, as the data I'm actually working with has multiple variables for the same ID AND same date.)

How would I use PROC TRANSPOSE with a BY statement here, more specifically? I'm not sure where to start. If I just use the following:

proc transpose data=visits;
by person_id;
run;

It gives me the same result as the original code I posted.

Reeza · Posted 10-17-2017 03:45 PM

Are you only keeping one row per person, or one row per person, per dovisit?

It's not clear. Since you're transposing several variables you'll need to do multiple iterations and merge the results, but I'm unclear as to how you decided which dovisit to be included.

I think this is your starting point could be something like:

proc sort data=visits;
by person_id dovisit sex;
run;

proc transpose data=visits out=want prefix=nvisit;
by person_id dovisit sex;
var nvisit;
run;

Here's an overview of both ways, data step or proc transpose

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

PaigeMiller · Posted 10-17-2017 03:47 PM

@fwashburn wrote:
How would I use PROC TRANSPOSE with a BY statement here, more specifically? I'm not sure where to start. If I just use the following:
proc transpose data=visits;
by person_id;
run;
It gives me the same result as the original code I posted.

You need a VAR statement, and an OUT= statement, and possibly a few other options depending on exactly what you want. The documentation explains. http://documentation.sas.com/?cdcId=pgmmvacdc&cdcVersion=9.4&docsetId=proc&docsetTarget=n1xno5xgs39b...

--
Paige Miller

slacey · Posted 10-17-2017 04:03 PM

An alternative to the transpose procedure is to use the dowloop:

data visits;
input dovisit date9. person_id sex :$1. nvisit :$1. fvisit :$1. avisit :$1.;
datalines;
18dec2007 444 M T F F
18dec2007 444 M T F F
20dec2008 444 M F F T
23apr2009 444 M T T T
31mar2010 444 M F F F
10jan2007 365 M T F F
10jan2007 365 M T F F
11jan2008 265 M F F T
01feb2007 212 F T T T
01feb2007 212 F T F T
;
run;

***Grab the maximum number of possible visit variables;
proc sql noprint;
	select max(nobs) into: maxn
	from (select person_id,count(*) as nobs
		  from visits
		  group by person_id);
quit;

***Use a dow loop to collapse the data and create visit variables;
proc sort data=visits; by person_id dovisit; run;

data trans_visits(keep=person_id sex nvisit_: avisit_: fvisit_:);
	array nvisit_{&maxn} $;
	array avisit_{&maxn} $;
	array fvisit_{&maxn} $;

	do _N_=1 by 1 until(last.person_id); /*"_N_=1 by 1" not necessarily required*/
		set visits;
		by person_id dovisit;

		if first.person_id then count=1;
		else count+1;

		if ^missing(nvisit) then nvisit_[count]=nvisit;
		if ^missing(fvisit) then fvisit_[count]=fvisit;
		if ^missing(avisit) then avisit_[count]=avisit;
	end;

	output;
run;

It's more complex, possibly more so than you need but it adds some flexibility to collapsing the dataset (such as ability to choose exactly which date you would like to keep per subject).

Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

Re: Collapse Up to Seven Observations by ID

SAS Innovate 2025: Call for Content