SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Collapse Up to Seven Observations by ID

Reply
Occasional Contributor
Posts: 8

Collapse Up to Seven Observations by ID

Howdy folks! I have a large dataset with IDs that have anywhere from one to seven observations, and I'm trying to collapse them into one per ID. It's from an Excel file, but I'll try to mimic some of the data here for the sake of the following code:

 

data visits;
input dovisit date9. person_id sex :$1. nvisit :$1. fvisit :$1. avisit :$1.;
datalines;
18dec2007 444 M T F F
18dec2007 444 M T F F
20dec2008 444 M F F T
23apr2009 444 M T T T
31mar2010 444 M F F F
10jan2007 365 M T F F
10jan2007 365 M T F F
11jan2008 265 M F F T
01feb2007 212 F T T T
01feb2007 212 F T F T
;
run;

/*create a data set of the duplicates using DUPOUT= option */ proc sort data=visits dupout=visits_dup nodupkey; by person_id; run; /* Create a macro variable with the variable names that are to */ /* be merged. The variables considered BY variables are excluded */ /* from going into the macro variable using the NOT IN operator. */ /* The resulting macro variable is in the format varname=varname_2 */ proc sql noprint; select trim(name) || '=' || trim(name) || '_2' into :varlist separated by ' ' from DICTIONARY.COLUMNS WHERE LIBNAME EQ "WORK" and MEMNAME EQ "VISITS" and upcase(name) not in ('PERSON_ID' 'DOVISIT' 'SEX'); quit; /*Merge the two data sets using the macro variable to rename the */ /*common variables in the second (duplicates) data set. */ data merged; merge visits visits_dup (rename=(&varlist)); by person_id; run; proc print; run;

 

Now what I want to modify this code to do is to have columns nvisit, fvisit, avisit, nvisit_2, fvisit_2, avisit_2, nvisit_3, fvisit_3, avisit_3, and so on, all the way to _7, but when I try to modify the code, specifically this line: 

select trim(name) || '=' || trim(name) || '_2'

...nothing I do seems to stick. The only change I've been able to make so far without getting an error:

select trim(name) || '=' || trim(name) || '_2' || '_3'

turns my nvisit_2 into an nvisit_2_3 and so on, instead of actually creating the separate column nvisit_3. I'm sure I'm making a syntax error but I'm not sure how to fix it. Thank you so much for taking a look!

 

(PS - This code is almost identical to what I found at Collapse observations in BY-Group so values from duplicate observations have new names; all I did here was add some more datalines because the code was originally only written to collapse two observations.)

Respected Advisor
Posts: 2,843

Re: Collapse Up to Seven Observations by ID

Posted in reply to fwashburn

I didn't go through your code carefully, but it seems to me that PROC TRANSPOSE with a BY statement will collapse the data the way you are asking.

--
Paige Miller
Super User
Posts: 23,354

Re: Collapse Up to Seven Observations by ID

Posted in reply to fwashburn

You don't show what you want as output, but I second @PaigeMiller solution. I suspect you want a PROC TRANSPOSE instead. Then the number (up to 7 or 20) doesn't matter, SAS will add as many columns as needed.

 

Occasional Contributor
Posts: 8

Re: Collapse Up to Seven Observations by ID

Hi Reeza! Sorry for being unclear. What I want as output is something like this:

 

Obsdovisitperson_idsexnvisitfvisitavisitnvisit_2fvisit_2avisit_2nvisit_3fvisit_3avisit_3nvisit_4fvisit_4avisit_4nvisit_4fvisit_5avisit_5nvisit_5fvisit_6avisit_6nvisit_6
117198212FTTTTFT             
217542265MFFT                
317176365MTFFTFF             
417518444MTFFTFFTFFFFTTFFTTTT

 

(I realized that I should've had the same dates for same IDs in my sample code, as the data I'm actually working with has multiple variables for the same ID AND same date.)

 

How would I use PROC TRANSPOSE with a BY statement here, more specifically? I'm not sure where to start. If I just use the following:

proc transpose data=visits;
by person_id;
run;

It gives me the same result as the original code I posted.

Super User
Posts: 23,354

Re: Collapse Up to Seven Observations by ID

Posted in reply to fwashburn

Are you only keeping one row per person, or one row per person, per dovisit?

 

It's not clear. Since you're transposing several variables you'll need to do multiple iterations and merge the results, but I'm unclear as to how you decided which dovisit to be included.

 

I think this is your starting point could be something like:

proc sort data=visits;
by person_id dovisit sex;
run;

proc transpose data=visits out=want prefix=nvisit;
by person_id dovisit sex;
var nvisit;
run;

Here's an overview of both ways, data step or proc transpose

https://stats.idre.ucla.edu/sas/modules/reshaping-data-long-to-wide-using-the-data-step/

https://stats.idre.ucla.edu/sas/modules/how-to-reshape-data-long-to-wide-using-proc-transpose/

 

 

Respected Advisor
Posts: 2,843

Re: Collapse Up to Seven Observations by ID

Posted in reply to fwashburn

fwashburn wrote:

 

How would I use PROC TRANSPOSE with a BY statement here, more specifically? I'm not sure where to start. If I just use the following:

proc transpose data=visits;
by person_id;
run;

It gives me the same result as the original code I posted.


 

You need a VAR statement, and an OUT= statement, and possibly a few other options depending on exactly what you want. The documentation explains. http://documentation.sas.com/?cdcId=pgmmvacdc&cdcVersion=9.4&docsetId=proc&docsetTarget=n1xno5xgs39b...

--
Paige Miller
Occasional Contributor
Posts: 19

Re: Collapse Up to Seven Observations by ID

Posted in reply to fwashburn

An alternative to the transpose procedure is to use the dowloop:

 

data visits;
input dovisit date9. person_id sex :$1. nvisit :$1. fvisit :$1. avisit :$1.;
datalines;
18dec2007 444 M T F F
18dec2007 444 M T F F
20dec2008 444 M F F T
23apr2009 444 M T T T
31mar2010 444 M F F F
10jan2007 365 M T F F
10jan2007 365 M T F F
11jan2008 265 M F F T
01feb2007 212 F T T T
01feb2007 212 F T F T
;
run;

***Grab the maximum number of possible visit variables;
proc sql noprint;
	select max(nobs) into: maxn
	from (select person_id,count(*) as nobs
		  from visits
		  group by person_id);
quit;

***Use a dow loop to collapse the data and create visit variables;
proc sort data=visits; by person_id dovisit; run;

data trans_visits(keep=person_id sex nvisit_: avisit_: fvisit_:);
	array nvisit_{&maxn} $;
	array avisit_{&maxn} $;
	array fvisit_{&maxn} $;

	do _N_=1 by 1 until(last.person_id); /*"_N_=1 by 1" not necessarily required*/
		set visits;
		by person_id dovisit;

		if first.person_id then count=1;
		else count+1;

		if ^missing(nvisit) then nvisit_[count]=nvisit;
		if ^missing(fvisit) then fvisit_[count]=fvisit;
		if ^missing(avisit) then avisit_[count]=avisit;
	end;

	output;
run;

It's more complex, possibly more so than you need but it adds some flexibility to collapsing the dataset (such as ability to choose exactly which date you would like to keep per subject).

Ask a Question
Discussion stats
  • 6 replies
  • 348 views
  • 2 likes
  • 4 in conversation