Solved: Re: Collapsing multiple records

DavidKaib · Posted 04-01-2020 02:42 PM

I have seen various things that are close to this, but not exactly was I am trying to do. I have a data set where there are anywhere from 1 to 3 records per person. There are some fields that are identical in multiple records (firstname, lastname, gender and id) and others that are not (degree and major). I know which fields are invariant and which ones are different. I am hoping to find code to collapse the records so there is only one per person, where the fields that have multiple values can be put into a string (BA;MA) and the ones that do not remain the same. Is there a way to do this? I'm using SAS base 9.4.

So this:

ID	FirstName	LastName	Gender	Degree	Major
1234567	Bill	Johnson	M	BA	Sociology
2345678	Jen	Baker	W	BA	Sociology
2345678	Jen	Baker	W	MA	Psychology
4567890	Stan	Wright	M	MA	Psychology
4567890	Stan	Wright	M	MA	Sociology

Would become this:

ID	FirstName	LastName	Gender	Degree	Major
1234567	Bill	Johnson	M	BA	Sociology
2345678	Jen	Baker	W	BA;MA	Sociology;Psychology
4567890	Stan	Wright	M	MA;MA	Psychology;Sociology

LeonidBatkhan · Posted 04-01-2020 03:26 PM

Hi DavidKaib,

I believe you can do it using the following code:

proc sort data=A out=A1 (rename=(Degree=D Major=M));
   by ID;
run;

data B (drop=D M);
   set A1;
   by ID;
   length Degree $20 Major $90;
   retain Degree Major;
   if first.ID then do;
      Degree = D;
      Major = M;
   end;
   else do; 
      Degree = catx(';',Degree,D);
      Major  = catx(';',Major, M);
   end;
   if last.ID then output;
run;

Here A is your first data set, B is your destination data set. At least, this gives you an idea.

Hope this helps.

➤ Leonid's SAS blog

View solution in original post

ballardw · Posted 04-01-2020 02:57 PM

Do you have a known maximum number of records that will need to collapse? This is needed so that you can create variables long enough to hold the results.

And another question, what exactly will you be able to do with the resulting file that can't be done with the existing one?

PaigeMiller · Posted 04-01-2020 03:06 PM

I would like to ask a similar question to that of @ballardw

What could you possibly do with text fields "BA;MA" or "Sociology;Psychology"? how would that help whatever comes next.

Maybe you want a column for each student named BA that is either 1 or 0, and another column for each student named MA which is 1 or 0; and so on. That could be a better form for subsequent analysis or reporting.

--
Paige Miller

DavidKaib · Posted 04-01-2020 03:14 PM

The variables won't end up being too long. I am losing a legacy way of generating the data set that did this for me, and that I used within SAS, and this hasn't been a problem. The new source cannot collaspe the records, so I am basically trying to replicate what I have previously had.

The resulting file is used as a contact list for a survey, and then later for analysis of the survey. In both cases I need an unduplicated count with all the information.

Reeza · Posted 04-01-2020 03:17 PM

https://gist.github.com/statgeek/d583cfa992bf56da51d435165b07e96a

This provides two examples. If you're combining more than one field, using the data step approach maybe easier/less steps.

PaigeMiller · Posted 04-01-2020 03:18 PM

Length was never the issue, in my mind. Creating a data set that would be easy to use and easy to program in subsequent analysis is the issue, and the data set I described with BA 0 or 1 and MA 0 or 1 and so on seems like an easier data set to work with.

--
Paige Miller

LeonidBatkhan · Posted 04-01-2020 03:26 PM

Hi DavidKaib,

I believe you can do it using the following code:

proc sort data=A out=A1 (rename=(Degree=D Major=M));
   by ID;
run;

data B (drop=D M);
   set A1;
   by ID;
   length Degree $20 Major $90;
   retain Degree Major;
   if first.ID then do;
      Degree = D;
      Major = M;
   end;
   else do; 
      Degree = catx(';',Degree,D);
      Major  = catx(';',Major, M);
   end;
   if last.ID then output;
run;

Here A is your first data set, B is your destination data set. At least, this gives you an idea.

Hope this helps.

➤ Leonid's SAS blog

DavidKaib · Posted 04-01-2020 03:52 PM

Thanks Leonid, this is perfect.

r_behata · Posted 04-01-2020 04:10 PM

data have;
input ID $ FirstName $ LastName $ Gender $ Degree $ Major  $10.;
cards;
1234567 Bill Johnson M BA Sociology 
2345678 Jen Baker W BA Sociology 
2345678 Jen Baker W MA Psychology 
4567890 Stan Wright M MA Psychology 
4567890 Stan Wright M MA Sociology 
;;;;
run;


data want(rename=(Degree_=Degree Major_=Major));
length Major_ $50. Degree_ $25.;
	do until(last.id);
		set have;
		by id;
		Degree_ = catx(';',Degree_,Degree); 
		Major_=catx(';',Major_,major); 
	end;
	drop major Degree;
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!