I have seen various things that are close to this, but not exactly was I am trying to do. I have a data set where there are anywhere from 1 to 3 records per person. There are some fields that are identical in multiple records (firstname, lastname, gender and id) and others that are not (degree and major). I know which fields are invariant and which ones are different. I am hoping to find code to collapse the records so there is only one per person, where the fields that have multiple values can be put into a string (BA;MA) and the ones that do not remain the same. Is there a way to do this? I'm using SAS base 9.4.
So this:
ID | FirstName | LastName | Gender | Degree | Major |
1234567 | Bill | Johnson | M | BA | Sociology |
2345678 | Jen | Baker | W | BA | Sociology |
2345678 | Jen | Baker | W | MA | Psychology |
4567890 | Stan | Wright | M | MA | Psychology |
4567890 | Stan | Wright | M | MA | Sociology |
Would become this:
ID | FirstName | LastName | Gender | Degree | Major |
1234567 | Bill | Johnson | M | BA | Sociology |
2345678 | Jen | Baker | W | BA;MA | Sociology;Psychology |
4567890 | Stan | Wright | M | MA;MA | Psychology;Sociology |
Hi DavidKaib,
I believe you can do it using the following code:
proc sort data=A out=A1 (rename=(Degree=D Major=M));
by ID;
run;
data B (drop=D M);
set A1;
by ID;
length Degree $20 Major $90;
retain Degree Major;
if first.ID then do;
Degree = D;
Major = M;
end;
else do;
Degree = catx(';',Degree,D);
Major = catx(';',Major, M);
end;
if last.ID then output;
run;
Here A is your first data set, B is your destination data set. At least, this gives you an idea.
Hope this helps.
Do you have a known maximum number of records that will need to collapse? This is needed so that you can create variables long enough to hold the results.
And another question, what exactly will you be able to do with the resulting file that can't be done with the existing one?
I would like to ask a similar question to that of @ballardw
What could you possibly do with text fields "BA;MA" or "Sociology;Psychology"? how would that help whatever comes next.
Maybe you want a column for each student named BA that is either 1 or 0, and another column for each student named MA which is 1 or 0; and so on. That could be a better form for subsequent analysis or reporting.
The variables won't end up being too long. I am losing a legacy way of generating the data set that did this for me, and that I used within SAS, and this hasn't been a problem. The new source cannot collaspe the records, so I am basically trying to replicate what I have previously had.
The resulting file is used as a contact list for a survey, and then later for analysis of the survey. In both cases I need an unduplicated count with all the information.
https://gist.github.com/statgeek/d583cfa992bf56da51d435165b07e96a
This provides two examples. If you're combining more than one field, using the data step approach maybe easier/less steps.
Length was never the issue, in my mind. Creating a data set that would be easy to use and easy to program in subsequent analysis is the issue, and the data set I described with BA 0 or 1 and MA 0 or 1 and so on seems like an easier data set to work with.
Hi DavidKaib,
I believe you can do it using the following code:
proc sort data=A out=A1 (rename=(Degree=D Major=M));
by ID;
run;
data B (drop=D M);
set A1;
by ID;
length Degree $20 Major $90;
retain Degree Major;
if first.ID then do;
Degree = D;
Major = M;
end;
else do;
Degree = catx(';',Degree,D);
Major = catx(';',Major, M);
end;
if last.ID then output;
run;
Here A is your first data set, B is your destination data set. At least, this gives you an idea.
Hope this helps.
Thanks Leonid, this is perfect.
data have;
input ID $ FirstName $ LastName $ Gender $ Degree $ Major $10.;
cards;
1234567 Bill Johnson M BA Sociology
2345678 Jen Baker W BA Sociology
2345678 Jen Baker W MA Psychology
4567890 Stan Wright M MA Psychology
4567890 Stan Wright M MA Sociology
;;;;
run;
data want(rename=(Degree_=Degree Major_=Major));
length Major_ $50. Degree_ $25.;
do until(last.id);
set have;
by id;
Degree_ = catx(';',Degree_,Degree);
Major_=catx(';',Major_,major);
end;
drop major Degree;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.