The dataset I received contains data from several household members on a single line, for instance each member's age will be under age_1, age_2, etc. All members of the same family are linked together by a household ID. Is there any way to separate this single observation (containing all of the family members), into separate observations for each member?
You could potentially perform a transpose depending on your data, or you could use a data step with multiple output statements:
data output(keep=householdID, age);
set input;
age = age_1; output;
age = age_2; output;
age = age_3; output;
...
run;
The output statements forces an observation to be written to the output set each time it is encountered. Just one idea
One would hope there are variables for the member id as well. They and any other specific individual data should be all split out at that time. Just extending the idea Sonywell posted.
data output(drop= age_1-age_3 personid_1 - personid_3;); /* drop the repeated variables will keep all of the other variables*/
set input;
age = age_1; personId = personid_1; output;
/* and to avoid creating blank records in the case where there are not more folks*/
if not missing (age_2) then do; age = age_2; personId = personid_2; output; end;
if not missing (age_3) then do;age = age_3; personId = personid_2; output; end;
...
run;
If you have a larger number of potential additional household members then the approach would be better using array processing. Suppose your data has up to 8 household members:
data output(drop= age_1-age_8 personid_1 - personid_8;); /* drop the repeated variables will keep all of the other variables*/
set input;
array a age_1 - age_8;
array id personid_1 - personid_8;
<continue to define additional arrays for additional person variables*/
do j=1 to dim(a); /* make sure that all of the arrays have the same number of elements for this to work*/
/* and to avoid creating blank records in the case where there are not more folks*/
if not missing (a
age = age
personId = id
<other array processes go here>
output;
end;/* the IF*/
end; /* the do j= loop*/
;
run;
unfortunately there is no specific person ID.
when I run this code:
data hh_split_2(drop= age_1-age_8 edulvl_1-edulvl_8 sex_1-sex_8 res_rdp_1-res_rdp_8 pregnnt_1-pregnnt_8);
set hh_split;
array a age_1 - age_8;
array s sex_1 - sex_8;
array r res_rdp_1-res_rdp_8;
array p pregnnt_1-pregnnt_8;
do j=1 to dim(a);
do j=1 to dim(s);
do j=1 to dim(r);
do j=1 to dim(p);
if not missing (a
age = a
output; end;
if not missing (s
sex= s
output; end;
if not missing (r
res_rdp= r
if not missing (p
pregnnt=p
run;
I get the following error: ERROR 117-185: There were 4 unclosed DO blocks.
Post your sample data. I think double transpose can do that.
INT_NAME | HH_Area | HH_ADDR | REF_PT | HHID | HH_NAME_1 | PERS_1 | AGE_YRS_1 | SEX_1 | DISABLE_1 | RES_RDP_1 | PREGNNT_1 | EDU_LVL_1 | EMPLYMNT_1 | SELECTED_1 | HH_NAME_2 | PERS_2 | SEX_2 | AGE_YRS_2 | DISABLE_2 | RES_RDP_2 | PREGNNT_2 | EDU_LVL_2 | EMPLYMNT_2 | SELECTED_2 | HH_NAME_3 | PERS_3 | SEX_3 | AGE_YRS_3 | DISABLE_3 | RES_RDP_3 | PREGNNT_3 | EDU_LVL_3 | EMPLYMNT_3 | SELECTED_3 | HH_NAME_4 | PERS_4 | SEX_4 | AGE_YRS_4 | DISABLE_4 | RES_RDP_4 | PREGNNT_4 | EDU_LVL_4 | EMPLYMNT_4 | SELECTED_4 | HH_NAME_5 | PERS_5 | AGE_YRS_5 | SEX_5 | DISABLE_5 | RES_RDP_5 | PREGNNT_5 | EDU_LVL_5 | EMPLYMNT_5 | SELECTED_5 | HH_NAME_6 | PERS_6 | SEX_6 | AGE_YRS_6 | DISABLE_6 | RES_RDP_6 | PREGNNT_6 | EDU_LVL_6 | EMPLYMNT_6 | SELECTED_6 | HH_NAME_7 | PERS_7 | SEX_7 | AGE_YRS_7 | DISABLE_7 | RES_RDP_7 | PREGNNT_7 | EDU_LVL_7 | EMPLYMNT_7 | SELECTED_7 | HH_NAME_8 | PERS_8 | SEX_8 | AGE_YRS_8 | DISABLE_8 | RES_RDP_8 | PREGNNT_8 | EDU_LVL_8 | EMPLYMNT_8 | SELECTED_8 | HH_NAME_9 | PERS_9 | SEX_9 | AGE_YRS_9 | DISABLE_9 | RES_RDP_9 | PREGNNT_9 | EDU_LVL_9 | EMPLYMNT_9 | SELECTED_9 | HH_NAME_10 | PERS_10 | AGE_YRS_10 | SEX_10 | DISABLE_10 | RES_RDP_10 | PREGNNT_10 | EDU_LVL_10 | EMPLYMNT_10 | SELECTED_10 | HH_NAME_11 | PERS_11 | AGE_YRS_11 | SEX_11 | DISABLE_11 | RES_RDP_11 | PREGNNT_11 | EDU_LVL_11 | EMPLYMNT_11 | SELECTED_11 | HH_NAME_12 | PERS_12 | AGE_YRS_12 | SEX_12 | DISABLE_12 | RES_RDP_12 | PREGNNT_12 | EDU_LVL_12 | EMPLYMNT_12 | SELECTED_12 | HH_NAME_13 | PERS_13 | AGE_YRS_13 | SEX_13 | DISABLE_13 | RES_RDP_13 | PREGNNT_13 | EDU_LVL_13 | EMPLYMNT_13 | SELECTED_13 | HH_NAME_14 | PERS_14 | AGE_YRS_14 | SEX_14 | DISABLE_14 | RES_RDP_14 | PREGNNT_14 | EDU_LVL_14 | EMPLYMNT_14 | SELECTED_14 |
This is how the variables are listed in the dataset, all the information before HH_NAME_1 is the same for all members of the household. The variables after HHID (household ID) are specific to each household member (depending on the ending 1, 2, 3...)
You didn't post some DATA yet .
data have;
input INT_NAME HH_Area HH_ADDR REF_PT HHID HH_NAME_1 SEX_1 HH_NAME_2 SEX_2;
n+1;
cards;
1 1 1 1 1 11 0 12 1
1 1 1 1 1 22 0 22 1
1 1 1 1 2 11 0 12 1
1 1 1 1 2 22 0 22 1
1 1 1 1 2 33 0 32 1
;
run;
proc transpose data=have out=temp;
by INT_NAME HH_Area HH_ADDR REF_PT HHID n;
var HH_NAME_: SEX_: ;
run;
data temp;
set temp;
id=input(scan(_name_,-1,'_'),best32.);
name=prxchange('s/_\d+$//',1,strip(_name_));
run;
proc sort data=temp(drop=_name_) out=x ;
by INT_NAME HH_Area HH_ADDR REF_PT HHID id;
run;
data x;
set x;
by INT_NAME HH_Area HH_ADDR REF_PT HHID n notsorted;
group+first.n;
run;
proc transpose data=x out=want;
by INT_NAME HH_Area HH_ADDR REF_PT HHID group;
var col1;
id name;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.