Creating multiple observations from one

rg25 · Posted 08-14-2015 02:54 PM

The dataset I received contains data from several household members on a single line, for instance each member's age will be under age_1, age_2, etc. All members of the same family are linked together by a household ID. Is there any way to separate this single observation (containing all of the family members), into separate observations for each member?

Sonywell · Posted 08-14-2015 03:34 PM

You could potentially perform a transpose depending on your data, or you could use a data step with multiple output statements:

data output(keep=householdID, age);

set input;

age = age_1; output;

age = age_2; output;

age = age_3; output;

...

run;

The output statements forces an observation to be written to the output set each time it is encountered. Just one idea

ballardw · Posted 08-14-2015 04:16 PM

One would hope there are variables for the member id as well. They and any other specific individual data should be all split out at that time. Just extending the idea Sonywell posted.

data output(drop= age_1-age_3 personid_1 - personid_3;); /* drop the repeated variables will keep all of the other variables*/

set input;

age = age_1; personId = personid_1; output;

/* and to avoid creating blank records in the case where there are not more folks*/

if not missing (age_2) then do; age = age_2; personId = personid_2; output; end;

if not missing (age_3) then do;age = age_3; personId = personid_2; output; end;

...

run;

If you have a larger number of potential additional household members then the approach would be better using array processing. Suppose your data has up to 8 household members:

data output(drop= age_1-age_8 personid_1 - personid_8;); /* drop the repeated variables will keep all of the other variables*/

set input;

array a age_1 - age_8;

array id personid_1 - personid_8;

<continue to define additional arrays for additional person variables*/

do j=1 to dim(a); /* make sure that all of the arrays have the same number of elements for this to work*/

/* and to avoid creating blank records in the case where there are not more folks*/

if not missing (a) then do;

age = age;

personId = id;

output;

end;/* the IF*/

end; /* the do j= loop*/

;

run;

rg25 · Posted 08-20-2015 04:38 PM

unfortunately there is no specific person ID.

when I run this code:

data hh_split_2(drop= age_1-age_8 edulvl_1-edulvl_8 sex_1-sex_8 res_rdp_1-res_rdp_8 pregnnt_1-pregnnt_8);

set hh_split;

array a age_1 - age_8;

array s sex_1 - sex_8;

array r res_rdp_1-res_rdp_8;

array p pregnnt_1-pregnnt_8;

do j=1 to dim(a);

do j=1 to dim(s);

do j=1 to dim(r);

do j=1 to dim(p);

if not missing (a) then do;

age = a;

output; end;

if not missing (s) then do;

sex= s;

output; end;

if not missing (r) then do;

res_rdp= r; output; end;

if not missing (p) then do;

pregnnt=p; output; end;

run;

I get the following error: ERROR 117-185: There were 4 unclosed DO blocks.

Ksharp · Posted 08-15-2015 07:38 AM

Post your sample data. I think double transpose can do that.

rg25 · Posted 08-20-2015 04:33 PM

INT_NAME

HH_Area

HH_ADDR

REF_PT

HHID

HH_NAME_1

PERS_1

AGE_YRS_1

SEX_1

DISABLE_1

RES_RDP_1

PREGNNT_1

EDU_LVL_1

EMPLYMNT_1

SELECTED_1

HH_NAME_2

PERS_2

SEX_2

AGE_YRS_2

DISABLE_2

RES_RDP_2

PREGNNT_2

EDU_LVL_2

EMPLYMNT_2

SELECTED_2

HH_NAME_3

PERS_3

SEX_3

AGE_YRS_3

DISABLE_3

RES_RDP_3

PREGNNT_3

EDU_LVL_3

EMPLYMNT_3

SELECTED_3

HH_NAME_4

PERS_4

SEX_4

AGE_YRS_4

DISABLE_4

RES_RDP_4

PREGNNT_4

EDU_LVL_4

EMPLYMNT_4

SELECTED_4

HH_NAME_5

PERS_5

AGE_YRS_5

SEX_5

DISABLE_5

RES_RDP_5

PREGNNT_5

EDU_LVL_5

EMPLYMNT_5

SELECTED_5

HH_NAME_6

PERS_6

SEX_6

AGE_YRS_6

DISABLE_6

RES_RDP_6

PREGNNT_6

EDU_LVL_6

EMPLYMNT_6

SELECTED_6

HH_NAME_7

PERS_7

SEX_7

AGE_YRS_7

DISABLE_7

RES_RDP_7

PREGNNT_7

EDU_LVL_7

EMPLYMNT_7

SELECTED_7

HH_NAME_8

PERS_8

SEX_8

AGE_YRS_8

DISABLE_8

RES_RDP_8

PREGNNT_8

EDU_LVL_8

EMPLYMNT_8

SELECTED_8

HH_NAME_9

PERS_9

SEX_9

AGE_YRS_9

DISABLE_9

RES_RDP_9

PREGNNT_9

EDU_LVL_9

EMPLYMNT_9

SELECTED_9

HH_NAME_10

PERS_10

AGE_YRS_10

SEX_10

DISABLE_10

RES_RDP_10

PREGNNT_10

EDU_LVL_10

EMPLYMNT_10

SELECTED_10

HH_NAME_11

PERS_11

AGE_YRS_11

SEX_11

DISABLE_11

RES_RDP_11

PREGNNT_11

EDU_LVL_11

EMPLYMNT_11

SELECTED_11

HH_NAME_12

PERS_12

AGE_YRS_12

SEX_12

DISABLE_12

RES_RDP_12

PREGNNT_12

EDU_LVL_12

EMPLYMNT_12

SELECTED_12

HH_NAME_13

PERS_13

AGE_YRS_13

SEX_13

DISABLE_13

RES_RDP_13

PREGNNT_13

EDU_LVL_13

EMPLYMNT_13

SELECTED_13

HH_NAME_14

PERS_14

AGE_YRS_14

SEX_14

DISABLE_14

RES_RDP_14

PREGNNT_14

EDU_LVL_14

EMPLYMNT_14

SELECTED_14

This is how the variables are listed in the dataset, all the information before HH_NAME_1 is the same for all members of the household. The variables after HHID (household ID) are specific to each household member (depending on the ending 1, 2, 3...)

Ksharp · Posted 08-21-2015 11:00 AM

You didn't post some DATA yet .

Code: Program

data have;
input INT_NAME HH_Area HH_ADDR REF_PT HHID HH_NAME_1 SEX_1 HH_NAME_2 SEX_2;
n+1;
cards;
1 1 1 1 1 11 0 12 1
1 1 1 1 1 22 0 22 1
1 1 1 1 2 11 0 12 1
1 1 1 1 2 22 0 22 1
1 1 1 1 2 33 0 32 1
;
run;
proc transpose data=have out=temp;
by INT_NAME HH_Area HH_ADDR REF_PT HHID n;
var HH_NAME_: SEX_: ;
run;
data temp;
 set temp;
 id=input(scan(_name_,-1,'_'),best32.);
 name=prxchange('s/_\d+$//',1,strip(_name_));
run;
proc sort data=temp(drop=_name_) out=x ;
 by INT_NAME HH_Area HH_ADDR REF_PT HHID id;
run;
data x;
 set x;
 by INT_NAME HH_Area HH_ADDR REF_PT HHID n notsorted;
 group+first.n;
run;

proc transpose data=x out=want;
 by INT_NAME HH_Area HH_ADDR REF_PT HHID group;
 var col1;
 id name;
run;