I have a data set with IDs like: 100101, 100102, 201201. The first two numbers indicate location, the second two numbers indicate a family group, and the last two numbers indicate the individual within that location and family group. I need to break these up into variables for the family.
Right now I have this:
data new;
set original;
if 100100<id<100200 then family=1;
if 100200<id<100300 then family=2;
if 100300<id<100400 then family=3;
etc.
run;
I have 2,000 observations and hundreds of families so this method is going to take forever. I'm not very skilled at macros yet, is there a way I can code a macro to process this for me?
length location family individual $2 int $6;
int = put(id,z6.);
location = substr(int,1,2);
family = substr(int,3,2);
individual = substr(int,5,2);
drop int;
Macros not needed here. This can be done with data step statements.
However, you say there are "hundreds of families", what would say the 157th family look like in this ID variable? You also say the family number is two digits, so it can't be 157. Please explain this further.
Also, is the variable named ID numeric or character according to PROC CONTENTS?
The families reset with each new group, so each group has ~20 families, each labeled from 1-20.
ID variable is numeric.
@taylorcclarkson wrote:
The families reset with each new group, so each group has ~20 families, each labeled from 1-20.
ID variable is numeric.
I would say that extracting just the family from such a description would be incomplete. Unless you have already figured out and added a GROUP variable.
Note: you could likely keep all of the group/family together just using a custom format
data example; input id; datalines; 100101 100102 100103 100201 100202 100203 100204 110101 110102 110201 110202 110203 ; proc format; picture grpfam low-high ='9999' (mult=0.01) ; proc print data=example; format id grpfam.; run;
The above custom format only displays the first 4 digits of the id. (it shifts the decimal place of the ID variable with the mult and then doesn't display any decimal values by default).
The groups created by formats are honored by most procedures so you can do things like:
proc freq data=example; tables id; format id grpfam.; run;
to count family members.
Or to use the grp/family as a group in a regression or similar procedure.
Since formats allow you to work with the formatted values in most places they can be very efficient ways to make different tables, analyses or graphs without having to go back and keep adding variables to the data set.
And just changing the digit selectors in the format:
proc format; picture fam low-high ='99' (mult=0.01) ;
will display the 2 digits of the family.
length location family individual $2 int $6;
int = put(id,z6.);
location = substr(int,1,2);
family = substr(int,3,2);
individual = substr(int,5,2);
drop int;
You can also just use arithmetic to pull out the values.
data new;
set original;
individual = mod(id,100);
family = int(id/100);
group = int(family/100);
family = mod(family,100);
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.