I have a data set that is formatted as a series of variable names paired with their values all in one row, and I would like to convert this into a SAS data set. Part of the complexity is that the rows contain different variables. A sampling of the raw data is below.
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
The data sets (and the variables in each set) are determined by the PART variable. For example, the first line above should be in a data set that looks like:
UNITID SURVSECT PART LINE SLEVEL SEX COUNT
459107 EF1 B 8 1 2 14
And the last line above should be in a data set that looks like:
UNITID SURVSECT PART CIPCODE LINE REG COUNT
475699 EF1 A 52.0000 17 51 3
Currently, I have the entire raw data brought in using an infile statement:
data imp_file4;
infile ef4 lrecl = 256 truncover;
input @1 kvp $256.;
put kvp;
run;
And I make each of my needed data sets through a bit of brute force using the scan function.
data i.imp_a1;
set imp_file4 imp_file2 imp_file2s;
where substr(kvp,33,1) = 'A';
unitid = input(scan(kvp,2,',='),6.);
survsect = put(scan(kvp,4,',='),$3.);
part = put(scan(kvp,6,',='),$1.);
cip = put(scan(kvp,8,',='),$7.);
line = input(scan(kvp,10,',='),2.);
reg = put(scan(kvp,12,',='),$1.)||put(scan(kvp,14,',='),$1.);
count = input(scan(kvp,16,',='),6.);
drop kvp;
run;
The thing is, I figure there has to be an easier way to do this using a do loop that can create all the variable names for me as well, and ideally all seven (in this case) data sets.
Have a looked into named input as this is for reading data in the structure you have.
I haven't added all the variables but here some sample code:
data;
infile datalines ;
length unitid 8. SURVSECT $20.;
input @;
_infile_=translate(_infile_,' ',',');
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;
Almost the same as @Patrick but this one retrieves the PT: variables.
data WANT;
infile datalines dlm=',';
length unitid 8. SURVSECT $20.;
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
run;
If not replacing the commas with blanks I'm not reading the data correctly. Is this different in your environment?
@Patrick Oh no you are right!
I was focusing on the PT fields which are not read.
I have no clue why they are not.
@ChrisNZ wrote:
@Patrick Oh no you are right!
I was focusing on the PT fields which are not read.
I have no clue why they are not.
It looks like the input buffer gets truncated in the code I've posted. No idea why. Assuming that in real life @MFLoGrasso will be reading the data from an external file (not inline data reading using datalines/card) below example demonstrates that this should work.
filename mydata temp;
data _null_;
file mydata;
infile datalines;
input;
put _infile_;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,IPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;
data want;
infile mydata;
length unitid 8. SURVSECT $20.;
input @;
_infile_=translate(_infile_,' ',',');
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
run;
As for creating several data sets, you have to describe them.
Something like this works:
data I.IMP_A1 (keep=UNITID SURVSECT PART LINE SLEVEL SEX COUNT)
I.IMP_A2 (keep=UNITID SURVSECT PART CIPCOD LINE );
...
if PART= 'A' then output I.IMP_A1;
if PART= 'B' then output I.IMP_A2;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.