I have a data set that is formatted as a series of variable names paired with their values all in one row, and I would like to convert this into a SAS data set. Part of the complexity is that the rows contain different variables. A sampling of the raw data is below.
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
The data sets (and the variables in each set) are determined by the PART variable. For example, the first line above should be in a data set that looks like:
UNITID SURVSECT PART LINE SLEVEL SEX COUNT
459107 EF1 B 8 1 2 14
And the last line above should be in a data set that looks like:
UNITID SURVSECT PART CIPCODE LINE REG COUNT
475699 EF1 A 52.0000 17 51 3
Currently, I have the entire raw data brought in using an infile statement:
data imp_file4;
infile ef4 lrecl = 256 truncover;
input @1 kvp $256.;
put kvp;
run;
And I make each of my needed data sets through a bit of brute force using the scan function.
data i.imp_a1;
set imp_file4 imp_file2 imp_file2s;
where substr(kvp,33,1) = 'A';
unitid = input(scan(kvp,2,',='),6.);
survsect = put(scan(kvp,4,',='),$3.);
part = put(scan(kvp,6,',='),$1.);
cip = put(scan(kvp,8,',='),$7.);
line = input(scan(kvp,10,',='),2.);
reg = put(scan(kvp,12,',='),$1.)||put(scan(kvp,14,',='),$1.);
count = input(scan(kvp,16,',='),6.);
drop kvp;
run;
The thing is, I figure there has to be an easier way to do this using a do loop that can create all the variable names for me as well, and ideally all seven (in this case) data sets.
Have a looked into named input as this is for reading data in the structure you have.
I haven't added all the variables but here some sample code:
data;
infile datalines ;
length unitid 8. SURVSECT $20.;
input @;
_infile_=translate(_infile_,' ',',');
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;
Almost the same as @Patrick but this one retrieves the PT: variables.
data WANT;
infile datalines dlm=',';
length unitid 8. SURVSECT $20.;
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
run;
If not replacing the commas with blanks I'm not reading the data correctly. Is this different in your environment?
@Patrick Oh no you are right!
I was focusing on the PT fields which are not read.
I have no clue why they are not.
@ChrisNZ wrote:
@Patrick Oh no you are right!
I was focusing on the PT fields which are not read.
I have no clue why they are not.
It looks like the input buffer gets truncated in the code I've posted. No idea why. Assuming that in real life @MFLoGrasso will be reading the data from an external file (not inline data reading using datalines/card) below example demonstrates that this should work.
filename mydata temp;
data _null_;
file mydata;
infile datalines;
input;
put _infile_;
datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1
UNITID=475699,SURVSECT=EF1,PART=A,IPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;
data want;
infile mydata;
length unitid 8. SURVSECT $20.;
input @;
_infile_=translate(_infile_,' ',',');
input
UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
;
run;
As for creating several data sets, you have to describe them.
Something like this works:
data I.IMP_A1 (keep=UNITID SURVSECT PART LINE SLEVEL SEX COUNT)
I.IMP_A2 (keep=UNITID SURVSECT PART CIPCOD LINE );
...
if PART= 'A' then output I.IMP_A1;
if PART= 'B' then output I.IMP_A2;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.