Re: Extract part of a string having positional delimiter

MFLoGrasso · Posted 04-17-2018 05:52 PM

I have a data set that is formatted as a series of variable names paired with their values all in one row, and I would like to convert this into a SAS data set. Part of the complexity is that the rows contain different variables. A sampling of the raw data is below.

UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15 
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1 
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0 
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3

The data sets (and the variables in each set) are determined by the PART variable. For example, the first line above should be in a data set that looks like:

UNITID   SURVSECT   PART   LINE   SLEVEL   SEX   COUNT
459107   EF1        B         8        1     2      14

And the last line above should be in a data set that looks like:

UNITID   SURVSECT   PART   CIPCODE   LINE   REG   COUNT
475699   EF1        A      52.0000     17   51        3

Currently, I have the entire raw data brought in using an infile statement:

data imp_file4;
	infile ef4 lrecl = 256 truncover;
	input @1 kvp $256.;
	put kvp;
	run;

And I make each of my needed data sets through a bit of brute force using the scan function.

data i.imp_a1;
	set imp_file4 imp_file2 imp_file2s;
		where substr(kvp,33,1) = 'A';
	unitid = input(scan(kvp,2,',='),6.);
	survsect = put(scan(kvp,4,',='),$3.);
	part = put(scan(kvp,6,',='),$1.);
	cip = put(scan(kvp,8,',='),$7.);
	line = input(scan(kvp,10,',='),2.);
	reg = put(scan(kvp,12,',='),$1.)||put(scan(kvp,14,',='),$1.);
	count = input(scan(kvp,16,',='),6.);
	drop kvp;
	run;

The thing is, I figure there has to be an easier way to do this using a do loop that can create all the variable names for me as well, and ideally all seven (in this case) data sets.

Patrick · Posted 04-17-2018 06:26 PM

@MFLoGrasso

Have a looked into named input as this is for reading data in the structure you have.

http://go.documentation.sas.com/?docsetId=lestmtsref&docsetTarget=p1jjrvmzmybdeqn1gt8gj1r0ed0u.htm&d...

I haven't added all the variables but here some sample code:

data;
  infile datalines ;
  length unitid 8. SURVSECT $20.;
  input @;
  _infile_=translate(_infile_,' ',',');
  input 
    UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
    FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
    ;
  datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15 
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1 
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0 
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;

ChrisNZ · Posted 04-17-2018 06:55 PM

Almost the same as @Patrick but this one retrieves the PT: variables.


data WANT;
  infile datalines dlm=',';
  length unitid 8. SURVSECT $20.;
  input 
    UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
    FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
    ;
  datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15 
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1 
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0 
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
run;

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-17-2018 07:02 PM

@ChrisNZ

If not replacing the commas with blanks I'm not reading the data correctly. Is this different in your environment?

ChrisNZ · Posted 04-17-2018 07:16 PM

@Patrick Oh no you are right!

I was focusing on the PT fields which are not read.

I have no clue why they are not.

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-17-2018 07:38 PM

@ChrisNZ wrote:

@Patrick Oh no you are right!

I was focusing on the PT fields which are not read.

I have no clue why they are not.

@ChrisNZ

It looks like the input buffer gets truncated in the code I've posted. No idea why. Assuming that in real life @MFLoGrasso will be reading the data from an external file (not inline data reading using datalines/card) below example demonstrates that this should work.

filename mydata temp;
data _null_;
  file mydata;
  infile datalines;
  input;
  put _infile_;
  datalines;
UNITID=459107,SURVSECT=EF1,PART=B,LINE=8,SLEVEL=1,SEX=2,COUNT=14 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=1,COUNT=3 
UNITID=459107,SURVSECT=EF1,PART=B,LINE=9,SLEVEL=1,SEX=2,COUNT=6 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=1,COUNT=35 
UNITID=459107,SURVSECT=EF1,PART=C,LINE=51,HS=2,COUNT=15 
UNITID=459107,SURVSECT=EF1,PART=D,COUNT=1 
UNITID=459107,SURVSECT=EF1,PART=E,FT_PY_COHORT=0,FT_EXCLUSIONS=0,FT_CY_COHORT=0,PT_PY_COHORT=0,PT_EXCLUSIONS=0,PT_CY_COHORT=0 
UNITID=459107,SURVSECT=EF1,PART=F,ST_STAFF_RATIO=10 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=1,ENROLL_EXCLUSIVE=19,ENROLL_SOME=80,NOTENROLL=238,INUS_PPS=19,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=459107,SURVSECT=EF1,PART=G,LINE=2,ENROLL_EXCLUSIVE=0,ENROLL_SOME=0,NOTENROLL=3,INUS_PPS=0,INUS_NOTPPS=0,INUS_UNKNOWN_STATE=0,OUTSIDEUS=0 
UNITID=475699,SURVSECT=EF1,PART=A,CIPCODE=52.0000,LINE=16,RACE=5,SEX=1,COUNT=1 
UNITID=475699,SURVSECT=EF1,PART=A,IPCODE=52.0000,LINE=17,RACE=5,SEX=1,COUNT=3
;
run;

data want;
  infile mydata;
  length unitid 8. SURVSECT $20.;
  input @;
  _infile_=translate(_infile_,' ',',');
  input 
    UNITID= SURVSECT=$ PART=$ LINE= SLEVEL= SEX= COUNT=
    FT_PY_COHORT= FT_EXCLUSIONS= FT_CY_COHORT= PT_PY_COHORT= PT_EXCLUSIONS= PT_CY_COHORT=
    ;
run;

ChrisNZ · Posted 04-17-2018 07:02 PM

As for creating several data sets, you have to describe them.

Something like this works:

data I.IMP_A1 (keep=UNITID SURVSECT PART LINE SLEVEL SEX COUNT)
     I.IMP_A2 (keep=UNITID SURVSECT PART CIPCOD LINE );
...
  if PART= 'A' then output I.IMP_A1;  
  if PART= 'B' then output I.IMP_A2;

High-Performance SAS Coding - Third Edition

Extract part of a string having positional delimiter