So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.
Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.
I am pretty comfortable that I have steps 3 - 5 set up properly. I figure steps 1 and 2 begin by initializing some counter as 1 (%LET COUNTER = 1;) then bringing in my entire csv starting from the next using PROC IMPORT and DATAROW = &COUNTER + 1, then updating &COUNTER to be the observation number that caused step 2 to end. This seems to be the sticking point, though. Using the attached data, I can't see it being overly difficult to create COM_A by using row 1 as the variable names and rows 2 - 17 as the data. But now I need to create COM_B using the variable names in row 18 and rows 19 - 34 as the data, COM_C using row 35 for names and row 36 for data, COM_D using row 37 for names and rows 38 - 40 for data, etc.
The file you attached is very easy to read.
UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16 188517,COM,A,1,11.1002,02,,,,,,,,,1,,,,,,,,,,1,0 188517,COM,A,1,11.1002,03,0,0,1,0,0,0,0,0,1,0,0,0,3,0,0,1,0,0,5,1 UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,DISTANCEED 188517,COM,B,1,11.1002,02,2 188517,COM,B,1,11.1002,03,2 UNITID,SURVSECT,PART,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16 188517,COM,C,,,10,11,,3,,3,9,25,,,12,54,1,7,1,,33,103 UNITID,SURVSECT,PART,CTLEVEL,CRACE15,CRACE16,CRACE24,CRACE17,CRACE41,CRACE42,CRACE43,CRACE44,CRACE45,CRACE46,CRACE47,CRACE23,CRACE48,AGE1,AGE2,AGE3,AGE4,AGE5,AGETOTAL 188517,COM,D,2,13,13,26,0,6,0,0,3,0,14,2,1,26,0,15,5,6,0,26 188517,COM,D,3,24,88,112,0,17,3,3,29,0,54,6,0,112,0,36,64,12,0,112
The number of data rows in a group doesn't matter.
You can read the first column into a character variable (looks like it is a character variable since it is an ID) and throw away the header rows. Then read the next two column and decide based on the value of the third column what other variables to read from the line and which output dataset to write into.
Here is the start of a program for you.
data
partA(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
partB(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED)
partC(keep=UNITID SURVSECT PART CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
partD(keep=UNITID SURVSECT PART CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL)
;
infile 'filename' dsd truncover ;
length UNITID $20 SURVSECT $5 PART $1 MAJORNUM 8 ..... ;
input unitid @;
if unitid='UNITID' then delete ;
input SURVSECT PART @ ;
select (part);
when ('A') do;
input MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
ouput partA;
end;
when ('B') do;
input MAJORNUM CIPCODE AWLEVEL DISTANCEED ;
ouput partB;
end;
when ('C') do;
input CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
ouput partC;
end;
when ('D') do;
input CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL ;
ouput partD;
end;
otherwise put 'Illegal part ' part= / _infile_;
end;
run;
Fill in the details about the variables definitions and any additional "PART"s if needed.
@MFLoGrasso wrote:
So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.
Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.
What do you gain by creating separate "files" (you do mean SAS data sets don't you?).
BY group processing with a single dataset would likely be easier.
N of what? The records in the data set or a variable in the data set?
It appears that what you need to do is parse the incoming line to see which read pattern you need to follow up with.
You can do that with
@input @;
If _infile_ =: <a string of values from the data that uniquely identifies a read pattern> then
input <the variables of that list>;
Else if _infile_ =: <another string of values> then
input <the other variables>;
<repeat as needed>.
The strings to compare would be easy to copy from the imbedded header lines using a TEXT EDITOR. Under no circumstances attempt to copy from the csv file using Excel or other spreadsheet.
@ballardw wrote:
@MFLoGrasso wrote:
So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.
Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.
What do you gain by creating separate "files" I need to compare each SURVSECT/PART with a series of files created from a different source to verify matches and reconcile any discrepancies between them.
(you do mean SAS data sets don't you?). Yes. Typing faster than I'm thinking. 🙂
BY group processing with a single dataset would likely be easier. I had considered that, but I wasn't sure how I would extract the variable names.
N of what? The records in the data set or a variable in the data set? N is records in a part (see for example, the first part D has 3 observations while the second part D has only 2).
It appears that what you need to do is parse the incoming line to see which read pattern you need to follow up with.
You can do that with
@input @;
If _infile_ =: <a string of values from the data that uniquely identifies a read pattern> then
input <the variables of that list>;
Since the unique combinations are based on the second and third variables of the list, would I need to do something like the following?
IF SCAN(_INFILE_,2,',',M) = 'COM' AND SCAN(_INFILE_,3,',',M) = 'B' THEN
INPUT UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED;
Else if _infile_ =: <another string of values> then
input <the other variables>;
<repeat as needed>.
The strings to compare would be easy to copy from the imbedded header lines using a TEXT EDITOR. Under no circumstances attempt to copy from the csv file using Excel or other spreadsheet.
The file you attached is very easy to read.
UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16 188517,COM,A,1,11.1002,02,,,,,,,,,1,,,,,,,,,,1,0 188517,COM,A,1,11.1002,03,0,0,1,0,0,0,0,0,1,0,0,0,3,0,0,1,0,0,5,1 UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,DISTANCEED 188517,COM,B,1,11.1002,02,2 188517,COM,B,1,11.1002,03,2 UNITID,SURVSECT,PART,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16 188517,COM,C,,,10,11,,3,,3,9,25,,,12,54,1,7,1,,33,103 UNITID,SURVSECT,PART,CTLEVEL,CRACE15,CRACE16,CRACE24,CRACE17,CRACE41,CRACE42,CRACE43,CRACE44,CRACE45,CRACE46,CRACE47,CRACE23,CRACE48,AGE1,AGE2,AGE3,AGE4,AGE5,AGETOTAL 188517,COM,D,2,13,13,26,0,6,0,0,3,0,14,2,1,26,0,15,5,6,0,26 188517,COM,D,3,24,88,112,0,17,3,3,29,0,54,6,0,112,0,36,64,12,0,112
The number of data rows in a group doesn't matter.
You can read the first column into a character variable (looks like it is a character variable since it is an ID) and throw away the header rows. Then read the next two column and decide based on the value of the third column what other variables to read from the line and which output dataset to write into.
Here is the start of a program for you.
data
partA(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
partB(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED)
partC(keep=UNITID SURVSECT PART CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
partD(keep=UNITID SURVSECT PART CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL)
;
infile 'filename' dsd truncover ;
length UNITID $20 SURVSECT $5 PART $1 MAJORNUM 8 ..... ;
input unitid @;
if unitid='UNITID' then delete ;
input SURVSECT PART @ ;
select (part);
when ('A') do;
input MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
ouput partA;
end;
when ('B') do;
input MAJORNUM CIPCODE AWLEVEL DISTANCEED ;
ouput partB;
end;
when ('C') do;
input CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
ouput partC;
end;
when ('D') do;
input CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL ;
ouput partD;
end;
otherwise put 'Illegal part ' part= / _infile_;
end;
run;
Fill in the details about the variables definitions and any additional "PART"s if needed.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.