Solved: Re: Importing an unusual csv file - revisited

MFLoGrasso · Posted 10-16-2018 04:25 PM

So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.

Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.

Read the first line as the list of variables for a temp data set (named temp)
Add lines to temp until the next instance of a line beginning with UNITID
Assign the SURVSECT and PART values (which will be the same for every observation in temp) to macro variables, e.g. &surv and &part (%IF _N_ = 1 %THEN %DO; %LET SURV = SURVSECT; %LET PART = PART; %END;)
Append the temp set to a table named &surv_&part (PROC APPEND BASE = &SURV_&PART DATA = TEMP; RUN;)
Delete the temp set (PROC DELETE DATA = TEMP; RUN;)
Repeat steps 1 - 5, beginning with the line that caused step 2 to end, until the end of the csv file

I am pretty comfortable that I have steps 3 - 5 set up properly. I figure steps 1 and 2 begin by initializing some counter as 1 (%LET COUNTER = 1;) then bringing in my entire csv starting from the next using PROC IMPORT and DATAROW = &COUNTER + 1, then updating &COUNTER to be the observation number that caused step 2 to end. This seems to be the sticking point, though. Using the attached data, I can't see it being overly difficult to create COM_A by using row 1 as the variable names and rows 2 - 17 as the data. But now I need to create COM_B using the variable names in row 18 and rows 19 - 34 as the data, COM_C using row 35 for names and row 36 for data, COM_D using row 37 for names and rows 38 - 40 for data, etc.

Tom · Posted 10-16-2018 11:57 PM

The file you attached is very easy to read.

UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16
188517,COM,A,1,11.1002,02,,,,,,,,,1,,,,,,,,,,1,0
188517,COM,A,1,11.1002,03,0,0,1,0,0,0,0,0,1,0,0,0,3,0,0,1,0,0,5,1
UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,DISTANCEED
188517,COM,B,1,11.1002,02,2
188517,COM,B,1,11.1002,03,2
UNITID,SURVSECT,PART,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16
188517,COM,C,,,10,11,,3,,3,9,25,,,12,54,1,7,1,,33,103
UNITID,SURVSECT,PART,CTLEVEL,CRACE15,CRACE16,CRACE24,CRACE17,CRACE41,CRACE42,CRACE43,CRACE44,CRACE45,CRACE46,CRACE47,CRACE23,CRACE48,AGE1,AGE2,AGE3,AGE4,AGE5,AGETOTAL
188517,COM,D,2,13,13,26,0,6,0,0,3,0,14,2,1,26,0,15,5,6,0,26
188517,COM,D,3,24,88,112,0,17,3,3,29,0,54,6,0,112,0,36,64,12,0,112

The number of data rows in a group doesn't matter.

You can read the first column into a character variable (looks like it is a character variable since it is an ID) and throw away the header rows. Then read the next two column and decide based on the value of the third column what other variables to read from the line and which output dataset to write into.

Here is the start of a program for you.

data
  partA(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
  partB(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED)
  partC(keep=UNITID SURVSECT PART CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
  partD(keep=UNITID SURVSECT PART CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL)
;
  infile 'filename' dsd truncover ;
  length UNITID $20 SURVSECT $5 PART $1 MAJORNUM 8 ..... ;
  input unitid @;
  if unitid='UNITID' then delete ;
  input SURVSECT PART @ ;
  select (part);
    when ('A') do;
      input MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
      ouput partA;
      end;
    when ('B') do;
      input MAJORNUM CIPCODE AWLEVEL DISTANCEED ;
      ouput partB;
      end;
    when ('C') do;
      input CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
      ouput partC;
      end;
    when ('D') do;
      input CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL ;
      ouput partD;
      end;
    otherwise put 'Illegal part ' part= / _infile_;
  end;
run;

Fill in the details about the variables definitions and any additional "PART"s if needed.

View solution in original post

ballardw · Posted 10-16-2018 04:57 PM

@MFLoGrasso wrote:

So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.

Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.

What do you gain by creating separate "files" (you do mean SAS data sets don't you?).

BY group processing with a single dataset would likely be easier.

N of what? The records in the data set or a variable in the data set?

It appears that what you need to do is parse the incoming line to see which read pattern you need to follow up with.

You can do that with

@input @;

If _infile_ =: <a string of values from the data that uniquely identifies a read pattern> then

input <the variables of that list>;

Else if _infile_ =: <another string of values> then

input <the other variables>;

<repeat as needed>.

The strings to compare would be easy to copy from the imbedded header lines using a TEXT EDITOR. Under no circumstances attempt to copy from the csv file using Excel or other spreadsheet.

MFLoGrasso · Posted 10-16-2018 05:51 PM

@ballardw wrote:

@MFLoGrasso wrote:

So, I have a situation that is similar to the problem posted here. The issue that I have is that whereas the OP has three different files that are already grouped together in the original csv, I have multiple "entries" of the same four files stacked on top of each other, and each time the next group of data for a particular file arises, the n for that particular group may be different. I have attached one data file that is typical of what I am working with.

Within this csv, the first three variables (UNITID, SURVSECT, PART) are always the same. I need to create a separate file for each unique combination of SURVSECT and PART (in this case, COM A, COM B, COM C, and COM D). The file is set up to give all four parts for each UNITID before looping back to part A for the next UNITID. I have a sense of how I THINK this needs to be approached, but I can't fully wrap my head around the code I would need to accomplish this task.

What do you gain by creating separate "files" I need to compare each SURVSECT/PART with a series of files created from a different source to verify matches and reconcile any discrepancies between them.

(you do mean SAS data sets don't you?).  Yes. Typing faster than I'm thinking. 🙂

BY group processing with a single dataset would likely be easier. I had considered that, but I wasn't sure how I would extract the variable names.

N of what? The records in the data set or a variable in the data set? N is records in a part (see for example, the first part D has 3 observations while the second part D has only 2).

It appears that what you need to do is parse the incoming line to see which read pattern you need to follow up with.

You can do that with

@input @;

If _infile_ =: <a string of values from the data that uniquely identifies a read pattern> then

   input <the variables of that list>;

Since the unique combinations are based on the second and third variables of the list, would I need to do something like the following?

IF SCAN(_INFILE_,2,',',M) = 'COM' AND SCAN(_INFILE_,3,',',M) = 'B' THEN

INPUT UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED;

Else if _infile_ =: <another string of values> then

   input <the other variables>;

<repeat as needed>.

The strings to compare would be easy to copy from the imbedded header lines using a TEXT EDITOR. Under no circumstances attempt to copy from the csv file using Excel or other spreadsheet.

Tom · Posted 10-16-2018 11:57 PM

The file you attached is very easy to read.

UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16
188517,COM,A,1,11.1002,02,,,,,,,,,1,,,,,,,,,,1,0
188517,COM,A,1,11.1002,03,0,0,1,0,0,0,0,0,1,0,0,0,3,0,0,1,0,0,5,1
UNITID,SURVSECT,PART,MAJORNUM,CIPCODE,AWLEVEL,DISTANCEED
188517,COM,B,1,11.1002,02,2
188517,COM,B,1,11.1002,03,2
UNITID,SURVSECT,PART,CRACE01,CRACE02,CRACE25,CRACE26,CRACE27,CRACE28,CRACE29,CRACE30,CRACE31,CRACE32,CRACE33,CRACE34,CRACE35,CRACE36,CRACE37,CRACE38,CRACE13,CRACE14,CRACE15,CRACE16
188517,COM,C,,,10,11,,3,,3,9,25,,,12,54,1,7,1,,33,103
UNITID,SURVSECT,PART,CTLEVEL,CRACE15,CRACE16,CRACE24,CRACE17,CRACE41,CRACE42,CRACE43,CRACE44,CRACE45,CRACE46,CRACE47,CRACE23,CRACE48,AGE1,AGE2,AGE3,AGE4,AGE5,AGETOTAL
188517,COM,D,2,13,13,26,0,6,0,0,3,0,14,2,1,26,0,15,5,6,0,26
188517,COM,D,3,24,88,112,0,17,3,3,29,0,54,6,0,112,0,36,64,12,0,112

The number of data rows in a group doesn't matter.

You can read the first column into a character variable (looks like it is a character variable since it is an ID) and throw away the header rows. Then read the next two column and decide based on the value of the third column what other variables to read from the line and which output dataset to write into.

Here is the start of a program for you.

data
  partA(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
  partB(keep=UNITID SURVSECT PART MAJORNUM CIPCODE AWLEVEL DISTANCEED)
  partC(keep=UNITID SURVSECT PART CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16)
  partD(keep=UNITID SURVSECT PART CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL)
;
  infile 'filename' dsd truncover ;
  length UNITID $20 SURVSECT $5 PART $1 MAJORNUM 8 ..... ;
  input unitid @;
  if unitid='UNITID' then delete ;
  input SURVSECT PART @ ;
  select (part);
    when ('A') do;
      input MAJORNUM CIPCODE AWLEVEL CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
      ouput partA;
      end;
    when ('B') do;
      input MAJORNUM CIPCODE AWLEVEL DISTANCEED ;
      ouput partB;
      end;
    when ('C') do;
      input CRACE01 CRACE02 CRACE25-CRACE38 CRACE13-CRACE16;
      ouput partC;
      end;
    when ('D') do;
      input CTLEVEL CRACE15 CRACE16 CRACE24 CRACE17 CRACE41-CRACE47 CRACE23 CRACE48 AGE1-AGE5 AGETOTAL ;
      ouput partD;
      end;
    otherwise put 'Illegal part ' part= / _infile_;
  end;
run;

Fill in the details about the variables definitions and any additional "PART"s if needed.

Importing an unusual csv file - revisited

Re: Importing an unusual csv file - revisited

Re: Importing an unusual csv file - revisited

Re: Importing an unusual csv file - revisited

Re: Importing an unusual csv file - revisited

Click image to register for webinar

Classroom Training Available!