BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SteveLe
Calcite | Level 5

Hi everyone, I have stuck when trying to access this raw data. Any help, I really appreciate! 

Here is my SAS Code:

data houseparty;
infile "......\houseparty.txt"   missover ;
input house_ID  address $21. @30 city $28. @60 state $ @67 ZIP_code 
/ First_name_1 $  last_name_1 $ Donation_1 
/ First_name_2  $ last_name_2 $ Donation_2 
/ First_name_3 $ last_name_3  $ Donation_3 
/ First_name_4 $ last_name_4 $ Donation_4 ;
run;

 

The capture of File :

The full question ( Chapter 11:  Exercises and Projects for The Little SAS Book )

 A political candidate solicits donations through grassroots efforts via small house parties. After the house party, she records the donations to her campaign for each house. The raw data set called HouseParty.dat contains information on the most recent wave of house parties. The variables in this file are house ID, address, city, state, and ZIP code, followed by the name of each attendee and the amount donated at the house party.

a. Open the raw data file in a simple editor such as WordPad. In a comment in your program, discuss why this data structure would be difficult to read into SAS with a single INPUT statement and without line pointers. What log messages might you encounter?

b. Write a program that will read the raw data into SAS without errors or warnings in the log. The resulting SAS data set should have one observation per attendee with repeating house ID and address information for those at the same house party.

c. Calculate the average amount donated by state. Make sure that states names are grouped together appropriately.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Lets look at your raw data a bit:

1001   0713 View Wy.         Los Angeles                   CA     90097
MICHELLE      MORALES        190
SANDRA        MORALES        65
KENNETH       MORALES        205
1022   032 Forest Wy.        Grandview                     Wa     98930
KENNETH       COLEMAN        405
RICHARD       COLEMAN        480
1027   9671 Hill Wy.         Mermentau                     LA     70556
JERRY         BERRY          475
SHANNON       SMITH          75
ROBERT        BERRY          190
RYAN          SMITH          90
1038   0866 9th Dr.          Long Beach                    ca     90845
TONYA         NICHOLS        500
1065   8855 Third St.        Blue Springs                  MO     64015
SANDRA        OLSON          200
STEPHANIE     OLSON          70

So apparently you have differing numbers of people at an address.
Your code as written forces reading 4 people. So you likely do not get what you want.

Likely you would will be better off in the long run NOT making 4 (or 3 or 6 or whatever) first name last name donation variables. That sort of data structure causes more headaches in the long run. You would be better off with a variable that indicates the position number within the household (?).

 

The question is how to parse the data to tell the difference between the house address line and the people.

Since you apparently have an ID value that is numeric at the start of the address then you can test for that, unless you have any people with names like "123".

The following codes reads things in the data step. Use your INFILE to point to your data set.

The input @;  places the current line of data into a long automatic variable you can examine named _infile_ and holds the input pointer on that line.

The ANYDIGIT function with Scan is looking for numerals in the first "word".

Retain keeps variables across iterations of the data step.

You likely do not want simple list input for names because sometimes folks have spaces or other stuff in first or last name and that would throw where the donation falls in the list count.

 

data want;
   infile datalines missover;
   input @;
   length address $ 21 city $ 28 state $ 2 ZIP_code $ 5 ;
   /* keep the address information for each person*/
   retain   house_ID  address city state  ZIP_code ;
   if anydigit(scan(_infile_,1)) >0 then do;
      /* read the address line*/
      input house_ID  address $21. @30 city $28. @60 state $ @67 ZIP_code $;
      person=0;
   end;
   else do;
      /* increment the person indicator*/
      person+1;
      input First_name $14.  last_name $15. Donation;
      /* only output data when person information has been read*/
      output; 
   end;
datalines;
1001   0713 View Wy.         Los Angeles                   CA     90097
MICHELLE      MORALES        190
SANDRA        MORALES        65
KENNETH       MORALES        205
1022   032 Forest Wy.        Grandview                     Wa     98930
KENNETH       COLEMAN        405
RICHARD       COLEMAN        480
1027   9671 Hill Wy.         Mermentau                     LA     70556
JERRY         BERRY          475
SHANNON       SMITH          75
ROBERT        BERRY          190
RYAN          SMITH          90
1038   0866 9th Dr.          Long Beach                    ca     90845
TONYA         NICHOLS        500
1065   8855 Third St.        Blue Springs                  MO     64015
SANDRA        OLSON          200
STEPHANIE     OLSON          70
;

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26

Show us the LOG so we can see the code and the warning messages and the error message.

 

Please help us by maintaining the formatting of the LOG by pasting it (as text) into the window that appears when you click on the </> icon. PLEASE DO NOT SKIP THIS STEP.

--
Paige Miller
SteveLe
Calcite | Level 5
I will do this, thanks for reminding.
Reeza
Super User

The approach is incorrect, it's hard coded it to read 4 rows of names per address when the data does not follow that pattern. You have a variable number of records after the address portion. You'll need to develop a dynamic approach that can identify if a record is an address or a name and how to handle it. Your addresses always start with a number so that's one way to identify it. You can use the trailing @ to hold a record. You may want to look into RETAIN as well to hold the same address across multiple records. 

 

This is untested.

 

Spoiler
data houseparty;

infile "......\houseparty.txt"   missover ;

retain house_ID  address  city  state  ZIP_code;

*read first variable;
input firstChar $ @;

*check if it contains any numbers if so, process as an address;
if anydigit(firstChar) then do;
    Input house_ID  address $21. @30 city $28. @60 state $ @67 ZIP_code ;
   Record + 1;
end;

*else read in names;
else do;
      input First_name $  last_name $ Donation;
end;

run;

 

 

SteveLe
Calcite | Level 5
hi bro. Thanks for your help.
Actually, I did apply your code but it didn't work on my SAS. :'(
Maybe I made a mistake.
ballardw
Super User

Lets look at your raw data a bit:

1001   0713 View Wy.         Los Angeles                   CA     90097
MICHELLE      MORALES        190
SANDRA        MORALES        65
KENNETH       MORALES        205
1022   032 Forest Wy.        Grandview                     Wa     98930
KENNETH       COLEMAN        405
RICHARD       COLEMAN        480
1027   9671 Hill Wy.         Mermentau                     LA     70556
JERRY         BERRY          475
SHANNON       SMITH          75
ROBERT        BERRY          190
RYAN          SMITH          90
1038   0866 9th Dr.          Long Beach                    ca     90845
TONYA         NICHOLS        500
1065   8855 Third St.        Blue Springs                  MO     64015
SANDRA        OLSON          200
STEPHANIE     OLSON          70

So apparently you have differing numbers of people at an address.
Your code as written forces reading 4 people. So you likely do not get what you want.

Likely you would will be better off in the long run NOT making 4 (or 3 or 6 or whatever) first name last name donation variables. That sort of data structure causes more headaches in the long run. You would be better off with a variable that indicates the position number within the household (?).

 

The question is how to parse the data to tell the difference between the house address line and the people.

Since you apparently have an ID value that is numeric at the start of the address then you can test for that, unless you have any people with names like "123".

The following codes reads things in the data step. Use your INFILE to point to your data set.

The input @;  places the current line of data into a long automatic variable you can examine named _infile_ and holds the input pointer on that line.

The ANYDIGIT function with Scan is looking for numerals in the first "word".

Retain keeps variables across iterations of the data step.

You likely do not want simple list input for names because sometimes folks have spaces or other stuff in first or last name and that would throw where the donation falls in the list count.

 

data want;
   infile datalines missover;
   input @;
   length address $ 21 city $ 28 state $ 2 ZIP_code $ 5 ;
   /* keep the address information for each person*/
   retain   house_ID  address city state  ZIP_code ;
   if anydigit(scan(_infile_,1)) >0 then do;
      /* read the address line*/
      input house_ID  address $21. @30 city $28. @60 state $ @67 ZIP_code $;
      person=0;
   end;
   else do;
      /* increment the person indicator*/
      person+1;
      input First_name $14.  last_name $15. Donation;
      /* only output data when person information has been read*/
      output; 
   end;
datalines;
1001   0713 View Wy.         Los Angeles                   CA     90097
MICHELLE      MORALES        190
SANDRA        MORALES        65
KENNETH       MORALES        205
1022   032 Forest Wy.        Grandview                     Wa     98930
KENNETH       COLEMAN        405
RICHARD       COLEMAN        480
1027   9671 Hill Wy.         Mermentau                     LA     70556
JERRY         BERRY          475
SHANNON       SMITH          75
ROBERT        BERRY          190
RYAN          SMITH          90
1038   0866 9th Dr.          Long Beach                    ca     90845
TONYA         NICHOLS        500
1065   8855 Third St.        Blue Springs                  MO     64015
SANDRA        OLSON          200
STEPHANIE     OLSON          70
;
SteveLe
Calcite | Level 5
Incredible! I did it.
Thank so much.
SteveLe
Calcite | Level 5
By the way, I just wonder: can we put many people ( first_name, last_name, donation ) in one row instead of many?
ballardw
Super User

@SteveLe wrote:
By the way, I just wonder: can we put many people ( first_name, last_name, donation ) in one row instead of many?

The question to answer first is "why have many on one row". You need to very explicitly state what the exact advantage would be. For most things in SAS you will be better off with one person per record.

You already know that have one to four people per address. So on some rows you would have mostly missing values. Almost any further processing would require additional coding to only use the values that are recorded "across" a row.

 

Generally I would not actually make data sets look like that but use a reporting procedure to display text like that.

Example, using the Want data set made above:

proc report data=want;
   columns house_id person,(last_name first_name donation)
   ;
   define house_id / group;
   define person /across "Donor";
   define last_name /display;
   define first_name /display;
run;

You can make an ugly data set like that. We get that question nearly daily. Though yours would be compounded by using 3 variables, what the result should actually look like, and as always how is that variable to be used.

 

 

SteveLe
Calcite | Level 5
"You need to very explicitly state what the exact advantage would be.
For most things in SAS you will be better off with one person per record."
Make sense, Thanks.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1597 views
  • 1 like
  • 4 in conversation