Hi everyone, I have stuck when trying to access this raw data. Any help, I really appreciate!
Here is my SAS Code:
data houseparty;
infile "......\houseparty.txt" missover ;
input house_ID address $21. @30 city $28. @60 state $ @67 ZIP_code
/ First_name_1 $ last_name_1 $ Donation_1
/ First_name_2 $ last_name_2 $ Donation_2
/ First_name_3 $ last_name_3 $ Donation_3
/ First_name_4 $ last_name_4 $ Donation_4 ;
run;
The capture of File :
The full question ( Chapter 11: Exercises and Projects for The Little SAS Book )
A political candidate solicits donations through grassroots efforts via small house parties. After the house party, she records the donations to her campaign for each house. The raw data set called HouseParty.dat contains information on the most recent wave of house parties. The variables in this file are house ID, address, city, state, and ZIP code, followed by the name of each attendee and the amount donated at the house party.
a. Open the raw data file in a simple editor such as WordPad. In a comment in your program, discuss why this data structure would be difficult to read into SAS with a single INPUT statement and without line pointers. What log messages might you encounter?
b. Write a program that will read the raw data into SAS without errors or warnings in the log. The resulting SAS data set should have one observation per attendee with repeating house ID and address information for those at the same house party.
c. Calculate the average amount donated by state. Make sure that states names are grouped together appropriately.
Lets look at your raw data a bit:
1001 0713 View Wy. Los Angeles CA 90097 MICHELLE MORALES 190 SANDRA MORALES 65 KENNETH MORALES 205 1022 032 Forest Wy. Grandview Wa 98930 KENNETH COLEMAN 405 RICHARD COLEMAN 480 1027 9671 Hill Wy. Mermentau LA 70556 JERRY BERRY 475 SHANNON SMITH 75 ROBERT BERRY 190 RYAN SMITH 90 1038 0866 9th Dr. Long Beach ca 90845 TONYA NICHOLS 500 1065 8855 Third St. Blue Springs MO 64015 SANDRA OLSON 200 STEPHANIE OLSON 70
So apparently you have differing numbers of people at an address.
Your code as written forces reading 4 people. So you likely do not get what you want.
Likely you would will be better off in the long run NOT making 4 (or 3 or 6 or whatever) first name last name donation variables. That sort of data structure causes more headaches in the long run. You would be better off with a variable that indicates the position number within the household (?).
The question is how to parse the data to tell the difference between the house address line and the people.
Since you apparently have an ID value that is numeric at the start of the address then you can test for that, unless you have any people with names like "123".
The following codes reads things in the data step. Use your INFILE to point to your data set.
The input @; places the current line of data into a long automatic variable you can examine named _infile_ and holds the input pointer on that line.
The ANYDIGIT function with Scan is looking for numerals in the first "word".
Retain keeps variables across iterations of the data step.
You likely do not want simple list input for names because sometimes folks have spaces or other stuff in first or last name and that would throw where the donation falls in the list count.
data want; infile datalines missover; input @; length address $ 21 city $ 28 state $ 2 ZIP_code $ 5 ; /* keep the address information for each person*/ retain house_ID address city state ZIP_code ; if anydigit(scan(_infile_,1)) >0 then do; /* read the address line*/ input house_ID address $21. @30 city $28. @60 state $ @67 ZIP_code $; person=0; end; else do; /* increment the person indicator*/ person+1; input First_name $14. last_name $15. Donation; /* only output data when person information has been read*/ output; end; datalines; 1001 0713 View Wy. Los Angeles CA 90097 MICHELLE MORALES 190 SANDRA MORALES 65 KENNETH MORALES 205 1022 032 Forest Wy. Grandview Wa 98930 KENNETH COLEMAN 405 RICHARD COLEMAN 480 1027 9671 Hill Wy. Mermentau LA 70556 JERRY BERRY 475 SHANNON SMITH 75 ROBERT BERRY 190 RYAN SMITH 90 1038 0866 9th Dr. Long Beach ca 90845 TONYA NICHOLS 500 1065 8855 Third St. Blue Springs MO 64015 SANDRA OLSON 200 STEPHANIE OLSON 70 ;
Show us the LOG so we can see the code and the warning messages and the error message.
Please help us by maintaining the formatting of the LOG by pasting it (as text) into the window that appears when you click on the </> icon. PLEASE DO NOT SKIP THIS STEP.
The approach is incorrect, it's hard coded it to read 4 rows of names per address when the data does not follow that pattern. You have a variable number of records after the address portion. You'll need to develop a dynamic approach that can identify if a record is an address or a name and how to handle it. Your addresses always start with a number so that's one way to identify it. You can use the trailing @ to hold a record. You may want to look into RETAIN as well to hold the same address across multiple records.
This is untested.
data houseparty;
infile "......\houseparty.txt" missover ;
retain house_ID address city state ZIP_code;
*read first variable;
input firstChar $ @;
*check if it contains any numbers if so, process as an address;
if anydigit(firstChar) then do;
Input house_ID address $21. @30 city $28. @60 state $ @67 ZIP_code ;
Record + 1;
end;
*else read in names;
else do;
input First_name $ last_name $ Donation;
end;
run;
Lets look at your raw data a bit:
1001 0713 View Wy. Los Angeles CA 90097 MICHELLE MORALES 190 SANDRA MORALES 65 KENNETH MORALES 205 1022 032 Forest Wy. Grandview Wa 98930 KENNETH COLEMAN 405 RICHARD COLEMAN 480 1027 9671 Hill Wy. Mermentau LA 70556 JERRY BERRY 475 SHANNON SMITH 75 ROBERT BERRY 190 RYAN SMITH 90 1038 0866 9th Dr. Long Beach ca 90845 TONYA NICHOLS 500 1065 8855 Third St. Blue Springs MO 64015 SANDRA OLSON 200 STEPHANIE OLSON 70
So apparently you have differing numbers of people at an address.
Your code as written forces reading 4 people. So you likely do not get what you want.
Likely you would will be better off in the long run NOT making 4 (or 3 or 6 or whatever) first name last name donation variables. That sort of data structure causes more headaches in the long run. You would be better off with a variable that indicates the position number within the household (?).
The question is how to parse the data to tell the difference between the house address line and the people.
Since you apparently have an ID value that is numeric at the start of the address then you can test for that, unless you have any people with names like "123".
The following codes reads things in the data step. Use your INFILE to point to your data set.
The input @; places the current line of data into a long automatic variable you can examine named _infile_ and holds the input pointer on that line.
The ANYDIGIT function with Scan is looking for numerals in the first "word".
Retain keeps variables across iterations of the data step.
You likely do not want simple list input for names because sometimes folks have spaces or other stuff in first or last name and that would throw where the donation falls in the list count.
data want; infile datalines missover; input @; length address $ 21 city $ 28 state $ 2 ZIP_code $ 5 ; /* keep the address information for each person*/ retain house_ID address city state ZIP_code ; if anydigit(scan(_infile_,1)) >0 then do; /* read the address line*/ input house_ID address $21. @30 city $28. @60 state $ @67 ZIP_code $; person=0; end; else do; /* increment the person indicator*/ person+1; input First_name $14. last_name $15. Donation; /* only output data when person information has been read*/ output; end; datalines; 1001 0713 View Wy. Los Angeles CA 90097 MICHELLE MORALES 190 SANDRA MORALES 65 KENNETH MORALES 205 1022 032 Forest Wy. Grandview Wa 98930 KENNETH COLEMAN 405 RICHARD COLEMAN 480 1027 9671 Hill Wy. Mermentau LA 70556 JERRY BERRY 475 SHANNON SMITH 75 ROBERT BERRY 190 RYAN SMITH 90 1038 0866 9th Dr. Long Beach ca 90845 TONYA NICHOLS 500 1065 8855 Third St. Blue Springs MO 64015 SANDRA OLSON 200 STEPHANIE OLSON 70 ;
@SteveLe wrote:
By the way, I just wonder: can we put many people ( first_name, last_name, donation ) in one row instead of many?
The question to answer first is "why have many on one row". You need to very explicitly state what the exact advantage would be. For most things in SAS you will be better off with one person per record.
You already know that have one to four people per address. So on some rows you would have mostly missing values. Almost any further processing would require additional coding to only use the values that are recorded "across" a row.
Generally I would not actually make data sets look like that but use a reporting procedure to display text like that.
Example, using the Want data set made above:
proc report data=want; columns house_id person,(last_name first_name donation) ; define house_id / group; define person /across "Donor"; define last_name /display; define first_name /display; run;
You can make an ugly data set like that. We get that question nearly daily. Though yours would be compounded by using 3 variables, what the result should actually look like, and as always how is that variable to be used.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.