Hello,
Say I have the following dataset:
data example1;
INFILE DATALINES ;
input ID UncleanVariable $25. ;
DATALINES;
1 Cyclone Limited
1 123 Center Street
1 Orlando
1 FL
1 12245
1 None
1 101(a)
1 Fund equipment
1 10000
1 Lagoon Corp
1 3814 Wakefield Ave
1 Oakland
1 CA
1 19406
1 KL21
1 Subsidise staff
1 200
2 Imagine Sports
2 4556 Sun Valley
2 Road
2 Raleigh
2 NC
2 21020
2 None
2 Airfares
2 14000
;
RUN;
Here, there are a few errors per ID.
Each row should technically be defined as follows:
Row 1 = Grantor
Row 2 = Street
Row 3 = City
Row 4 = State
Row 5 = Postcode
Row 6 = Relationship
Row 7 = Status
Row 8 = Purpose
Row 9 = Contribution Amount
The output I am looking for is:
data solution1;
INFILE DATALINES dsd;
input ID CleanedVariable ~ $30. Category $25. ;
DATALINES;
1,Cyclone Limited,Grantor
1,123 Center Street,Street
1,Orlando,City
1,FL,State
1,12245,Postcode
1,Parent company,Relationship
1,101(a),Status
1,Fund equipment,Purpose
1,10000, Contribution Amount
1,Lagoon Corp,Grantor
1,3814 Wakefield Ave,Street
1,Oakland,City
1,CA,State
1,19406,Postcode
1,N/A,Relationship
1,KL21,Status
1,Subsidise staff,Purpose
1,200,Contribution Amount
2,Imagine Sports,Grantor
2,4556 Sun Valley Road,Street
2,Raleigh,City
2,NC,State
2,21020,Postcode
2,Subsidiary,Relationship
2,Missing,Status
2,Airfares,Purpose
2,14000,Contribution Amount
;
RUN;
There are a few problems I want to address. I'm not sure if there is a one-size fits all solution, so that is okay if there isn't. Let me first visualise the problem, with a few screenshots.
This is the original data:
Error 1 - The value for "relationship" is missing in the input data, so the "status" row is read prematurely. Is there a way to adjust this so that every Sixth row is either "none" or "None", and if not, insert the value "N/A" between row 5 and 6 in the original set? My criteria is that the value should ALWAYS be "none" or "None" and if not, "N/A" is input.
Error 2 - Here the address row has spilled over to the city row. Except for manually correcting this, is there a way to fix this spillover in a big data set? The pattern I've seen is that usually it's words like "floor" that spill over from the address. Or if there are more than two spaces in the address line, it will spillover. So we have 4556[1 space]Sun[2nd space]Valley[3rd space] Road.
Error 3 - arises because of error 2.
Error 4 - Assuming errors 1-3 are all addressed, there is a new error, very similar to error 1. Here the value for "Status" is missing, and should be replaced as "N/A" or "Missing" to indicate there was no value for this. The only criteria I can think of is that there should never any spaces contained in the value of this row, but it can contain brackets () and alphanumeric values.
Ideally, my cleaned data should look like this:
So Ideally, I would like to correct all the Errors, but in terms of importance;
Errors 1, 4, 2.
Thanks in advance for any help
Edit: In the last screenshot, "solution1" it should read "Street" not "Street Address" - my mistake.
Apologies, just replacing infile with set was not quite sufficient.
Provided your data is not huge, set point= is the easiest replacement.
data HAVE;
input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited
1 123 Center Street
1 Orlando
1 FL
1 12245
1 None
1 101(a)
1 Fund equipment
1 10000
1 Lagoon Corp
1 3814 Wakefield Ave
1 Oakland
1 CA
1 19406
1 KL21
1 Subsidise staff
1 200
2 Imagine Sports
2 4556 Sun Valley
2 Road
2 Raleigh
2 NC
2 21020
2 None
2 Airfares
2 14000
run;
data WANT;
OBS_NO+1;
set HAVE point=OBS_NO nobs=NOBS;
CATEGORY_NO+1;
if CATEGORY_NO=10 then CATEGORY_NO=1;
select(CATEGORY_NO);
%* Street ;
when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
%* see whats next;
TMP=OBS_NO+1;
set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
if UNCLEANVARIABLE2 in('Floor','Road') then do;
CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
OBS_NO+1;
end;
end;
%* Relationship;
when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
CLEANVARIABLE='N/A';
output;
CATEGORY_NO+1;
CLEANVARIABLE=UNCLEANVARIABLE;
end;
otherwise;
end;
CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
output;
if OBS_NO=NOBS then stop;
keep ID CATEGORY_NO CLEANVARIABLE;
run;
Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.
The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.
First comment: GIGO
If you have a "row" that might have a limited number of responses, such as possibly relationship you might be able to test the value and if not in the list advance the "category" value.
How many records (IDs in this case) do you have to process? If the number is smallish, say under 100 or so, it may be quicker to edit the file manually by inserting appropriate lines then to try to work out "logic" for something like this.
Generally if someone brings something this poor the estimate on hours and $$ goes way up.
And I suggest to them to get a data storage system designed more recently than a shoe box filled with 3x5 cards.
In total I have 21 IDs to process, and of the 21, around 16 have constant errors in them.
In terms of total observations, there are over 4 million.
The majority of the errors are the missing "relationship", I'd say about 80% of them. The remaining 20% of the errors are 19% missing "status", and the final 1% of errors are the miscellaneous spillovers such as "Floor" or "Street" being read into the "city"..
That's still 40,000 records to deal with, but I'm sure once the initial errors are dealt with, it'll be easier to identify some sort of pattern to deal with it.
Agree with @ballardw.
Fixing garbage data is expensive and the result is never guaranteed.
If you are stuck with it, this will get you started:
data EXAMPLE1;
input ID UNCLEANVARIABLE $25. ;
CATEGORY_NO+1;
if CATEGORY_NO=10 then CATEGORY_NO=1;
select(CATEGORY_NO);
%* Grantor ;
when(1 ) ;
%* Street ;
when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
input ID UNCLEANVARIABLE2 $25. @1 @@;
if UNCLEANVARIABLE2 in('Floor','Road') then do;
CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
input;
end;
end;
%* City ;
when(3 ) ;
%* State ;
when(4 ) ;
%* Postcode ;
when(5 ) ;
%* Relationship;
when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
CLEANVARIABLE='N/A';
output;
CATEGORY_NO+1;
CLEANVARIABLE=UNCLEANVARIABLE;
end;
otherwise;
end;
CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
output;
cards;
1 Cyclone Limited
1 123 Center Street
1 Orlando
1 FL
1 12245
1 None
1 101(a)
1 Fund equipment
1 10000
1 Lagoon Corp
1 3814 Wakefield Ave
1 Oakland
1 CA
1 19406
1 KL21
1 Subsidise staff
1 200
2 Imagine Sports
2 4556 Sun Valley
2 Road
2 Raleigh
2 NC
2 21020
2 None
2 Airfares
2 14000
run;
ID | CATEGORY_NO | CLEANVARIABLE |
---|---|---|
1 | 1 | Cyclone Limited |
1 | 2 | 123 Center Street |
1 | 3 | Orlando |
1 | 4 | FL |
1 | 5 | 12245 |
1 | 6 | None |
1 | 7 | 101(a) |
1 | 8 | Fund equipment |
1 | 9 | 10000 |
1 | 1 | Lagoon Corp |
1 | 2 | 3814 Wakefield Ave |
1 | 3 | Oakland |
1 | 4 | CA |
1 | 5 | 19406 |
1 | 6 | N/A |
1 | 7 | KL21 |
1 | 8 | Subsidise staff |
1 | 9 | 200 |
2 | 1 | Imagine Sports |
2 | 2 | 4556 Sun Valley Road |
2 | 3 | Raleigh |
2 | 4 | NC |
2 | 5 | 21020 |
2 | 6 | None |
2 | 7 | Airfares |
2 | 8 | 14000 |
Hi Chris, thanks so much for your response.
Unfortunately I am indeed stuck with the data, and have accepted that a lot of manual scrubbing is going to be needed. Although, I am still pretty new to SAS and coding, so I really appreciate your response.
Could I just clarify how I would amend the code you've written so that I wouldn't need to input the raw data myself, and could just use an already imported data set?
In addition, I would like to have labelled rows, like the following:
IF x=1 THEN Var_Name="Grantor";
ELSE IF x=2 THEN Var_Name= "Street";
ELSE IF x=3 THEN Var_Name = "City";
ELSE IF x=4 THEN Var_Name= "State";
ELSE IF x=5 THEN Var_Name = "Postcode";
ELSE IF x=6 THEN Var_Name= "Relationship";
ELSE IF x=7 THEN Var_Name = "Status";
ELSE IF x=8 THEN Var_Name= "Purpose";
ELSE IF x=0 THEN Var_Name= "Contribution";
So that I can then transpose this data into a wide form.
Thanks again!
1. just use an already imported data set?
Replace the INPUT statement with a SET statement
2. Have labelled rows
Use variable CATEGORY_NO. Your code is fine. Or use my select sections,
Thanks for that, I seem to be getting an error though. Would you mind telling me what I'm doing wrong?
I've used this code:
data EXAMPLE1;
SET test_code;
CATEGORY_NO+1;
if CATEGORY_NO=10 then CATEGORY_NO=1;
select(CATEGORY_NO);
%* Grantor ;
when(1 ) ;
%* Street ;
when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
input ID UNCLEANVARIABLE2 $25. @1 @@;
if UNCLEANVARIABLE2 in('Floor','Road') then do;
CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
input;
end;
end;
%* City ;
when(3 ) ;
%* State ;
when(4 ) ;
%* Postcode ;
when(5 ) ;
%* Relationship;
when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
CLEANVARIABLE='N/A';
output;
CATEGORY_NO+1;
CLEANVARIABLE=UNCLEANVARIABLE;
end;
otherwise;
end;
CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
output;
RUN;
Where SET test_code is just the full data set of the sample I posted here.
The error I am getting is:
ERROR: No DATALINES or INFILE statement.
Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
If I wanted to add the additional: Status, Purpose, and Contribution, would that just be by appending the above code with:
%* Status;
when(7 ) ;
%* Relationship ;
when(8 ) ;
%* Contribution;
when(9 ) ;
CLEANVARIABLE=coalescec(CLEANVARIABLE,Var);
output;
Thanks!
Apologies, just replacing infile with set was not quite sufficient.
Provided your data is not huge, set point= is the easiest replacement.
data HAVE;
input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited
1 123 Center Street
1 Orlando
1 FL
1 12245
1 None
1 101(a)
1 Fund equipment
1 10000
1 Lagoon Corp
1 3814 Wakefield Ave
1 Oakland
1 CA
1 19406
1 KL21
1 Subsidise staff
1 200
2 Imagine Sports
2 4556 Sun Valley
2 Road
2 Raleigh
2 NC
2 21020
2 None
2 Airfares
2 14000
run;
data WANT;
OBS_NO+1;
set HAVE point=OBS_NO nobs=NOBS;
CATEGORY_NO+1;
if CATEGORY_NO=10 then CATEGORY_NO=1;
select(CATEGORY_NO);
%* Street ;
when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
%* see whats next;
TMP=OBS_NO+1;
set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
if UNCLEANVARIABLE2 in('Floor','Road') then do;
CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
OBS_NO+1;
end;
end;
%* Relationship;
when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
CLEANVARIABLE='N/A';
output;
CATEGORY_NO+1;
CLEANVARIABLE=UNCLEANVARIABLE;
end;
otherwise;
end;
CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
output;
if OBS_NO=NOBS then stop;
keep ID CATEGORY_NO CLEANVARIABLE;
run;
Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.
The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.
@1 repositions the pointer at the start of the line so the next input statement will read from the start.
@@ keeps the current record open so the next input statement will read the same record again.
One @ keeps the record open just for the duration of the current implicit data step loop iteration (until the interpreter reaches run;).
Two @s keep the record open across data step implicit iterations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.