Hello, Say I have the following dataset: data example1; INFILE DATALINES ; input ID UncleanVariable $25. ; DATALINES; 1 Cyclone Limited 1 123 Center Street 1 Orlando 1 FL 1 12245 1 None 1 101(a) 1 Fund equipment 1 10000 1 Lagoon Corp 1 3814 Wakefield Ave 1 Oakland 1 CA 1 19406 1 KL21 1 Subsidise staff 1 200 2 Imagine Sports 2 4556 Sun Valley 2 Road 2 Raleigh 2 NC 2 21020 2 None 2 Airfares 2 14000 ; RUN; Here, there are a few errors per ID. Each row should technically be defined as follows: Row 1 = Grantor Row 2 = Street Row 3 = City Row 4 = State Row 5 = Postcode Row 6 = Relationship Row 7 = Status Row 8 = Purpose Row 9 = Contribution Amount The output I am looking for is: data solution1; INFILE DATALINES dsd; input ID CleanedVariable ~ $30. Category $25. ; DATALINES; 1,Cyclone Limited,Grantor 1,123 Center Street,Street 1,Orlando,City 1,FL,State 1,12245,Postcode 1,Parent company,Relationship 1,101(a),Status 1,Fund equipment,Purpose 1,10000, Contribution Amount 1,Lagoon Corp,Grantor 1,3814 Wakefield Ave,Street 1,Oakland,City 1,CA,State 1,19406,Postcode 1,N/A,Relationship 1,KL21,Status 1,Subsidise staff,Purpose 1,200,Contribution Amount 2,Imagine Sports,Grantor 2,4556 Sun Valley Road,Street 2,Raleigh,City 2,NC,State 2,21020,Postcode 2,Subsidiary,Relationship 2,Missing,Status 2,Airfares,Purpose 2,14000,Contribution Amount ; RUN; There are a few problems I want to address. I'm not sure if there is a one-size fits all solution, so that is okay if there isn't. Let me first visualise the problem, with a few screenshots. This is the original data: Error 1 - The value for "relationship" is missing in the input data, so the "status" row is read prematurely. Is there a way to adjust this so that every Sixth row is either "none" or "None", and if not, insert the value "N/A" between row 5 and 6 in the original set? My criteria is that the value should ALWAYS be "none" or "None" and if not, "N/A" is input. Error 2 - Here the address row has spilled over to the city row. Except for manually correcting this, is there a way to fix this spillover in a big data set? The pattern I've seen is that usually it's words like "floor" that spill over from the address. Or if there are more than two spaces in the address line, it will spillover. So we have 4556[1 space]Sun[2nd space]Valley[3rd space] Road. Error 3 - arises because of error 2. Error 4 - Assuming errors 1-3 are all addressed, there is a new error, very similar to error 1. Here the value for "Status" is missing, and should be replaced as "N/A" or "Missing" to indicate there was no value for this. The only criteria I can think of is that there should never any spaces contained in the value of this row, but it can contain brackets () and alphanumeric values. Ideally, my cleaned data should look like this: So Ideally, I would like to correct all the Errors, but in terms of importance; Errors 1, 4, 2. Thanks in advance for any help Edit: In the last screenshot, "solution1" it should read "Street" not "Street Address" - my mistake.
... View more