BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
UniversitySas
Quartz | Level 8

Hello,

 

Say I have the following dataset:

 

data example1;
INFILE DATALINES ;
input ID UncleanVariable $25. ;
DATALINES;
1 Cyclone Limited
1 123 Center Street
1 Orlando
1 FL
1 12245
1 None
1 101(a)
1 Fund equipment
1 10000
1 Lagoon Corp
1 3814 Wakefield Ave
1 Oakland
1 CA
1 19406
1 KL21
1 Subsidise staff
1 200
2 Imagine Sports
2 4556 Sun Valley
2 Road
2 Raleigh
2 NC
2 21020
2 None
2 Airfares
2 14000
;
RUN;

 

Here, there are a few errors per ID.
Each row should technically be defined as follows:

 

Row 1 = Grantor

Row 2 = Street

Row 3 = City

Row 4 = State

Row 5 = Postcode

Row 6 = Relationship

Row 7 = Status

Row 8 = Purpose

Row 9 = Contribution Amount

The output I am looking for is:

data solution1;
INFILE DATALINES dsd;
input ID CleanedVariable ~ $30. Category $25. ;
DATALINES;
1,Cyclone Limited,Grantor
1,123 Center Street,Street
1,Orlando,City
1,FL,State
1,12245,Postcode
1,Parent company,Relationship
1,101(a),Status
1,Fund equipment,Purpose
1,10000, Contribution Amount
1,Lagoon Corp,Grantor
1,3814 Wakefield Ave,Street
1,Oakland,City
1,CA,State
1,19406,Postcode
1,N/A,Relationship
1,KL21,Status
1,Subsidise staff,Purpose
1,200,Contribution Amount
2,Imagine Sports,Grantor
2,4556 Sun Valley Road,Street
2,Raleigh,City
2,NC,State
2,21020,Postcode
2,Subsidiary,Relationship
2,Missing,Status
2,Airfares,Purpose
2,14000,Contribution Amount
;
RUN;

There are a few problems I want to address. I'm not sure if there is a one-size fits all solution, so that is okay if there isn't. Let me first visualise the problem, with a few screenshots.

This is the original data:

sasforums_help.jpg

 

Error 1 - The value for "relationship" is missing in the input data, so the "status" row is read prematurely. Is there a way to adjust this so that every Sixth row is either "none" or "None", and if not, insert the value "N/A" between row 5 and 6 in the original set? My criteria is that the value should ALWAYS be "none" or "None" and if not, "N/A" is input.

 

Error 2 - Here the address row has spilled over to the city row. Except for manually correcting this, is there a way to fix this spillover in a big data set? The pattern I've seen is that usually it's words like "floor" that spill over from the address. Or if there are more than two spaces in the address line, it will spillover. So we have 4556[1 space]Sun[2nd space]Valley[3rd space] Road. 

 

Error 3 - arises because of error 2. 

 

Error 4 - Assuming errors 1-3 are all addressed, there is a new error, very similar to error 1. Here the value for "Status" is missing, and should be replaced as "N/A" or "Missing" to indicate there was no value for this. The only criteria I can think of is that there should never any spaces contained in the value of this row, but it can contain brackets () and alphanumeric values. 


Ideally, my cleaned data should look like this:

sashelp2.jpg

 

So Ideally, I would like to correct all the Errors, but in terms of importance;

Errors 1, 4, 2.

 

Thanks in advance for any help

 

Edit: In the last screenshot, "solution1" it should read "Street" not "Street Address" - my mistake.

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Apologies, just replacing infile with set was not quite sufficient. 

Provided your data is not huge, set point= is the easiest replacement.

 

 


data HAVE;
  input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;
data WANT;
  OBS_NO+1;
  set HAVE point=OBS_NO nobs=NOBS;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      %* see whats next;
      TMP=OBS_NO+1;
      set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        OBS_NO+1;
      end;
    end;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
  if OBS_NO=NOBS then stop;
  keep ID CATEGORY_NO CLEANVARIABLE;
run;

 

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.

The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.

 

View solution in original post

9 REPLIES 9
ballardw
Super User

First comment: GIGO

 

If you have a "row" that might have a limited number of responses, such as possibly relationship you might be able to test the value and if not in the list advance the "category" value.

 

How many records (IDs in this case) do you have to process? If the number is smallish, say under 100 or so, it may be quicker to edit the file manually by inserting appropriate lines then to try to work out "logic" for something like this.

 

Generally if someone brings something this poor the estimate on hours and $$ goes way up.

And I suggest to them to get a data storage system designed more recently than a shoe box filled with 3x5 cards.

UniversitySas
Quartz | Level 8

In total I have 21 IDs to process, and of the 21, around 16 have constant errors in them.

In terms of total observations, there are over 4 million.

The majority of the errors are the missing "relationship", I'd say about 80% of them. The remaining 20% of the errors are 19% missing "status", and the final 1% of errors are the miscellaneous spillovers such as "Floor" or "Street" being read into the "city"..

 

That's still 40,000 records to deal with, but I'm sure once the initial errors are dealt with, it'll be easier to identify some sort of pattern to deal with it. 

 

 

ChrisNZ
Tourmaline | Level 20

Agree with @ballardw.

Fixing garbage data is expensive and the result is never guaranteed.

If you are stuck with it, this will get you started:

 


data EXAMPLE1;
  input ID UNCLEANVARIABLE $25. ;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);

    %* Grantor ;      
    when(1 ) ;     
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      input ID UNCLEANVARIABLE2 $25. @1 @@;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        input;
      end;
    end;

    %* City     ;
    when(3 ) ;

    %* State    ;
    when(4 ) ;

    %* Postcode ;
    when(5 ) ;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;

 

ID CATEGORY_NO CLEANVARIABLE
1 1 Cyclone Limited
1 2 123 Center Street
1 3 Orlando
1 4 FL
1 5 12245
1 6 None
1 7 101(a)
1 8 Fund equipment
1 9 10000
1 1 Lagoon Corp
1 2 3814 Wakefield Ave
1 3 Oakland
1 4 CA
1 5 19406
1 6 N/A
1 7 KL21
1 8 Subsidise staff
1 9 200
2 1 Imagine Sports
2 2 4556 Sun Valley Road
2 3 Raleigh
2 4 NC
2 5 21020
2 6 None
2 7 Airfares
2 8 14000

 

 

UniversitySas
Quartz | Level 8

Hi Chris, thanks so much for your response.


Unfortunately I am indeed stuck with the data, and have accepted that a lot of manual scrubbing is going to be needed. Although, I am still pretty new to SAS and coding, so I really appreciate your response.

 

Could I just clarify how I would amend the code you've written so that I wouldn't need to input the raw data myself, and could just use an already imported data set?

In addition, I would like to have labelled rows, like the following:

IF x=1 THEN Var_Name="Grantor";
	ELSE IF x=2 THEN Var_Name= "Street";
	ELSE IF x=3 THEN Var_Name = "City";
	ELSE IF x=4 THEN Var_Name= "State";
	ELSE IF x=5 THEN Var_Name = "Postcode";
	ELSE IF x=6 THEN Var_Name= "Relationship";
	ELSE IF x=7 THEN Var_Name = "Status";
	ELSE IF x=8 THEN Var_Name= "Purpose";
	ELSE IF x=0 THEN Var_Name= "Contribution";

So that I can then transpose this data into a wide form.


Thanks again!

ChrisNZ
Tourmaline | Level 20

1. just use an already imported data set?

Replace the INPUT statement with a SET statement

 

2. Have labelled rows

Use variable CATEGORY_NO. Your code is fine. Or use my select sections,

UniversitySas
Quartz | Level 8

Thanks for that, I seem to be getting an error though. Would you mind telling me what I'm doing wrong?

I've used this code:

 

 

data EXAMPLE1;
  SET test_code;
   CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);

    %* Grantor ;      
    when(1 ) ;     
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      input ID UNCLEANVARIABLE2 $25. @1 @@;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        input;
      end;
    end;

    %* City     ;
    when(3 ) ;

    %* State    ;
    when(4 ) ;

    %* Postcode ;
    when(5 ) ;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;

RUN;

Where SET test_code is just the full data set of the sample I posted here.

 


The error I am getting is:

ERROR: No DATALINES or INFILE statement.

 

 

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
If I wanted to add the additional: Status, Purpose, and Contribution, would that just be by appending the above code with:

 

 

  %* Status;
    when(7 ) ;

    %* Relationship ;
    when(8 ) ;

    %* Contribution;
    when(9 ) ;

CLEANVARIABLE=coalescec(CLEANVARIABLE,Var);
output;


Thanks!

 

 

ChrisNZ
Tourmaline | Level 20

Apologies, just replacing infile with set was not quite sufficient. 

Provided your data is not huge, set point= is the easiest replacement.

 

 


data HAVE;
  input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;
data WANT;
  OBS_NO+1;
  set HAVE point=OBS_NO nobs=NOBS;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      %* see whats next;
      TMP=OBS_NO+1;
      set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        OBS_NO+1;
      end;
    end;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
  if OBS_NO=NOBS then stop;
  keep ID CATEGORY_NO CLEANVARIABLE;
run;

 

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.

The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.

 

UniversitySas
Quartz | Level 8
This worked perfectly - thanks so much.
Also, I'm curious about your previous code:
input ID UNCLEANVARIABLE2 $25. @1 @@;

What is the purpose of the "@1" and "@@" here? I tried to read up the definitions, but they didnt really make sense.

Thanks
ChrisNZ
Tourmaline | Level 20

@1 repositions the pointer at the start of the line so the next input statement will read from the start.

@@  keeps the current record open so the next input statement will read the same record again.

 

One @ keeps the record open just for the duration of the current implicit data step loop iteration (until the interpreter reaches run;).

Two @s keep the record open across data step implicit iterations.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1190 views
  • 0 likes
  • 3 in conversation