Solved: Re: How to conditionally insert a row in long form data

UniversitySas · Posted 09-26-2018 06:38 PM

Hello,

Say I have the following dataset:

data example1;
INFILE DATALINES ;
 input ID UncleanVariable $25. ;
 DATALINES;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
;
RUN;

Here, there are a few errors per ID.
Each row should technically be defined as follows:

Row 1 = Grantor
Row 2 = Street
Row 3 = City
Row 4 = State
Row 5 = Postcode
Row 6 = Relationship
Row 7 = Status
Row 8 = Purpose
Row 9 = Contribution Amount

The output I am looking for is:

data solution1;
INFILE DATALINES dsd;
 input ID CleanedVariable ~ $30. Category $25. ;
 DATALINES;
1,Cyclone Limited,Grantor
1,123 Center Street,Street
1,Orlando,City
1,FL,State
1,12245,Postcode
1,Parent company,Relationship
1,101(a),Status
1,Fund equipment,Purpose 
1,10000, Contribution Amount
1,Lagoon Corp,Grantor 
1,3814 Wakefield Ave,Street 
1,Oakland,City
1,CA,State 
1,19406,Postcode
1,N/A,Relationship
1,KL21,Status
1,Subsidise staff,Purpose
1,200,Contribution Amount
2,Imagine Sports,Grantor
2,4556 Sun Valley Road,Street
2,Raleigh,City
2,NC,State
2,21020,Postcode
2,Subsidiary,Relationship
2,Missing,Status
2,Airfares,Purpose
2,14000,Contribution Amount
;
RUN;

There are a few problems I want to address. I'm not sure if there is a one-size fits all solution, so that is okay if there isn't. Let me first visualise the problem, with a few screenshots.

This is the original data:

Error 1 - The value for "relationship" is missing in the input data, so the "status" row is read prematurely. Is there a way to adjust this so that every Sixth row is either "none" or "None", and if not, insert the value "N/A" between row 5 and 6 in the original set? My criteria is that the value should ALWAYS be "none" or "None" and if not, "N/A" is input.

Error 2 - Here the address row has spilled over to the city row. Except for manually correcting this, is there a way to fix this spillover in a big data set? The pattern I've seen is that usually it's words like "floor" that spill over from the address. Or if there are more than two spaces in the address line, it will spillover. So we have 4556[1 space]Sun[2nd space]Valley[3rd space] Road.

Error 3 - arises because of error 2.

Error 4 - Assuming errors 1-3 are all addressed, there is a new error, very similar to error 1. Here the value for "Status" is missing, and should be replaced as "N/A" or "Missing" to indicate there was no value for this. The only criteria I can think of is that there should never any spaces contained in the value of this row, but it can contain brackets () and alphanumeric values.

Ideally, my cleaned data should look like this:

So Ideally, I would like to correct all the Errors, but in terms of importance;

Errors 1, 4, 2.

Thanks in advance for any help

Edit: In the last screenshot, "solution1" it should read "Street" not "Street Address" - my mistake.

ChrisNZ · Posted 09-30-2018 05:53 PM

Apologies, just replacing infile with set was not quite sufficient.

Provided your data is not huge, set point= is the easiest replacement.


data HAVE;
  input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;
data WANT;
  OBS_NO+1;
  set HAVE point=OBS_NO nobs=NOBS;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      %* see whats next;
      TMP=OBS_NO+1;
      set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        OBS_NO+1;
      end;
    end;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
  if OBS_NO=NOBS then stop;
  keep ID CATEGORY_NO CLEANVARIABLE;
run;

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.

The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.

High-Performance SAS Coding - Third Edition

View solution in original post

ballardw · Posted 09-26-2018 06:57 PM

First comment: GIGO

If you have a "row" that might have a limited number of responses, such as possibly relationship you might be able to test the value and if not in the list advance the "category" value.

How many records (IDs in this case) do you have to process? If the number is smallish, say under 100 or so, it may be quicker to edit the file manually by inserting appropriate lines then to try to work out "logic" for something like this.

Generally if someone brings something this poor the estimate on hours and $$ goes way up.

And I suggest to them to get a data storage system designed more recently than a shoe box filled with 3x5 cards.

UniversitySas · Posted 09-27-2018 03:53 PM

In total I have 21 IDs to process, and of the 21, around 16 have constant errors in them.

In terms of total observations, there are over 4 million.

The majority of the errors are the missing "relationship", I'd say about 80% of them. The remaining 20% of the errors are 19% missing "status", and the final 1% of errors are the miscellaneous spillovers such as "Floor" or "Street" being read into the "city"..

That's still 40,000 records to deal with, but I'm sure once the initial errors are dealt with, it'll be easier to identify some sort of pattern to deal with it.

ChrisNZ · Posted 09-26-2018 07:57 PM

Agree with @ballardw.

Fixing garbage data is expensive and the result is never guaranteed.

If you are stuck with it, this will get you started:


data EXAMPLE1;
  input ID UNCLEANVARIABLE $25. ;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);

    %* Grantor ;      
    when(1 ) ;     
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      input ID UNCLEANVARIABLE2 $25. @1 @@;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        input;
      end;
    end;

    %* City     ;
    when(3 ) ;

    %* State    ;
    when(4 ) ;

    %* Postcode ;
    when(5 ) ;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;

ID	CATEGORY_NO	CLEANVARIABLE
1	1	Cyclone Limited
1	2	123 Center Street
1	3	Orlando
1	4	FL
1	5	12245
1	6	None
1	7	101(a)
1	8	Fund equipment
1	9	10000
1	1	Lagoon Corp
1	2	3814 Wakefield Ave
1	3	Oakland
1	4	CA
1	5	19406
1	6	N/A
1	7	KL21
1	8	Subsidise staff
1	9	200
2	1	Imagine Sports
2	2	4556 Sun Valley Road
2	3	Raleigh
2	4	NC
2	5	21020
2	6	None
2	7	Airfares
2	8	14000

High-Performance SAS Coding - Third Edition

UniversitySas · Posted 09-27-2018 03:38 PM

Hi Chris, thanks so much for your response.

Unfortunately I am indeed stuck with the data, and have accepted that a lot of manual scrubbing is going to be needed. Although, I am still pretty new to SAS and coding, so I really appreciate your response.

Could I just clarify how I would amend the code you've written so that I wouldn't need to input the raw data myself, and could just use an already imported data set?

In addition, I would like to have labelled rows, like the following:

IF x=1 THEN Var_Name="Grantor";
	ELSE IF x=2 THEN Var_Name= "Street";
	ELSE IF x=3 THEN Var_Name = "City";
	ELSE IF x=4 THEN Var_Name= "State";
	ELSE IF x=5 THEN Var_Name = "Postcode";
	ELSE IF x=6 THEN Var_Name= "Relationship";
	ELSE IF x=7 THEN Var_Name = "Status";
	ELSE IF x=8 THEN Var_Name= "Purpose";
	ELSE IF x=0 THEN Var_Name= "Contribution";

So that I can then transpose this data into a wide form.

Thanks again!

ChrisNZ · Posted 09-27-2018 06:39 PM

1. just use an already imported data set?

Replace the INPUT statement with a SET statement

2. Have labelled rows

Use variable CATEGORY_NO. Your code is fine. Or use my select sections,

High-Performance SAS Coding - Third Edition

UniversitySas · Posted 09-29-2018 04:08 PM

Thanks for that, I seem to be getting an error though. Would you mind telling me what I'm doing wrong?

I've used this code:

data EXAMPLE1;
  SET test_code;
   CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);

    %* Grantor ;      
    when(1 ) ;     
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      input ID UNCLEANVARIABLE2 $25. @1 @@;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        input;
      end;
    end;

    %* City     ;
    when(3 ) ;

    %* State    ;
    when(4 ) ;

    %* Postcode ;
    when(5 ) ;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;

RUN;

Where SET test_code is just the full data set of the sample I posted here.

The error I am getting is:

ERROR: No DATALINES or INFILE statement.

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
If I wanted to add the additional: Status, Purpose, and Contribution, would that just be by appending the above code with:

  %* Status;
    when(7 ) ;

    %* Relationship ;
    when(8 ) ;

    %* Contribution;
    when(9 ) ;

 CLEANVARIABLE=coalescec(CLEANVARIABLE,Var);
 output;

Thanks!

ChrisNZ · Posted 09-30-2018 05:53 PM

Apologies, just replacing infile with set was not quite sufficient.

Provided your data is not huge, set point= is the easiest replacement.


data HAVE;
  input ID UNCLEANVARIABLE $25. ;
cards;
1 Cyclone Limited 
1 123 Center Street
1 Orlando 
1 FL 
1 12245
1 None 
1 101(a) 
1 Fund equipment 
1 10000 
1 Lagoon Corp 
1 3814 Wakefield Ave 
1 Oakland 
1 CA 
1 19406
1 KL21 
1 Subsidise staff 
1 200 
2 Imagine Sports 
2 4556 Sun Valley 
2 Road 
2 Raleigh 
2 NC 
2 21020 
2 None 
2 Airfares
2 14000 
run;
data WANT;
  OBS_NO+1;
  set HAVE point=OBS_NO nobs=NOBS;
  CATEGORY_NO+1;
  if CATEGORY_NO=10 then CATEGORY_NO=1; 
  select(CATEGORY_NO);
 
    %* Street  ;
    when(2 ) if countw(UNCLEANVARIABLE)=3 then do;
      %* see whats next;
      TMP=OBS_NO+1;
      set HAVE(rename=(UNCLEANVARIABLE=UNCLEANVARIABLE2)) point=TMP;
      if UNCLEANVARIABLE2 in('Floor','Road') then do;
        CLEANVARIABLE=catx(' ',UNCLEANVARIABLE,UNCLEANVARIABLE2);
        OBS_NO+1;
      end;
    end;

    %* Relationship;
    when(6 ) if upcase(UNCLEANVARIABLE) ne 'NONE' then do;
      CLEANVARIABLE='N/A';
      output;
      CATEGORY_NO+1;
      CLEANVARIABLE=UNCLEANVARIABLE;
    end;

    otherwise;
  end;
  CLEANVARIABLE=coalescec(CLEANVARIABLE,UNCLEANVARIABLE);
  output;
  if OBS_NO=NOBS then stop;
  keep ID CATEGORY_NO CLEANVARIABLE;
run;

Also, how come in your code you have the variables grantor, street, city etc... but they do not appear in the final output in a separate column?
I don't have these variables anywhere. I just created space holders as a model, for you to fill. What you propose for Relationship looks fine. Please do understand the code provided.

The people helping you here are not here do do your work for you and give you a fish, they are here to teach you how to fish.

High-Performance SAS Coding - Third Edition

UniversitySas · Posted 10-16-2018 01:17 PM

This worked perfectly - thanks so much.
Also, I'm curious about your previous code:
input ID UNCLEANVARIABLE2 $25. @1 @@;

What is the purpose of the "@1" and "@@" here? I tried to read up the definitions, but they didnt really make sense.

Thanks

ChrisNZ · Posted 10-16-2018 05:36 PM

@1 repositions the pointer at the start of the line so the next input statement will read from the start.

@@ keeps the current record open so the next input statement will read the same record again.

One @ keeps the record open just for the duration of the current implicit data step loop iteration (until the interpreter reaches run;).

Two @s keep the record open across data step implicit iterations.

High-Performance SAS Coding - Third Edition

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!