DATA Step, Macro, Functions and more

SAS is duplicating observations when I split my dataset

Reply
Regular Contributor
Posts: 150

SAS is duplicating observations when I split my dataset

I have a dataset and am trying to split it by the presence of a keyword in a variable string.

 

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

 

I am getting duplications in my dataset b (not all but a few observations are duplicating), none in dataset c. What am I doing wrong?

PROC Star
Posts: 1,570

Re: SAS is duplicating observations when I split my dataset

[ Edited ]

perhaps you have values that are low case or mixed case as well?

 

if yes, to override that, can the following help at all?

 

data b c;
    set a;
    if find(school, "ELEMENTARY",'i') then output b;
    else output c;
run;

 

 

also, what do you mean by duplicates?

 

can you post the sample data you are working with

Regular Contributor
Posts: 150

Re: SAS is duplicating observations when I split my dataset

Posted in reply to novinosrin

I have already set to be all upper case before running the code.

 

sample data A;

 

ID              SCHOOL

1                    ELEMENTARY SCHOOL

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

4                    DOVER ELEMENTARY

 

sample data B after split:

1                    ELEMENTARY SCHOOL

1                    ELEMENTARY SCHOOL

4                    DOVER ELEMENTARY

 

sample data C after split:

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

 

 

PROC Star
Posts: 8,146

Re: SAS is duplicating observations when I split my dataset

I'm confused! Nothing in your code suggests that you're trying to eliminate duplicates. Since you didn't get any records in c, I would think that b would contain all of your records.

 

It would help if you provided a sample dataset and, based on that sample, what you expect b and c to look like.

 

Art, CEO, AnalystFinder.com

 

Regular Contributor
Posts: 150

Re: SAS is duplicating observations when I split my dataset

[ Edited ]

This should help clarify:

 

sample data A;

 

ID              SCHOOL

1                    ELEMENTARY SCHOOL

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

4                    DOVER ELEMENTARY

 

sample data B after split:

1                    ELEMENTARY SCHOOL

1                    ELEMENTARY SCHOOL /*this was duplicated - occurred only once in A - dont want duplicate*/;

4                    DOVER ELEMENTARY

 

sample data C after split:

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

PROC Star
Posts: 8,146

Re: SAS is duplicating observations when I split my dataset

That isn't what I get. I ran:

data a;
  informat SCHOOL $25.;
  input ID              SCHOOL &;
  cards;
1                    ELEMENTARY SCHOOL
2                    MIDDLE SCHOOL
3                    HIGH SCHOOL
4                    DOVER ELEMENTARY
;

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

and got the expected two records in b and the other two in c.

 

Art, CEO, AnalystFinder.com

 

Regular Contributor
Posts: 150

Re: SAS is duplicating observations when I split my dataset

This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated

Super User
Posts: 9,886

Re: SAS is duplicating observations when I split my dataset

[ Edited ]

@Melk wrote:

This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated


Why do you expect the doctor to give a correct diagnosis if you don't let her see the patient?

Make up a sample dataset that makes the issue happen, and post it here. That it's not your code as initially posted has been proven already.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Super User
Posts: 7,932

Re: SAS is duplicating observations when I split my dataset

This step is NOT the source of the duplicates.

Look at the code you used to create the input to this step.

Regular Contributor
Posts: 150

Re: SAS is duplicating observations when I split my dataset

I run the duplicate check for data A right before this split and there are none, not sure what else it can be.

Super User
Super User
Posts: 7,932

Re: SAS is duplicating observations when I split my dataset

Not sure what you mean by duplicate check.

 

The only way code like what you posted could duplicate the ELEMENTARY records is if you have two OUTPUT statements being run.

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

Or you forgot to tell it output only to C in the ELSE clause.

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output ;
run;
PROC Star
Posts: 1,449

Re: SAS is duplicating observations when I split my dataset

As @Tom said, the code you posted cannot create duplicates.  It  reads each of the records in A, and outputs each record to B or C.

 

You might add a duplicate check to your step.  For example, if SCHOOL is a unique ID, you can sort by SCHOOL, and then below IF statement will write an error to the log if there are any duplicate values in A.

 

proc sort data=a;
  by school;
run;

data b c;
    set a;
    by school ;
    if not (first.school and last.school) then putlog "ERROR: duplicate in A " school=; 

    if find(school, "ELEMENTARY") then output b;
    else output c;
run;
Super User
Posts: 9,886

Re: SAS is duplicating observations when I split my dataset

Use the macro from my footnote #2 to create a data step from your dataset. Then run that code and your split code in one sequence, and see what happens. If the effect is still there, copy/paste that whole code into a code window here (see my footnote #3).

Only then will all people here talk about the same thing, at the moment it's all guesswork because we really do not know the exact structure of your dataset and the exact code you ran. @art297's example very clearly shows that the code you initially posted works as desired, if the data looks like it seems from your other post.

 

This all once again illustrates why the "old hands" here are so adamant about posting example data in data steps. You would very probably have gotten a solution within two or three answers if you had presented your issue in the correct way.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Ask a Question
Discussion stats
  • 12 replies
  • 117 views
  • 5 likes
  • 6 in conversation