BookmarkSubscribeRSS Feed
Melk
Lapis Lazuli | Level 10

I have a dataset and am trying to split it by the presence of a keyword in a variable string.

 

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

 

I am getting duplications in my dataset b (not all but a few observations are duplicating), none in dataset c. What am I doing wrong?

12 REPLIES 12
novinosrin
Tourmaline | Level 20

perhaps you have values that are low case or mixed case as well?

 

if yes, to override that, can the following help at all?

 

data b c;
    set a;
    if find(school, "ELEMENTARY",'i') then output b;
    else output c;
run;

 

 

also, what do you mean by duplicates?

 

can you post the sample data you are working with

Melk
Lapis Lazuli | Level 10

I have already set to be all upper case before running the code.

 

sample data A;

 

ID              SCHOOL

1                    ELEMENTARY SCHOOL

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

4                    DOVER ELEMENTARY

 

sample data B after split:

1                    ELEMENTARY SCHOOL

1                    ELEMENTARY SCHOOL

4                    DOVER ELEMENTARY

 

sample data C after split:

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

 

 

art297
Opal | Level 21

I'm confused! Nothing in your code suggests that you're trying to eliminate duplicates. Since you didn't get any records in c, I would think that b would contain all of your records.

 

It would help if you provided a sample dataset and, based on that sample, what you expect b and c to look like.

 

Art, CEO, AnalystFinder.com

 

Melk
Lapis Lazuli | Level 10

This should help clarify:

 

sample data A;

 

ID              SCHOOL

1                    ELEMENTARY SCHOOL

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

4                    DOVER ELEMENTARY

 

sample data B after split:

1                    ELEMENTARY SCHOOL

1                    ELEMENTARY SCHOOL /*this was duplicated - occurred only once in A - dont want duplicate*/;

4                    DOVER ELEMENTARY

 

sample data C after split:

2                    MIDDLE SCHOOL

3                    HIGH SCHOOL

art297
Opal | Level 21

That isn't what I get. I ran:

data a;
  informat SCHOOL $25.;
  input ID              SCHOOL &;
  cards;
1                    ELEMENTARY SCHOOL
2                    MIDDLE SCHOOL
3                    HIGH SCHOOL
4                    DOVER ELEMENTARY
;

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

and got the expected two records in b and the other two in c.

 

Art, CEO, AnalystFinder.com

 

Melk
Lapis Lazuli | Level 10

This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated

Kurt_Bremser
Super User

@Melk wrote:

This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated


Why do you expect the doctor to give a correct diagnosis if you don't let her see the patient?

Make up a sample dataset that makes the issue happen, and post it here. That it's not your code as initially posted has been proven already.

Tom
Super User Tom
Super User

This step is NOT the source of the duplicates.

Look at the code you used to create the input to this step.

Melk
Lapis Lazuli | Level 10

I run the duplicate check for data A right before this split and there are none, not sure what else it can be.

Tom
Super User Tom
Super User

Not sure what you mean by duplicate check.

 

The only way code like what you posted could duplicate the ELEMENTARY records is if you have two OUTPUT statements being run.

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;

Or you forgot to tell it output only to C in the ELSE clause.

data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output ;
run;
Quentin
Super User

As @Tom said, the code you posted cannot create duplicates.  It  reads each of the records in A, and outputs each record to B or C.

 

You might add a duplicate check to your step.  For example, if SCHOOL is a unique ID, you can sort by SCHOOL, and then below IF statement will write an error to the log if there are any duplicate values in A.

 

proc sort data=a;
  by school;
run;

data b c;
    set a;
    by school ;
    if not (first.school and last.school) then putlog "ERROR: duplicate in A " school=; 

    if find(school, "ELEMENTARY") then output b;
    else output c;
run;
Kurt_Bremser
Super User

Use the macro from my footnote #2 to create a data step from your dataset. Then run that code and your split code in one sequence, and see what happens. If the effect is still there, copy/paste that whole code into a code window here (see my footnote #3).

Only then will all people here talk about the same thing, at the moment it's all guesswork because we really do not know the exact structure of your dataset and the exact code you ran. @art297's example very clearly shows that the code you initially posted works as desired, if the data looks like it seems from your other post.

 

This all once again illustrates why the "old hands" here are so adamant about posting example data in data steps. You would very probably have gotten a solution within two or three answers if you had presented your issue in the correct way.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 3398 views
  • 5 likes
  • 6 in conversation