I have a dataset and am trying to split it by the presence of a keyword in a variable string.
data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;
I am getting duplications in my dataset b (not all but a few observations are duplicating), none in dataset c. What am I doing wrong?
perhaps you have values that are low case or mixed case as well?
if yes, to override that, can the following help at all?
data b c;
    set a;
    if find(school, "ELEMENTARY",'i') then output b;
    else output c;
run;
also, what do you mean by duplicates?
can you post the sample data you are working with
I have already set to be all upper case before running the code.
sample data A;
ID SCHOOL
1 ELEMENTARY SCHOOL
2 MIDDLE SCHOOL
3 HIGH SCHOOL
4 DOVER ELEMENTARY
sample data B after split:
1 ELEMENTARY SCHOOL
1 ELEMENTARY SCHOOL
4 DOVER ELEMENTARY
sample data C after split:
2 MIDDLE SCHOOL
3 HIGH SCHOOL
I'm confused! Nothing in your code suggests that you're trying to eliminate duplicates. Since you didn't get any records in c, I would think that b would contain all of your records.
It would help if you provided a sample dataset and, based on that sample, what you expect b and c to look like.
Art, CEO, AnalystFinder.com
This should help clarify:
sample data A;
ID SCHOOL
1 ELEMENTARY SCHOOL
2 MIDDLE SCHOOL
3 HIGH SCHOOL
4 DOVER ELEMENTARY
sample data B after split:
1 ELEMENTARY SCHOOL
1 ELEMENTARY SCHOOL /*this was duplicated - occurred only once in A - dont want duplicate*/;
4 DOVER ELEMENTARY
sample data C after split:
2 MIDDLE SCHOOL
3 HIGH SCHOOL
That isn't what I get. I ran:
data a;
  informat SCHOOL $25.;
  input ID              SCHOOL &;
  cards;
1                    ELEMENTARY SCHOOL
2                    MIDDLE SCHOOL
3                    HIGH SCHOOL
4                    DOVER ELEMENTARY
;
data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;
and got the expected two records in b and the other two in c.
Art, CEO, AnalystFinder.com
This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated
@Melk wrote:
This is a sample dataset, I cannot post the actual dataset. As I mentioned, not all observations are duplicated
Why do you expect the doctor to give a correct diagnosis if you don't let her see the patient?
Make up a sample dataset that makes the issue happen, and post it here. That it's not your code as initially posted has been proven already.
This step is NOT the source of the duplicates.
Look at the code you used to create the input to this step.
I run the duplicate check for data A right before this split and there are none, not sure what else it can be.
Not sure what you mean by duplicate check.
The only way code like what you posted could duplicate the ELEMENTARY records is if you have two OUTPUT statements being run.
data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;Or you forgot to tell it output only to C in the ELSE clause.
data b c;
    set a;
    if find(school, "ELEMENTARY") then output b;
    else output ;
run;As @Tom said, the code you posted cannot create duplicates. It reads each of the records in A, and outputs each record to B or C.
You might add a duplicate check to your step. For example, if SCHOOL is a unique ID, you can sort by SCHOOL, and then below IF statement will write an error to the log if there are any duplicate values in A.
proc sort data=a;
  by school;
run;
data b c;
    set a;
    by school ;
    if not (first.school and last.school) then putlog "ERROR: duplicate in A " school=; 
    if find(school, "ELEMENTARY") then output b;
    else output c;
run;Use the macro from my footnote #2 to create a data step from your dataset. Then run that code and your split code in one sequence, and see what happens. If the effect is still there, copy/paste that whole code into a code window here (see my footnote #3).
Only then will all people here talk about the same thing, at the moment it's all guesswork because we really do not know the exact structure of your dataset and the exact code you ran. @art297's example very clearly shows that the code you initially posted works as desired, if the data looks like it seems from your other post.
This all once again illustrates why the "old hands" here are so adamant about posting example data in data steps. You would very probably have gotten a solution within two or three answers if you had presented your issue in the correct way.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
