BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kmardinian
Quartz | Level 8

Hi, so I have a merged dataset I created in SAS that looks like this in the output results

 

ID   Alteration    Mutation

1                         ART

1                         COX

2      GBR           LIR

2      ART            EGF

3      LOP            

3      ART             NIH

3                         EGF

 

My problem is, I want to remove all the observations that do not have both a value for Alteration and Mutation for each ID. But unfortunately, I have an excel file with duplicate ID numbers for each mutation. So for example, I'd like to find a way to remove the first two rows, without losing the last row (ID:3) since ID:3 does have some rows with both alteration and mutation results.

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

No, you won't lose them. In fact, you don't even have to initialize a and m between ID values, as they are automatically initialized. E.g., try:

data have;
  input ID   Alteration $    Mutation $;
  cards;
1      .         ART
1      .         COX
2      GBR       LIR
2      ART       EGF
3      LOP        .
3      ART       NIH
3      .         EGF
4      .         ART
4      .         COX
;

data want (drop=a m);
  do until (last.id);
    set have;
    by id;
    if not missing(alteration) then a=1;
    if not missing(mutation) then m=1;
  end;
  do until (last.id);
    set have;
    by id;
    if a and m then output;
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

15 REPLIES 15
PaigeMiller
Diamond | Level 26

Your explanation makes it sound like removing the first two rows will remove anything with ID=3, and that's not making sense to me. Can you explain further?

 

Or do you want something simple like this:

 

if missing(alteration) or missing(mutation) then delete;
--
Paige Miller
kmardinian
Quartz | Level 8

I have two tests that were conducted for people. One called Mutation and the other one, Alteration. Not everyone got both tests, which is why 1 is missing values for Alteration. But 3 did get both tests, but there are blanks from how the two datasets were merged.

 

I only want the data from people who took both tests and I am trying to get rid of ID numbers that only took one of the tests

 

If I used the code that you provided me with, would I not lose my last row since it's missing an observation in Alteration?

 

if missing(alteration) or missing(mutation) then delete;

 

art297
Opal | Level 21

If I correctly understand what you want to do, you need to make two passes thru your data. A DOW loop is good for that. e.g.:

data have;
  input ID   Alteration $    Mutation $;
  cards;
1      .         ART
1      .         COX
2      GBR       LIR
2      ART       EGF
3      LOP        .
3      ART       NIH
3      .         EGF
;

data want;
  do until (last.id);
    set have;
    by id;
    if first.id then do;
      a=0;
      m=0;
    end;
    if not missing(alteration) then a=1;
    if not missing(mutation) then m=1;
  end;
  do until (last.id);
    set have;
    by id;
    if a and m then output;
  end;
run;

Art, CEO, AnalystFinder.com

 

kmardinian
Quartz | Level 8

Hi art297, 

 

Do you mind walking me through what the first do statement is doing exactly?

Thank you!

art297
Opal | Level 21

Sure. With a DOW loop you're taking control of how SAS processes the dataset. All of the variable values you create are retained.

 

Thus, in the first do loop, separately for each id, a and m are only set to 1 if both values are present.

 

The second do loop starts as soon as the first loop reads the last record for a given id. Thus, if both a and m are present, it outputs the records. Then it goes back to the first do loop to process the next id.

 

Art, CEO, AnalystFinder.com

 

You can find more technical descriptions if you search the web for "DOW loop". More than likely the authors will be Whitlock, Dorfman or Henderson as they are the ones how came up with the concept.

 

kmardinian
Quartz | Level 8
Thanks art, I think that might work! And I won't lose the LOP and EGF observations for ID 3 since they have missing values for alteration and mutation? Thank you so much!
art297
Opal | Level 21

No, you won't lose them. In fact, you don't even have to initialize a and m between ID values, as they are automatically initialized. E.g., try:

data have;
  input ID   Alteration $    Mutation $;
  cards;
1      .         ART
1      .         COX
2      GBR       LIR
2      ART       EGF
3      LOP        .
3      ART       NIH
3      .         EGF
4      .         ART
4      .         COX
;

data want (drop=a m);
  do until (last.id);
    set have;
    by id;
    if not missing(alteration) then a=1;
    if not missing(mutation) then m=1;
  end;
  do until (last.id);
    set have;
    by id;
    if a and m then output;
  end;
run;

Art, CEO, AnalystFinder.com

 

kmardinian
Quartz | Level 8

Hi art, so I just tried running the code and unfortunately it won't run. I keep getting this note over and over again

NOTE: Invalid numeric data, Alteration='MYST3' , at line 72 column 4.

 

and then this one:  last.ID=1 ID=0463 mutation= Protocol=predict Alteration=SETD2 FIRST.MRN=0 _ERROR_=1 _N_=5

 

Do you know what this issue might be? Thank you!

kmardinian
Quartz | Level 8
Thank you for all the help!
art297
Opal | Level 21

Post the full set of code that you're running along with the log.

 

Art, CEO, AnalystFinder.com

 

kmardinian
Quartz | Level 8

Here is my code so far:

PROC IMPORT DATAFILE="xlsx"

   OUT=Guardant

   DBMS=XLSX

   REPLACE;

RUN;

 

 

PROC IMPORT DATAFILE="xlsx"

   OUT=Patient

   DBMS=XLSX

   REPLACE;

RUN;

 

 

PROC IMPORT DATAFILE=".xlsx"

   OUT=Foundation

   DBMS=XLSX

   REPLACE;

RUN;

 

proc sort data= G  out=G1;

 by ID;

 run;

 

proc sort data= P  out=P1;

 by ID;

 run;

 

proc sort data=F out=F1;

by ID;

run;

 

 

Data CombinedData;

 Merge G1 P1 F1;

 By  ID;

 Run;

 

Data MCC1; 

Set CombinedData (keep= Mutation ID Alteration protocol);

If protocol= "No" then delete;

if missing (mutation) and missing(alteration) then delete;

run;

 

Data Final (drop=alteration mutation);

do until (last.ID);

set MCC1;

By ID;

if not missing (alteration) then alteration=1;

if not missing(mutation) then mutation=1;

end;

do until (last.ID);

set MCC1;

by ID;

if alteration and mutation then output;

end;

Run;

 

 

Here is my LOG: 

 

 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

 61        

 62         Data MCC1;

 63         Set CombinedData (keep= Mutation ID Alteration protocol);

 64         If protocol= "No" then delete;

 65         if missing (Mutation) and missing(alteration) then delete;

 66         run;

 

 NOTE: There were 38390 observations read from the data set WORK.COMBINEDDATA.

 NOTE: The data set WORK.MCC1 has 30871 observations and 4 variables.

 NOTE: DATA statement used (Total process time):

       real time           0.04 seconds

       cpu time            0.03 seconds

      

 

 67        

 68         Data Final (drop=alteration Mutation);

 69         do until (last.ID);

 70         set MCC1;

 71         By ID;

 72         if not missing (alteration) then alteration='1';

 73         if not missing(Mutation) then Mutation='1';

 74         end;

 75         do until (last.ID);

 76         set MCC1;

 77         by ID;

 78         if alteration and Mutation then output;

 79         end;

 80         Run;

 

 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).

       78:4    78:19  

 NOTE: Invalid numeric data, Alteration='NA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='LRP1B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TSC1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FANCA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MLL3' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SNCAIP' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MLL' , at line 78 column 4.

 last.ID=1 ID=03697703 Mutation=  Protocol=predict Alteration=MLL FIRST.ID=0 _ERROR_=1 _N_=1

 NOTE: Invalid numeric data, Alteration='NTRK1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TP53' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CUL3' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TP53' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ARID1A' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MAP3K1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MLL' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MYST3' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NTRK1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TET2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CDKN2B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='EPHB1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PIK3CB' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TERC' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PRKCI' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='KRAS' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='BCL2L2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NFKBIA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NKX2-1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PIK3C2B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MDM4' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='IKBKE' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PIK3CA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SOX2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='EGFR' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MYC' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ZNF703' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FGFR1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CDKN2A/B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SMAD4' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SDHC' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='DDR2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='GSK3B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ATR' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='KLHL6' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='BCL6' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='GPR124' , at line 78 column 4.

 last.ID=1 ID=03907441 Mutation=  Protocol=PREDICT Alteration=GPR124 FIRST.ID=0 _ERROR_=1 _N_=2

 NOTE: Invalid numeric data, Alteration='PTPN11' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NF1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='NF1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TSC1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CDKN2C' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='DOT1L' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ESR1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='IL7R' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MSH6' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PIK3CG' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SPTA1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='STK11' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PDGFRA' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='KIT' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CDKN2A/B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TET2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FBXW7' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='IRF2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FAT1' , at line 78 column 4.

 last.ID=1 ID=03921269 Mutation=  Protocol=PREDICT Alteration=FAT1 FIRST.ID=0 _ERROR_=1 _N_=3

 NOTE: Invalid numeric data, Alteration='TP53' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ARID1B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ATM' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='BRCA2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='BRCA2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FOXP1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FRS2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='HSP90AA1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MEN1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MLL2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='RB1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='TP53' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='WISP3' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ZNF217' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='HSP90AA1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='KMT2C' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='GATA6' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MCL1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='ZNF217' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MYCN' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FH' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='FUBP1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='RANBP2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='INPP4B' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='APC' , at line 78 column 4.

 last.ID=1 ID=04090148 Mutation=  Protocol=PREDICT Alteration=APC FIRST.ID=0 _ERROR_=1 _N_=4

 NOTE: Invalid numeric data, Alteration='BRAF' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='PREX2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='EP300' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='SETD2' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='BCORL1' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='CREBBP' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MLL' , at line 78 column 4.

 NOTE: Invalid numeric data, Alteration='MYST3' , at line 78 column 4.

 NOTE: Over 100 NOTES, additional NOTES suppressed.

 NOTE: Invalid numeric data, Alteration='SETD2' , at line 78 column 4.

 last.ID=1 ID=04463 Mutation=  Protocol=predict Alteration=SETD2 FIRST.ID=0 _ERROR_=1 _N_=5

 last.ID=1 ID=04700 Mutation=  Protocol=predict Alteration=SNCAIP FIRST.ID=0 _ERROR_=1 _N_=6

 last.ID=1 ID=04390 Mutation=  Protocol=PREDICT Alteration=SPTA1 FIRST.ID=0 _ERROR_=1 _N_=7

 last.ID=1 ID=04482 Mutation=  Protocol=PREDICT Alteration=TAF1 FIRST.ID=0 _ERROR_=1 _N_=10

 last.ID=1 ID=04100 Mutation=  Protocol=PREDICT Alteration=PAK3 FIRST.ID=0 _ERROR_=1 _N_=11

 last.ID=1 ID=56035 Mutation=  Protocol=PREDICT Alteration=NA FIRST.ID=0 _ERROR_=1 _N_=12

 last.ID=1 ID=04757 Mutation=  Protocol=PREDICT Alteration=KLHL6 FIRST.ID=0 _ERROR_=1 _N_=13

 last.ID=1 ID=04962 Mutation=  Protocol=predict Alteration=RUNX1T1 FIRST.ID=0 _ERROR_=1 _N_=15

 last.ID=1 ID=04984 Mutation=  Protocol=predict Alteration=PIK3R2 FIRST.ID=0 _ERROR_=1 _N_=16

 last.ID=1 ID=05027 Mutation=  Protocol=predict Alteration=FANCL FIRST.ID=0 _ERROR_=1 _N_=17

 last.ID=1 ID=05502 Mutation=  Protocol=PREDICT Alteration=SPTA1 FIRST.ID=0 _ERROR_=1 _N_=20

 last.ID=1 ID=04312 Mutation=  Protocol=predict Alteration=KMT2A FIRST.ID=0 _ERROR_=1 _N_=22

 last.ID=1 ID=05476 Mutation=  Protocol=  Alteration=RARA FIRST.ID=0 _ERROR_=1 _N_=24

 last.ID=1 ID=00902 Mutation=  Protocol=predict Alteration=NA FIRST.ID=0 _ERROR_=1 _N_=25

 last.ID=1 ID=05856 Mutation=  Protocol=PREDICT Alteration=VHL FIRST.ID=0 _ERROR_=1 _N_=27

 WARNING: Limit set by ERRORS= option reached.  Further errors of this type will not be printed.

 last.ID=1 ID=05474 Mutation=  Protocol=predict Alteration=SETD2 FIRST.ID=0 _ERROR_=1 _N_=28

 NOTE: There were 30871 observations read from the data set WORK.MCC1.

 NOTE: There were 30871 observations read from the data set WORK.MCC1.

 NOTE: The data set WORK.FINAL has 0 observations and 2 variables.

 NOTE: DATA statement used (Total process time):

       real time           0.03 seconds

       cpu time            0.02 seconds

      

 

 81        

 82        

 83        

 84         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;

 97        

 

kmardinian
Quartz | Level 8

I'm definitely a SAS beginner, so I apologize for the organization of my code. 

 

I did try to change the alteration=1 to alteration='1' in the do loop, since I thought it was a numeric vs. character issue. But that didn't solve the problem

art297
Opal | Level 21

You didn't follow my suggested code. I created two new variables, namely a and m, and then dropped them as part of the data statement.

 

You tried to replace existing variables which risks bringing up the types of problems for which you got errors. You never mentioned anything about wanting to replace existing variable values with something else.

 

I would simply use the code the way I suggested.

 

Art, CEO, AnalystFinder.com

 

kmardinian
Quartz | Level 8

Yep, that was it. I'm sorry, for some reason I saw those as abbreviations! That worked though, thank you very much Art!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 2901 views
  • 0 likes
  • 4 in conversation