BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kmardinian
Quartz | Level 8

Hi, I'm running a dataset filled with different gene mutations listed like AKT1 EL7K, AKT1 E49K, and AKT1 F55L. And I'd like to group them all into one variable named AKT1. This is what I had based on what I foudn online, but it hasn't been working.

 

Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;
if findw(N,'AKT1') then Ncat=AKT; end;

 

Any help is much appreciated!

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

 

Are you saying you have 50 vars for each individual?  But you want a single frequency table over all those 50, right?

 

Presumably you want to run a proc freq on a data set derived from your data set.  The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE.  You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.

 

 

 

data need (keep=name main_gene);
  set have;
  array n {*} n1-n50;
  do I=1 to dim(n);
    main_gene=scan(n{I},1);
    output;
  end;
run;

proc freq data=need;
  tables main_gene;
run;

 

Notes:

  1. If a given main gene value occurs more than once in the 50 vars, this program counts it every time.  If you only want to count main gene once, then the simplest solution is to put this between the DATA step and the PROC FREQ:
         proc sort data=need noduprec;
           by name main_gene;
         run;
  2. If your original vars are NOT neatly named N1, N2, ... N50, you can still use the array.  Just replace the "N1-N50" with the list of vars in the array statement.  But note the program assume each of the original gene values has main gene as the first space-separated word.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
Reeza
Super User

Please post some example data detailing what you have and what you want as output. 
This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem. 

kmardinian
Quartz | Level 8

My data looks like this:

 

Name                                N

John Doe                    EGFR R32R

John Doe                    EGFR 5539

Jane Williams             BRCA1 6006

Jane Williams             BRCA1 4003

Jane Williams             AKT1 3997

 

So each person has several different mutations and I want to group their mutations based on the prefix of each one. So I would like to have them renamed as EGFR, BRCA1, and AKT1, erasing the numbers at the end.

 

As for my output, I want to find the total frequency of each mutation in my population, which is why I'd like to focus on the main gene and not the numbers at the end.

 

Thank you!

Reeza
Super User

If there's a space, as in your sample use SCAN()

 

prefix = scan(N_var, 1);
kmardinian
Quartz | Level 8

Hi Reeza,

 

Thank you for taking the time to help me with this. I'm not sure I have the syntax right.

 

Data Assign1;
ABL1 = scan(N_var, 1);
Run;

 

I'm not sure what to put for N_var.

 

Thank you!

Reeza
Super User

The variable name. The old variable that is, the left hand side of the equation is the new variable name.

 

new_variable = scan( old_variable, 1);

All functions are listed here and you should bookmark this page:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#p0w6napahk6...

 

It's helpful to search by category when you don't know exactly what you're looking for.

 

SCAN function specifically:

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#p0jshdjy2z9...

 

 

 
 
kmardinian
Quartz | Level 8

So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables? 

 

Thank you!

Reeza
Super User

CONTAINS/FINDW/INDEXW all work to find strings but I think you haven't explained your problem thoroughly.

 

SCAN separates the word regardless of what the prefix is, so in my opinion that's a better method, given your example.

 

Going back to my first response:

This does not need to be real data, it can be fake, but it should be as simple as possible, while complex enough to illustrate your problem. 

 

Given your last answer, I suggest you provide a better example of what you're trying to achieve. 

You have several options but it also depends on what you want as output which you haven't specified. You can use an array to loop through as many variables as you want, or you could transpose your data to a long format which is more suitable for analysis and follows the concept of 'Tidy Data'.

 

So please provide a better example of your sample data AND what you want you want as output. 

kmardinian
Quartz | Level 8
John DoeEGFR RS2R
John DoeEGFR 5539
Jane WilliamsBRCA1 6006
Jane WilliamsBRCA1 2002
Tom FordBRCA1 4008
Tom FordBRCA1 2343
Tom FordBRCA1 2343
Tom FordEGFR 6382
Luis MoALK1 8373
Luis MoEGFR 3378
Katie LuBRCA1 3873
katie LuEGFR 8739

 

This is a tiny example of what my dataset looks like. I have 2,000 different individuals, each with their own different set of mutations. I want to find the frequency of each mutation (ex. EGFR, BRCA1, etc ) in my total population, regardless of what location the mutation is on the gene (so I don't care about the numbers). So I wanted to find an easy way to fine the freuquency of EGFR mutations, by grouping all the EGFR XXXX into one variable category (EGFR), the BRAC1 XXXX into another, etc... without having to manually do it for 2,000 people. I would like my output to show the percentage of each mutation in my population with the new variables I create.

 

I'm sorry my explanations are poor, as you can see I am definitely a SAS beginner and have not mastered a lot of the data management aspect.

 

Thank you!

ballardw
Super User

@kmardinian wrote:

So the only issue with the scan function, is that I have over 50 variables with EGFR as a prefix, so is the only way to write out each line of code for the 50 variables? 

 

Thank you!


Your examples have all showed a single variable, not clearly named, to process. If you have have multiple variables then you should show an example with multiple variables and what the result should be.

It is possible that if you are looking to search a number of variables for a single value that an array with the variables to search would be the likely approach. But a clearer example would help.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against.

 

The above would show us exactly what a subset of your data would look like.

Then you could show, using the same approach, what the results for that example data should be.

mkeintz
PROC Star

 

Are you saying you have 50 vars for each individual?  But you want a single frequency table over all those 50, right?

 

Presumably you want to run a proc freq on a data set derived from your data set.  The derived data set (NEED below) would have 50 observations for each observation of 50 vars in your original dataset HAVE.  You can do this in a data step with an array declaration containing all of your original vars as elements of the array, and then loop over the array, generating and outputting the main gene from each array element.

 

 

 

data need (keep=name main_gene);
  set have;
  array n {*} n1-n50;
  do I=1 to dim(n);
    main_gene=scan(n{I},1);
    output;
  end;
run;

proc freq data=need;
  tables main_gene;
run;

 

Notes:

  1. If a given main gene value occurs more than once in the 50 vars, this program counts it every time.  If you only want to count main gene once, then the simplest solution is to put this between the DATA step and the PROC FREQ:
         proc sort data=need noduprec;
           by name main_gene;
         run;
  2. If your original vars are NOT neatly named N1, N2, ... N50, you can still use the array.  Just replace the "N1-N50" with the list of vars in the array statement.  But note the program assume each of the original gene values has main gene as the first space-separated word.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
kmardinian
Quartz | Level 8

Yes, that worked nicely! Thank you everyone for all your help. This seemed to be the easiest way to do this, I appreciate the advice.

 

Thank you!

kmardinian
Quartz | Level 8

I also did find a contains statement that might possibly work, but haven't been able to figure it out either. Here is an example of that below

 

data test;
   set test;
   if find(name,'ian','i') ge 1;
   /* equivalent WHERE clause */
   *where upcase(name) contains 'IAN';
run;

It is from this website: http://support.sas.com/kb/43/303.html 

Michelle
Obsidian | Level 7

Have you tried running Reeza's suggestion to use the SCAN in your original code? You would replace it in your original code like this:

 

Data MCCAssign1;
Set CombinedData (keep= N Name Notes protocol);
If missing(protocol) then delete;

Ncat = SCAN( N, 1);

run;

 

And you should end up with the desired Ncat variable for all of the cases.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 3487 views
  • 0 likes
  • 5 in conversation