BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
_Heinz_
Fluorite | Level 6
data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

Dear All,

 

I am using for quite some time SAS Studio with increasing enthusiasm. However, I am facing a problem know which I could not solve by myself:

I wish to analyze a data set from a proteomics shot gun approach, so it is a long list of proteins with some further information. As can be seen in the example table, there are almost identical identifier that differ from the unique identifier by an additional dash and a number (so-called isoforms; see below: Name). However, we would like to condense some of these data into a single line as most of the data connected to these isoforms indeed belong to the standard protein (due to an erroneous assignment by the software used in the step before, see lines 1+2 and 3-5), and the data are complement to each other. However, to make thing more complex, sometimes the isoforms contain real data that should be retained (see line 6+7):

  

Accordingly, I would like to complete the missing values from the corresponding line into the one already containing the highest number of entries. In case of our example, we would wish to integrate the data from row 1 into row 2 and subsequently delete row 1. If data are complete (here: line 3), we would like to drop the corresponding lines completely (here: 4 and 5). Of cause, a merge of two (or more) lines is only intended if the data are really complement to each to other or identical. Any difference within the corresponding two (or more) data sets should lead to the preservation of entries and data (e.g. line 6 and 7), even if this difference is expressed in a single entry only.  

Of cause, this is only an example table; in the real situation, the number of cases with the described situation is much higher, so manual curation is not an option (although the human eye immediately grasp the problem I cannot translate this into program language…).

 

Thanks a lot for helpful hints and ideas to solve the problem!

 

Heinz

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Hi Heinz!

 

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

View solution in original post

5 REPLIES 5
Kurt_Bremser
Super User

Please post example data in a data step; many firewalls won't let you download Office files from the internet, and Excel can in no way preserve SAS dataset/column attributes.

Kurt_Bremser
Super User

Thank you for posting the data so nicely.

I'll get back to you on Monday (no SAS on weekends), unless someone else already has a solution.

My idea so far:

create a new variable that holds the count of non-missing values in each line

sort by name_corpus count descending

do a data step by name_corpus, retain all values in new variables at first.name_corpus, and then fill up missing values from following observations (use array processing)

at last.corpus_name, output

keep only the names and the new variables

_Heinz_
Fluorite | Level 6

Hallo Mr. Bremser,

the first part I could implement, but with the second part I had some trouble because I am not well accustomed to array processing. Would you please post a Syntax for example? My sas code looks like this now.

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
proc sort; by descending name_corpus;
proc print;
run;

Many thanks you for help so far,

 

Heinz

Kurt_Bremser
Super User

Hi Heinz!

 

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

_Heinz_
Fluorite | Level 6

Thank you very much for helping – your code did work well!

Using it as a starting point, with some small additional steps I could find all relevant data sets from the table.

 

Thanks again and enjoy the weekend,

 

Kind regards

Heinz

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 961 views
  • 1 like
  • 2 in conversation