data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, ,
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , ,
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run; Dear All, I am using for quite some time SAS Studio with increasing enthusiasm. However, I am facing a problem know which I could not solve by myself: I wish to analyze a data set from a proteomics shot gun approach, so it is a long list of proteins with some further information. As can be seen in the example table, there are almost identical identifier that differ from the unique identifier by an additional dash and a number (so-called isoforms; see below: Name). However, we would like to condense some of these data into a single line as most of the data connected to these isoforms indeed belong to the standard protein (due to an erroneous assignment by the software used in the step before, see lines 1+2 and 3-5), and the data are complement to each other. However, to make thing more complex, sometimes the isoforms contain real data that should be retained (see line 6+7): Accordingly, I would like to complete the missing values from the corresponding line into the one already containing the highest number of entries. In case of our example, we would wish to integrate the data from row 1 into row 2 and subsequently delete row 1. If data are complete (here: line 3), we would like to drop the corresponding lines completely (here: 4 and 5). Of cause, a merge of two (or more) lines is only intended if the data are really complement to each to other or identical. Any difference within the corresponding two (or more) data sets should lead to the preservation of entries and data (e.g. line 6 and 7), even if this difference is expressed in a single entry only. Of cause, this is only an example table; in the real situation, the number of cases with the described situation is much higher, so manual curation is not an option (although the human eye immediately grasp the problem I cannot translate this into program language…). Thanks a lot for helpful hints and ideas to solve the problem! Heinz
... View more