Solved: Re: How to datas into a single line?

_Heinz_ · Posted 03-03-2017 07:21 AM

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

Dear All,

I am using for quite some time SAS Studio with increasing enthusiasm. However, I am facing a problem know which I could not solve by myself:

I wish to analyze a data set from a proteomics shot gun approach, so it is a long list of proteins with some further information. As can be seen in the example table, there are almost identical identifier that differ from the unique identifier by an additional dash and a number (so-called isoforms; see below: Name). However, we would like to condense some of these data into a single line as most of the data connected to these isoforms indeed belong to the standard protein (due to an erroneous assignment by the software used in the step before, see lines 1+2 and 3-5), and the data are complement to each other. However, to make thing more complex, sometimes the isoforms contain real data that should be retained (see line 6+7):

Accordingly, I would like to complete the missing values from the corresponding line into the one already containing the highest number of entries. In case of our example, we would wish to integrate the data from row 1 into row 2 and subsequently delete row 1. If data are complete (here: line 3), we would like to drop the corresponding lines completely (here: 4 and 5). Of cause, a merge of two (or more) lines is only intended if the data are really complement to each to other or identical. Any difference within the corresponding two (or more) data sets should lead to the preservation of entries and data (e.g. line 6 and 7), even if this difference is expressed in a single entry only.

Of cause, this is only an example table; in the real situation, the number of cases with the described situation is much higher, so manual curation is not an option (although the human eye immediately grasp the problem I cannot translate this into program language…).

Thanks a lot for helpful hints and ideas to solve the problem!

Heinz

Kurt_Bremser · Posted 03-07-2017 07:11 AM

Hi Heinz!

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 03-03-2017 08:25 AM

Please post example data in a data step; many firewalls won't let you download Office files from the internet, and Excel can in no way preserve SAS dataset/column attributes.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 03-04-2017 02:37 AM

Thank you for posting the data so nicely.

I'll get back to you on Monday (no SAS on weekends), unless someone else already has a solution.

My idea so far:

create a new variable that holds the count of non-missing values in each line

sort by name_corpus count descending

do a data step by name_corpus, retain all values in new variables at first.name_corpus, and then fill up missing values from following observations (use array processing)

at last.corpus_name, output

keep only the names and the new variables

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

_Heinz_ · Posted 03-07-2017 03:22 AM

Hallo Mr. Bremser,

the first part I could implement, but with the second part I had some trouble because I am not well accustomed to array processing. Would you please post a Syntax for example? My sas code looks like this now.

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
proc sort; by descending name_corpus;
proc print;
run;

Many thanks you for help so far,

Heinz

Kurt_Bremser · Posted 03-07-2017 07:11 AM

Hi Heinz!

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

_Heinz_ · Posted 03-10-2017 08:58 AM

Thank you very much for helping – your code did work well!

Using it as a starting point, with some small additional steps I could find all relevant data sets from the table.

Thanks again and enjoy the weekend,

Kind regards

Heinz

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!