Write and run SAS programs in your web browser

How to datas into a single line?

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

How to datas into a single line?

[ Edited ]
data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

Dear All,

 

I am using for quite some time SAS Studio with increasing enthusiasm. However, I am facing a problem know which I could not solve by myself:

I wish to analyze a data set from a proteomics shot gun approach, so it is a long list of proteins with some further information. As can be seen in the example table, there are almost identical identifier that differ from the unique identifier by an additional dash and a number (so-called isoforms; see below: Name). However, we would like to condense some of these data into a single line as most of the data connected to these isoforms indeed belong to the standard protein (due to an erroneous assignment by the software used in the step before, see lines 1+2 and 3-5), and the data are complement to each other. However, to make thing more complex, sometimes the isoforms contain real data that should be retained (see line 6+7):

  

Accordingly, I would like to complete the missing values from the corresponding line into the one already containing the highest number of entries. In case of our example, we would wish to integrate the data from row 1 into row 2 and subsequently delete row 1. If data are complete (here: line 3), we would like to drop the corresponding lines completely (here: 4 and 5). Of cause, a merge of two (or more) lines is only intended if the data are really complement to each to other or identical. Any difference within the corresponding two (or more) data sets should lead to the preservation of entries and data (e.g. line 6 and 7), even if this difference is expressed in a single entry only.  

Of cause, this is only an example table; in the real situation, the number of cases with the described situation is much higher, so manual curation is not an option (although the human eye immediately grasp the problem I cannot translate this into program language…).

 

Thanks a lot for helpful hints and ideas to solve the problem!

 

Heinz


Accepted Solutions
Solution
‎03-10-2017 08:57 AM
Super User
Posts: 7,758

Re: How to datas into a single line?

Hi Heinz!

 

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Posts: 7,758

Re: How to datas into a single line?

Please post example data in a data step; many firewalls won't let you download Office files from the internet, and Excel can in no way preserve SAS dataset/column attributes.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 7,758

Re: How to datas into a single line?

Thank you for posting the data so nicely.

I'll get back to you on Monday (no SAS on weekends), unless someone else already has a solution.

My idea so far:

create a new variable that holds the count of non-missing values in each line

sort by name_corpus count descending

do a data step by name_corpus, retain all values in new variables at first.name_corpus, and then fill up missing values from following observations (use array processing)

at last.corpus_name, output

keep only the names and the new variables

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
New Contributor
Posts: 3

Re: How to datas into a single line?

Posted in reply to KurtBremser

Hallo Mr. Bremser,

the first part I could implement, but with the second part I had some trouble because I am not well accustomed to array processing. Would you please post a Syntax for example? My sas code looks like this now.

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
proc sort; by descending name_corpus;
proc print;
run;

Many thanks you for help so far,

 

Heinz

Solution
‎03-10-2017 08:57 AM
Super User
Posts: 7,758

Re: How to datas into a single line?

Hi Heinz!

 

This is my suggested code:

data prot_table;
infile cards dlm= ",";
input Name:$8. Name_corpus: $8. D1 E1 B2 C2 B3 C3;
cards ;
O00139,O00139, , , , ,1.23,1.58
O00139-2,O00139,1.69,2.49,0.94,0.8, ,    
O00429,O00429,0.84,0.94,0.99,1.02,1.2,0.85
O00429-4,O00429, , ,0.99,1.02, , 
O00429-2,O00429,0.84,0.94, , ,1.2,0.85
O94925-3,O94925,1.64,1.76,0.78,0.81, , , 
O94925,O94925,1.52,1.41,0.92,0.87,10,10
;
run;

data missing;
set prot_table;
array vars(*) D1--C3;
anzMissing = cmiss(of vars[*]);
run;

proc sort;
by name_corpus anzmissing;
run;

data want;
set missing;
by name_corpus;
retain _D1 _E1 _B2 _C2 _B3 _C3;
array newvars {*} _D1 _E1 _B2 _C2 _B3 _C3;
array old {*} D1 E1 B2 C2 B3 C3;
if first.name_corpus
then do i = 1 to dim(newvars);
  newvars{i} = old{i};
end;
else do i = 1 to dim(newvars);
  if missing(newvars{i}) then newvars{i} = old{i};
end;
if last.name_corpus then output;
drop name i anzmissing D1--C3;
run;

proc print data=want noobs;
run;

The result looks like this:

Name_
corpus     _D1     _E1     _B2     _C2     _B3      _C3

O00139    1.69    2.49    0.94    0.80     1.23     1.58
O00429    0.84    0.94    0.99    1.02     1.20     0.85
O94925    1.52    1.41    0.92    0.87    10.00    10.00

Of course the code is open to all kinds of automatization. One could read the variable names from dictionary.columns and then use call execute in a data _null_ step to make the liste of variables dynamic.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
New Contributor
Posts: 3

Re: How to datas into a single line?

Posted in reply to KurtBremser

Thank you very much for helping – your code did work well!

Using it as a starting point, with some small additional steps I could find all relevant data sets from the table.

 

Thanks again and enjoy the weekend,

 

Kind regards

Heinz

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 198 views
  • 1 like
  • 2 in conversation