Hi,
I am having trouble reading a file containing non-consistent data in wide format into long format.
for example, the data looks like this:
subject, sex, time1, time2, time3,..., intensity1, intensity2, intensity3....
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
so some subjects may have 3 times and 3 intensities and other may have 4 times/intensities or 5 times/intensities...
If I read the file as
data id sex$ time1-5, intensity1-5;
then transform into long format by
data long; set wide;
array t(5) time1-time5;
array i(5) intensity1-intensity5;
do j=1 to 5;
time=t(j);
intensity= i(j);
output;
end;
however for those subjects that have <5 times/intensities, SAS will continue to read the next subject's data and the whole table screws up.
is there a way I can read the data directly into long format? something like...
data long;
infle 'xxx';
input id sex$ @@;
- do a loop that reads the remaining variables
- I can change the variables into time1 intensity1 time2 intensity2 time3 intensity3....
time=
intensity=
output;
thanks for any suggestions!
ming
Assuming that there are no missing values for "time" (=2 consecutive commas) below code should work - else it would need a little tweak. It's a variation of what already posted.
data sample(drop=_:);
attrib id informat=$10. sex informat=$1.;
array _t_i (20) 8.;
call missing(of _all_);
infile datalines truncover dsd dlm=',';
input id sex _t_i
_isize=countc(_infile_,',')-1 ;
do _ind=1 to _isize by 2;
time=_t_i(_ind);
intensity=_t_i(_ind+1);
output;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Message was edited by: Patrick Matter Stole the "isize" approach from Jaap, so now the code should also work for cases where there is a missing (=2 consecutive commas)
Hello,
The first data step determines the number of the variables (times / intensity) which the data set will contain,
while the second step imports the values within the data set.
data _null_;
input @5 tmins & $30.;
x=(length(strip(compress(tmins,,'d')))+1)/2;
if _N_=1 then call symputx('nrvar',x);
else if x gt input(symget('nrvar'),best12.) then call symputx('nrvar',x);;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
data have (drop=rowvar i tmins);
infile datalines dsd;
*VARIABLES;
input subject sex $ tmins & $30. @ ;
array times{&nrvar};
array intens{&nrvar};
*Computation;
rowvar=length(strip(compress(tmins,,'d')))+1;*number of commas in order to determin the loop;
do i=1 to rowvar/2;
times{i}=input(scan(tmins,i,","),best12.);*in order to avoid the note within the log;
intens{i}=input(scan(tmins,i+rowvar/2,","),best12.);
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Just one datastep needed. See commentlines
data long (keep=id sex time intensity) ;
length sex $ 1 ;
array timen(5) timen1-timen5;
array inten(5) intensity1-intensity5;
infile cards missover DSD DLM=','; /*just to be sure of options */
input id sex @ ;
isize=(countw(_infile_)-2)/2 ; /* _infile is the current record - size of the array in words*/
do i=1 to isize ; input timen @; end;
do i=1 to isize ; input inten @; end;
put id sex isize ; /* debugging shows processing */
do i=1 to isize ; /* output long in array size */
time=timen ;
intensity=inten ;
output ;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
run;
Assuming that there are no missing values for "time" (=2 consecutive commas) below code should work - else it would need a little tweak. It's a variation of what already posted.
data sample(drop=_:);
attrib id informat=$10. sex informat=$1.;
array _t_i (20) 8.;
call missing(of _all_);
infile datalines truncover dsd dlm=',';
input id sex _t_i
_isize=countc(_infile_,',')-1 ;
do _ind=1 to _isize by 2;
time=_t_i(_ind);
intensity=_t_i(_ind+1);
output;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Message was edited by: Patrick Matter Stole the "isize" approach from Jaap, so now the code should also work for cases where there is a missing (=2 consecutive commas)
data sample(drop=n i); infile cards; input; subject=scan(_infile_,1,',','m'); sex=scan(_infile_,2,',','m'); n=divide(countc(_infile_,',')-1,2); do i=3 to n+2; time=scan(_infile_,i,',','m'); intensity=scan(_infile_,i+n,',','m'); output; end; datalines; 1,m,1,45,5,2,5,3 2,m,3,6,5,5,6,3 3,m,6,65,5,2,20,3,23,63 4,m,9,6,5,43,5,3 5,m,3,195,5,2,5,32,34,78,12,42 6,m,6,92,5,232,5,23 ; run;
Xia Keshan
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.