Hi,
I am having trouble reading a file containing non-consistent data in wide format into long format.
for example, the data looks like this:
subject, sex, time1, time2, time3,..., intensity1, intensity2, intensity3....
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
so some subjects may have 3 times and 3 intensities and other may have 4 times/intensities or 5 times/intensities...
If I read the file as
data id sex$ time1-5, intensity1-5;
then transform into long format by
data long; set wide;
array t(5) time1-time5;
array i(5) intensity1-intensity5;
do j=1 to 5;
time=t(j);
intensity= i(j);
output;
end;
however for those subjects that have <5 times/intensities, SAS will continue to read the next subject's data and the whole table screws up.
is there a way I can read the data directly into long format? something like...
data long;
infle 'xxx';
input id sex$ @@;
- do a loop that reads the remaining variables
- I can change the variables into time1 intensity1 time2 intensity2 time3 intensity3....
time=
intensity=
output;
thanks for any suggestions!
ming
Assuming that there are no missing values for "time" (=2 consecutive commas) below code should work - else it would need a little tweak. It's a variation of what already posted.
data sample(drop=_:);
attrib id informat=$10. sex informat=$1.;
array _t_i (20) 8.;
call missing(of _all_);
infile datalines truncover dsd dlm=',';
input id sex _t_i
_isize=countc(_infile_,',')-1 ;
do _ind=1 to _isize by 2;
time=_t_i(_ind);
intensity=_t_i(_ind+1);
output;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Message was edited by: Patrick Matter Stole the "isize" approach from Jaap, so now the code should also work for cases where there is a missing (=2 consecutive commas)
Hello,
The first data step determines the number of the variables (times / intensity) which the data set will contain,
while the second step imports the values within the data set.
data _null_;
input @5 tmins & $30.;
x=(length(strip(compress(tmins,,'d')))+1)/2;
if _N_=1 then call symputx('nrvar',x);
else if x gt input(symget('nrvar'),best12.) then call symputx('nrvar',x);;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
data have (drop=rowvar i tmins);
infile datalines dsd;
*VARIABLES;
input subject sex $ tmins & $30. @ ;
array times{&nrvar};
array intens{&nrvar};
*Computation;
rowvar=length(strip(compress(tmins,,'d')))+1;*number of commas in order to determin the loop;
do i=1 to rowvar/2;
times{i}=input(scan(tmins,i,","),best12.);*in order to avoid the note within the log;
intens{i}=input(scan(tmins,i+rowvar/2,","),best12.);
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Just one datastep needed. See commentlines
data long (keep=id sex time intensity) ;
length sex $ 1 ;
array timen(5) timen1-timen5;
array inten(5) intensity1-intensity5;
infile cards missover DSD DLM=','; /*just to be sure of options */
input id sex @ ;
isize=(countw(_infile_)-2)/2 ; /* _infile is the current record - size of the array in words*/
do i=1 to isize ; input timen @; end;
do i=1 to isize ; input inten @; end;
put id sex isize ; /* debugging shows processing */
do i=1 to isize ; /* output long in array size */
time=timen ;
intensity=inten ;
output ;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
run;
Assuming that there are no missing values for "time" (=2 consecutive commas) below code should work - else it would need a little tweak. It's a variation of what already posted.
data sample(drop=_:);
attrib id informat=$10. sex informat=$1.;
array _t_i (20) 8.;
call missing(of _all_);
infile datalines truncover dsd dlm=',';
input id sex _t_i
_isize=countc(_infile_,',')-1 ;
do _ind=1 to _isize by 2;
time=_t_i(_ind);
intensity=_t_i(_ind+1);
output;
end;
datalines;
1,m,1,45,5,2,5,3
2,m,3,6,5,5,6,3
3,m,6,65,5,2,20,3,23,63
4,m,9,6,5,43,5,3
5,m,3,195,5,2,5,32,34,78,12,42
6,m,6,92,5,232,5,23
;
run;
Message was edited by: Patrick Matter Stole the "isize" approach from Jaap, so now the code should also work for cases where there is a missing (=2 consecutive commas)
data sample(drop=n i); infile cards; input; subject=scan(_infile_,1,',','m'); sex=scan(_infile_,2,',','m'); n=divide(countc(_infile_,',')-1,2); do i=3 to n+2; time=scan(_infile_,i,',','m'); intensity=scan(_infile_,i+n,',','m'); output; end; datalines; 1,m,1,45,5,2,5,3 2,m,3,6,5,5,6,3 3,m,6,65,5,2,20,3,23,63 4,m,9,6,5,43,5,3 5,m,3,195,5,2,5,32,34,78,12,42 6,m,6,92,5,232,5,23 ; run;
Xia Keshan
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.