Sorry guys, but if you remember me from my previous post(a few weeks back), I assure you this one may not be as difficult as the previous one. I was just wondering if I had a data file but it had headers in between the data points, how can I get it to skip over these lines? I understand that to skip a header you simply put firstobs=line # you want to start at but if there's a header in between is there a way to tell SAS to do something like read from these lines, skip over other lines, read from some lines again? I will post a sample of the data to show what I mean or you can simply visit this website "http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat" (there's really 5 different data files but we only have to read in one based on a RNG and this one is the one I have to read).
I understand we simply use dlm = '09'x to specifiy that it is a tab delimited file. Basically there's ~24 lines of headers in the beginning then we want our firstobs=25(I'm guessing). But then we have this gap in the middle which is what is throwing me off. Any help would be appreciated.
*
2008 0.09 -0.02 -0.10 -0.16 0.18 0.50 -0.13 -0.26 -0.06 0.01 1.13 0.28 0.10 0.52 0.51
2009 0.04 0.17 0.34 0.09 0.16 0.24 0.05 0.05 -0.44 -0.02 0.91 0.62 0.85 0.98 0.84
2010 -0.10 0.27 0.45 0.14 0.19 -0.04 0.52 0.63 -0.13 0.24 0.48 1.16 0.58 0.10 0.57
300-100 mb 100-50 mb 100-30 mb
______________________________________________________________________________________________________________
Year Win Spr Sum Fall Ann Win Spr Sum Fall Ann Win Spr Sum Fall Ann
______________________________________________________________________________________________________________
1958 -0.34 -0.68 0.08 -0.45 -0.35 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
1959 -0.47 0.51 0.53 -0.35 0.06 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
1960 -0.71 -0.41 1.68 0.25 0.20 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99 -99.99
*
What you can do is to first check if a line contains the data you need like done in sample code below:
data test;
infile datalines truncover dlm=' ';
input check $4. @;
if notdigit(check) ne 0 then delete;
input year win whatever;
datalines;
2008 1 2 3
2009 4 6 7
---- some header stuff --
year win whatever
--- end the header stuff ---
2000 4 5
2001 99 0
;
run;
proc print data=test;
run;
This definitely appears to be a homework assignment, thus I'd suggest that you don't use this solution, but study it to see how you could write original code that solves the problem.
Your data file has a number of things that you have to account for besides the fact that not all lines are relevant.
Each data line contains the year and three sets of measures. I would presume that your desired end result would be the season measures, for each amount group, for each year:
filename rng_in http
"http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat";
data want (drop=i counter);
infile rng_in lrecl=32000;
input;
if length(strip(_infile_)) gt 1 then do;
if anydigit(substr(_infile_,1,1)) then do;
year=input(scan(_infile_,1),12.);
if year eq 1958 then counter+1;
amount=(counter-1)*3;
do i=2 to 12 by 5;
amount+1;
win=input(scan(_infile_,i,,'s'),12.);
spr=input(scan(_infile_,i+1,,'s'),12.);
sum=input(scan(_infile_,i+2,,'s'),12.);
fall=input(scan(_infile_,i+3,,'s'),12.);
ann=input(scan(_infile_,i+4,,'s'),12.);
output;
end;
end;
end;
run;
The easiest way is to use ?? modifier to suppress these error message.
filename xx url 'http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat'; data want; infile xx firstobs=27 dlm=' ' expandtabs truncover; input (Year Win Spr Sum Fall Ann Win Spr Sum Fall Ann Win Spr Sum Fall Ann ) ( ?? : best32.); run;
Ksharp
Well, I actually got this to "work" but not the way the teacher wants it to. For this assignment I actually only had to read the "Surface" temps and not the whole file. Currently I have it so it reads a year's data as one observation(aka it has all the seasons and annual in one observation). The instructions say it wants each value to count as one observation. winter value = 1st obs, spring value = 2nd obs, etc...; These obs should correspond to the variable "Surf." My 3 input variables should be QTR(depending on the season), Surf(the observed value) and year. Right now my code just has the year and an array which holds all the "surf" values as one observation.
I'm going to try to figure it out in the mean time
proc format;
value group
11 = 'Surface'
12 = 'Surface-100 mb'
13 = '850-300 mb'
21 = '300-100 mb'
22 = '100-50 mb'
23 = '100-30 mb';
run;
data foobar;
infile 'http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat' url truncover;
input @;
if scan(_infile_,1)='Year' then major+10;
if prxmatch('/^\d{4}/o',_infile_)>0 then do;
year=input(scan(_infile_,1),4.);
minor=major;
do _n_=2 to 12 by 5;
minor+1;
win=input(scan(_infile_,_n_),12.);
spr=input(scan(_infile_,_n_+1),12.);
sum=input(scan(_infile_,_n_+2),12.);
fall=input(scan(_infile_,_n_+3),12.);
ann=input(scan(_infile_,_n_+4),12.);
group=put(minor,group.);
output;
end;
end;
drop major minor;
run;
OH. That is too complicated. I don't think you can get it in one data step.
filename xx url 'http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat'; data want1; infile xx firstobs=27 obs=80 dlm=' ' expandtabs ; length qtr $ 40 season $ 10; input year @; do qtr='Surface','Surface-100 mb','850-300 mb'; do season='Win','Spr','Sum','Fall','Ann' ; input surf @; output; end; end; run; data want2; infile xx firstobs=87 dlm=' ' expandtabs ; length qtr $ 40 season $ 10; input year @; do qtr='300-100 mb','100-50 mb','100-30 mb'; do season='Win','Spr','Sum','Fall','Ann' ; input surf @; output; end; end; run; data want; set want1 want2; run;
Ksharp
A lot of you guys are giving me something with code that we haven't learned, Ksharp's looks like stuff that we've done but it's not exactly correct. I think some of you misunderstood when I said I only had to read in the "surface" part of the data. Reading the other parts are extra credit. This is what I have so far but like I said it's treating one whole year's data points as "one" observation. When my assignment says... http://www.stat.ncsu.edu/people/monahan/courses/st445/hwk4.txt "Create a dataset for the SURFACE ONLY with each observation corresponding to a season/quarter, with variables YEAR, QTR (numeric)(quarter/season), and SURF. Also create a time or date variable that properly indexes the data." Some of the "extra" stuff you see in my data step were to address the other parts of my homework. Like I said right now it "works" as in I can answer all my teacher's questions but I don't think it's working the way he wants us to do it. I would just like some tips/pointers to lead me in the right direction as to how I can change it from what I have now to what he wants.
data angell;
array QTR(5);
if ( _n_ > 53 ) then stop ;
infile 'E:\School Stuff\ST445\stemp.txt' firstobs=27;
input year @@;
do i = 1 to 5;
input QTR(i) @;
end;
drop i;
Exactannual = ((QTR1+QTR2+QTR3+QTR4)/4);
put _all_;
run;
I was thinking along the lines of somehow keeping the year and if the obs = 5 or less than it corresponds to ONE year(Using the remainder function I could make an index like this I think, if the remainder = 4 then it's winter, 3 = spring, 2= summer, 1= fall, and 0 = annual). I'm just having trouble thinking about how I would code this.
For the Year Qtr Surf Try this for your input statement;
data angell (keep=year qtr surf);
infile ' ';
infile 'E:\School Stuff\ST445\stemp.txt' firstobs=27;
if ( _n_ > 53 ) then stop ;
input year win spr sum fall ;
array season win spr sum fall;
do QTR = 1 to 4;
surf= season(qtr);
output;
end;
run;
So for this, if I wanted to include annual as well, I just make add annual to the input, array and do QTR = 1 to 5 as well.
hmm... actually this would create a slight problem, one of the later parts asks me to simply do a simple linear regression with year and annual but in the infile statement we only kept year surf and qtr. So I can't "call" on qtr5 because we deleted it, I'm gonna try messing around and see if I can figure it out though, perhaps by not letting Annual in the array and just also keeping that value?
Hmm... no just tried keeping annual and doing proc reg on it but it's not right because we basically consider the same value for annual 5 different times(Total df = 264 when it should be only 53). When I had it "Working" the previous time my slope was -4.1036 and my std err was 5.99532, I know this time it should match up, my slope does but not the std err.
Yeah it's counting my years and "annual" multiple times, that's the problem when I do a proc reg. I just tried inputting annual without putting it into the array but I would still get df = 211 and that's because simply doing model year=annual; would count each year 4 different times...(4*53=212)
data yyq(drop=exactannual surf: ann rename=(s=surf)) yy(drop=s qtr);
if _n_=54 then stop;
infile 'http://www.stat.ncsu.edu/people/monahan/courses/st445/angell/stemp.dat' url truncover firstobs=27;
input @1 year :4. @;
array surf[4];
do qtr=1 to dim(surf);
input s :6.2 @;
date=yyq(year,qtr);
output yyq;
surf[qtr]=s;
end;
input ann :6.2;
exactannual=sum(of surf
date=mdy(1,1,year);
output yy;
if round(exactannual,.01) ne ann then
putlog 'ERROR: Recorded and Calculated Annual Does Not Agree ' (year ann exactannual) (=);
run;
Actually, what you suggested would work (i.e., adding ann to the input and array statements and increasing the qtr loop to 5.
There is no reason you couldn't do the regression with that data simply using a where statement to only take records that have a qtr=5.
I just edited my previous post to address what issue I'm having now. It's counting multiple years, there should only be 53 years. Aka the layman's way is to simply plot all the years (1958-2010) on the y-axis and all the annual amounts on the x-axis and then finding the "Best fit" line(there should be only 53 points). If I simply do a model year = annual, then it counts the years multiple times(based on my array number). It's giving me a DF of years * # in array.
I'm almost "there"(finish assignment) though, this is the only issue I'm having now.
If you did what you said you were going to do, your annual data would simply be 53 records that have the year, qtr=5 and a variable called surf. I presume that surf will be your dependent variable and year your independent variable.
As I suggested, earlier, use a where statement to only the the qtr=5 records.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.