Hi All
I have a dataset with 15 records that holds data for some of the years between Year_2005 and Year_2018. Where data exists it is a number between -5 and 5. The table does not have a value for every record for every year:
2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
1 | -2 | 2 | -1 | 1 | 0 | 3 | -1 | 1 | 0 | ||||
-2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 2 | 0 | ||||
2 | 1 | 0 | 0 | -1 | -2 | -2 | -1 | -2 | -3 | -4 | -4 | ||
-1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | |||
1 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | |||
0 | -1 | -2 | 1 | 0 | 1 | 3 | 2 | 1 | 2 | ||||
3 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 1 | ||
-1 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | |||
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | ||||
-1 | 0 | 1 | 2 | 3 | 4 | 5 | |||||||
-4 | -4 | -3 | -3 | -4 | -3 | -3 | -4 | ||||||
0 | 1 | 1 | 1 | 1 | 1 | 2 | |||||||
-1 | -1 | -2 | -2 | -1 | 0 | -1 | -1 | -1 | |||||
-1 | -2 | -1 | |||||||||||
0 | 1 | -1 | -1 | -2 | -1 | -1 | -1 | 1 |
I am trying to work out how to 'compress' this data so that the resultant dataset only records instances where there is an observation, this would mean that the records would have varying numbers of observations. So I want to disregard the year and get a dataset that looks like this:
Obs1 | Obs2 | Obs3 | Obs4 | Obs5 | Obs6 | Obs7 | Obs8 | Obs9 | Obs10 | Obs11 | Obs12 |
1 | -2 | 2 | -1 | 1 | 0 | 3 | -1 | 1 | 0 | ||
-2 | 1 | 2 | 1 | 1 | 1 | 2 | 1 | 2 | 0 | ||
2 | 1 | 0 | 0 | -1 | -2 | -2 | -1 | -2 | -3 | -4 | -4 |
-1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | |
1 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | |
0 | -1 | -2 | 1 | 0 | 1 | 3 | 2 | 1 | 2 | ||
3 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 1 |
-1 | 0 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | |
0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | ||
-1 | 0 | 1 | 2 | 3 | 4 | 5 | |||||
-4 | -4 | -3 | -3 | -4 | -3 | -3 | -4 | ||||
0 | 1 | 1 | 1 | 1 | 1 | 2 | |||||
-1 | -1 | -2 | -2 | -1 | 0 | -1 | -1 | -1 | |||
-1 | -2 | -1 | |||||||||
0 | 1 | -1 | -1 | -2 | -1 | -1 | -1 | 1 |
I've found all sorts of ways to compress or trim variables but I can't see how to do this, any ideas anyone?
Thanks
A
data have;
input _2005 _2006 _2007 _2008 _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;
cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
;
data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;
data have;
input _2005 _2006 _2007 _2008 _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;
cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
;
data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;
Thanks Novinosrin! Once again you have earned the "you've destroyed it" badge. Much appreciated
A
Now that you have a proposed solution, let me ask why you would want to do this. I can't envision a scenario where the new form of the data is remotely useful. Is my imagination just falling short? What is it that you are ultimately trying to achieve?
A good question Astounding! Basically, each of these coefficients, from 5 to -5 is an expression of how difficult an exam question is from the perspective of the candidates attempting it. The exam questions are used at different times over the period 2005 to 2018, sometimes every year over a period of years, sometimes with a gap of a year or more between each use.
Some of our analysis needs to take account of the gaps between each use of the exam question (hence the first dataset with the years included) and some analysis needs to ignore the gaps and look only at the changes in coefficients each time the exam question is used (hence the second dataset), for this latter dataset I need to strip out the years when the question is not used.
A
Got it. It's a relief to see that you plan on keeping the original form of the data around.
Just a thought to consider ...
Would a normalized form of the data support all types of analysis? Something like 3 variables and many more observations, so many instances of:
Question, Year, Rating
That is one of the things I was going to ask also. In clinical trials is common to use questionnaires, and so CDISC already have a datastructure for such a thing. Normalised it contains a variable for questionnaire, quesition, character result, numeric result.
This question really highlights why transposed data is really not a good storage solution. It takes up far more room (as all the missings are there as well), and means a recoding to get a different structure to work further with.
So Astounding and @RW9
I'd like to explore this approach a bit further and chekc my understanding, so if I took the first three exam questions (items) and normalized the data in the way you suggest I assume I'd get this:
Item | Year | Rating |
1 | 2005 | 1 |
1 | 2007 | -2 |
1 | 2008 | 2 |
1 | 2009 | -1 |
1 | 2010 | 1 |
1 | 2011 | 0 |
1 | 2012 | 3 |
1 | 2013 | -1 |
1 | 2014 | 1 |
1 | 2015 | 0 |
2 | 2005 | -2 |
2 | 2007 | 1 |
2 | 2009 | 2 |
2 | 2010 | 1 |
2 | 2011 | 1 |
2 | 2012 | 1 |
2 | 2013 | 2 |
2 | 2014 | 1 |
2 | 2015 | 2 |
2 | 2016 | 0 |
3 | 2005 | 2 |
3 | 2007 | 1 |
3 | 2009 | 0 |
3 | 2010 | 0 |
3 | 2011 | -1 |
3 | 2012 | -2 |
3 | 2013 | -2 |
3 | 2014 | -1 |
3 | 2015 | -2 |
3 | 2016 | -3 |
3 | 2017 | -4 |
3 | 2018 | -4 |
I like this approach, I'd just need to re-format the data and work out how to apply my analyses from there.
Thanks
Andy
Based on a quick check, that's exactly what I'm picturing. As you mentioned, you still need to consider whether data in that form would support all the analyses you intend to perform.
@RW9 also mentioned that some fields, when converted to this form, would require a character version of RATING rather than numeric. So if you need to stuff all your data into this format, in a single data set, you might need 4 fields instead of 3. (Perhaps even 5 fields if you need two fields to identify both the questionnaire and the item from that questionnaire.) It's not a requirement, but you would be the one to make that call given that you know the data best.
Data have;
Input "2005"n - "2018"n;
cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
-2 . 1 . 2 1 1 1 2 1 2 0 . .
;
run;
data want;
retain obs1-obs14;
set have;
array dt "2018"n-"2005"n;
array ob obs14-obs1;
do i=1 to dim(dt);
ob[i]=dt[i];
end;
call sortn(of ob[*]);
drop "2018"n-"2005"n i;
run;
data have;
input _2005 _2006 _2007 _2008 _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;
cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
;
data want;
set have;
array t(*) _2005-_2018;
array tt(*) __2005-__2018;
n=0;
do i=1 to dim(t);
if not missing(t{i}) then do;n+1;tt{n}=t{i};end;
end;
drop n i _2005-_2018;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.