BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AJChamberlain
Obsidian | Level 7

Hi All

 

I have a dataset with 15 records that holds data for some of the years between Year_2005 and Year_2018. Where data exists it is a number between -5 and 5. The table does not have a value for every record for every year:

 

20052006200720082009201020112012201320142015201620172018
1 -22-1103-110   
-2 1 21112120  
2 1 00-1-2-2-1-2-3-4-4
-1 0  000011101
1 2  223333344
0-1-21 013212   
 321222211001 
 -1011 112 2211
0100001   11 0
-10 1  2  3 4 5
-4  -4 -3  -3-4-3-3-4 
011   1   112 
 -1-1-2-2 -1 0-1 -1 -1
-1-2-1           
 01-1-1-2-1  -1 -1 1

 

I am trying to work out how to 'compress' this data so that the resultant dataset only records instances where there is an observation, this would mean that the records would have varying numbers of observations. So I want to disregard the year and get a dataset that looks like this:

 

Obs1Obs2Obs3Obs4Obs5Obs6Obs7Obs8Obs9Obs10Obs11Obs12
1-22-1103-110  
-2121112120  
2100-1-2-2-1-2-3-4-4
-10000011101 
12223333344 
0-1-21013212  
321222211001
-1011 1122211
010000111 0 
-101234 5    
-4-4-3-3-4-3-3-4    
0111112     
-1-1-2-2-10-1          -1-1   
-1-2-1         
01-1-1-2-1-1-11   

 

I've found all sorts of ways to compress or trim variables but I can't see how to do this, any ideas anyone?

 

Thanks

A

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20
data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;

View solution in original post

10 REPLIES 10
novinosrin
Tourmaline | Level 20
data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;
AJChamberlain
Obsidian | Level 7

Thanks Novinosrin! Once again you have earned the "you've destroyed it" badge. Much appreciated

A

Astounding
PROC Star

Now that you have a proposed solution, let me ask why you would want to do this.  I can't envision a scenario where the new form of the data is remotely useful.  Is my imagination just falling short?  What is it that you are ultimately trying to achieve?

AJChamberlain
Obsidian | Level 7

A good question Astounding! Basically, each of these coefficients, from 5 to -5 is an expression of how difficult an exam question is from the perspective of the candidates attempting it. The exam questions are used at different times over the period 2005 to 2018, sometimes every year over a period of years, sometimes with a gap of a year or more between each use.

Some of our analysis needs to take account of the gaps between each use of the exam question (hence the first dataset with the years included) and some analysis needs to ignore the gaps and look only at the changes in coefficients each time the exam question is used (hence the second dataset), for this latter dataset I need to strip out the years when the question is not used.

A

Astounding
PROC Star

Got it.  It's a relief to see that you plan on keeping the original form of the data around.

 

Just a thought to consider ...

 

Would a normalized form of the data support all types of analysis?  Something like 3 variables and many more observations, so many instances of:

 

Question, Year, Rating

RW9
Diamond | Level 26 RW9
Diamond | Level 26

That is one of the things I was going to ask also.  In clinical trials is common to use questionnaires, and so CDISC already have a datastructure for such a thing.  Normalised it contains a variable for questionnaire, quesition, character result, numeric result.

This question really highlights why transposed data is really not a good storage solution.  It takes up far more room (as all the missings are there as well), and means a recoding to get a different structure to work further with.

 

AJChamberlain
Obsidian | Level 7

So Astounding and @RW9 

 

I'd like to explore this approach a bit further and chekc my understanding, so if I took the first three exam questions (items) and normalized the data in the way you suggest I assume I'd get this:

 

ItemYearRating
120051
12007-2
120082
12009-1
120101
120110
120123
12013-1
120141
120150
22005-2
220071
220092
220101
220111
220121
220132
220141
220152
220160
320052
320071
320090
320100
32011-1
32012-2
32013-2
32014-1
32015-2
32016-3
32017-4
32018-4

 

I like this approach, I'd just need to re-format the data and work out how to apply my analyses from there.

 

Thanks

Andy

Astounding
PROC Star

Based on a quick check, that's exactly what I'm picturing.  As you mentioned, you still need to consider whether data in that form would support all the analyses you intend to perform.

 

@RW9 also mentioned that some fields, when converted to this form, would require a character version of RATING rather than numeric.  So if you need to stuff all your data into this format, in a single data set, you might need 4 fields instead of 3.  (Perhaps even 5 fields if you need two fields to identify both the questionnaire and the item from that questionnaire.)  It's not a requirement, but you would be the one to make that call given that you know the data best.

r_behata
Barite | Level 11
Data have;
	Input "2005"n - "2018"n;
	cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
-2 . 1 . 2 1 1 1 2 1 2 0 . .
;
run;

data want;
retain obs1-obs14;
	set have;
	array dt  "2018"n-"2005"n;
	array ob obs14-obs1;

	do i=1 to dim(dt);
		ob[i]=dt[i];
	end;

	call sortn(of ob[*]); 

	drop "2018"n-"2005"n i;
run;
Ksharp
Super User
data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
array t(*) _2005-_2018;
array tt(*) __2005-__2018;
n=0;
do i=1 to dim(t);
 if not missing(t{i}) then do;n+1;tt{n}=t{i};end;
end;
drop n i _2005-_2018;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1031 views
  • 8 likes
  • 6 in conversation