Solved: Re: Removing blank fields from a dataset in SAS and compressing the re...

AJChamberlain · Posted 03-04-2019 09:46 AM

Hi All

I have a dataset with 15 records that holds data for some of the years between Year_2005 and Year_2018. Where data exists it is a number between -5 and 5. The table does not have a value for every record for every year:

2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
1		-2	2	-1	1	0	3	-1	1	0
-2		1		2	1	1	1	2	1	2	0
2		1		0	0	-1	-2	-2	-1	-2	-3	-4	-4
-1		0			0	0	0	0	1	1	1	0	1
1		2			2	2	3	3	3	3	3	4	4
0	-1	-2	1		0	1	3	2	1	2
	3	2	1	2	2	2	2	1	1	0	0	1
	-1	0	1	1		1	1	2		2	2	1	1
0	1	0	0	0	0	1				1	1		0
-1	0		1			2			3		4		5
-4			-4		-3			-3	-4	-3	-3	-4
0	1	1				1				1	1	2
	-1	-1	-2	-2		-1		0	-1		-1		-1
-1	-2	-1
	0	1	-1	-1	-2	-1			-1		-1		1

I am trying to work out how to 'compress' this data so that the resultant dataset only records instances where there is an observation, this would mean that the records would have varying numbers of observations. So I want to disregard the year and get a dataset that looks like this:

Obs1	Obs2	Obs3	Obs4	Obs5	Obs6	Obs7	Obs8	Obs9	Obs10	Obs11	Obs12
1	-2	2	-1	1	0	3	-1	1	0
-2	1	2	1	1	1	2	1	2	0
2	1	0	0	-1	-2	-2	-1	-2	-3	-4	-4
-1	0	0	0	0	0	1	1	1	0	1
1	2	2	2	3	3	3	3	3	4	4
0	-1	-2	1	0	1	3	2	1	2
3	2	1	2	2	2	2	1	1	0	0	1
-1	0	1	1		1	1	2	2	2	1	1
0	1	0	0	0	0	1	1	1		0
-1	0	1	2	3	4		5
-4	-4	-3	-3	-4	-3	-3	-4
0	1	1	1	1	1	2
-1	-1	-2	-2	-1	0	-1	-1	-1
-1	-2	-1
0	1	-1	-1	-2	-1	-1	-1	1

I've found all sorts of ways to compress or trim variables but I can't see how to do this, any ideas anyone?

Thanks

A

novinosrin · Posted 03-04-2019 09:57 AM

data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;

View solution in original post

novinosrin · Posted 03-04-2019 09:57 AM

data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
k=compress(catx(' ',of _2005--_2018),'.');
array t(*)_2005--_2018;
call missing( of t(*));
do _n_=1 to dim(t);
t(_n_)=input(scan(k,_n_,' '),8.);
end;
drop k;
run;

AJChamberlain · Posted 03-04-2019 10:03 AM

Thanks Novinosrin! Once again you have earned the "you've destroyed it" badge. Much appreciated

A

Astounding · Posted 03-04-2019 10:00 AM

Now that you have a proposed solution, let me ask why you would want to do this. I can't envision a scenario where the new form of the data is remotely useful. Is my imagination just falling short? What is it that you are ultimately trying to achieve?

AJChamberlain · Posted 03-04-2019 10:11 AM

A good question Astounding! Basically, each of these coefficients, from 5 to -5 is an expression of how difficult an exam question is from the perspective of the candidates attempting it. The exam questions are used at different times over the period 2005 to 2018, sometimes every year over a period of years, sometimes with a gap of a year or more between each use.

Some of our analysis needs to take account of the gaps between each use of the exam question (hence the first dataset with the years included) and some analysis needs to ignore the gaps and look only at the changes in coefficients each time the exam question is used (hence the second dataset), for this latter dataset I need to strip out the years when the question is not used.

A

Astounding · Posted 03-04-2019 10:18 AM

Got it. It's a relief to see that you plan on keeping the original form of the data around.

Just a thought to consider ...

Would a normalized form of the data support all types of analysis? Something like 3 variables and many more observations, so many instances of:

Question, Year, Rating

RW9 · Posted 03-04-2019 10:29 AM

That is one of the things I was going to ask also. In clinical trials is common to use questionnaires, and so CDISC already have a datastructure for such a thing. Normalised it contains a variable for questionnaire, quesition, character result, numeric result.

This question really highlights why transposed data is really not a good storage solution. It takes up far more room (as all the missings are there as well), and means a recoding to get a different structure to work further with.

AJChamberlain · Posted 03-04-2019 10:52 AM

So Astounding and @RW9

I'd like to explore this approach a bit further and chekc my understanding, so if I took the first three exam questions (items) and normalized the data in the way you suggest I assume I'd get this:

Item	Year	Rating
1	2005	1
1	2007	-2
1	2008	2
1	2009	-1
1	2010	1
1	2011	0
1	2012	3
1	2013	-1
1	2014	1
1	2015	0
2	2005	-2
2	2007	1
2	2009	2
2	2010	1
2	2011	1
2	2012	1
2	2013	2
2	2014	1
2	2015	2
2	2016	0
3	2005	2
3	2007	1
3	2009	0
3	2010	0
3	2011	-1
3	2012	-2
3	2013	-2
3	2014	-1
3	2015	-2
3	2016	-3
3	2017	-4
3	2018	-4

I like this approach, I'd just need to re-format the data and work out how to apply my analyses from there.

Thanks

Andy

Astounding · Posted 03-04-2019 11:07 AM

Based on a quick check, that's exactly what I'm picturing. As you mentioned, you still need to consider whether data in that form would support all the analyses you intend to perform.

@RW9 also mentioned that some fields, when converted to this form, would require a character version of RATING rather than numeric. So if you need to stuff all your data into this format, in a single data set, you might need 4 fields instead of 3. (Perhaps even 5 fields if you need two fields to identify both the questionnaire and the item from that questionnaire.) It's not a requirement, but you would be the one to make that call given that you know the data best.

r_behata · Posted 03-04-2019 10:26 AM

Data have;
	Input "2005"n - "2018"n;
	cards;
1 . -2 2 -1 1 0 3 -1 1 0 . . .
-2 . 1 . 2 1 1 1 2 1 2 0 . .
;
run;

data want;
retain obs1-obs14;
	set have;
	array dt  "2018"n-"2005"n;
	array ob obs14-obs1;

	do i=1 to dim(dt);
		ob[i]=dt[i];
	end;

	call sortn(of ob[*]); 

	drop "2018"n-"2005"n i;
run;

Ksharp · Posted 03-05-2019 08:07 AM

data have;
input _2005	_2006	_2007	_2008	_2009	_2010	_2011	_2012	_2013	_2014	_2015	_2016	_2017	_2018;
cards;
1	. 	-2	2	-1	1	0	3	-1	1	0	. . .
;

data want;
set have;
array t(*) _2005-_2018;
array tt(*) __2005-__2018;
n=0;
do i=1 to dim(t);
 if not missing(t{i}) then do;n+1;tt{n}=t{i};end;
end;
drop n i _2005-_2018;
run;

Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

Re: Removing blank fields from a dataset in SAS and compressing the remaining fields together

SAS Innovate 2025: Call for Content

Classroom Training Available!