BookmarkSubscribeRSS Feed
dwaynejarman
Fluorite | Level 6

Recently I had a project where I had to concatenate values in a column by a unique identifier.  I had no idea how many values or the size of the values so I had to pre-build the table with extra padding (like 20000 characters).  Well this bloated the file to over 3 GB for a relatively small dataset < 50 variables and < 30,000 records.  I didn't have the time to write the data management code in SAS so I googled the topic and found this article:  http://support.sas.com/resources/papers/proceedings13/206-2013.pdf

 

The paper was missing a transpose step which would create the maxx_t variable.  I also noticed that the final dataset was not replaced nor was there any clean up done.  Also, the columns were ordered alphanumerically by name not in the original order.  I corrected all of these and decided to share the code with you all.  Enjoy and use at your own risk.

 

%MACRO	ChgLen (inp=);
	*STEP 0. Identify the original order of the variables and create the proc sql necessary to rebuild the dataset;
	PROC CONTENTS DATA = &inp.	NOPRINT
		OUT	= origorder (KEEP = memname name type length label varnum);
	RUN;
	proc sort data = origorder; by varnum; run;
	data origorder2 ; format samplenumber $25.; length vars $3000.;
	   do until (last.memname);
	      set origorder ;
	        by memname notsorted;
			vars=catx(',',vars,name);
	   end;
	   vars = "proc sql; create table "||strip(memname)||" as select distinct "||strip(vars)||" from newone; quit;";
	   CALL SYMPUT('rebuild', vars);
	   keep vars ;
	run;
	*STEP 1. SELECT ONLY CHARACTER VARIABLES FROM THE DATASET;
	PROC CONTENTS DATA = &inp.	NOPRINT
		OUT	= CharVar (WHERE =(type=2)
		KEEP = memname name type length label);
	RUN;
	*STEP 2. CREATE SAS CODES BY USING LENGTH AND MAX FUNCTIONS;
	DATA	CharVar2;
		LENGTH	name1	name2	name3	name4	name5	$100; SET CharVar;
		name1 = trim(name)||'1=length('||trim(name)||')'; name2 = 'max('||trim(name)||'1)	as '||trim(name); name3 = trim(name)||'x = '||trim(name);
		name4 = trim(name)||' = '||trim(name)||'x';
		name5 = trim(name)||'	$'||strip(put(LENGTH, best.));
	RUN;
	*STEP 3. CREATE MACRO VARIABLES THAT WILL BE USED TO GENERATE DIFFERENT SAS STATEMENTS;
	PROC SQL	NOPRINT;
		SELECT	trim(name1) INTO: newvar	SEPARATED BY ';	' FROM CharVar2;
		SELECT trim(name2) INTO: maxvar	SEPARATED BY ',	' FROM CharVar2;
		SELECT trim(name3) INTO: tname	SEPARATED BY ';	' FROM CharVar2;
		SELECT trim(name4) INTO: tnamex	SEPARATED BY ';	' FROM CharVar2;
		SELECT trim(name) INTO: CVar	SEPARATED BY '	'
		FROM CharVar2;
		SELECT trim(name)||'x' INTO: CVarx	SEPARATED BY '	' FROM CharVar2;
		SELECT trim(name5)	INTO: CVarLen	SEPARATED BY '	'
		FROM CharVar2;
	QUIT;
	*STEP 4. FIND THE MAXIMUM LENGTH VALUE FOR EACH VARIABLE;
	DATA VarLen ; SET &inp.; &newvar;
	RUN;
	*STEP 4.1. FIND THE MAXIMUM LENGTH VALUE FOR EACH VARIABLE;
	PROC SQL NOPRINT  ; 
		CREATE TABLE maxx	AS
		SELECT &maxvar FROM VarLen	;
	QUIT;
	*STEP 4.2. Include the missing Transpose step;
	proc transpose data=work.maxx
		out=work.maxx_t;
	run;
	*STEP 5. GET MAXIMUM LENGTH VALUES;
	DATA maxx_t2;
		LENGTH	name6	$100; SET maxx_t ;
		IF col1<1 THEN col1=1;
		name6 = TRIM(_name_)||'	$'||STRIP(PUT(col1, best.));
	RUN;
	*STEP 5.1. Then create a macro variable to store the new length of each character variable.;
	PROC SQL	NOPRINT ;
		SELECT	strip(name6)	INTO: newlen	SEPARATED BY '	' FROM maxx_t2 ;
	QUIT;
	*STEP 6. CHANGE VARIABLE NAMES TO TEMPORARY NAMES;
	DATA	temp (drop = &CVar. ) ; LENGTH	&CVarLen	;
		SET	&inp.; &tname.;
	RUN;
	*STEP 7. CHANGE TEMPORARY NAMES BACK TO ORIGINAL VARIABLE NAMES WITH NEW LENGTHS;
	DATA		newONE	(drop= &CVarx.) ; LENGTH &newlen.	;
		SET	temp ;
		&tnamex.;
	RUN;
	*STEP 8. Rebuild the original dataset in the same order as the original (from step 0 above);
	&rebuild ; 


	proc datasets library=work memtype=data nolist;
		delete origorder origorder2 newONE temp maxx_t2 maxx_t maxx VarLen CharVar2 ;
	run;
	quit;

%MEND ChgLen;

 

1 REPLY 1
jackiedale
Calcite | Level 5

Thank you for the macro.  It's very useful when working with data sets created by REDCap code where the default character-variable length is $500.

 

Another modification that might be needed is to increase the length of the variable VARS in the data step OrigOrder2.  The variable must be long enough to hold all variable names in the original data set.  It could be set to 32767, the maximum length of a character variable or it could be set to 33*Number of Variables (32 for the maximum length of a variable name plus 1 for the spaces between the names.)

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 458 views
  • 2 likes
  • 2 in conversation