Solved: Re: Stack database columns

yeaforme · Posted 04-16-2012 09:02 PM

Kind of new to SAS - sorry if this question is easier than I think it is. I'm trying to transform a dataset. I believe proc transpose is the best way to go about it, but all the proc transpose examples I've found don't deal with my particular data structure. What I currently have is the following:

Name	Jim	Bob	Sue	George
Code	101	105	216	412
Date1	1	4	4	6
Date2	1	3	2	6
Date3	5	7	8	7
Date4	5	1	6	8

What I want, is the following:

Code	Date	Amount
101	Date1	1
101	Date2	1
101	Date3	5
101	Date4	5
105	Date1	4
105	Date2	3
105	Date3	7
105	Date4	1
216	Date1	4
216	Date2	2
216	Date3	8
216	Date4	6
412	Date1	6
412	Date2	6
412	Date3	7
412	Date4	8

Of course, the example is overly simplistic. The actual data set has roughly 1000 IDs across 5000 Dates. Any help would be appreciated. Cheers!

EDIT: Changed to remove header which should not have existed.

Patrick · Posted 04-24-2012 06:06 AM

Sticking with my code version - now tested on 2 copies of the data you've provided.

To run it in your environment the only thing you need to do is to change the value of csvPath to the directory where your csv files live.

The way the max. number of array elements is determined is not very efficient - but I wanted to keep the code simple. If performance is an issue then it could be done quite a bit more efficient.

%let csvPath=C:\_temp\demo;

/* determine max. number of array elements needed */
filename Mycsv "&csvPath\*.csv"; /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
infile Mycsv end=last lrecl=32767;
input;
retain maxvars;
maxvars=max(countc(_infile_,','),maxvars);
if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
attrib _Type length=$20 informat=$20.;
array _Codes {&Nvars} $20. (&Nvars*'');
array _Currs {&Nvars} $3. (&Nvars*'');
attrib
    Date length=8 format=date9.
    Code length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

infile Mycsv dsd dlm=',' truncover LRECL=32767;

input _Type @;
_Type=upcase(_Type);

if _Type='NAME' then
do;
     return;
end;
else
if _Type='CODE' then
do;
     call missing(of _all_);
    _infile_=prxchange('s/$[[:alnum:]]*$//o',-1,_infile_);
    input _Codes{*} :$20.;
end;
else
if _Type='CURRENCY' then
do;
    input _Currs{*} :$20.;
end;
else
do;
    Date=input(_Type,mmddyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
end;
run;

/*proc print data=want;*/
/*run;*/

View solution in original post

Ksharp · Posted 04-17-2012 01:07 AM

Hoooooh. That is really not easy.

data x;
infile cards expandtabs;
input Name $     Jim     Bob     Sue     George ;
cards;
Code     101     105     216     412
Date1     1     4     4     6
Date2     1     3     2     6
Date3     5     7     8     7
Date4     5     1     6     8
;
run;
data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date'),catt(varname(dsid,i),' as amount'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',num-1);
rc=close(dsid);
run;
data _null_;
 set x(obs=1);
 array _v{*} _numeric_;
 do i=1 to dim(_v);
  call symputx(cats('value',i),_v{i}) ;
 end;
run;
%put _user_;

%macro stack;
proc sql;
create table want as 
%do i=1 %to &n;
 select "&&value&i" as code,&&name&i from x(firstobs=2) 
 %if &i ne &n %then %do; union all  %end;
%end;
 ;
quit;
%mend stack;

 %stack

Ksharp

Cynthia_sas · Posted 04-17-2012 09:58 AM

I'm just curious. What is the purpose of the "Header 1", "Header 2", etc. Are they standing in for variable names? SAS variable names cannot contain spaces, so are you showing a PROC PRINT of the ACTUAL dataset or of the dummy dataset? If so, are "Header 1", "Header 2", etc the variable labels??? Can you explain in the before/after, what happens to the NAME information?

cynthia

yeaforme · Posted 04-17-2012 11:11 AM

Sorry, headers should not have been there. I apologize for the oversight and have edited the original post to reflect that.

Astounding · Posted 04-17-2012 11:30 AM

OK, the program changes but not by much. This version assumes that all your fields are character. It's easier to program if we can assume that the first column is character but the rest are numeric:

data have (keep=code date amount);

if 0 then set have (drop=name); /* Make sure NAME is not part of the array */

array names {*} _character_;

length code date amount $ 8; /* MUST come AFTER array statement */

do _i_=1 to dim(names);

do _n_=1 to _nobs_;

set have point=_n_ nobs=_nobs_;

if _n_=1 then code = names{_i_};

else do;

date = name;

amount = names{_i_};

output;

end;

stop; /* DO NOT OMIT THIS or you will loop forever */

run;

As usual, it's untested. Good luck. (Not sure I'll be around to follow up but I'll try.)

Small correction to array reference, using the right array name.

yeaforme · Posted 04-17-2012 11:58 AM

Also, the data is sitting in CSV files at the moment - does that alter the code significantly as well?

Astounding · Posted 04-17-2012 10:45 AM

yeaforme,

If your machine has enough memory, I would consider reading the entire data set into a two-dimensional array, then using loops to reconstruct the output. It would take about 40MB of memory. Here is the slower way (not too bad though, considering the size of your data set). It runs through the entire data set for each Code, assumes your existing variables are actually named Header_1, Header_2, etc, and that the first row of data is the one that has all the Names in it.

data want;

* If needed, add a LENGTH statement to widen the new variables;

new_header_2='Code';

new_header_3=''Date';
new_header_4='Amount';

output;

keep new_header_2 new_header_3 new_header_4;

if 0 then set have;

array headers {*} header_:;

do _i_ = 2 to dim(headers);

do _n_=2 to _nobs_;

set have point=_n_ nobs=_nobs_;

if _n_=2 then new_header_2=headers{_i_};

else do;

new_header_3 = header_1;

new_header_4 = headers{_i_};

output;

end;

stop;

run;

Sorry, I had placed a comment right in the middle of the program. Still, this is untested code so it may need to be tweaked.

And a subsequent correction to the calculation of new_header_2.

Good luck.

FriedEgg · Posted 04-17-2012 12:44 PM

data foo;

infile cards dlm=',';

input (var val1-val4) ($);

cards;

Name,Jim,Bob,Sue,George

Code,101,105,216,412

Date1,1,4,4,6

Date2,1,3,2,6

Date3,5,7,8,7

Date4,5,1,6,8

;

run;

proc transpose data=foo out=bar(drop=_: name);

id var;

var val1-val4;

run;

proc sort data=bar; by code; run;

proc transpose data=bar out=want(rename=( _name_=Date col1=Amount));

by code;

var date1-date4;

run;

proc print data=want noobs; run;

Code	Date	Amount
101	Date1	1
101	Date2	1
101	Date3	5
101	Date4	5
105	Date1	4
105	Date2	3
105	Date3	7
105	Date4	1
216	Date1	4
216	Date2	2
216	Date3	8
216	Date4	6
412	Date1	6
412	Date2	6
412	Date3	7
412	Date4	8

FriedEgg · Posted 04-17-2012 01:28 PM

Another way

data foo;

infile cards dlm=',';

length var1-var6 $10 vals1 $1 vals2-vals6 $32767; /* the average length of the 1,000 ids will need to be less then 32bytes */

input var1 vals1 && var2 vals2 && var3 vals3 && var4 vals4 && var5 vals5 && var6 vals6 &&; * create this as macro for more columns;

dates=catx(',',of var3-var6); *what rows were the dates on;

count=countw(vals2,',');

do i=1 to count;

code=scan(vals2,i,','); *use input function to properly format this with your real data;

date=scan(dates,i,','); *use input function to properly format this with your real data;

do j=3 to 6; *what rows were the dates on;

amount=scan(vvaluex(cats('vals',j)),i,','); *use input function to properly format this with your real data;

output;

end;

keep code date amount;

cards;

Name,Jim,Bob,Sue,George

Code,101,105,216,412

Date1,1,4,4,6

Date2,1,3,2,6

Date3,5,7,8,7

Date4,5,1,6,8

;

run;

code	date	amount
101	Date1	1
101	Date1	1
101	Date1	5
101	Date1	5
105	Date2	4
105	Date2	3
105	Date2	7
105	Date2	1
216	Date3	4
216	Date3	2
216	Date3	8
216	Date3	6
412	Date4	6
412	Date4	6
412	Date4	7
412	Date4	8

Haikuo · Posted 04-17-2012 01:20 PM

Another approach:

data have;

input Name$ Jim Bob Sue George;

cards;

Code 101 105 216 412

Date1 1 4 4 6

Date2 1 3 2 6

Date3 5 7 8 7

Date4 5 1 6 8

;

proc transpose data=have out=w1;

by name;

run;

proc sql;

create table want(drop=_name_ rename=(name=date col1=amount)) as

select b.*,a.col1 as code from w1(where= (name='Code')) a

left join

w1(where= (name ne 'Code')) b

on a._name_=b._name_

order by code, name;

quit;

proc print;run;

Haikuo

yeaforme · Posted 04-17-2012 05:09 PM

Thanks so much for all the responses - I certainly have a lot of code to test =).

Looking quickly through many of the responses I'm realizing that another aspect that will further complicate the matter is that the Date1, Date2, etc. that I have written in the example dataset in the original post are actual dates per the CSV file and that those dates don't follow any simple iterative convention (making it hard to utilize code that does a do loop for vars Date1-DateX, since the dates are actual dates, such as 1/5/1990, 1/7/1990, 1/8/1990, etc. and there are gaps such that not every date between the start date and end date are represented). I certainly have my work cut out for me still, but I think all these code examples will give me a solid start - whereas before creating this post I was completely lost at even how to approach the question.

Thanks again! If you have any other code that might help, I'll gladly take it, too!

FriedEgg · Posted 04-17-2012 05:27 PM

the second example code I provide should work seamlessly with actual dates, with the transpose style you will want to make a few tweaks but it will still work, just add options validvarname=any; before the transpose step. You will probably also want to add another step to then convert the data columns to their properly formatted values.

Astounding · Posted 04-17-2012 06:22 PM

Given the solutions posted so far, and the changes to the requirements, I'm going to go back to my original plan ... sticking all the data into a two-dimensional array. This will be necessary to compete on speed with the other solutions. It may have memory limitations, but probably no more so than any of the other approaches.

data have;

infile cards end=done;

length varname name1-name4 $ 8;

input varname name1-name4;

cards;

Code 101 105 216 412

Date1 1 4 4 6

Date2 1 3 2 6

Date3 5 7 8 7

Date4 5 1 6 8

;

data _null_; /* extra step to capture dimensions of the array */

set have nobs=_nobs_;

array chars {*} _character_;

ncols = dim(chars);

call symputx ('nrows', _nobs_);

call symputx('ncols', ncols);

stop;

run;

data want;

length code date amount $ 8;

array names_x2 {&nrows, &ncols} $;

do rownum=1 to &nrows until (loaded);

set have end=loaded;

array names {&ncols} varname name:;

do colnum = 1 to &ncols;

names_x2{rownum, colnum} = names{colnum};

end;

* Two-dimensional array is now loaded. Time to unload it.;

do colnum=2 to &ncols;

do rownum=1 to &nrows;

if rownum=1 then code = names_x2{1, colnum};

else do;

date = names_x2{rownum, 1};

amount = names_x2{rownum, colnum}

output;

end;

keep code date amount;

run;

This will be flexible, fast, and reasonably scalable. Alas, it is untested at the moment.

Good luck.

yeaforme · Posted 04-21-2012 10:03 PM

Get multiple errors with the above code -

ERROR: Too many variables defined for the dimension(s) specified for the array names

right below:

array names {&ncols} varname name:;

AND

ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, -, /, ;, <, <=,

<>, =, >, ><, >=, AND, EQ, GE, GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR,

^=, |, ||, ~=.

right below;

output;

Tried the other code, but I'm just not code-headed enough to understand how to further manipulate it to my CSV files.

In that vein, here is a very small portion taken from one of the CSV files:

Name	SCIENCES	NEUROBIOLOGICAL	MARTIN MATS.
Code	144672(PA)	543325(PA)	98761J(PA)
Currency	U$	MP	C
1/4/1990	#N/A	150	#N/A
1/5/1990	15.26	#N/A	#N/A
1/8/1990	15.30	155	1.05
1/9/1990	14.89	152	0.98

There are roughly 1,000 columns and roughly 6,000 date-rows in ascending order (starting at 1/1/1990 through 12/31/2011) making the CSV file roughly 1,000 columns across and 6,000 rows down. To compound the problem, there are 66 CSV files all with data in this form. Ultimately I need to get all the data across all the CSV files put together into one long list, but once I have each of the 66 imported and organized correctly, appending them is simple. What I really need help with is getting the data re-formatted

Ultimately, I need to get the data in the following form:

Date	Code	Currency	Amount
1/4/1990	144672	U$	.
1/5/1990	144672	U$	15.26
1/8/1990	144672	U$	15.30
1/9/1990	144672	U$	14.89
1/4/1990	543325	MP	150
1/5/1990	543325	MP	.
1/8/1990	543325	MP	155
1/9/1990	543325	MP	152
1/4/1990	98761J	C	.
1/5/1990	98761J	C	.
1/8/1990	98761J	C	1.05
1/9/1990	98761J	C	0.98

Note: The (PA) needs to be removed from the Code, and #N/A's need to be replaced with missing - all of this I can do on my own if it simplifies helping me with the code to re-form the data.

Thanks again!!

Patrick · Posted 04-22-2012 12:08 AM

Below code works with the sample data you've provided.

/* create some sample csv's */
filename Mycsv 'C:\mycsvs';
data _null_;
file Mycsv(file1.csv);
input;
put _infile_;
datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1990,#N/A,150,#N/A
1/05/1990,15.26,#N/A,#N/A
1/08/1990,15.3,155,1.05
1/09/1990,14.89,152,0.98
Code,x44672(PA),x43325(PA),x8761J(PA)
Currency,U$,MP,C
1/10/1990,#N/A,150,#N/A
1/11/1990,15.26,#N/A,#N/A
1/12/1990,15.3,155,1.05
;
run;
data _null_;
file Mycsv(file2.csv);
input;
put _infile_;
datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1991,#N/A,150,#N/A
1/05/1991,15.26,#N/A,#N/A
1/08/1991,15.3,155,1.05
1/09/1991,14.89,152,0.98
;
run;

/* determine max. number of array elements needed */
filename Mycsv 'C:\mycsvs\file*.csv'; /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
infile Mycsv end=last;
input;
retain maxvars;
maxvars=max(countc(_infile_,','),maxvars);
if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
attrib _Type length=$20 informat=$20.;
array _Codes {&Nvars} $20. (&Nvars*'');
array _Currs {&Nvars} $3. (&Nvars*'');
attrib
    Date length=8 format=ddmmyy10.
    Code length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

infile Mycsv dsd dlm=',' truncover LRECL=2000;

input _Type @;
_Type=upcase(_Type);

if _Type='NAME' then
do;
return;
end;
else
if _Type='CODE' then
do;

     call missing(of _all_);
    _infile_=prxchange('s/$[[:alnum:]]*$//o',-1,_infile_);
    input _Codes{*} :$20.;
end;
else
if _Type='CURRENCY' then
do;
    input _Currs{*} :$20.;
end;
else
do;
    Date=input(_Type,ddmmyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
end;
run;

proc print data=want;
run;

Message was edited by: Patrick Removal of unnecessary Amts array

Registration is open

SAS Training: Just a Click Away