Kind of new to SAS - sorry if this question is easier than I think it is. I'm trying to transform a dataset. I believe proc transpose is the best way to go about it, but all the proc transpose examples I've found don't deal with my particular data structure. What I currently have is the following:
| Name | Jim | Bob | Sue | George | 
| Code | 101 | 105 | 216 | 412 | 
| Date1 | 1 | 4 | 4 | 6 | 
| Date2 | 1 | 3 | 2 | 6 | 
| Date3 | 5 | 7 | 8 | 7 | 
| Date4 | 5 | 1 | 6 | 8 | 
What I want, is the following:
| Code | Date | Amount | 
| 101 | Date1 | 1 | 
| 101 | Date2 | 1 | 
| 101 | Date3 | 5 | 
| 101 | Date4 | 5 | 
| 105 | Date1 | 4 | 
| 105 | Date2 | 3 | 
| 105 | Date3 | 7 | 
| 105 | Date4 | 1 | 
| 216 | Date1 | 4 | 
| 216 | Date2 | 2 | 
| 216 | Date3 | 8 | 
| 216 | Date4 | 6 | 
| 412 | Date1 | 6 | 
| 412 | Date2 | 6 | 
| 412 | Date3 | 7 | 
| 412 | Date4 | 8 | 
Of course, the example is overly simplistic. The actual data set has roughly 1000 IDs across 5000 Dates. Any help would be appreciated. Cheers!
EDIT: Changed to remove header which should not have existed.
Sticking with my code version - now tested on 2 copies of the data you've provided.
To run it in your environment the only thing you need to do is to change the value of csvPath to the directory where your csv files live.
The way the max. number of array elements is determined is not very efficient - but I wanted to keep the code simple. If performance is an issue then it could be done quite a bit more efficient.
%let csvPath=C:\_temp\demo;
/* determine max. number of array elements needed */
filename Mycsv "&csvPath\*.csv";  /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
  infile Mycsv end=last lrecl=32767;
  input;
  retain maxvars;
  maxvars=max(countc(_infile_,','),maxvars);
  if last then call symput('Nvars',cats(maxvars));
run;
/* read, append and reorganise source data */
data want(drop=_:);
  attrib _Type length=$20 informat=$20.;
  array _Codes {&Nvars} $20. (&Nvars*'');
  array _Currs {&Nvars} $3. (&Nvars*'');
  attrib 
    Date length=8 format=date9.
    Code  length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;
infile Mycsv dsd dlm=',' truncover LRECL=32767;
  input _Type @;
  _Type=upcase(_Type);
  if _Type='NAME' then
  do;
     return;
  end;
  else
  if _Type='CODE' then
  do;
     call missing(of _all_);
    _infile_=prxchange('s/\([[:alnum:]]*\)//o',-1,_infile_);
    input _Codes{*} :$20.;
  end;
  else
  if _Type='CURRENCY' then
  do;
    input _Currs{*} :$20.;
  end;
  else
  do;
    Date=input(_Type,mmddyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));
      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
  end;
run;
/*proc print data=want;*/
/*run;*/
Hoooooh. That is really not easy.
data x;
infile cards expandtabs;
input Name $     Jim     Bob     Sue     George ;
cards;
Code     101     105     216     412
Date1     1     4     4     6
Date2     1     3     2     6
Date3     5     7     8     7
Date4     5     1     6     8
;
run;
data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date'),catt(varname(dsid,i),' as amount'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',num-1);
rc=close(dsid);
run;
data _null_;
 set x(obs=1);
 array _v{*} _numeric_;
 do i=1 to dim(_v);
  call symputx(cats('value',i),_v{i}) ;
 end;
run;
%put _user_;
%macro stack;
proc sql;
create table want as 
%do i=1 %to &n;
 select "&&value&i" as code,&&name&i from x(firstobs=2) 
 %if &i ne &n %then %do; union all  %end;
%end;
 ;
quit;
%mend stack;
 %stack
Ksharp
I'm just curious. What is the purpose of the "Header 1", "Header 2", etc. Are they standing in for variable names? SAS variable names cannot contain spaces, so are you showing a PROC PRINT of the ACTUAL dataset or of the dummy dataset? If so, are "Header 1", "Header 2", etc the variable labels??? Can you explain in the before/after, what happens to the NAME information?
cynthia
Sorry, headers should not have been there. I apologize for the oversight and have edited the original post to reflect that.
OK, the program changes but not by much. This version assumes that all your fields are character. It's easier to program if we can assume that the first column is character but the rest are numeric:
data have (keep=code date amount);
if 0 then set have (drop=name); /* Make sure NAME is not part of the array */
array names {*} _character_;
length code date amount $ 8; /* MUST come AFTER array statement */
do _i_=1 to dim(names);
do _n_=1 to _nobs_;
set have point=_n_ nobs=_nobs_;
if _n_=1 then code = names{_i_};
else do;
date = name;
amount = names{_i_};
output;
end;
end;
end;
stop; /* DO NOT OMIT THIS or you will loop forever */
run;
As usual, it's untested. Good luck. (Not sure I'll be around to follow up but I'll try.)
Small correction to array reference, using the right array name.
Also, the data is sitting in CSV files at the moment - does that alter the code significantly as well?
yeaforme,
If your machine has enough memory, I would consider reading the entire data set into a two-dimensional array, then using loops to reconstruct the output. It would take about 40MB of memory. Here is the slower way (not too bad though, considering the size of your data set). It runs through the entire data set for each Code, assumes your existing variables are actually named Header_1, Header_2, etc, and that the first row of data is the one that has all the Names in it.
data want;
* If needed, add a LENGTH statement to widen the new variables;
new_header_2='Code';
   new_header_3=''Date';
   new_header_4='Amount';
output;
keep new_header_2 new_header_3 new_header_4;
if 0 then set have;
array headers {*} header_:;
do _i_ = 2 to dim(headers);
do _n_=2 to _nobs_;
set have point=_n_ nobs=_nobs_;
if _n_=2 then new_header_2=headers{_i_};
else do;
new_header_3 = header_1;
new_header_4 = headers{_i_};
output;
end;
end;
end;
stop;
run;
Sorry, I had placed a comment right in the middle of the program. Still, this is untested code so it may need to be tweaked.
And a subsequent correction to the calculation of new_header_2.
Good luck.
data foo;
infile cards dlm=',';
input (var val1-val4) ($);
cards;
Name,Jim,Bob,Sue,George
Code,101,105,216,412
Date1,1,4,4,6
Date2,1,3,2,6
Date3,5,7,8,7
Date4,5,1,6,8
;
run;
proc transpose data=foo out=bar(drop=_: name);
id var;
var val1-val4;
run;
proc sort data=bar; by code; run;
proc transpose data=bar out=want(rename=( _name_=Date col1=Amount));
by code;
var date1-date4;
run;
proc print data=want noobs; run;
| Code | Date | Amount | 
| 101 | Date1 | 1 | 
| 101 | Date2 | 1 | 
| 101 | Date3 | 5 | 
| 101 | Date4 | 5 | 
| 105 | Date1 | 4 | 
| 105 | Date2 | 3 | 
| 105 | Date3 | 7 | 
| 105 | Date4 | 1 | 
| 216 | Date1 | 4 | 
| 216 | Date2 | 2 | 
| 216 | Date3 | 8 | 
| 216 | Date4 | 6 | 
| 412 | Date1 | 6 | 
| 412 | Date2 | 6 | 
| 412 | Date3 | 7 | 
| 412 | Date4 | 8 | 
Another way
data foo;
infile cards dlm=',';
length var1-var6 $10 vals1 $1 vals2-vals6 $32767; /* the average length of the 1,000 ids will need to be less then 32bytes */
input var1 vals1 && var2 vals2 && var3 vals3 && var4 vals4 && var5 vals5 && var6 vals6 &&; * create this as macro for more columns;
dates=catx(',',of var3-var6); *what rows were the dates on;
count=countw(vals2,',');
do i=1 to count;
code=scan(vals2,i,','); *use input function to properly format this with your real data;
date=scan(dates,i,','); *use input function to properly format this with your real data;
do j=3 to 6; *what rows were the dates on;
amount=scan(vvaluex(cats('vals',j)),i,','); *use input function to properly format this with your real data;
output;
end;
end;
keep code date amount;
cards;
Name,Jim,Bob,Sue,George
Code,101,105,216,412
Date1,1,4,4,6
Date2,1,3,2,6
Date3,5,7,8,7
Date4,5,1,6,8
;
run;
| code | date | amount | 
| 101 | Date1 | 1 | 
| 101 | Date1 | 1 | 
| 101 | Date1 | 5 | 
| 101 | Date1 | 5 | 
| 105 | Date2 | 4 | 
| 105 | Date2 | 3 | 
| 105 | Date2 | 7 | 
| 105 | Date2 | 1 | 
| 216 | Date3 | 4 | 
| 216 | Date3 | 2 | 
| 216 | Date3 | 8 | 
| 216 | Date3 | 6 | 
| 412 | Date4 | 6 | 
| 412 | Date4 | 6 | 
| 412 | Date4 | 7 | 
| 412 | Date4 | 8 | 
Another approach:
data have;
input Name$ Jim Bob Sue George;
cards;
Code 101 105 216 412
Date1 1 4 4 6
Date2 1 3 2 6
Date3 5 7 8 7
Date4 5 1 6 8
;
proc transpose data=have out=w1;
by name;
run;
proc sql;
create table want(drop=_name_ rename=(name=date col1=amount)) as
select b.*,a.col1 as code from w1(where= (name='Code')) a
left join
w1(where= (name ne 'Code')) b
on a._name_=b._name_
order by code, name;
quit;
proc print;run;
Haikuo
Thanks so much for all the responses - I certainly have a lot of code to test =).
Looking quickly through many of the responses I'm realizing that another aspect that will further complicate the matter is that the Date1, Date2, etc. that I have written in the example dataset in the original post are actual dates per the CSV file and that those dates don't follow any simple iterative convention (making it hard to utilize code that does a do loop for vars Date1-DateX, since the dates are actual dates, such as 1/5/1990, 1/7/1990, 1/8/1990, etc. and there are gaps such that not every date between the start date and end date are represented). I certainly have my work cut out for me still, but I think all these code examples will give me a solid start - whereas before creating this post I was completely lost at even how to approach the question.
Thanks again! If you have any other code that might help, I'll gladly take it, too!
the second example code I provide should work seamlessly with actual dates, with the transpose style you will want to make a few tweaks but it will still work, just add options validvarname=any; before the transpose step. You will probably also want to add another step to then convert the data columns to their properly formatted values.
Given the solutions posted so far, and the changes to the requirements, I'm going to go back to my original plan ... sticking all the data into a two-dimensional array. This will be necessary to compete on speed with the other solutions. It may have memory limitations, but probably no more so than any of the other approaches.
data have;
infile cards end=done;
length varname name1-name4 $ 8;
input varname name1-name4;
cards;
Code 101 105 216 412
Date1 1 4 4 6
Date2 1 3 2 6
Date3 5 7 8 7
Date4 5 1 6 8
;
data _null_; /* extra step to capture dimensions of the array */
set have nobs=_nobs_;
array chars {*} _character_;
ncols = dim(chars);
call symputx ('nrows', _nobs_);
call symputx('ncols', ncols);
stop;
run;
data want;
length code date amount $ 8;
array names_x2 {&nrows, &ncols} $;
do rownum=1 to &nrows until (loaded);
set have end=loaded;
array names {&ncols} varname name:;
do colnum = 1 to &ncols;
names_x2{rownum, colnum} = names{colnum};
end;
end;
* Two-dimensional array is now loaded. Time to unload it.;
do colnum=2 to &ncols;
do rownum=1 to &nrows;
if rownum=1 then code = names_x2{1, colnum};
else do;
date = names_x2{rownum, 1};
amount = names_x2{rownum, colnum}
output;
end;
end;
end;
keep code date amount;
run;
This will be flexible, fast, and reasonably scalable. Alas, it is untested at the moment.
Good luck.
Get multiple errors with the above code -
ERROR: Too many variables defined for the dimension(s) specified for the array names
right below:
array names {&ncols} varname name:;
AND
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, -, /, ;, <, <=,
<>, =, >, ><, >=, AND, EQ, GE, GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR,
^=, |, ||, ~=.
right below;
output;
Tried the other code, but I'm just not code-headed enough to understand how to further manipulate it to my CSV files.
In that vein, here is a very small portion taken from one of the CSV files:
| Name | SCIENCES | NEUROBIOLOGICAL | MARTIN MATS. | 
| Code | 144672(PA) | 543325(PA) | 98761J(PA) | 
| Currency | U$ | MP | C | 
| 1/4/1990 | #N/A | 150 | #N/A | 
| 1/5/1990 | 15.26 | #N/A | #N/A | 
| 1/8/1990 | 15.30 | 155 | 1.05 | 
| 1/9/1990 | 14.89 | 152 | 0.98 | 
There are roughly 1,000 columns and roughly 6,000 date-rows in ascending order (starting at 1/1/1990 through 12/31/2011) making the CSV file roughly 1,000 columns across and 6,000 rows down. To compound the problem, there are 66 CSV files all with data in this form. Ultimately I need to get all the data across all the CSV files put together into one long list, but once I have each of the 66 imported and organized correctly, appending them is simple. What I really need help with is getting the data re-formatted
Ultimately, I need to get the data in the following form:
| Date | Code | Currency | Amount | 
| 1/4/1990 | 144672 | U$ | . | 
| 1/5/1990 | 144672 | U$ | 15.26 | 
| 1/8/1990 | 144672 | U$ | 15.30 | 
| 1/9/1990 | 144672 | U$ | 14.89 | 
| 1/4/1990 | 543325 | MP | 150 | 
| 1/5/1990 | 543325 | MP | . | 
| 1/8/1990 | 543325 | MP | 155 | 
| 1/9/1990 | 543325 | MP | 152 | 
| 1/4/1990 | 98761J | C | . | 
| 1/5/1990 | 98761J | C | . | 
| 1/8/1990 | 98761J | C | 1.05 | 
| 1/9/1990 | 98761J | C | 0.98 | 
Note: The (PA) needs to be removed from the Code, and #N/A's need to be replaced with missing - all of this I can do on my own if it simplifies helping me with the code to re-form the data.
Thanks again!!
Below code works with the sample data you've provided.
/* create some sample csv's */
filename Mycsv 'C:\mycsvs';
data _null_;
  file Mycsv(file1.csv);
  input;
  put _infile_;
  datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1990,#N/A,150,#N/A
1/05/1990,15.26,#N/A,#N/A
1/08/1990,15.3,155,1.05
1/09/1990,14.89,152,0.98
Code,x44672(PA),x43325(PA),x8761J(PA)
Currency,U$,MP,C
1/10/1990,#N/A,150,#N/A
1/11/1990,15.26,#N/A,#N/A
1/12/1990,15.3,155,1.05
;
run;
data _null_;
  file Mycsv(file2.csv);
  input;
  put _infile_;
  datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1991,#N/A,150,#N/A
1/05/1991,15.26,#N/A,#N/A
1/08/1991,15.3,155,1.05
1/09/1991,14.89,152,0.98
;
run;
/* determine max. number of array elements needed */
filename Mycsv 'C:\mycsvs\file*.csv';  /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
  infile Mycsv end=last;
  input;
  retain maxvars;
  maxvars=max(countc(_infile_,','),maxvars);
  if last then call symput('Nvars',cats(maxvars));
run;
/* read, append and reorganise source data */
data want(drop=_:);
  attrib _Type length=$20 informat=$20.;
  array _Codes {&Nvars} $20. (&Nvars*'');
  array _Currs {&Nvars} $3. (&Nvars*'');
  attrib 
    Date length=8 format=ddmmyy10.
    Code  length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;
infile Mycsv dsd dlm=',' truncover LRECL=2000;
  input _Type @;
  _Type=upcase(_Type);
  if _Type='NAME' then
  do;
     return;
  end;
  else
  if _Type='CODE' then
  do;
     call missing(of _all_);
    _infile_=prxchange('s/\([[:alnum:]]*\)//o',-1,_infile_);
    input _Codes{*} :$20.;
  end;
  else
  if _Type='CURRENCY' then
  do;
    input _Currs{*} :$20.;
  end;
  else
  do;
    Date=input(_Type,ddmmyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));
      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
  end;
run;
proc print data=want;
run;
Message was edited by: Patrick Removal of unnecessary Amts array
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
