BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
yeaforme
Calcite | Level 5

Kind of new to SAS - sorry if this question is easier than I think it is.  I'm trying to transform a dataset.  I believe proc transpose is the best way to go about it, but all the proc transpose examples I've found don't deal with my particular data structure.  What I currently have is the following:

NameJimBobSueGeorge
Code101105216412
Date11446
Date21326
Date35787
Date45168

What I want, is the following:

CodeDateAmount
101Date11
101Date21
101Date35
101Date45
105Date14
105Date23
105Date37
105Date41
216Date14
216Date22
216Date38
216Date46
412Date16
412Date26
412Date37
412Date48

Of course, the example is overly simplistic.  The actual data set has roughly 1000 IDs across 5000 Dates.  Any help would be appreciated.  Cheers!

EDIT: Changed to remove header which should not have existed.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Sticking with my code version - now tested on 2 copies of the data you've provided.

To run it in your environment the only thing you need to do is to change the value of csvPath to the directory where your csv files live.

The way the max. number of array elements is determined is not very efficient - but I wanted to keep the code simple. If performance is an issue then it could be done quite a bit more efficient.

%let csvPath=C:\_temp\demo;

/* determine max. number of array elements needed */
filename Mycsv "&csvPath\*.csv";  /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
  infile Mycsv end=last lrecl=32767;
  input;
  retain maxvars;
  maxvars=max(countc(_infile_,','),maxvars);
  if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
  attrib _Type length=$20 informat=$20.;
  array _Codes {&Nvars} $20. (&Nvars*'');
  array _Currs {&Nvars} $3. (&Nvars*'');
  attrib
    Date length=8 format=date9.
    Code  length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

  infile Mycsv dsd dlm=',' truncover LRECL=32767;

  input _Type @;
  _Type=upcase(_Type);

  if _Type='NAME' then
  do;
     return;
  end;
  else
  if _Type='CODE' then
  do;
     call missing(of _all_);
    _infile_=prxchange('s/\([[:alnum:]]*\)//o',-1,_infile_);
    input _Codes{*} :$20.;
  end;
  else
  if _Type='CURRENCY' then
  do;
    input _Currs{*} :$20.;
  end;
  else
  do;
    Date=input(_Type,mmddyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
  end;
run;

/*proc print data=want;*/
/*run;*/

View solution in original post

35 REPLIES 35
Ksharp
Super User

Hoooooh. That is really not easy.

data x;
infile cards expandtabs;
input Name $     Jim     Bob     Sue     George ;
cards;
Code     101     105     216     412
Date1     1     4     4     6
Date2     1     3     2     6
Date3     5     7     8     7
Date4     5     1     6     8
;
run;
data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date'),catt(varname(dsid,i),' as amount'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',num-1);
rc=close(dsid);
run;
data _null_;
 set x(obs=1);
 array _v{*} _numeric_;
 do i=1 to dim(_v);
  call symputx(cats('value',i),_v{i}) ;
 end;
run;
%put _user_;

%macro stack;
proc sql;
create table want as 
%do i=1 %to &n;
 select "&&value&i" as code,&&name&i from x(firstobs=2) 
 %if &i ne &n %then %do; union all  %end;
%end;
 ;
quit;
%mend stack;

 %stack



Ksharp

Cynthia_sas
SAS Super FREQ

I'm just curious. What is the purpose of the "Header 1", "Header 2", etc. Are they standing in for variable names? SAS variable names cannot contain spaces, so are you showing a PROC PRINT of the ACTUAL dataset or of the dummy dataset? If so, are "Header 1", "Header 2", etc the variable labels??? Can you explain in the before/after, what happens to the NAME information?

cynthia

yeaforme
Calcite | Level 5

Sorry, headers should not have been there.  I apologize for the oversight and have edited the original post to reflect that.

Astounding
PROC Star

OK, the program changes but not by much.  This version assumes that all your fields are character.  It's easier to program if we can assume that the first column is character but the rest are numeric:

data have (keep=code date amount);

    if 0 then set have (drop=name);  /* Make sure NAME is not part of the array */

   array names {*} _character_;

   length code  date amount $ 8;  /* MUST come AFTER array statement */

   do _i_=1 to dim(names);

       do _n_=1 to _nobs_;

           set have point=_n_ nobs=_nobs_;

           if _n_=1 then code = names{_i_};

           else do;

               date = name;

               amount = names{_i_};

               output;

           end;

        end;

   end;

   stop;   /* DO NOT OMIT THIS or you will loop forever */

run;

As usual, it's untested.  Good luck.  (Not sure I'll be around to follow up but I'll try.)

Small correction to array reference, using the right array name.

yeaforme
Calcite | Level 5

Also, the data is sitting in CSV files at the moment - does that alter the code significantly as well?

Astounding
PROC Star

yeaforme,

If your machine has enough memory, I would consider reading the entire data set into a two-dimensional array, then using loops to reconstruct the output.  It would take about 40MB of memory.  Here is the slower way (not too bad though, considering the size of your data set).  It runs through the entire data set for each Code, assumes your existing variables are actually named Header_1, Header_2, etc, and that the first row of data is the one that has all the Names in it.

data want;

   * If needed, add a LENGTH statement to widen the new variables;

   new_header_2='Code';

   new_header_3=''Date';
   new_header_4='Amount';

   output;

   keep new_header_2 new_header_3 new_header_4;

   if 0 then set have;

   array headers {*} header_:;

   do _i_ = 2 to dim(headers);

      do _n_=2 to _nobs_;

          set have point=_n_ nobs=_nobs_;

          if _n_=2 then new_header_2=headers{_i_};

          else do;

              new_header_3 = header_1;

              new_header_4 = headers{_i_};

              output;

          end;

      end;

   end;

   stop;

run;

Sorry, I had placed a comment right in the middle of the program.  Still, this is untested code so it may need to be tweaked.

And a subsequent correction to the calculation of new_header_2.

Good luck.

FriedEgg
SAS Employee

data foo;

infile cards dlm=',';

input (var val1-val4) ($);

cards;

Name,Jim,Bob,Sue,George

Code,101,105,216,412

Date1,1,4,4,6

Date2,1,3,2,6

Date3,5,7,8,7

Date4,5,1,6,8

;

run;

proc transpose data=foo out=bar(drop=_: name);

id var;

var val1-val4;

run;

proc sort data=bar; by code; run;

proc transpose data=bar out=want(rename=( _name_=Date col1=Amount));

by code;

var date1-date4;

run;

proc print data=want noobs; run;

     

CodeDateAmount
101Date11
101Date21
101Date35
101Date45
105Date14
105Date23
105Date37
105Date41
216Date14
216Date22
216Date38
216Date46
412Date16
412Date26
412Date37
412Date48
FriedEgg
SAS Employee

Another way

data foo;

infile cards dlm=',';

length var1-var6 $10 vals1 $1 vals2-vals6 $32767; /* the average length of the 1,000 ids will need to be less then 32bytes */

input var1 vals1 && var2 vals2 && var3 vals3 && var4 vals4 && var5 vals5 && var6 vals6 &&; * create this as macro for more columns;

dates=catx(',',of var3-var6); *what rows were the dates on;

count=countw(vals2,',');

do i=1 to count;

  code=scan(vals2,i,','); *use input function to properly format this with your real data;

  date=scan(dates,i,','); *use input function to properly format this with your real data;

  do j=3 to 6; *what rows were the dates on;

   amount=scan(vvaluex(cats('vals',j)),i,','); *use input function to properly format this with your real data;

   output;

  end;

end;

keep code date amount;

cards;

Name,Jim,Bob,Sue,George

Code,101,105,216,412

Date1,1,4,4,6

Date2,1,3,2,6

Date3,5,7,8,7

Date4,5,1,6,8

;

run;

codedateamount
101Date11
101Date11
101Date15
101Date15
105Date24
105Date23
105Date27
105Date21
216Date34
216Date32
216Date38
216Date36
412Date46
412Date46
412Date47
412Date48
Haikuo
Onyx | Level 15

Another approach:

data have;

input Name$ Jim Bob Sue George;

cards;

Code 101 105 216 412

Date1 1 4 4 6

Date2 1 3 2 6

Date3 5 7 8 7

Date4 5 1 6 8

;

proc transpose data=have out=w1;

by name;

run;

proc sql;

create table want(drop=_name_ rename=(name=date col1=amount)) as

select b.*,a.col1 as code from w1(where= (name='Code')) a

left join

w1(where= (name ne 'Code')) b

on a._name_=b._name_

order by code, name;

quit;

proc print;run;

Haikuo

yeaforme
Calcite | Level 5

Thanks so much for all the responses - I certainly have a lot of code to test =).

Looking quickly through many of the responses I'm realizing that another aspect that will further complicate the matter is that the Date1, Date2, etc. that I have written in the example dataset in the original post are actual dates per the CSV file and that those dates don't follow any simple iterative convention (making it hard to utilize code that does a do loop for vars Date1-DateX, since the dates are actual dates, such as 1/5/1990, 1/7/1990, 1/8/1990, etc. and there are gaps such that not every date between the start date and end date are represented).  I certainly have my work cut out for me still, but I think all these code examples will give me a solid start - whereas before creating this post I was completely lost at even how to approach the question.

Thanks again!  If you have any other code that might help, I'll gladly take it, too!

FriedEgg
SAS Employee

the second example code I provide should work seamlessly with actual dates, with the transpose style you will want to make a few tweaks but it will still work, just add options validvarname=any; before the transpose step.  You will probably also want to add another step to then convert the data columns to their properly formatted values.

Astounding
PROC Star

Given the solutions posted so far, and the changes to the requirements, I'm going to go back to my original plan ... sticking all the data into a two-dimensional array.  This will be necessary to compete on speed with the other solutions.  It may have memory limitations, but probably no more so than any of the other approaches.

data have;

   infile cards end=done;

   length varname name1-name4 $ 8;

   input varname name1-name4;

cards;

Code 101 105 216 412

Date1 1 4 4 6

Date2 1 3 2 6

Date3 5 7 8 7

Date4 5 1 6 8

;

data _null_;  /* extra step to capture dimensions of the array */

   set have nobs=_nobs_;

   array chars {*} _character_;

   ncols = dim(chars);

   call symputx ('nrows', _nobs_);

   call symputx('ncols', ncols);

   stop;

run;

data want;

   length code date amount $ 8;

   array names_x2 {&nrows, &ncols} $;

   do rownum=1 to &nrows until (loaded);

      set have end=loaded;

      array names {&ncols} varname name:;

      do colnum = 1 to &ncols;

           names_x2{rownum, colnum} = names{colnum};

      end;

   end;

   * Two-dimensional array is now loaded.  Time to unload it.;

   do colnum=2 to &ncols;

        do rownum=1 to &nrows;

             if rownum=1 then code = names_x2{1, colnum};

             else do;

                date = names_x2{rownum, 1};

                amount = names_x2{rownum, colnum}

                output;

             end;

        end;

   end;

   keep code date amount;

run;

This will be flexible, fast, and reasonably scalable.  Alas, it is untested at the moment.

Good luck.

yeaforme
Calcite | Level 5

Get multiple errors with the above code -

ERROR: Too many variables defined for the dimension(s) specified for the array names

right below:

array names {&ncols} varname name:;


AND


ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, -, /, ;, <, <=,

              <>, =, >, ><, >=, AND, EQ, GE, GT, IN, LE, LT, MAX, MIN, NE, NG, NL, NOTIN, OR,

              ^=, |, ||, ~=.

right below;

output;

Tried the other code, but I'm just not code-headed enough to understand how to further manipulate it to my CSV files.

In that vein, here is a very small portion taken from one of the CSV files:

NameSCIENCESNEUROBIOLOGICALMARTIN MATS.
Code144672(PA)543325(PA)98761J(PA)
CurrencyU$MPC
1/4/1990#N/A150#N/A

1/5/1990

15.26#N/A#N/A
1/8/199015.301551.05
1/9/199014.891520.98

There are roughly 1,000 columns and roughly 6,000 date-rows in ascending order (starting at 1/1/1990 through 12/31/2011) making the CSV file roughly 1,000 columns across and 6,000 rows down.  To compound the problem, there are 66 CSV files all with data in this form.  Ultimately I need to get all the data across all the CSV files put together into one long list, but once I have each of the 66 imported and organized correctly, appending them is simple.  What I really need help with is getting the data re-formatted

Ultimately, I need to get the data in the following form:

DateCodeCurrencyAmount
1/4/1990144672U$.

1/5/1990

144672U$15.26
1/8/1990144672U$15.30
1/9/1990144672U$14.89
1/4/1990543325MP150
1/5/1990543325MP.
1/8/1990543325MP155
1/9/1990543325MP152
1/4/199098761JC.
1/5/199098761JC.
1/8/199098761JC1.05
1/9/199098761JC0.98

Note: The (PA) needs to be removed from the Code, and #N/A's need to be replaced with missing - all of this I can do on my own if it simplifies helping me with the code to re-form the data.

Thanks again!!

Patrick
Opal | Level 21

Below code works with the sample data you've provided.

/* create some sample csv's */
filename Mycsv 'C:\mycsvs';
data _null_;
  file Mycsv(file1.csv);
  input;
  put _infile_;
  datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1990,#N/A,150,#N/A
1/05/1990,15.26,#N/A,#N/A
1/08/1990,15.3,155,1.05
1/09/1990,14.89,152,0.98
Code,x44672(PA),x43325(PA),x8761J(PA)
Currency,U$,MP,C
1/10/1990,#N/A,150,#N/A
1/11/1990,15.26,#N/A,#N/A
1/12/1990,15.3,155,1.05
;
run;
data _null_;
  file Mycsv(file2.csv);
  input;
  put _infile_;
  datalines;
Name,SCIENCES,NEUROBIOLOGICAL,MARTIN MATS.
Code,144672(PA),543325(PA),98761J(PA)
Currency,U$,MP,C
1/04/1991,#N/A,150,#N/A
1/05/1991,15.26,#N/A,#N/A
1/08/1991,15.3,155,1.05
1/09/1991,14.89,152,0.98
;
run;

/* determine max. number of array elements needed */
filename Mycsv 'C:\mycsvs\file*.csv';  /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
  infile Mycsv end=last;
  input;
  retain maxvars;
  maxvars=max(countc(_infile_,','),maxvars);
  if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
  attrib _Type length=$20 informat=$20.;
  array _Codes {&Nvars} $20. (&Nvars*'');
  array _Currs {&Nvars} $3. (&Nvars*'');
  attrib
    Date length=8 format=ddmmyy10.
    Code  length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

  infile Mycsv dsd dlm=',' truncover LRECL=2000;

  input _Type @;
  _Type=upcase(_Type);

  if _Type='NAME' then
  do;
     return;
  end;
  else
  if _Type='CODE' then
  do;

     call missing(of _all_);
    _infile_=prxchange('s/\([[:alnum:]]*\)//o',-1,_infile_);
    input _Codes{*} :$20.;
  end;
  else
  if _Type='CURRENCY' then
  do;
    input _Currs{*} :$20.;
  end;
  else
  do;
    Date=input(_Type,ddmmyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
  end;
run;

proc print data=want;
run;

Message was edited by: Patrick Removal of unnecessary Amts array

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 4371 views
  • 6 likes
  • 9 in conversation