Solved: Re: Stack database columns - Page 2

MikeZdeb · Posted 04-22-2012 01:04 PM

hi ... here are a couple of different approaches to your two questions ....

#1 using the datalines file from Ksharp's posting ...

data x;

infile datalines;

input name $ jim bob sue george;

datalines;

Code 101 105 216 412

Date1 1 4 4 6

Date2 1 3 2 6

Date3 5 7 8 7

Date4 5 1 6 8

;

* use a data step to create a macro variable that renames variables in the data set;

filename x dummy;

data _null_;

set x (drop=name obs=1);

file nosee;

put (_all_) (=) @;

call symputx('rename',tranwrd(_file_,'=','=_'));

run;

data x;

set x (firstobs=2 rename=(&rename));

array _(*) _: ;

do j = 1 to dim(_);

amount = _(j);

date = name;

code = compress(vname(_(j)),,'kd');

output;

end;

keep date code amount;

run;

#2 using the attached CSV file based on the data you posted in the second question (it's specific to the posted data, but if all your data takes that form, it should work regardless of the number of codes and dates) ...

* use a data step to create some macro variables (a list of variables for an INPUT statement, a count of currencies, a list of currencies);

data _null_;

infile 'z:\data.csv' dsd firstobs=2 _infile_=x;

input;

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , ' _' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(x)-1);

call symputx('curr',catt('"',tranwrd(x, ',' , '" "'),'"'));

stop;

run;

* translate the #N/A to missing with an informat;

proc format;

invalue x '#N/A' = .;

run;

* use the macro variables and the informat;

data x;

infile 'z:\data.csv' dsd firstobs=4;

input date mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $ _temporary_ (&curr);

do j=1 to dim(_);

currency = curr(j);

amount = _(j);

code = compress(vname(_(j)),,'kad');

output;

end;

format date mmddyy10.;

keep date code currency amount;

run;

yeaforme · Posted 04-23-2012 02:02 PM

First, thanks for the additional input - I'm not quite there but I think we are very, very close.

Second, to Ksharp: yes, I apologize that my initial post was not what the data exactly looked like. Being naive I figured I could save us all time by short-handing the data here and then just applying the principles I saw to the original data - how wrong I was.

MikeZdeb: I tried your second set of code and I got a lot of errors, mostly involving a lot of "". In the process of trying to decipher the errors, I opened up the original csv files in a text editor (before I had been opening them in Excel - again, my naivete) and found that the actual text files have "" around everything except the date-amounts (so variable names are actually: "Name", "Code", "Currency"; dates are actually "1/4/2000", etc.; SCIENCES is actually "SCIENCES", etc.; 144672(PA) is actually "144672(PA)", etc.; and U$ is actually "U$" - however, the date-amounts [i.e., N/A, 150, #N/A, 15.26, #N/A, #N/A, 15.30, 155, 1.05, 14.89, 152, 0.98] are without "" [i.e., are actually N/A, 150, #N/A, 15.26, #N/A, #N/A, 15.30, 155, 1.05, 14.89, 152, 0.98]). These "" were invisible when I viewed them in Excel.

I also tried Ksharps code on the original files, however the proc sql statement sent the following error: ERROR: A maximum of 256 tables can be processed in a single PROC SQL statement. The original data are roughly 1000 IDs across and 6000 dates down, so I imagine that is pushing the proc sql over its limit. Is there a way to alleviate this?

Thoughts? I am continuing to tinker around with both sets of code and I hope I can solve the issues on my own, but again, and as I'm sure you can tell, I'm fairly noob in this area so if you can get there quicker, I'd much appreciated it. Thanks again, so much!

MikeZdeb · Posted 04-23-2012 03:57 PM

hi ... i changed the CSV to look as you specified with quotes around all entries in the first three lines

I modified the first data step, the one that creates the macro variables used in the second data step (actually made it easier to create macro variable &CURR)

give it a try (use with PROC FORMAT and second data step from previous posting) with the new CSV file and with your other data (cut/paste from here since the single quotes with no space bewteen them might look like double quotes if you try to retype this)

data _null_;

infile 'z:\data.csv' dsd firstobs=2 _infile_=x;

input;

x = compress(translate(x,'','"'));

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , '_' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(x)-1);

call symputx('curr',x);

stop;

run;

yeaforme · Posted 04-23-2012 07:16 PM

Still getting some errors. Here's the last part of the log:

52 array curr(0:&n) $ _temporary_ (&curr);

-

22

ERROR 22-322: Syntax error, expecting one of the following: a quoted string, a numeric constant,

a datetime constant, a missing value.

53

54 do j=1 to dim(_);

ERROR: Attempt to initialize element of character array curr with numeric constant 1.

WARNING: Partial value initialization of the array curr.

55

56 currency = curr(j);

57

58 amount = _(j);

59

60 code = compress(vname(_(j)),,'kad');

61

62 output;

63

64 end;

65

66 format date mmddyy10.;

67

68 keep date code currency amount;

69

70 run;

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.X may be incomplete. When this step was stopped there were 0

observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 0.03 seconds

cpu time 0.00 seconds

Also, probably should have done this at the beginning, butust to make it easier, I've attached one of the actual csv files (28 megs or so). Thanks again!

Patrick · Posted 04-23-2012 10:02 PM

I believe the alternative code version I've posted previously in this thread would just work with your data.

It's not a very fancy approach - but I believe it to be stable and able to process any number of source csv's at once.

MikeZdeb · Posted 04-23-2012 10:09 PM

hi ... OK, providing real data helped ... this worked with that data (after some changes, e.g. providing LRECL) ...

data _null_;

infile 'z:\ws1.csv' dsd firstobs=2 _infile_=x lrecl=20000 pad;

input;

x = compress(translate(x,'','"'));

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , '_' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(compress(x),',')-1);

file 'z:\curr.txt' lrecl=10000;

put x;

call symputx('curr',compress(x));

stop;

run;

* translate the #N/A to missing with an informat;

proc format;

invalue x '#n/a' , 'N/A' = .;

run;

* use the macro variables and the informat;

data x;

infile 'z:\ws1.csv' dsd firstobs=4 lrecl=20000 pad;

input date : mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $ _temporary_ (&curr);

do j=1 to dim(_);

currency = curr(j);

amount = _(j);

code = compress(vname(_(j)),,'kad');

output;

end;

format date mmddyy10.;

keep date code currency amount;

run;

* check the data ... random sample;

proc print data=x;

where ranuni(999) le .0001;

run;

* first 10, last 10;

data _null_;

do j=1 to 10, lastrec-9 to lastrec;

set x nobs=lastrec point=j;

put j 7. +2 date mmddyy10. +2 currency $2. +2 amount 8.2 +2 code;

end;

stop;

run;

the LOG (using the data you provided) ...

574 * use the macro variables and the informat;

575 data x;

576 infile 'z:\ws1.csv' dsd firstobs=4 lrecl=20000 pad;

577 input date : mmddyy. (&vars) (: x.);

578 array _(*) _: ;

579 array curr(0:&n) $ _temporary_ (&curr);

580 do j=1 to dim(_);

581 currency = curr(j);

582 amount = _(j);

583 code = compress(vname(_(j)),,'kad');

584 output;

585 end;

586 format date mmddyy10.;

587 keep date code currency amount;

588 run;

NOTE: The infile 'z:\ws1.csv' is:

Filename=z:\ws1.csv,

RECFM=V,LRECL=20000,File Size (bytes)=20022456,

Last Modified=23Apr2012:19:16:00,

Create Time=23Apr2012:21:34:53

NOTE: 5740 records were read from the infile 'z:\ws1.csv'.

The minimum record length was 3295.

The maximum record length was 3872.

NOTE: The data set WORK.X has 3771180 observations and 4 variables.

NOTE: DATA statement used (Total process time):

real time 26.04 seconds

cpu time 16.45 seconds

first 10 and last 10 (I checked your CSV file and those are the last 10 values ... HEY !!! looks like it actually worked) ...

1 01/01/1990 U$ . 130042

2 01/01/1990 U$ . 130057

3 01/01/1990 U$ . 130062

4 01/01/1990 U$ . 130079

5 01/01/1990 U$ . 130086

6 01/01/1990 U$ . 130088

7 01/01/1990 MP . 130092

8 01/01/1990 U$ . 130104

9 01/01/1990 U$ . 130113

10 01/01/1990 MP . 130115

3771171 12/30/2011 U$ 34.97 134057

3771172 12/30/2011 U$ . 134058

3771173 12/30/2011 C$ 0.04 134069

3771174 12/30/2011 K$ 0.77 13406D

3771175 12/30/2011 U$ 13.87 134072

3771176 12/30/2011 U$ . 134076

3771177 12/30/2011 RL 10.78 13407C

3771178 12/30/2011 U$ . 134083

3771179 12/30/2011 U$ 25.33 134093

3771180 12/30/2011 CH 7.17 13409Q

Ksharp · Posted 04-23-2012 01:05 AM

You did not post what your data exactly looks like.

Use Mike's csv file and my code.

proc import datafile='c:\data.csv' out=x dbms=csv replace;getnames=no;
run;

data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date'),catt(varname(dsid,i),' as amount'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',n);
rc=close(dsid);  
run;
data _null_;
 set x(firstobs=2 obs=2);
 array _v{*} $ _character_ ;
 do i=2 to dim(_v);
  call symputx(cats('value',i-1),_v{i}) ;
 end;
run;
data _null_;
 set x(firstobs=3 obs=3);
 array _v{*} $ _character_;
 do i=2 to dim(_v);
  call symputx(cats('_value',i-1),_v{i}) ;
 end;
run;


%macro stack;
proc sql;
create table want as 
%do i=1 %to &n;
 select "&&value&i" as code,"&&_value&i" as currency,&&name&i from x(firstobs=4) 
 %if &i ne &n %then %do; union all  %end;
%end;
 ;
quit;
%mend stack;

 %stack

Ksharp

Ksharp · Posted 04-23-2012 11:17 PM

Oh, I don't realize that there is such a limitation at SQL. Try this:

proc import datafile='c:\data.csv' out=x dbms=csv replace;getnames=no;
run;

data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date length=40'),catt(varname(dsid,i),' as amount length=40'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',n);
rc=close(dsid);  
run;
data _null_;
 set x(firstobs=2 obs=2);
 array _v{*} $ _character_ ;
 do i=2 to dim(_v);
  call symputx(cats('value',i-1),_v{i}) ;
 end;
run;
data _null_;
 set x(firstobs=3 obs=3);
 array _v{*} $ _character_;
 do i=2 to dim(_v);
  call symputx(cats('_value',i-1),_v{i}) ;
 end;
run;



%macro stack;
%do i=1 %to &n;
proc sql;
create table want_&i as 
 select "&&value&i" as code length=40,"&&_value&i" as currency length=20,&&name&i from x(firstobs=4) 
 ;
quit;
%end;
%mend stack;

 %stack

 data want;
  set want_: ;run;

Ksharp

Patrick · Posted 04-24-2012 06:06 AM

Sticking with my code version - now tested on 2 copies of the data you've provided.

To run it in your environment the only thing you need to do is to change the value of csvPath to the directory where your csv files live.

The way the max. number of array elements is determined is not very efficient - but I wanted to keep the code simple. If performance is an issue then it could be done quite a bit more efficient.

%let csvPath=C:\_temp\demo;

/* determine max. number of array elements needed */
filename Mycsv "&csvPath\*.csv"; /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
infile Mycsv end=last lrecl=32767;
input;
retain maxvars;
maxvars=max(countc(_infile_,','),maxvars);
if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
attrib _Type length=$20 informat=$20.;
array _Codes {&Nvars} $20. (&Nvars*'');
array _Currs {&Nvars} $3. (&Nvars*'');
attrib
    Date length=8 format=date9.
    Code length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

infile Mycsv dsd dlm=',' truncover LRECL=32767;

input _Type @;
_Type=upcase(_Type);

if _Type='NAME' then
do;
     return;
end;
else
if _Type='CODE' then
do;
     call missing(of _all_);
    _infile_=prxchange('s/$[[:alnum:]]*$//o',-1,_infile_);
    input _Codes{*} :$20.;
end;
else
if _Type='CURRENCY' then
do;
    input _Currs{*} :$20.;
end;
else
do;
    Date=input(_Type,mmddyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
end;
run;

/*proc print data=want;*/
/*run;*/

yeaforme · Posted 04-24-2012 07:29 PM

Thanks again everyone for the input. MikeZdeb's code seems to be working perfectly for me. I'm going to try Ksharp's code soon and see which gives faster, accurate results. I like the all-csv-at-once aspect of Patrick's, but when I ran that code on the data sample I provided, the resulting dataset only had about 800,000 of the over 3,000,000 observations it should have had as it seemed to be stopping partway through the sample. I tried to merge the two codes (MikeZdeb's and Patrick's) to get the best of both (all csvs at once with all observations), but was unsuccessful at that - though that is a different problem than the current one.

Thanks to everyone for all the help, I really appreciate it!! And unless anyone has more code they are just dying to write, I'm calling this one solved.

MikeZdeb · Posted 04-24-2012 09:01 PM

hi ... OK, one last bit of help

not much is needed to use the code I posted repetitively and read all your CSV files

if you put them all in one folder, you can use ...

* create a list of csv files in folder z:\ and a count of such files as macro variables;

filename csvfiles pipe "dir /b z:\*.csv";

data _null_;

length fnames $10000;

infile csvfiles end=done;

do j=1 by 1 until (done);

input;

fnames = catx('*',fnames,_infile_);

end;

call symputx('nfiles',j);

call symputx('files',fnames);

run;

then you can use a the macro variables within a macro to read a CSV file, create a data set , append that data set to the final data set you want, repeat the process for each CSV file found in the specified folder ... the macro (in the attached SAS file) starts / ends with ...

%macro readcsv;

proc datasets lib=work nolist;

delete allmydata;

quit;

%do i = 1 %to &nfiles;

%let fname=%scan(&files,&i,*);

<more>

proc append base=allmydata data=x;

run;

%end;

proc datasets lib=work nolist;

delete x;

quit;

%mend;

give it a try with a few CSV files (all in the same folder)

ps I changed the 2nd data step a bit, changing the LENGTHS of the variables CODE and CURRENCY from 200 and 8 to 15 and 4 by using a LENGTH statement and modifying one ARRAY statement ...

data x;

length code $15;

infile csv(&fname) dsd firstobs=4 lrecl=20000 pad;

input date : mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $4 _temporary_ (&curr);

<more>

yeaforme · Posted 04-25-2012 05:33 PM

Hmmm...get the following log/errors:

169

170 * non-repetitive stuff;

171

172 * informat for converting text to missing;

173 proc format;

174 invalue x '#n/a' , 'N/A' = .;

NOTE: Informat X is already on the library.

NOTE: Informat X has been output.

175 run;

NOTE: PROCEDURE FORMAT used (Total process time):

real time 0.03 seconds

cpu time 0.00 seconds

176

177 * create a list of csv files in folder z:\ and a count of such files as macro variables;

178 filename csvfiles pipe "dir /b C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv";

179

180 data _null_;

181 length fnames $10000;

182 infile csvfiles end=done;

183 do j=1 by 1 until (done);

184 input;

185 fnames = catx('*',fnames,_infile_);

186 end;

187 call symputx('nfiles',j);

188 call symputx('files',fnames);

189 run;

NOTE: The infile CSVFILES is:

Unnamed Pipe Access Device,

PROCESS=dir /b C:\Ask\*.csv,

RECFM=V,LRECL=256

Stderr output:

The system cannot find the file specified.

NOTE: 0 records were read from the infile CSVFILES.

NOTE: DATA statement used (Total process time):

real time 0.02 seconds

cpu time 0.00 seconds

190

191 * location of csv files;

192 filename csv 'C:\Ask\';

193

194 * repetitive stuff;

195

196 * macro to read all the csv file in folder z:\ and create one data set name ALLMYDATA;

197 %macro readcsv;

198

199 proc datasets lib=work nolist;

200 delete allmydata;

201 quit;

202

203 %do i = 1 %to &nfiles;

204 %let fname=%scan(&files,&i,*);

205

206 data _null_;

207 infile csv(&fname) dsd firstobs=2 _infile_=x lrecl=20000 pad;

208 input;

209 x = compress(translate(x,'','"'));

210 x = tranwrd(x, 'Code' , '');

211 x = tranwrd(x, ',' , '_' );

212 x = tranwrd(x, '(PA)' ,'' );

213 call symputx('vars',x);

214 input;

215 call symputx('n',countw(compress(x),',')-1);

216 file 'C:\Ask\curr.txt' lrecl=10000;

217 put x;

218 call symputx('curr',compress(x));

219 stop;

220 run;

221

222 data x;

223 length code $15;

224 infile csv(&fname) dsd firstobs=4 lrecl=20000 pad;

225 input date : mmddyy. (&vars) (: x.);

226 array _(*) _: ;

227 array curr(0:&n) $4 _temporary_ (&curr);

228 do j=1 to dim(_);

229 currency = curr(j);

230 amount = _(j);

231 code = compress(vname(_(j)),,'kad');

232 output;

233 end;

234 format date mmddyy10.;

235 keep date code currency amount;

236 run;

237

238 proc append base=allmydata data=x;

239 run;

240

241 %end;

242

243 proc datasets lib=work nolist;

244 delete x;

245 quit;

246

247 %mend;

248

249 options nomprint;

250

251 %readcsv;

NOTE: The file WORK.ALLMYDATA (memtype=DATA) was not found, but appears on a DELETE statement.

NOTE: PROCEDURE DATASETS used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

WARNING: Apparent symbolic reference NFILES not resolved.

ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric

operand is required. The condition was: &nfiles

ERROR: The %TO value of the %DO I loop is invalid.

ERROR: The macro READCSV will stop executing.

Patrick · Posted 04-24-2012 09:37 PM

And also to fix up what I've posted: The LRECL value in my code was way too short and so I missed to read the full record - and though the missing observations.

Changing LRECL to the maximum and everything works with your data (tested). I've fixed the code accordingly in my previous post.

The longest record in your sample data has 22753 characters.

yeaforme · Posted 04-25-2012 05:39 PM

Thanks Patrick - tried the code and it seems to run fast and smooth. Checking for errors, but seems to be running well.

MikeZdeb · Posted 04-25-2012 10:57 PM

hi ... glad this is winding down !!!

I know what the problem is that you had with my last posting ... all those spaces in your folder name ...

filename csvfiles pipe "dir /b C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv";

kill the PIPE method for getting a list of files in a FOLDER, the LOG ...

BUT different use of quotes ...

filename csvfiles pipe 'dir /b "C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv" ';

should work ... also, the next filename statement you have is ...

* location of csv files;

filename csv 'C:\Ask\';

and it should be ...

* location of csv files;

filename csv "C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\" ;

the SAS code will not work unless the folder matches in the two filename statements (the location of your CSV files)

so, since I'm curious as to if it solve the problem, it'd be nice (if you have a chance) if you gave it a try

it would also be of interest to know how the performance (elapsed and CPU times) compare to Patrick's solution (which you say is working just fine)

glad that something worked for you !!!

ps I'm always amazed when I see folder names that are full of spaces since it's difficult to distinguish one space from two ... reminds me of my students who store files in places like "C:\Documents and Settings\Joe Student\My Documents\EPI 514\data" rather than "c:\epi514\data"

Register Today!

SAS Training: Just a Click Away