BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MikeZdeb
Rhodochrosite | Level 12

hi ... here are a couple of different approaches to your two questions ....

#1  using the datalines file from Ksharp's posting ...

data x;

infile datalines;

input name $ jim bob sue george;

datalines;

Code    101  105  216 412

Date1     1     4     4     6

Date2     1     3     2     6

Date3     5     7     8     7

Date4     5     1     6     8

;

* use a data step to create a macro variable that renames variables in the data set;

filename x dummy;

data _null_;

set x (drop=name obs=1);

file nosee;

put (_all_) (=) @;

call symputx('rename',tranwrd(_file_,'=','=_'));

run;

data x;

set x (firstobs=2 rename=(&rename));

array _(*) _: ;

do j = 1 to dim(_);

   amount = _(j);

   date = name;

   code = compress(vname(_(j)),,'kd');

   output;

end;

keep date code amount;

run;

#2  using the attached CSV file based on the data you posted in the second question (it's specific to the posted data, but if all your data takes that form, it should work regardless of the number of codes and dates) ...

* use a data step to create some macro variables (a list of variables for an INPUT statement, a count of currencies, a list of currencies);

data _null_;

infile 'z:\data.csv' dsd firstobs=2 _infile_=x;

input;

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , ' _' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(x)-1);

call symputx('curr',catt('"',tranwrd(x, ',' , '" "'),'"'));

stop;

run;

* translate the #N/A to missing with an informat; 

proc format;

invalue x '#N/A' = .;

run;

* use the macro variables and the informat;

data x;

infile 'z:\data.csv' dsd firstobs=4;

input date mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $ _temporary_ (&curr);

do j=1 to dim(_);

   currency = curr(j);

   amount = _(j);

   code = compress(vname(_(j)),,'kad');

   output;

end;

format date mmddyy10.;

keep date code currency amount;

run;

yeaforme
Calcite | Level 5

First, thanks for the additional input - I'm not quite there but I think we are very, very close.

Second, to Ksharp: yes, I apologize that my initial post was not what the data exactly looked like.  Being naive I figured I could save us all time by short-handing the data here and then just applying the principles I saw to the original data - how wrong I was.

MikeZdeb: I tried your second set of code and I got a lot of errors, mostly involving a lot of "".  In the process of trying to decipher the errors, I opened up the original csv files in a text editor (before I had been opening them in Excel - again, my naivete) and found that the actual text files have "" around everything except the date-amounts (so variable names are actually: "Name", "Code", "Currency"; dates are actually "1/4/2000", etc.; SCIENCES is actually "SCIENCES", etc.; 144672(PA) is actually "144672(PA)", etc.; and U$ is actually "U$" - however, the date-amounts [i.e., N/A, 150, #N/A, 15.26, #N/A, #N/A, 15.30, 155, 1.05, 14.89, 152, 0.98] are without "" [i.e., are actually N/A, 150, #N/A, 15.26, #N/A, #N/A, 15.30, 155, 1.05, 14.89, 152, 0.98]).  These "" were invisible when I viewed them in Excel.

I also tried Ksharps code on the original files, however the proc sql statement sent the following error:  ERROR: A maximum of 256 tables can be processed in a single PROC SQL statement.  The original data are roughly 1000 IDs across and 6000 dates down, so I imagine that is pushing the proc sql over its limit.  Is there a way to alleviate this?

Thoughts?  I am continuing to tinker around with both sets of code and I hope I can solve the issues on my own, but again, and as I'm sure you can tell, I'm fairly noob in this area so if you can get there quicker, I'd much appreciated it.  Thanks again, so much!

MikeZdeb
Rhodochrosite | Level 12

hi ... i changed the CSV to look as you specified with quotes around all entries in the first three lines

I modified the first data step, the one that creates the macro variables used in the second data step (actually made it easier to create macro variable &CURR)

give it a try (use with PROC FORMAT and second data step from previous posting) with the new CSV file and with your other data (cut/paste from here since the single quotes with no space bewteen them might look like double quotes if you try to retype this)

data _null_;

infile 'z:\data.csv' dsd firstobs=2 _infile_=x;

input;

x = compress(translate(x,'','"'));

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , '_' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(x)-1);

call symputx('curr',x);

stop; 

run;

yeaforme
Calcite | Level 5

Still getting some errors.  Here's the last part of the log:

52   array curr(0:&n) $ _temporary_ (&curr);

                                          -

                                          22

ERROR 22-322: Syntax error, expecting one of the following: a quoted string, a numeric constant,

              a datetime constant, a missing value.


53

54   do j=1 to dim(_);

ERROR: Attempt to initialize element of character array curr with numeric constant 1.

WARNING: Partial value initialization of the array curr.

55

56      currency = curr(j);

57

58      amount = _(j);

59

60      code = compress(vname(_(j)),,'kad');

61

62      output;

63

64   end;

65

66   format date mmddyy10.;

67

68   keep date code currency amount;

69

70   run;

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.X may be incomplete.  When this step was stopped there were 0

         observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           0.03 seconds

      cpu time            0.00 seconds

Also, probably should have done this at the beginning, butust to make it easier, I've attached one of the actual csv files (28 megs or so).  Thanks again!

Patrick
Opal | Level 21

I believe the alternative code version I've posted previously in this thread would just work with your data.

It's not a very fancy approach - but I believe it to be stable and able to process any number of source csv's at once.

MikeZdeb
Rhodochrosite | Level 12

hi ... OK, providing real data helped ... this worked with that data (after some changes, e.g. providing LRECL)  ...

data _null_;

infile 'z:\ws1.csv' dsd firstobs=2 _infile_=x lrecl=20000 pad;

input;

x = compress(translate(x,'','"'));

x = tranwrd(x, 'Code' , '');

x = tranwrd(x, ',' , '_' );

x = tranwrd(x, '(PA)' ,'' );

call symputx('vars',x);

input;

call symputx('n',countw(compress(x),',')-1);

file 'z:\curr.txt' lrecl=10000;

put x;

call symputx('curr',compress(x));

stop;

run;

* translate the #N/A to missing with an informat;

proc format;

invalue x '#n/a' , 'N/A' = .;

run;

* use the macro variables and the informat;

data x;

infile 'z:\ws1.csv' dsd firstobs=4 lrecl=20000 pad;

input date : mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $ _temporary_ (&curr);

do j=1 to dim(_);

   currency = curr(j);

   amount = _(j);

   code = compress(vname(_(j)),,'kad');

   output;

end;

format date mmddyy10.;

keep date code currency amount;

run;

* check the data ... random sample;

proc print data=x;

where ranuni(999) le .0001;

run;


* first 10, last 10;

data _null_;

do j=1 to 10, lastrec-9 to lastrec;

   set x nobs=lastrec point=j;

   put j 7. +2 date mmddyy10. +2 currency $2. +2 amount 8.2 +2 code;

end;

stop;

run;

the LOG (using the data you provided) ...

574  * use the macro variables and the informat;

575  data x;

576  infile 'z:\ws1.csv' dsd firstobs=4 lrecl=20000 pad;

577  input date : mmddyy. (&vars) (: x.);

578  array _(*) _: ;

579  array curr(0:&n) $ _temporary_ (&curr);

580  do j=1 to dim(_);

581     currency = curr(j);

582     amount = _(j);

583     code = compress(vname(_(j)),,'kad');

584     output;

585  end;

586  format date mmddyy10.;

587  keep date code currency amount;

588  run;

NOTE: The infile 'z:\ws1.csv' is:

      Filename=z:\ws1.csv,

      RECFM=V,LRECL=20000,File Size (bytes)=20022456,

      Last Modified=23Apr2012:19:16:00,

      Create Time=23Apr2012:21:34:53

NOTE: 5740 records were read from the infile 'z:\ws1.csv'.

      The minimum record length was 3295.

      The maximum record length was 3872.

NOTE: The data set WORK.X has 3771180 observations and 4 variables.

NOTE: DATA statement used (Total process time):

      real time           26.04 seconds

      cpu time            16.45 seconds

first 10 and last 10 (I checked your CSV file and those are the last 10 values ... HEY !!! looks like it actually worked) ...

      1  01/01/1990  U$       .    130042

      2  01/01/1990  U$       .    130057

      3  01/01/1990  U$       .    130062

      4  01/01/1990  U$       .    130079

      5  01/01/1990  U$       .    130086

      6  01/01/1990  U$       .    130088

      7  01/01/1990  MP       .    130092

      8  01/01/1990  U$       .    130104

      9  01/01/1990  U$       .    130113

     10  01/01/1990  MP       .    130115

3771171  12/30/2011  U$     34.97  134057

3771172  12/30/2011  U$       .    134058

3771173  12/30/2011  C$      0.04  134069

3771174  12/30/2011  K$      0.77  13406D

3771175  12/30/2011  U$     13.87  134072

3771176  12/30/2011  U$       .    134076

3771177  12/30/2011  RL     10.78  13407C

3771178  12/30/2011  U$       .    134083

3771179  12/30/2011  U$     25.33  134093

3771180  12/30/2011  CH      7.17  13409Q

Ksharp
Super User

You did not post what your data exactly looks like.

Use Mike's csv file and my code.

proc import datafile='c:\data.csv' out=x dbms=csv replace;getnames=no;
run;

data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date'),catt(varname(dsid,i),' as amount'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',n);
rc=close(dsid);  
run;
data _null_;
 set x(firstobs=2 obs=2);
 array _v{*} $ _character_ ;
 do i=2 to dim(_v);
  call symputx(cats('value',i-1),_v{i}) ;
 end;
run;
data _null_;
 set x(firstobs=3 obs=3);
 array _v{*} $ _character_;
 do i=2 to dim(_v);
  call symputx(cats('_value',i-1),_v{i}) ;
 end;
run;


%macro stack;
proc sql;
create table want as 
%do i=1 %to &n;
 select "&&value&i" as code,"&&_value&i" as currency,&&name&i from x(firstobs=4) 
 %if &i ne &n %then %do; union all  %end;
%end;
 ;
quit;
%mend stack;

 %stack

Ksharp

Ksharp
Super User

Oh, I don't realize that there is such a limitation at SQL. Try this:

proc import datafile='c:\data.csv' out=x dbms=csv replace;getnames=no;
run;

data _null_;
dsid=open("x","i");
num=attrn(dsid,"nvars");
do i=2 to num;
 n+1;
 name=catx(',',catt(varname(dsid,1),' as date length=40'),catt(varname(dsid,i),' as amount length=40'));
 call symputx(cats('name',n),name);
end;
  call symputx('n',n);
rc=close(dsid);  
run;
data _null_;
 set x(firstobs=2 obs=2);
 array _v{*} $ _character_ ;
 do i=2 to dim(_v);
  call symputx(cats('value',i-1),_v{i}) ;
 end;
run;
data _null_;
 set x(firstobs=3 obs=3);
 array _v{*} $ _character_;
 do i=2 to dim(_v);
  call symputx(cats('_value',i-1),_v{i}) ;
 end;
run;



%macro stack;
%do i=1 %to &n;
proc sql;
create table want_&i as 
 select "&&value&i" as code length=40,"&&_value&i" as currency length=20,&&name&i from x(firstobs=4) 
 ;
quit;
%end;
%mend stack;

 %stack

 data want;
  set want_: ;run;


Ksharp

Patrick
Opal | Level 21

Sticking with my code version - now tested on 2 copies of the data you've provided.

To run it in your environment the only thing you need to do is to change the value of csvPath to the directory where your csv files live.

The way the max. number of array elements is determined is not very efficient - but I wanted to keep the code simple. If performance is an issue then it could be done quite a bit more efficient.

%let csvPath=C:\_temp\demo;

/* determine max. number of array elements needed */
filename Mycsv "&csvPath\*.csv";  /* reads all files in folder mycsvs following naming pattern file*.csv */
data _null_;
  infile Mycsv end=last lrecl=32767;
  input;
  retain maxvars;
  maxvars=max(countc(_infile_,','),maxvars);
  if last then call symput('Nvars',cats(maxvars));
run;

/* read, append and reorganise source data */
data want(drop=_:);
  attrib _Type length=$20 informat=$20.;
  array _Codes {&Nvars} $20. (&Nvars*'');
  array _Currs {&Nvars} $3. (&Nvars*'');
  attrib
    Date length=8 format=date9.
    Code  length=$20
    Currency length=$3
    Amount length=8 format=comma16.2
    ;

  infile Mycsv dsd dlm=',' truncover LRECL=32767;

  input _Type @;
  _Type=upcase(_Type);

  if _Type='NAME' then
  do;
     return;
  end;
  else
  if _Type='CODE' then
  do;
     call missing(of _all_);
    _infile_=prxchange('s/\([[:alnum:]]*\)//o',-1,_infile_);
    input _Codes{*} :$20.;
  end;
  else
  if _Type='CURRENCY' then
  do;
    input _Currs{*} :$20.;
  end;
  else
  do;
    Date=input(_Type,mmddyy10.);
    do _i=1 to &Nvars while(not missing(_Codes(_i)));

      Code=_Codes(_i);
      Currency=_Currs(_i);
      input Amount ?? :20. @;
      output;
    end;
  end;
run;

/*proc print data=want;*/
/*run;*/

yeaforme
Calcite | Level 5

Thanks again everyone for the input.  MikeZdeb's code seems to be working perfectly for me.  I'm going to try Ksharp's code soon and see which gives faster, accurate results.  I like the all-csv-at-once aspect of Patrick's, but when I ran that code on the data sample I provided, the resulting dataset only had about 800,000 of the over 3,000,000 observations it should have had as it seemed to be stopping partway through the sample.  I tried to merge the two codes (MikeZdeb's and Patrick's) to get the best of both (all csvs at once with all observations), but was unsuccessful at that - though that is a different problem than the current one.

Thanks to everyone for all the help, I really appreciate it!!  And unless anyone has more code they are just dying to write, I'm calling this one solved.

MikeZdeb
Rhodochrosite | Level 12

hi ... OK, one last bit of help

not much is needed to use the code I posted repetitively and read all your CSV files

if you put them all in one folder, you can use ...

* create a list of csv files in folder z:\ and a count of such files as macro variables;

filename csvfiles pipe "dir /b z:\*.csv";

data _null_;

length fnames $10000;

infile csvfiles end=done;

do j=1 by 1 until (done);

  input;

  fnames = catx('*',fnames,_infile_);

end;

call symputx('nfiles',j);

call symputx('files',fnames);

run;

then you can use a the macro variables within a macro to read a CSV file, create a data set , append that data set to the final data set you want, repeat the process for each CSV file found in the specified folder ... the macro (in the attached SAS file) starts / ends  with ...

%macro readcsv;

proc datasets lib=work nolist;

delete allmydata;

quit;

%do i = 1 %to &nfiles;

%let fname=%scan(&files,&i,*);


<more>


proc append base=allmydata data=x;

run;

%end;

proc datasets lib=work nolist;

delete x;

quit;

%mend;

give it a try with a few CSV files (all in the same folder)

ps  I changed the 2nd data step a bit, changing the LENGTHS of the variables CODE and CURRENCY from 200 and 8 to 15 and 4 by using a LENGTH statement and modifying one ARRAY statement ...

data x;

length code $15;

infile csv(&fname) dsd firstobs=4 lrecl=20000 pad;

input date : mmddyy. (&vars) (: x.);

array _(*) _: ;

array curr(0:&n) $4 _temporary_ (&curr);

<more>

yeaforme
Calcite | Level 5

Hmmm...get the following log/errors:

169

170  * non-repetitive stuff;

171

172  * informat for converting text to missing;

173  proc format;

174  invalue x '#n/a' , 'N/A' = .;

NOTE: Informat X is already on the library.

NOTE: Informat X has been output.

175  run;


NOTE: PROCEDURE FORMAT used (Total process time):

      real time           0.03 seconds

      cpu time            0.00 seconds

 

176

177  * create a list of csv files in folder z:\ and a count of such files as macro variables;

178  filename csvfiles pipe "dir /b C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv";

179

180  data _null_;

181  length fnames $10000;

182  infile csvfiles end=done;

183  do j=1 by 1 until (done);

184    input;

185    fnames = catx('*',fnames,_infile_);

186  end;

187  call symputx('nfiles',j);

188  call symputx('files',fnames);

189  run;


NOTE: The infile CSVFILES is:

      Unnamed Pipe Access Device,

      PROCESS=dir /b C:\Ask\*.csv,

      RECFM=V,LRECL=256


Stderr output:

The system cannot find the file specified.

NOTE: 0 records were read from the infile CSVFILES.

NOTE: DATA statement used (Total process time):

      real time           0.02 seconds

      cpu time            0.00 seconds

190

191  * location of csv files;

192  filename csv 'C:\Ask\';

193

194  * repetitive stuff;

195

196  * macro to read all the csv file in folder z:\ and create one data set name ALLMYDATA;

197  %macro readcsv;

198

199  proc datasets lib=work nolist;

200  delete allmydata;

201  quit;

202

203  %do i = 1 %to &nfiles;

204  %let fname=%scan(&files,&i,*);

205

206  data _null_;

207  infile csv(&fname) dsd firstobs=2 _infile_=x lrecl=20000 pad;

208  input;

209  x = compress(translate(x,'','"'));

210  x = tranwrd(x, 'Code' , '');

211  x = tranwrd(x, ',' , '_' );

212  x = tranwrd(x, '(PA)' ,'' );

213  call symputx('vars',x);

214  input;

215  call symputx('n',countw(compress(x),',')-1);

216  file 'C:\Ask\curr.txt' lrecl=10000;

217  put x;

218  call symputx('curr',compress(x));

219  stop;

220  run;

221

222  data x;

223  length code $15;

224  infile csv(&fname) dsd firstobs=4 lrecl=20000 pad;

225  input date : mmddyy. (&vars) (: x.);

226  array _(*) _: ;

227  array curr(0:&n) $4 _temporary_ (&curr);

228  do j=1 to dim(_);

229     currency = curr(j);

230     amount = _(j);

231     code = compress(vname(_(j)),,'kad');

232     output;

233  end;

234  format date mmddyy10.;

235  keep date code currency amount;

236  run;

237

238  proc append base=allmydata data=x;

239  run;

240

241  %end;

242

243  proc datasets lib=work nolist;

244  delete x;

245  quit;

246

247  %mend;

248

249  options nomprint;

250

251  %readcsv;


NOTE: The file WORK.ALLMYDATA (memtype=DATA) was not found, but appears on a DELETE statement.

NOTE: PROCEDURE DATASETS used (Total process time):

      real time           0.00 seconds

      cpu time            0.00 seconds


 

WARNING: Apparent symbolic reference NFILES not resolved.

ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric

       operand is required. The condition was: &nfiles

ERROR: The %TO value of the %DO I loop is invalid.

ERROR: The macro READCSV will stop executing.

Patrick
Opal | Level 21

And also to fix up what I've posted: The LRECL value in my code was way too short and so I missed to read the full record - and though the missing observations.

Changing LRECL to the maximum and everything works with your data (tested). I've fixed the code accordingly in my previous post.

The longest record in your sample data has 22753 characters.

yeaforme
Calcite | Level 5

Thanks Patrick - tried the code and it seems to run fast and smooth.  Checking for errors, but seems to be running well.

MikeZdeb
Rhodochrosite | Level 12

hi ... glad this is winding down !!!

I know what the problem is that you had with my last posting ... all those spaces in your folder name ...

filename csvfiles pipe "dir /b C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv";


kill the PIPE method for getting a list of files in a FOLDER, the LOG ...


BUT different use of quotes ...

filename csvfiles pipe 'dir /b "C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\*.csv" ';

should work ... also, the next filename statement you have is ...

* location of csv files;

filename csv 'C:\Ask\';

and it should be ...

* location of csv files;

filename csv "C:\Data Extract - WSDS\DSWin Pull\DSPull - Ask\" ;

the SAS code will not work unless the folder matches in the two filename statements (the location of your CSV files)

so,  since I'm curious as to if it solve the problem, it'd be nice (if you have a chance) if you gave it a try

it would also be of interest to know how the performance (elapsed and CPU times) compare to Patrick's solution (which you say is working just fine)

glad that something worked for you !!!

ps  I'm always amazed when I see folder names that are full of spaces since it's difficult to distinguish one space from two ... reminds me of my students who store files in places like "C:\Documents and Settings\Joe Student\My Documents\EPI 514\data" rather than "c:\epi514\data"

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 4425 views
  • 6 likes
  • 9 in conversation