Hey guys,
I have a quick question. I extract and process a lot of .xls and .csv data with strange data structure so usually do all data management in R then export it to SAS to run the SQL and data analysis. Decided to give it a run in SAS though.
I extract in .csv format the following sample structure:
AA UN Equity | | | | | | AIG UN Equity | | | | |
---|---|---|---|---|---|---|---|---|---|---|
Date | PX_LAST | PX_LOW | PX_HIGH | VOLUME | Date | PX_LAST | PX_LOW | PX_HIGH | VOLUME | |
#NAME? | 40.4688 | 40.1875 | 41.7813 | 2705600 | #NAME? | 1160.9274 | 1150.456 | 1200.0206 | 151718 | |
1/4/2000 | 40.6563 | 40.1875 | 40.9063 | 4112400 | 1/4/2000 | 1101.5896 | 1098.0991 | 1140.6827 | 340523 | |
1/5/2000 | 43 | 40.5 | 43.25 | 5844600 | 1/5/2000 | 1103.6838 | 1080.6468 | 1116.2495 | 248678 | |
1/6/2000 | 42.4375 | 42.4063 | 43.1875 | 8595200 | 1/6/2000 | 1139.2866 | 1100.1934 | 1157.437 | 261000 | |
1/7/2000 | 42.625 | 42.2813 | 43.4375 | 7026800 | 1/7/2000 | 1223.0576 | 1153.2484 | 1227.9443 | 311985 |
And want to transform into (upper row is var names or columns names) :
Ticker | Date | PX_LAST | PX_LOW | PX_HIGH | VOLUME |
---|---|---|---|---|---|
AA UN Equity | 40.4688 | 40.1875 | 41.7813 | 2705600 | |
AA UN Equity | 1/4/2000 | 40.6563 | 40.1875 | 40.9063 | 4112400 |
AA UN Equity | 1/5/2000 | 43 | 40.5 | 43.25 | 5844600 |
AA UN Equity | 1/6/2000 | 42.4375 | 42.4063 | 43.1875 | 8595200 |
AA UN Equity | 1/7/2000 | 42.625 | 42.2813 | 43.4375 | 7026800 |
AIG UN Equity | 1160.9274 | 1150.456 | 1200.0206 | 151718 | |
AIG UN Equity | 1/4/2000 | 1101.5896 | 1098.0991 | 1140.6827 | 340523 |
AIG UN Equity | 1/5/2000 | 1103.6838 | 1080.6468 | 1116.2495 | 248678 |
AIG UN Equity | 1/6/2000 | 1139.2866 | 1100.1934 | 1157.437 | 261000 |
AIG UN Equity | 1/7/2000 | 1223.0576 | 1153.2484 | 1227.9443 | 311985 |
Ticker | Date | PX_LAST | PX_LOW | PX_HIGH | VOLUME |
AA UN Equity | 40.4688 | 40.1875 | 41.7813 | 2705600 | |
AA UN Equity | 1/4/2000 | 40.6563 | 40.1875 | 40.9063 | 4112400 |
AA UN Equity | 1/5/2000 | 43 | 40.5 | 43.25 | 5844600 |
AA UN Equity | 1/6/2000 | 42.4375 | 42.4063 | 43.1875 | 8595200 |
AA UN Equity | 1/7/2000 | 42.625 | 42.2813 | 43.4375 | 7026800 |
AIG UN Equity | 1160.927 | 1150.456 | 1200.021 | 151718 | |
AIG UN Equity | 1/4/2000 | 1101.59 | 1098.099 | 1140.683 | 340523 |
AIG UN Equity | 1/5/2000 | 1103.684 | 1080.647 | 1116.25 | 248678 |
AIG UN Equity | 1/6/2000 | 1139.287 | 1100.193 | 1157.437 | 261000 |
AIG UN Equity | 1/7/2000 | 1223.058 | 1153.248 | 1227.944 | 311985 |
Series can be very long, a lot of tickers pero .csv and number of variables per ticker may vary so the most flexibility ( parametrization of the whole process) would be great!!
Would really appreciate any hint or help,
Thanks,
Arsenio
p.s. the new posting window layout is pretty good!
You can use proc import datafile= "C:\FOLDER\*.csv" to import data first.
Then process the data in SAS by reomoved unwanted row and rename the variable.
Good luck.
Randy,
The main problem for me is operating on the data once it's in SAS, I'm used to indexing matrices in R, and see no way on how to perform that kind of manipulation.
Arsenio
I think you might be able to use something like the following to accomplish the task fairly easily. I've included some comments in the code to explain the suggested process:
/* Identify the path where your csv files exist */
%let path=c:\art\test\;
/* pipe all of the filenames into indata */
filename indata pipe "dir &path.*.csv /b";
/* Build a dataset containing the filenames and accomplish the imports */
data filenames (keep=fil2write);
length fil2write $50;
length fil2read $80;
infile indata truncover;
informat f2r $50.;
input f2r &;
fil2write=tranwrd(translate(strip(f2r),'_',' '),".csv","");
fil2read="&path."||f2r;
call execute (
"PROC IMPORT OUT= WORK."||fil2write
||'DATAFILE="'||strip(fil2read)||'" '
||"DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;");
run;
/* Create a macro variable to later be used in a datastep set statement */
proc sql noprint;
select fil2write||" (rename=(date=datein))"
into :fnames separated by " "
from filenames
;
quit;
/* Create the desired file correct the date field at the same time */
data want (drop=datein);
format ticker $50.;
format date date9.;
set &fnames. indsname=dsn;
ticker=scan(dsn,2,".");
date=input(datein, ?? mmddyy8.);
run;
filename out1 '/temp/tmp1.csv';
filename out2 '/temp/tmp2.csv';
data s1;
infile cards dsd dlm='09'x;
input (date px_last px_low px_high volume) (:$20.);
file out1 dsd dlm=',';
put (date px_last px_low px_high volume) (:$20.);
cards;
#NAME? 40.4688 40.1875 41.7813 2705600
1/4/2000 40.6563 40.1875 40.9063 4112400
1/5/2000 43 40.5 43.25 5844600
1/6/2000 42.4375 42.4063 43.1875 8595200
1/7/2000 42.625 42.2813 43.4375 7026800
;
run;
data s2;
infile cards dsd dlm='09'x;
input (date px_last px_low px_high volume) (:$20.);
file out2 dsd dlm=',';
put (date px_last px_low px_high volume) (:$20.);
cards;
#NAME? 1160.9274 1150.456 1200.0206 151718
1/4/2000 1101.5896 1098.0991 1140.6827 340523
1/5/2000 1103.6838 1080.6468 1116.2495 248678
1/6/2000 1139.2866 1100.1934 1157.437 261000
1/7/2000 1223.0576 1153.2484 1227.9443 311985
;
run;
/* this is the part that matters to your situation */
filename in ('/temp/tmp1.csv' '/temp/tmp2.csv');
data have;
length _file $256;
infile in dsd dlm=',' filename=_file;
input date :??mmddyy8. (px_last px_low px_high) (:8.4) volume :7.;
ticker=scan(scan(strip(_file),-1,'/'),1,'.');
run;
/* end */
x 'rm -rf /temp/tmp*.csv';
Hey FriedEgg,
Thanks for the reply! I tried your code and work.s1 from the first piece of code comes out as :
date | px_last | px_low | px_high | volume |
---|---|---|---|---|
#NAME? 40.4 | 1/4/2000 40 | |||
1/5/2000 43 | 1/6/2000 42 | 1/7/2000 42 |
With this printed in the logs:
23 put (date px_last px_low px_high volume) (:$20.);
24
25 cards;
NOTE: The file OUT1 is:
Filename=C:\temp\tmp1.csv,
RECFM=V,LRECL=256,File Size (bytes)=0,
Last Modified=04Oct2011:15:22:38,
Create Time=04Oct2011:15:13:22
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
37 ;
date= px_last= px_low= px_high= volume= _ERROR_=1 _N_=3
NOTE: 2 records were written to the file OUT1.
The minimum record length was 44.
The maximum record length was 64.
I haven't been able to convert this to the table structure in the original message. I think it's a problem on my side.
Make sure the delimiter is actually a tab.
Also that part of my code was only to recreate your example data files.
This is the part that is actually revelant to your situation:
filename in ('/temp/*.csv');
data have;
length _file $256;
infile in dsd dlm=',' filename=_file;
input date :??mmddyy8. (px_last px_low px_high) (:8.4) volume :7.;
ticker=scan(scan(strip(_file),-1,'/'),1,'.');
run;
RandyDai, art297, FriedEgg,
Thanks for a lot for the info! Will go over it and get back to you guys.
How about:
data temp; infile datalines expandtabs dlm=' ' truncover; if _n_ in (1 2) then do; count+1;input _ticker & $10. @;output; if _n_ eq 1 then do;count+1;input _ticker & $10. ;output;end; else do;count+1;input _ticker & $10. /;output;end; end; else do; count+1; input (Date PX_LAST PX_LOW PX_HIGH VOLUME) (: $10.)@; output; count+1; input (Date PX_LAST PX_LOW PX_HIGH VOLUME) (: $10.); output; end; datalines; AA UN AIG UN Equity Equity Date PX_LAST PX_LOW PX_HIGH VOLUME Date PX_LAST PX_LOW PX_HIGH VOLUME #NAME? 40.4688 40.1875 41.7813 2705600 #NAME? 1160.9274 1150.456 1200.0206 151718 1/4/2000 40.6563 40.1875 40.9063 4112400 1/4/2000 1101.5896 1098.0991 1140.6827 340523 1/5/2000 43 40.5 43.25 5844600 1/5/2000 1103.6838 1080.6468 1116.2495 248678 1/6/2000 42.4375 42.4063 43.1875 8595200 1/6/2000 1139.2866 1100.1934 1157.437 261000 1/7/2000 42.625 42.2813 43.4375 7026800 1/7/2000 1223.0576 1153.2484 1227.9443 311985 ; run; data temp; set temp; if mod(count,2)=1 then flag=0; else flag=1; run; proc sort data=temp;by flag count;run; data want(drop=count _ticker flag); length ticker $ 20; retain ticker; set temp; by flag; if first.flag then call missing(ticker); if not missing(_ticker) then do; ticker=catx(' ',ticker,_ticker);delete; end; run;
Ksharp
Hey Ksharp,
Been a long time since my rolling correlation question! Anyway, thanks for the reply! i have been testing that code.
So the work.temp comes out as:
count | _ticker | Date | PX_LAST | PX_LOW | PX_HIGH | VOLUME | flag |
1 | AA UN | 0 | |||||
3 | Equity | 0 | |||||
5 | #NAME? | 40.4688 | 40.1875 | 41.7813 | 2705600 | 0 | |
7 | 1/4/2000 | 40.6563 | 40.1875 | 40.9063 | 4112400 | 0 | |
9 | 1/5/2000 | 43 | 40.5 | 43.25 | 5844600 | 0 | |
11 | 1/6/2000 | 42.4375 | 42.4063 | 43.1875 | 8595200 | 0 | |
13 | 1/7/2000 | 42.625 | 42.2813 | 43.4375 | 7026800 | 0 | |
2 | AIG UN | 1 | |||||
4 | Equity | 1 | |||||
6 | #NAME? | 1160.9274 | 1150.456 | 1200.0206 | 151718 | 1 |
This is an already shifted and stacked version of the original table, right? Nice reminder on retain!
From there your code works great. Fills the ticker with the ticker name for the corresponding observations. Any way to convert it to that already stacked form you started with?
Thanks!
Arsenio
Oh. I remember that question about correlation coefficient.
Yes. This is an already shifted and stacked version of the original table. However not completed.So I need the following code to complete it.
But I do not understand what you mean about the last sentence.
"Any way to convert it to that already stacked form you started with?"
I only add another variable FLAG to prepare for the next process.
Ksharp
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.