Mixed data reshaping question

Arsenio_Staer · Posted 09-28-2011 01:15 AM

Hey guys,

I have a quick question. I extract and process a lot of .xls and .csv data with strange data structure so usually do all data management in R then export it to SAS to run the SQL and data analysis. Decided to give it a run in SAS though.

I extract in .csv format the following sample structure:

AA UN Equity					AIG UN Equity
Date	PX_LAST	PX_LOW	PX_HIGH	VOLUME	Date	PX_LAST	PX_LOW	PX_HIGH	VOLUME
#NAME?	40.4688	40.1875	41.7813	2705600	#NAME?	1160.9274	1150.456	1200.0206	151718
1/4/2000	40.6563	40.1875	40.9063	4112400	1/4/2000	1101.5896	1098.0991	1140.6827	340523
1/5/2000	43	40.5	43.25	5844600	1/5/2000	1103.6838	1080.6468	1116.2495	248678
1/6/2000	42.4375	42.4063	43.1875	8595200	1/6/2000	1139.2866	1100.1934	1157.437	261000
1/7/2000	42.625	42.2813	43.4375	7026800	1/7/2000	1223.0576	1153.2484	1227.9443	311985

And want to transform into (upper row is var names or columns names) :

Ticker	Date	PX_LAST	PX_LOW	PX_HIGH	VOLUME
AA UN Equity		40.4688	40.1875	41.7813	2705600
AA UN Equity	1/4/2000	40.6563	40.1875	40.9063	4112400
AA UN Equity	1/5/2000	43	40.5	43.25	5844600
AA UN Equity	1/6/2000	42.4375	42.4063	43.1875	8595200
AA UN Equity	1/7/2000	42.625	42.2813	43.4375	7026800
AIG UN Equity		1160.9274	1150.456	1200.0206	151718
AIG UN Equity	1/4/2000	1101.5896	1098.0991	1140.6827	340523
AIG UN Equity	1/5/2000	1103.6838	1080.6468	1116.2495	248678
AIG UN Equity	1/6/2000	1139.2866	1100.1934	1157.437	261000
AIG UN Equity	1/7/2000	1223.0576	1153.2484	1227.9443	311985

Ticker	Date	PX_LAST	PX_LOW	PX_HIGH	VOLUME
AA UN Equity		40.4688	40.1875	41.7813	2705600
AA UN Equity	1/4/2000	40.6563	40.1875	40.9063	4112400
AA UN Equity	1/5/2000	43	40.5	43.25	5844600
AA UN Equity	1/6/2000	42.4375	42.4063	43.1875	8595200
AA UN Equity	1/7/2000	42.625	42.2813	43.4375	7026800
AIG UN Equity		1160.927	1150.456	1200.021	151718
AIG UN Equity	1/4/2000	1101.59	1098.099	1140.683	340523
AIG UN Equity	1/5/2000	1103.684	1080.647	1116.25	248678
AIG UN Equity	1/6/2000	1139.287	1100.193	1157.437	261000
AIG UN Equity	1/7/2000	1223.058	1153.248	1227.944	311985

Series can be very long, a lot of tickers pero .csv and number of variables per ticker may vary so the most flexibility ( parametrization of the whole process) would be great!!

Would really appreciate any hint or help,

Thanks,

Arsenio

p.s. the new posting window layout is pretty good!

RD2 · Posted 09-28-2011 06:10 AM

You can use proc import datafile= "C:\FOLDER\*.csv" to import data first.

Then process the data in SAS by reomoved unwanted row and rename the variable.

Good luck.

Arsenio_Staer · Posted 10-04-2011 06:00 PM

Randy,

The main problem for me is operating on the data once it's in SAS, I'm used to indexing matrices in R, and see no way on how to perform that kind of manipulation.

Arsenio

art297 · Posted 09-28-2011 10:52 AM

I think you might be able to use something like the following to accomplish the task fairly easily. I've included some comments in the code to explain the suggested process:

/* Identify the path where your csv files exist */

%let path=c:\art\test\;

/* pipe all of the filenames into indata */

filename indata pipe "dir &path.*.csv /b";

/* Build a dataset containing the filenames and accomplish the imports */

data filenames (keep=fil2write);

length fil2write $50;

length fil2read $80;

infile indata truncover;

informat f2r $50.;

input f2r &;

fil2write=tranwrd(translate(strip(f2r),'_',' '),".csv","");

fil2read="&path."||f2r;

call execute (

"PROC IMPORT OUT= WORK."||fil2write

||'DATAFILE="'||strip(fil2read)||'" '

||"DBMS=CSV REPLACE;

GETNAMES=YES;

DATAROW=2;

RUN;");

run;

/* Create a macro variable to later be used in a datastep set statement */

proc sql noprint;

select fil2write||" (rename=(date=datein))"

into :fnames separated by " "

from filenames

;

quit;

/* Create the desired file correct the date field at the same time */

data want (drop=datein);

format ticker $50.;

format date date9.;

set &fnames. indsname=dsn;

ticker=scan(dsn,2,".");

date=input(datein, ?? mmddyy8.);

run;

FriedEgg · Posted 09-28-2011 11:34 AM

filename out1 '/temp/tmp1.csv';

filename out2 '/temp/tmp2.csv';

data s1;

infile cards dsd dlm='09'x;

input (date px_last px_low px_high volume) (:$20.);

file out1 dsd dlm=',';

put (date px_last px_low px_high volume) (:$20.);

cards;

#NAME? 40.4688 40.1875 41.7813 2705600

1/4/2000 40.6563 40.1875 40.9063 4112400

1/5/2000 43 40.5 43.25 5844600

1/6/2000 42.4375 42.4063 43.1875 8595200

1/7/2000 42.625 42.2813 43.4375 7026800

;

run;

data s2;

infile cards dsd dlm='09'x;

input (date px_last px_low px_high volume) (:$20.);

file out2 dsd dlm=',';

put (date px_last px_low px_high volume) (:$20.);

cards;

#NAME? 1160.9274 1150.456 1200.0206 151718

1/4/2000 1101.5896 1098.0991 1140.6827 340523

1/5/2000 1103.6838 1080.6468 1116.2495 248678

1/6/2000 1139.2866 1100.1934 1157.437 261000

1/7/2000 1223.0576 1153.2484 1227.9443 311985

;

run;

/* this is the part that matters to your situation */

filename in ('/temp/tmp1.csv' '/temp/tmp2.csv');

data have;

length _file $256;

infile in dsd dlm=',' filename=_file;

input date :??mmddyy8. (px_last px_low px_high) (:8.4) volume :7.;

ticker=scan(scan(strip(_file),-1,'/'),1,'.');

run;

/* end */

x 'rm -rf /temp/tmp*.csv';

Arsenio_Staer · Posted 10-04-2011 06:24 PM

Hey FriedEgg,

Thanks for the reply! I tried your code and work.s1 from the first piece of code comes out as :

date	px_last	px_low	px_high	volume
	#NAME? 40.4		1/4/2000 40
1/5/2000 43		1/6/2000 42		1/7/2000 42

With this printed in the logs:

23 put (date px_last px_low px_high volume) (:$20.);

24

25 cards;

NOTE: The file OUT1 is:

Filename=C:\temp\tmp1.csv,

RECFM=V,LRECL=256,File Size (bytes)=0,

Last Modified=04Oct2011:15:22:38,

Create Time=04Oct2011:15:13:22

NOTE: LOST CARD.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+

37 ;

date= px_last= px_low= px_high= volume= _ERROR_=1 _N_=3

NOTE: 2 records were written to the file OUT1.

The minimum record length was 44.

The maximum record length was 64.

I haven't been able to convert this to the table structure in the original message. I think it's a problem on my side.

FriedEgg · Posted 10-04-2011 07:28 PM

Make sure the delimiter is actually a tab.

Also that part of my code was only to recreate your example data files.

This is the part that is actually revelant to your situation:

filename in ('/temp/*.csv');

data have;

length _file $256;

infile in dsd dlm=',' filename=_file;

input date :??mmddyy8. (px_last px_low px_high) (:8.4) volume :7.;

ticker=scan(scan(strip(_file),-1,'/'),1,'.');

run;

Arsenio_Staer · Posted 09-28-2011 09:11 PM

RandyDai, art297, FriedEgg,

Thanks for a lot for the info! Will go over it and get back to you guys.

Ksharp · Posted 09-30-2011 10:24 AM

How about:

data temp;
infile datalines expandtabs dlm=' ' truncover;
if _n_ in (1 2) then do;
                     count+1;input _ticker & $10. @;output;
                     if _n_ eq 1 then do;count+1;input _ticker & $10.  ;output;end;
                         else do;count+1;input _ticker & $10.  /;output;end;
                          end;

else do;
                  count+1;
                     input  (Date PX_LAST     PX_LOW     PX_HIGH     VOLUME) (: $10.)@;
                     output;
                 count+1;
                     input  (Date PX_LAST     PX_LOW     PX_HIGH     VOLUME) (: $10.);
                     output;
                     end;

datalines;
AA UN                                      AIG UN 
Equity                                        Equity     
Date     PX_LAST     PX_LOW     PX_HIGH     VOLUME          Date     PX_LAST     PX_LOW     PX_HIGH     VOLUME
#NAME?     40.4688     40.1875     41.7813     2705600          #NAME?     1160.9274     1150.456     1200.0206     151718
1/4/2000     40.6563     40.1875     40.9063     4112400          1/4/2000     1101.5896     1098.0991     1140.6827     340523
1/5/2000     43     40.5     43.25     5844600          1/5/2000     1103.6838     1080.6468     1116.2495     248678
1/6/2000     42.4375     42.4063     43.1875     8595200          1/6/2000     1139.2866     1100.1934     1157.437     261000
1/7/2000     42.625     42.2813     43.4375     7026800          1/7/2000     1223.0576     1153.2484     1227.9443     311985
;
run;
data temp;
  set temp;
  if mod(count,2)=1 then flag=0;
   else flag=1;
run;
proc sort data=temp;by flag count;run;
data want(drop=count _ticker flag);
length ticker $ 20;
retain ticker;
 set temp;
 by flag;
 if first.flag then  call missing(ticker);
 if not missing(_ticker) then do;
      ticker=catx(' ',ticker,_ticker);delete;
                              end;
run;

Ksharp

Arsenio_Staer · Posted 10-04-2011 08:07 PM

Hey Ksharp,

Been a long time since my rolling correlation question! Anyway, thanks for the reply! i have been testing that code.

So the work.temp comes out as:

count	_ticker	Date	PX_LAST	PX_LOW	PX_HIGH	VOLUME	flag
1	AA UN						0
3	Equity						0
5		#NAME?	40.4688	40.1875	41.7813	2705600	0
7		1/4/2000	40.6563	40.1875	40.9063	4112400	0
9		1/5/2000	43	40.5	43.25	5844600	0
11		1/6/2000	42.4375	42.4063	43.1875	8595200	0
13		1/7/2000	42.625	42.2813	43.4375	7026800	0
2	AIG UN						1
4	Equity						1
6		#NAME?	1160.9274	1150.456	1200.0206	151718	1

This is an already shifted and stacked version of the original table, right? Nice reminder on retain!

From there your code works great. Fills the ticker with the ticker name for the corresponding observations. Any way to convert it to that already stacked form you started with?

Thanks!

Arsenio

Ksharp · Posted 10-05-2011 03:45 AM

Oh. I remember that question about correlation coefficient.

Yes. This is an already shifted and stacked version of the original table. However not completed.So I need the following code to complete it.

But I do not understand what you mean about the last sentence.

"Any way to convert it to that already stacked form you started with?"

I only add another variable FLAG to prepare for the next process.

Ksharp