Solved: How to condense CSV import code when some vars have same prefix

iggles · Posted 05-25-2022 05:42 PM

SAS Studio 3.8.

I am trying to condense my CSV import code.

CSV file Description:

The CSV files contain 101 variables (and includes the variable names), many with the same prefix. It's basically a ticker variable, followed by 50 pairs of transaction ID and Volume.
I.E. ticker txid_1 txvol_1 txid_2 txvol_2 ..... txid_50 txvol_50
The character IDs can be up to length 40, and the volumes can have up 8 trailing decimal places.
Not all IDs/Volumes are filled.

Here is the contents of a sample CSV file but with only 1 record and 3 pairs of ID/Vol (for brevity):

ticker,txid_1,txvol_1,txid_2,txvol_2,txid_3,txvol_3
USD12,dhrba-dnjs,1234.1,bs782na,4528.12345678,longidnamegoeshere,456123.12345

The resulting imported SAS dataset needs to:

Display all 8 trailing decimal places for volume (I use 16.8 number format)
Keep the order of the variables as-is

See below for my current code. It works fine, but is very long once I add in the 50 IDs and volumes for each informat, format, and input.

Is there a way to clean it up? I know with drop= and keep= statements I can do something like (txid_1 - txid_50) but that doesn't seem to work with informats, formats, or input. I tried playing with array code but no luck. The fact that the ID and Volume vars are interleaved throws me off too.

Thanks for looking.

data have;
infile "/home/myhome/inputs/zz_test_csv.csv"	delimiter = ","	missover dsd firstobs=2;
informat		
ticker		$5.
txid_1		$40.
txvol_1		best32.
txid_2		$40.
txvol_2		best32.
txid_3		$40.
txvol_3		best32.
;
format		
ticker		$5.
txid_1		$40.
txvol_1		16.8
txid_2		$40.
txvol_2		16.8
txid_3		$40.
txvol_3		16.8
;
input
ticker		$
txid_1		$
txvol_1
txid_2		$
txvol_2	
txid_3		$
txvol_3
;
run;

Reeza · Posted 05-25-2022 05:54 PM

For Format and Informat statements you can use variable shortcut lists or the colon shortcut - my lazy option.

For INPUT you have to list them in the order unfortunately.

data have;
infile "/home/myhome/inputs/zz_test_csv.csv"	delimiter = ","	missover dsd firstobs=2;
informat		
ticker		$5.
txid_1-txid_3		$40.
txvol_:		best32.;

format		
ticker		$5.
txid_:		$40.
txvol_:		16.8;

input
ticker		$
txid_1		$
txvol_1
txid_2		$
txvol_2	
txid_3		$
txvol_3
;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

@iggles wrote:

SAS Studio 3.8.

I am trying to condense my CSV import code.

CSV file Description:

The CSV files contain 101 variables (and includes the variable names), many with the same prefix. It's basically a ticker variable, followed by 50 pairs of transaction ID and Volume.

I.E. ticker txid_1 txvol_1 txid_2 txvol_2 ..... txid_50 txvol_50

The character IDs can be up to length 40, and the volumes can have up 8 trailing decimal places.

Not all IDs/Volumes are filled.

Here is the contents of a sample CSV file but with only 1 record and 3 pairs of ID/Vol (for brevity):
ticker,txid_1,txvol_1,txid_2,txvol_2,txid_3,txvol_3
USD12,dhrba-dnjs,1234.1,bs782na,4528.12345678,longidnamegoeshere,456123.12345
The resulting imported SAS dataset needs to:

Display all 8 trailing decimal places for volume (I use 16.8 number format)

Keep the order of the variables as-is

See below for my current code. It works fine, but is very long once I add in the 50 IDs and volumes for each informat, format, and input.

Is there a way to clean it up? I know with drop= and keep= statements I can do something like (txid_1 - txid_50) but that doesn't seem to work with informats, formats, or input. I tried playing with array code but no luck. The fact that the ID and Volume vars are interleaved throws me off too.

Thanks for looking.
data have;
infile "/home/myhome/inputs/zz_test_csv.csv"	delimiter = ","	missover dsd firstobs=2;
informat		
ticker		$5.
txid_1		$40.
txvol_1		best32.
txid_2		$40.
txvol_2		best32.
txid_3		$40.
txvol_3		best32.
;
format		
ticker		$5.
txid_1		$40.
txvol_1		16.8
txid_2		$40.
txvol_2		16.8
txid_3		$40.
txvol_3		16.8
;
input
ticker		$
txid_1		$
txvol_1
txid_2		$
txvol_2	
txid_3		$
txvol_3
;
run;

View solution in original post

Reeza · Posted 05-25-2022 05:54 PM

For Format and Informat statements you can use variable shortcut lists or the colon shortcut - my lazy option.

For INPUT you have to list them in the order unfortunately.

data have;
infile "/home/myhome/inputs/zz_test_csv.csv"	delimiter = ","	missover dsd firstobs=2;
informat		
ticker		$5.
txid_1-txid_3		$40.
txvol_:		best32.;

format		
ticker		$5.
txid_:		$40.
txvol_:		16.8;

input
ticker		$
txid_1		$
txvol_1
txid_2		$
txvol_2	
txid_3		$
txvol_3
;
run;

Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html

@iggles wrote:

SAS Studio 3.8.

I am trying to condense my CSV import code.

CSV file Description:

The CSV files contain 101 variables (and includes the variable names), many with the same prefix. It's basically a ticker variable, followed by 50 pairs of transaction ID and Volume.

I.E. ticker txid_1 txvol_1 txid_2 txvol_2 ..... txid_50 txvol_50

The character IDs can be up to length 40, and the volumes can have up 8 trailing decimal places.

Not all IDs/Volumes are filled.

Here is the contents of a sample CSV file but with only 1 record and 3 pairs of ID/Vol (for brevity):
ticker,txid_1,txvol_1,txid_2,txvol_2,txid_3,txvol_3
USD12,dhrba-dnjs,1234.1,bs782na,4528.12345678,longidnamegoeshere,456123.12345
The resulting imported SAS dataset needs to:

Display all 8 trailing decimal places for volume (I use 16.8 number format)

Keep the order of the variables as-is

See below for my current code. It works fine, but is very long once I add in the 50 IDs and volumes for each informat, format, and input.

Is there a way to clean it up? I know with drop= and keep= statements I can do something like (txid_1 - txid_50) but that doesn't seem to work with informats, formats, or input. I tried playing with array code but no luck. The fact that the ID and Volume vars are interleaved throws me off too.

Thanks for looking.
data have;
infile "/home/myhome/inputs/zz_test_csv.csv"	delimiter = ","	missover dsd firstobs=2;
informat		
ticker		$5.
txid_1		$40.
txvol_1		best32.
txid_2		$40.
txvol_2		best32.
txid_3		$40.
txvol_3		best32.
;
format		
ticker		$5.
txid_1		$40.
txvol_1		16.8
txid_2		$40.
txvol_2		16.8
txid_3		$40.
txvol_3		16.8
;
input
ticker		$
txid_1		$
txvol_1
txid_2		$
txvol_2	
txid_3		$
txvol_3
;
run;

SASKiwi · Posted 05-25-2022 06:04 PM

Using the colon to specify all variables with a common prefix works OK in the FORMAT or INFORMAT statements - see below. You can only use variable lists in an INPUT statement if all the variables in the list follow each other. Unfortunately yours don't so you can't shorten your INPUT statement.

data _null_;
  var1 = 123.45;
  var2 = 234.56;
  format var: 8.1;
  put _all_;
run;

Reeza · Posted 05-25-2022 06:10 PM

You could use a macro to simplify the INPUT statement but if you already have it coded not sure if its worth it.

Tom · Posted 05-25-2022 06:29 PM

There is no need for the INFORMAT statement at all. SAS does not need any special instructions for reading character strings or numbers. Also there is no "best" informat, BEST is the name of a FORMAT, if you use it as the name of an informat you just get the normal numeric informat. There is no need to attach $ format to character variables, SAS does not need special instructions for how to print character variables. There is no need to include the $ in the input statement for a variable whose type has already begin defined (either explicitly defined in a LENGTH statement or defined as a side effect of being used in some other statement, like FORMAT or INFORMAT).

The easiest way to simplify is to also simplify the data structure you are creating. Instead of making 50 copies of the same variables with numeric suffixes, just make 50 observations. Then you can just use a DO loop over the number or repetitions.

data want;
  infile "/home/myhome/inputs/zz_test_csv.csv" dsd truncover firstobs=2;
  length ticker $5 rep 8 txid $40 txvol 8;
  input ticker @;
  do rep=1 to 50 ;
    input txid txvol @;
    output;
  end;
  format txvol 16.8 ;
run;

ballardw · Posted 05-25-2022 06:40 PM

I might be tempted to make the data long with only one ID and Vol variable plus an indicator of which on the row something occurs.

Rough example;

data example;
   infile datalines dlm=',' truncover;
   informat		
   ticker		$5.
   txid		$40.
   txvol		best32.
   ;
   input ticker @@;
   do i=1 to 3;
      input txid txvol @@;
      if not missing(txid) then output;
   end;
   input;
datalines;
USD12,dhrba-dnjs,1234.1,bs782na,4528.12345678,, 
USD1x,xxrba-dnjs,1234.1,xx782na,4528.12345678,xxxgname,456 
;

Only change to read 50 pairs would be to change the 3 to 50 on the Do i= 1 to 3;

Note that this has dummy data to simulate one row with 2 pairs and another with 3 just to show this may be very viable.

mkeintz · Posted 05-25-2022 10:57 PM

There are many reasons to take the advice you've received to read the data as multiple observations per ticker, instead of multiple variable pairs per ticker.

But there are times when you might prefer a wide vs long format. For example, you might want to record the TXVOL range for a given ticker (i.e. volrange=range(of txvol:).

In that case, to save a lot of typing, consider a macro per @Reeza 's suggestion:

%macro txloop (niter);
  %do i=1 %to 50;
    txid_&i :$40.  txvol_&i 
  %end ;
%mend;

data have;
  infile "/home/myhome/inputs/zz_test_csv.csv" dsd truncover firstobs=2;  
  input ticker $5. @;
  input  %txloop(50) ;
  format txvol: 16.8 ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

iggles · Posted 05-25-2022 11:58 PM

All, thanks for the very speedy and great responses!

Some of my CSV imports actually get transposed to long format, via arrays, for certain processing, then transposed back to wide (everything goes wide in the end). Your macro ideas will definitely help me skip a step or two there.

And the other informat/format suggestions will help as well. I chose Reeza's answer as the solution, but really, I'll be taking bits from everyone. Thanks again. 😃👍

How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Re: How to condense CSV import code when some vars have same prefix

Classroom Training Available!