Solved: How to delimit large dataset (~28 Million rows) into ~700 variables?

promo_at_work · Posted 08-14-2018 12:39 PM

I've tried running the following on my dataset:

data split;
   set test;
   length var1-var700 $10.;
   array var(700) $;
   do i = 1 to dim(var);
      var[i]=scan(row,i,'|','m');
   end;
run;

But this takes a very long time.

Is there a more efficient way of handling this?

AhmedAl_Attar · Posted 08-15-2018 07:03 AM

Hi @promo_at_work

I was so interested in learning this _infile_ trick, I stayed up researching and trying to resolve it myself, and here is what I found

According to Reading Long Instream Data Records,

It says "To ensure that your data is processed correctly, use an external file for input when record lengths are greater than 80 bytes."

That's why @Tom's original solution didn't work for you, Due to using cards/datalines, the _infile_ value was truncated down to 80 chars/bytes, instead of the original value of the row variable.

Below is a hybrid of both @Tom's & @FreelanceReinhard solutions, and it looks like this

/* Dummy File */
filename tmpf temp;

/* Ensure it gets created */
data _null_;
    file tmpf;
    put;
run;

/* Use External file with the _infile_ trick */
Data split;
   set test;
   array var{700} $10;
   Infile tmpf DSD DLM = '|' truncover ls=7699 lrecl=7699;
   Input @1 @@;
   _infile_ = strip(row) ;
   Input @1 var[*] :$10. @@;
   drop row;
Run;

It kinda looks like @s_lassen's solution, but I wanted to explain why, and what was needed to get it working.

I hope this was an interesting learning experience for you, as much as it was for me.

Regards,

Ahmed

View solution in original post

Reeza · Posted 08-14-2018 01:34 PM

1. The file is already a SAS data set?

2. Can you go back to the source and read it directly correctly, that would probably be a bit more efficient.

3. That's a wide data set, can you partition it out in any manner? Do you have enough space to store/manage it? That's going to be in the tens if not hundreds of gigabytes. I had a data set with ~30 million rows with 100 variables that was usually 30GB.....

4. What is your SAS installation like? Are you working on a desktop or server installation? If this is with SAS UE, it likely will not handle data this large.

Processing any file that large will take time - so what's your expected amount of time to read/process this file? It likely won't happen in seconds (unless you have an amazing server) but should be less than 30 minutes I would expect.

promo_at_work · Posted 08-14-2018 01:40 PM

Hi Reeza,

Thanks for offering to help. See below for response to your questions:

1) It's already a SAS Dataset

2) I'm doing this at the source of the dataset itself

3) The current dataset is around 67GB with compression. It's just one var with a string. There is an associated load date as well, so maybe can partition according to that if that's what you mean

4) Working on SAS Grid Platform using EG as a client. I'm doing this at work.

Less than 30 mins is fine with me. I tried this with 10,000 records which takes about 30 seconds and would be very slow to handle 28 Million rows.

Reeza · Posted 08-14-2018 01:42 PM

I'm not an efficiency expert, but I suspect there's not a way to do that more efficiently once it's into SAS. Ideally, if you could go back a step to when the data is loaded and have it loaded correctly would be better. You should be able to specify the delimiter easily if this is a well formatted text file, which it sounds like it is.

AhmedAl_Attar · Posted 08-14-2018 04:19 PM

Hi @promo_at_work

If you are using SAS 9.4, you may want to use Proc DS2 to process your ~28 Million Rows data set in a Multi-threaded way.

Here is how you can do it

proc ds2; 

  /* Declare a thread that would split the long delimited string into 700 variables */
  thread splitter/overwrite=yes; 
    dcl double y count; 
    dcl bigint thisThread;
    dcl vararray varchar(10) var[1:700] var1-var700;
    drop count; 
    method run(); 
      set test; 
	  do i = LBOUND(var) to HBOUND(var);
        var[i]=scan(row,i,'|','m');
      end;      
      thisThread=_threadid_; 
      count+1;
    end; 
    method term();
      put '**Thread' _threadid_ 'processed'  count 'rows:';
    end;
  endthread; 
  run; 

  /* Use 6 threads to process the large ~28 Million rows */
  data split/overwrite=yes; 
    dcl thread splitter st; 
    method run(); 
         set from st threads=6; 
    end; 
  enddata;
  run; 
quit;

Credits to @SASJedi for posting this blog Jedi SAS Tricks: Warp Speed DATA Steps with DS2

Hope this helps with speeding up your processing,

Ahmed

Astounding · Posted 08-14-2018 04:23 PM

Some characteristics of your data may make this possible. Here are a few related questions. Among VAR1-VAR700:

Do any variables have a missing value?

Do any variables contain embedded blanks?

What I'm looking at is that each variable forces SAS to scan (on average) for 350 delimiters. There might be a way to convert the values using either arrays or a hash table (or maybe even parsing functions), while searching for just a single delimiter each time. So let's start with those questions.

Tom · Posted 08-14-2018 04:51 PM

Did you try using "_infile_ trick"?

data split;
   infile cards dsd dlm='|' truncover ;
   set test;
   input @1 @@;
   _infile_=row;
   input (var1-var700) (:$10.) @@;
   drop row;
cards;
***dummy line***
;

promo_at_work · Posted 08-14-2018 10:13 PM

Hi @Tom & @FreelanceReinh

Thanks for your replies. I tried both your approaches on the generated dataset and it works great. However mine seems to lead to some issues which I think is due to many missing values.

Here's an eg of how data would look like, where you can see there will be a lot of missing values:

A|452435234|4353663412|JOHN|DOE|25||||||MI|00001||||||||||||||||||

When I run @Tom's data steps on dataset, the output ends up as follows (rest as blanks). Looks like issue with truncover, not sure how to rectify:

var1	var2	var3	var4	var5	var6	var7	var8	var9	var10	var11	var...
A	452435234	4353663412	JOHN	DOE

When I run @FreelanceReinh data steps, output ends up blank.

FreelanceReinh · Posted 08-15-2018 06:58 AM

Hi @promo_at_work,

Thanks for testing the suggested solutions. Did you check the resulting SPLIT datasets, e.g. by comparing them to the dataset obtained with your correct, but slow SCAN approach?

The reason I am asking is that I hadn't been able to get correct results from my generated test data using the "INFILE CARDS" approach (which was used in the 2016 post where I had seen the "_infile_ trick" for the first time). I don't think that TRUNCOVER is the issue here, but rather the limitation to a record length of 80 characters (which I wasn't able to overcome, regardless of the CARDIMAGE system option setting or the LINESIZE= option of the INFILE statement, LRECL= being inapplicable).

The length of variable _INFILE_ can be written to the log:

701  options nocardimage;
702  data _null_;
703     infile cards dsd dlm='|' truncover ;
704     set test(obs=1);
705     input @1 @@;
706     _infile_=row;
707     len=length(_infile_);
708     put len=;
709  cards;

len=80
NOTE: There were 1 observations read from the data set WORK.TEST.

So, the subsequent INPUT statement (populating the 700 character variables) would operate on a truncated record, resulting in missing values for all but the first few variables. This is the effect you describe.

Therefore I resorted to reading from an external (dummy) file where the default logical record length is 32767 (unless the LRECL= system option has been set to a smaller value), hence sufficient for your data. The external file can be an existing file, as @s_lassen has suggested, or a newly created temporary file as in my approach.

When you apply my approach to your data and the "output ends up blank", do you get any log messages (including NOTEs) which are different from those obtained with the test data (where "it works great")?

I don't believe that "many missing values" can be the reason. The test data contain missing values as well, up to seven in a row, and it's easy to increase the proportion of missing values further, e.g. by assigning

l=floor(11*rand('uniform')**4);

(not to speak of l=0).

Are different operating systems involved, so that the TERMSTR= option of the INFILE statement should be used? (I'm not familiar with the "SAS Grid Platform".) Or do your data contain control characters, multi-byte characters etc.?

promo_at_work · Posted 08-15-2018 12:27 PM

Hey @FreelanceReinh,

Thanks for looking into it. Yes, it may appear to be the record length. Here's extract from log:

NOTE: LOST CARD.

_ERROR_=1 _INFILE_= _N_=1

NOTE: 1 record was read from the infile FT15F001.
The minimum record length was 4.
The maximum record length was 4.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: There were 1 observations read from the data set WORK.TEST.
NOTE: The data set WORK.SPLIT has 0 observations and 700 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds

FreelanceReinh · Posted 08-15-2018 01:08 PM

The record lengths of 4 mentioned in the log are correct. They just reflect the length of the dummy string "blah". Much more important is the record length that is normally mentioned in the log message starting with "NOTE: The infile FT15F001 is:". On my computer it says:

RECFM=V,LRECL=32767,File Size (bytes)=6,

and 32767 should be enough for 700 variables of length 10 (plus the delimiters between them).

Do you see a different value in your log?

FreelanceReinh · Posted 08-15-2018 01:27 PM

The notes

NOTE: LOST CARD.

and

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

together with

NOTE: There were 1 observations read from the data set WORK.TEST.
NOTE: The data set WORK.SPLIT has 0 observations and 700 variables.

indicate that variable ROW contains less than 700 pipe-delimited values in the first observation of dataset TEST. You can add the TRUNCOVER option to the INFILE statement to avoid these notes and probably get some (or possibly all) observations in dataset SPLIT. However, the discrepancy between the anticipated and the actual number of values would still be alarming.

Edit: You can run the following step to determine the number of delimiters in the first observation (or in the first few observations if you increase the OBS= option value):

data _null_;
set test(obs=1);
numdlm=countc(row,'|');
put numdlm=;
run;

The result numdlm=699 (in the log) would be good, but a smaller number is more likely, given the symptoms you described.

AhmedAl_Attar · Posted 08-15-2018 07:03 AM

Hi @promo_at_work

I was so interested in learning this _infile_ trick, I stayed up researching and trying to resolve it myself, and here is what I found

According to Reading Long Instream Data Records,

It says "To ensure that your data is processed correctly, use an external file for input when record lengths are greater than 80 bytes."

That's why @Tom's original solution didn't work for you, Due to using cards/datalines, the _infile_ value was truncated down to 80 chars/bytes, instead of the original value of the row variable.

Below is a hybrid of both @Tom's & @FreelanceReinhard solutions, and it looks like this

/* Dummy File */
filename tmpf temp;

/* Ensure it gets created */
data _null_;
    file tmpf;
    put;
run;

/* Use External file with the _infile_ trick */
Data split;
   set test;
   array var{700} $10;
   Infile tmpf DSD DLM = '|' truncover ls=7699 lrecl=7699;
   Input @1 @@;
   _infile_ = strip(row) ;
   Input @1 var[*] :$10. @@;
   drop row;
Run;

It kinda looks like @s_lassen's solution, but I wanted to explain why, and what was needed to get it working.

I hope this was an interesting learning experience for you, as much as it was for me.

Regards,

Ahmed

promo_at_work · Posted 08-16-2018 10:23 AM

Ahmed, thank you very much for this. It ran great.

I'm glad you learnt something from it too. I certainly did 🙂

JasonS_SAS · Posted 08-28-2018 12:12 PM

Using the INPUT statement in this way is pretty neat.

To speed this DATA step just a little more, you could add a "retain var;" statement after the ARRAY statement.

Each time the DATA step iterates, all the variables in the array VAR are initialized to all blanks. Then, the INPUT statement sets all the variables in VAR to a value. Using the RETAIN statement causes the initialization to blanks to not occur as it isn't needed.

There might be a slight speed up by using the $CHAR10. informat instead of $10.. $CHARw. results in a memory move while $w. does a little more work to remove leading blanks and to check for the value ".".

If you're using a Unix system, you can use "/dev/zero" instead of tmpf on the INPUT statement. This makes the program a little simpler by not needing to create a temp file.

If you're using Windows or z/OS, I'm not aware of an equivalent to /dev/zero on those operating systems.

Data split2;
   set test;
   array var{700} $10;
   retain var;
   Infile '/dev/zero' DSD DLM = '|' truncover ls=7699 lrecl=7699;
   Input @1 @@;
   _infile_ = strip(row) ;
   Input @1 var[*] :$CHAR10. @@;
   drop row;
Run;

How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Re: How to delimit large dataset (~28 Million rows) into ~700 variables?

Classroom Training Available!