topic Re: 10 % of data for all datasets to work more efficiently in SAS Data Management

10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 00:56:09 GMT

I am working with huge datasets of more than 7mio datapoints per set. So working on these datasets takes considerable time.

How can I reduce them by working on only 10% of the data? That way I could faster work through all the coding and then apply the codes to the whole dataset.

I could sort each dataset by the same (say two ) main variables and take the first 700'000 observations in each dataset.

Does anybodyknow though a faster and more elegant way?

Many thanks,

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 01:01:56 GMT

Do it for one set and then use the individuals of this reduced set1 to create the other sets to make sure I have the same individuals to play with ?

Would you merge such set? Rather join it so not to create 'more' individuals than what I have in set1? hmmm

Re: 10 % of data for all datasets to work more efficiently

kiranv_ — Fri, 23 Jun 2017 01:18:58 GMT

what kind of datasets you have. what are the operations you have do. how many columns you need for output dataset. Whether you have indexes on your tables etc.

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 01:30:21 GMT

Currently I am building the spine of the dataset, which will then grow by 800 new modelled variables.

It has repeated Client_Ids and repated Refral_Ids, whereby a Refral_Id can have multiple Client_Ids - complex

Finally that dataset will be used for implementing a new, macchine learnt tool

Re: 10 % of data for all datasets to work more efficiently

AndrewHowell — Fri, 23 Jun 2017 02:17:01 GMT

@AnnaNZ - there are lots of ways to create subsets of data to work on. Which technique you use will depend on factors like (for example) how random you want that 10% to be.

Thoughts:

Sorting the dataset would automatically "bias" any sampling of the data.
If you have SAS/STAT, there is PROC SURVEYSELECT (Click here)
There are data set tips using the RANUNI() function, which can sample with or without replacement (in other words, whether you care if one observation is randomly reslected or not). Also useful if you don't have SAS/STAT and/or you'd like to output to multiple tables (which the DATA STEP can do and PROC SURVEYSELECT cannot).
Rick Wiklin has posted a couple of SAS blogs here (sampling with replacement) and here (sampling without replacement)

Best of luck!

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 02:20:58 GMT

The 10 % datasets are only created to be able to work faster and do not have to represent the whole dataset as such. It is only to created to be able to faster code the 800 variable that will need to be created.

At the end of that excerse I simply take the codes and apply them to the whole datasets.

Re: 10 % of data for all datasets to work more efficiently

Reeza — Fri, 23 Jun 2017 02:24:29 GMT

options obs=1000000; *work with only a small portion of your data;

Use the obs option to set the number of records you want to process that will run in an acceptable time.

Then set it back to max when you're done. NEVER EVER do this while sorting a dataset without replacing it. It will sort only the first N obs that you had selected and you'll lose the rest of the dataset.

options obs=max; * reset to use all obs by default;

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 02:44:20 GMT

Hi Reeza, thank you for your advise. I always create new datasets so i can lways go back to the previous version when something went wrong. I also keep a set of oroginal datasets at a safe place :-).

What I did now is , I used now the dataset that will build the spine of this whole set, sorted it by Client-Id and Refer_ID and thern created with this a new dataset, with only 10% of the observations. My idea would be to take these Client-Id and Refer_ID pairs to choose the equivalent Client-Id and / or Refer_ID pairs in the other sets.

Would that work?

The next question though, how do I use this subset of (Client-Id and Refer_ID_set 1) to create data with the same (Client-Id and Refer_ID-set_n) pairs? When I just merge them then (Client-Id and Refer_ID-set_n) will add all its other Client-Id and Refer_ID pairs to (Client-Id and Refer_ID_set 1).

How do I make sure the other sets have the nthe exacly same Client-Id and Refer_ID pairs?

Are joins the way to go here??

The reason I need to have these same pairs accross the datasets is because they all are used to ccreate and code new variables accross the other 50 odd datasets I have.

Otherwise I wil not be able to test whether a code made sense ect

Re: 10 % of data for all datasets to work more efficiently

Reeza — Fri, 23 Jun 2017 02:54:48 GMT

I usually just make the N big enough. Even though I'm doing matching, you'll usually see all the issues with 1M or 5M rows, but those process fast enough that it doesn't matter. You'll end up with less N that you set because of the match but that was usually enough for development for me.

However, I always make it a point to test for 'edge' cases in my code. So those are the 'boundaries' of each condition and then one or two random cases. Or in a case of patient/client match I would check for someone with multiple records and single records and what happens when one is not in one or the other dataset. I build these checks into my process anyways, because data changes over time. I can't always assume that things are the same and I've caught many a data change error this way. Especially in cases where people change how the source data is structured only slightly...enough that you miss when looking over, but is caught in the code. An example was a dataset where it went from having a single record per person to having multiple records.

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Fri, 23 Jun 2017 03:31:46 GMT

many thanks reeza tat is great information.

How do you join them without creating massive amount of new IDs.

Is the join statement better here - though I read that merge and join essentiually do the same

say, I want only datasets with the ID and refer_IDs of dataset 1. How do I extract this info from dataset 2? so Id 2 and 3 should not be incorporated. ?

dataset1

ID	ref_id
1	12
4	24
5	24
5	26
6	36
7	29
7	30
8	31


dataset 2

1	12
2	8
2	9
2	10
3	11
3	11
3	12
4	24
5	24
5	26
6	36
7	29
7	30
8	31

Re: 10 % of data for all datasets to work more efficiently

Reeza — Fri, 23 Jun 2017 03:41:44 GMT

I do all my joins via SQL personally, I find it easier to understand. Then I look at the famous SQL graphic to decide what kind of join I need.

https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

The good thing about this one is they usually have sample code.

So since you want ID's that are in both sets you can use an inner join.

Joins and Merges are the same, MERGE is the statement used in a Data Step. JOIN is the term used in SQL.

Union and Append/Concatenate are the same as well, but different terms.

Re: 10 % of data for all datasets to work more efficiently

AndrewHowell — Fri, 23 Jun 2017 05:54:08 GMT

"Joins and Merges are the same.."??

Re: 10 % of data for all datasets to work more efficiently

kiranv_ — Fri, 23 Jun 2017 09:08:15 GMT

When ever queryoptimization is needed, what I would do is see how each query works, which part of query is running slow and try to find why it i running slow and how can I enhance query performance. With proper techniques, lot of queries can be optimized. Say building indexes can help, a hash join could be better or a join by format. There could be many more techniques

Re: 10 % of data for all datasets to work more efficiently

Reeza — Fri, 23 Jun 2017 13:05:42 GMT

@AndrewHowell wrote:

"Joins and Merges are the same.."??

The concept of a SQL join and data step MERGE - they both try and put the data together 'side by side'. Yes, a SQL join can do things a SAS merge cannot and a Merge can do some things that a Join cannot in some cases.

Re: 10 % of data for all datasets to work more efficiently

Patrick — Sat, 24 Jun 2017 04:31:20 GMT

@AnnaNZ

Using SAS Hash tables can be an efficient way to maintain referential integrity between tables when reducing volumes.

Assuming the first table has already been reduced using Proc Surveyselect or other techniques then below code demonstrates how you can reduce the volume in the 2nd table while maintaining referential integrity.

data ds1_reduced;
  input ID ref_id;
datalines;
1 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

   
data ds2;
  input ID ref_id;
datalines;
1 12
2 8
2 9
2 10
3 11
3 11
3 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

data ds2_reduced;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'ds1_reduced(keep=id ref_id)');
      h1.defineKey('id','ref_id');
      h1.defineDone();
    end;

  set ds2;
  if h1.check()=0 then output;
run;

Re: 10 % of data for all datasets to work more efficiently

AnnaNZ — Sat, 24 Jun 2017 22:22:46 GMT

Hi Patrick, very interesting solution, I like it. It works when I use one one variable, and gives the same rulst like an inner joint function.

But it files the fault below, when I try to do it with both variables. I have yet to find out why.

data Kea.Trail1_102;

  if _n_=1 then
    do;
     dcl hash h1(dataset:'Kea.testdata10 (keep= REFER_ID CL_ID)');
      h1.defineKey('REFER_ID' 'CL_ID');
      h1.defineDone();
    end;
set Kea.Trails_file_1;
 if h1.check()=0 then output;
run;


401  data Kea.Trail1_102;
402
403    if _n_=1 then
404      do;
405       dcl hash h1(dataset:'Kea.testdata10 (keep= REFER_ID CL_ID)');
406        h1.defineKey('REFER_ID' 'CL_ID');
                                   -------
                                   22
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, ',', -, /, <, <=,
              <>, =, >, ><, >=, AND, EQ, GE, GT, LE, LT, MAX, MIN, NE, NG, NL, OR, ^=, |, ||,
              ~=.

407        h1.defineDone();
ERROR: DATA STEP Component Object failure.  Aborted during the COMPILATION phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

408      end;
409  set Kea.Trails_file_1;
410   if h1.check()=0 then output;
411  run;

Re: 10 % of data for all datasets to work more efficiently

Patrick — Sat, 24 Jun 2017 23:49:55 GMT

@AnnaNZ

That's a simple syntax error. You've missed a comma.

h1.defineKey('REFER_ID', 'CL_ID');

406        h1.defineKey('REFER_ID' 'CL_ID');
                                   -------
                                   22
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, ',', -, /, <, <=,
              <>, =, >, ><, >=, AND, EQ, GE, GT, LE, LT, MAX, MIN, NE, NG, NL, OR, ^=, |, ||,
              ~=.