Solved: Re: 10 % of data for all datasets to work more efficiently

AnnaNZ · Posted 06-22-2017 08:56 PM

Hi

I am working with huge datasets of more than 7mio datapoints per set. So working on these datasets takes considerable time.

How can I reduce them by working on only 10% of the data? That way I could faster work through all the coding and then apply the codes to the whole dataset.

I could sort each dataset by the same (say two ) main variables and take the first 700'000 observations in each dataset.

Does anybodyknow though a faster and more elegant way?

Many thanks,

Patrick · Posted 06-23-2017 11:46 PM

@AnnaNZ

Using SAS Hash tables can be an efficient way to maintain referential integrity between tables when reducing volumes.

Assuming the first table has already been reduced using Proc Surveyselect or other techniques then below code demonstrates how you can reduce the volume in the 2nd table while maintaining referential integrity.

data ds1_reduced;
  input ID ref_id;
datalines;
1 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

   
data ds2;
  input ID ref_id;
datalines;
1 12
2 8
2 9
2 10
3 11
3 11
3 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

data ds2_reduced;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'ds1_reduced(keep=id ref_id)');
      h1.defineKey('id','ref_id');
      h1.defineDone();
    end;

  set ds2;
  if h1.check()=0 then output;
run;

View solution in original post

AnnaNZ · Posted 06-22-2017 09:01 PM

Do it for one set and then use the individuals of this reduced set1 to create the other sets to make sure I have the same individuals to play with ?

Would you merge such set? Rather join it so not to create 'more' individuals than what I have in set1? hmmm

kiranv_ · Posted 06-22-2017 09:09 PM

what kind of datasets you have. what are the operations you have do. how many columns you need for output dataset. Whether you have indexes on your tables etc.

AnnaNZ · Posted 06-22-2017 09:30 PM

Currently I am building the spine of the dataset, which will then grow by 800 new modelled variables.

It has repeated Client_Ids and repated Refral_Ids, whereby a Refral_Id can have multiple Client_Ids - complex

Finally that dataset will be used for implementing a new, macchine learnt tool

AndrewHowell · Posted 06-22-2017 10:17 PM

@AnnaNZ - there are lots of ways to create subsets of data to work on. Which technique you use will depend on factors like (for example) how random you want that 10% to be.

Thoughts:

Sorting the dataset would automatically "bias" any sampling of the data.
If you have SAS/STAT, there is PROC SURVEYSELECT (Click here)
There are data set tips using the RANUNI() function, which can sample with or without replacement (in other words, whether you care if one observation is randomly reslected or not). Also useful if you don't have SAS/STAT and/or you'd like to output to multiple tables (which the DATA STEP can do and PROC SURVEYSELECT cannot).
Rick Wiklin has posted a couple of SAS blogs here (sampling with replacement) and here (sampling without replacement)

Best of luck!

AnnaNZ · Posted 06-22-2017 10:20 PM

The 10 % datasets are only created to be able to work faster and do not have to represent the whole dataset as such. It is only to created to be able to faster code the 800 variable that will need to be created.

At the end of that excerse I simply take the codes and apply them to the whole datasets.

Reeza · Posted 06-22-2017 10:24 PM

options obs=1000000; *work with only a small portion of your data;

Use the obs option to set the number of records you want to process that will run in an acceptable time.

Then set it back to max when you're done. NEVER EVER do this while sorting a dataset without replacing it. It will sort only the first N obs that you had selected and you'll lose the rest of the dataset.

options obs=max; * reset to use all obs by default;

AnnaNZ · Posted 06-22-2017 10:44 PM

Hi Reeza, thank you for your advise. I always create new datasets so i can lways go back to the previous version when something went wrong. I also keep a set of oroginal datasets at a safe place :-).

What I did now is , I used now the dataset that will build the spine of this whole set, sorted it by Client-Id and Refer_ID and thern created with this a new dataset, with only 10% of the observations. My idea would be to take these Client-Id and Refer_ID pairs to choose the equivalent Client-Id and / or Refer_ID pairs in the other sets.

Would that work?

The next question though, how do I use this subset of (Client-Id and Refer_ID_set 1) to create data with the same (Client-Id and Refer_ID-set_n) pairs? When I just merge them then (Client-Id and Refer_ID-set_n) will add all its other Client-Id and Refer_ID pairs to (Client-Id and Refer_ID_set 1).

How do I make sure the other sets have the nthe exacly same Client-Id and Refer_ID pairs?

Are joins the way to go here??

The reason I need to have these same pairs accross the datasets is because they all are used to ccreate and code new variables accross the other 50 odd datasets I have.

Otherwise I wil not be able to test whether a code made sense ect

Reeza · Posted 06-22-2017 10:54 PM

I usually just make the N big enough. Even though I'm doing matching, you'll usually see all the issues with 1M or 5M rows, but those process fast enough that it doesn't matter. You'll end up with less N that you set because of the match but that was usually enough for development for me.

However, I always make it a point to test for 'edge' cases in my code. So those are the 'boundaries' of each condition and then one or two random cases. Or in a case of patient/client match I would check for someone with multiple records and single records and what happens when one is not in one or the other dataset. I build these checks into my process anyways, because data changes over time. I can't always assume that things are the same and I've caught many a data change error this way. Especially in cases where people change how the source data is structured only slightly...enough that you miss when looking over, but is caught in the code. An example was a dataset where it went from having a single record per person to having multiple records.

AnnaNZ · Posted 06-22-2017 11:31 PM

many thanks reeza tat is great information.

How do you join them without creating massive amount of new IDs.

Is the join statement better here - though I read that merge and join essentiually do the same

say, I want only datasets with the ID and refer_IDs of dataset 1. How do I extract this info from dataset 2? so Id 2 and 3 should not be incorporated. ?

dataset1

ID	ref_id
1	12
4	24
5	24
5	26
6	36
7	29
7	30
8	31


dataset 2

1	12
2	8
2	9
2	10
3	11
3	11
3	12
4	24
5	24
5	26
6	36
7	29
7	30
8	31

Reeza · Posted 06-22-2017 11:41 PM

I do all my joins via SQL personally, I find it easier to understand. Then I look at the famous SQL graphic to decide what kind of join I need.

https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

The good thing about this one is they usually have sample code.

So since you want ID's that are in both sets you can use an inner join.

Joins and Merges are the same, MERGE is the statement used in a Data Step. JOIN is the term used in SQL.

Union and Append/Concatenate are the same as well, but different terms.

AndrewHowell · Posted 06-23-2017 01:54 AM

"Joins and Merges are the same.."??

Reeza · Posted 06-23-2017 09:05 AM

@AndrewHowell wrote:

"Joins and Merges are the same.."??

The concept of a SQL join and data step MERGE - they both try and put the data together 'side by side'. Yes, a SQL join can do things a SAS merge cannot and a Merge can do some things that a Join cannot in some cases.

Patrick · Posted 06-23-2017 11:46 PM

@AnnaNZ

Using SAS Hash tables can be an efficient way to maintain referential integrity between tables when reducing volumes.

Assuming the first table has already been reduced using Proc Surveyselect or other techniques then below code demonstrates how you can reduce the volume in the 2nd table while maintaining referential integrity.

data ds1_reduced;
  input ID ref_id;
datalines;
1 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

   
data ds2;
  input ID ref_id;
datalines;
1 12
2 8
2 9
2 10
3 11
3 11
3 12
4 24
5 24
5 26
6 36
7 29
7 30
8 31
;
run;

data ds2_reduced;

  if _n_=1 then
    do;
      dcl hash h1(dataset:'ds1_reduced(keep=id ref_id)');
      h1.defineKey('id','ref_id');
      h1.defineDone();
    end;

  set ds2;
  if h1.check()=0 then output;
run;

AnnaNZ · Posted 06-24-2017 06:22 PM

Hi Patrick, very interesting solution, I like it. It works when I use one one variable, and gives the same rulst like an inner joint function.

But it files the fault below, when I try to do it with both variables. I have yet to find out why.

data Kea.Trail1_102;

  if _n_=1 then
    do;
     dcl hash h1(dataset:'Kea.testdata10 (keep= REFER_ID CL_ID)');
      h1.defineKey('REFER_ID' 'CL_ID');
      h1.defineDone();
    end;
set Kea.Trails_file_1;
 if h1.check()=0 then output;
run;


401  data Kea.Trail1_102;
402
403    if _n_=1 then
404      do;
405       dcl hash h1(dataset:'Kea.testdata10 (keep= REFER_ID CL_ID)');
406        h1.defineKey('REFER_ID' 'CL_ID');
                                   -------
                                   22
ERROR 22-322: Syntax error, expecting one of the following: !, !!, &, *, **, +, ',', -, /, <, <=,
              <>, =, >, ><, >=, AND, EQ, GE, GT, LE, LT, MAX, MIN, NE, NG, NL, OR, ^=, |, ||,
              ~=.

407        h1.defineDone();
ERROR: DATA STEP Component Object failure.  Aborted during the COMPILATION phase.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

408      end;
409  set Kea.Trails_file_1;
410   if h1.check()=0 then output;
411  run;

Registration is open