Solved: Re: How to select record pairs and then concatenate them?

genemroz · Posted 05-28-2020 11:22 AM

I'm looking for a way to create a dataset from observations/records that are time-based. The example below shows records of Data and Seconds.
data arbitrary;
input Data $ Seconds;
datalines;
AAA 111.1
BBB 112.0
CCC 115.0
DDD 117.2
EEE 118.0
FFF 120.3
GGG 121.0
HHH 123.0
JJJ 130.0
;
The goal is the create an new dataset consisting only of pairs of records separated by less than one second of time that are then concatenated into a single record. So the resulting dataset should look like:

Data1 Seconds1 Data2 Seconds2
AAA 111.1 BBB 112.0
DDD 117.2 EEE 118.0
FFF 120.3 GGG 121.0

I've been scratching my head over this for a few days without success. Thanks in advance for any advice, help, pointers anyone can provide.

Regards,

Gene

ballardw · Posted 05-28-2020 11:51 AM

What if there are more than two records that match the requirement exactly for the time difference?

Is one supposed to be larger than the other?

This duplicates your example but with more records/values is likely not to be a complete answer.

data have;
input Data $ Seconds;
datalines;
AAA 111.1
BBB 112.0
CCC 115.0
DDD 117.2
EEE 118.0
FFF 120.3
GGG 121.0
HHH 123.0
JJJ 130.0
;

proc sql;
   create table want as
   select a.data as data1, a.seconds as seconds1 
         ,b.data as data2, b.seconds as seconds2
   from have as a,
        have as b
   where 0 le b.seconds - a.seconds lt 1
        and a.data ne b.data
   ;
quit;

Proc SQL implements ANSI standard SQL actions such as using a reflexive Cartesian join, the From have as a, have as b which joins a data set on itself for each record and then allows use of where to filter the results. More rules would be needed and may add significant code.

View solution in original post

ballardw · Posted 05-28-2020 11:51 AM

What if there are more than two records that match the requirement exactly for the time difference?

Is one supposed to be larger than the other?

This duplicates your example but with more records/values is likely not to be a complete answer.

data have;
input Data $ Seconds;
datalines;
AAA 111.1
BBB 112.0
CCC 115.0
DDD 117.2
EEE 118.0
FFF 120.3
GGG 121.0
HHH 123.0
JJJ 130.0
;

proc sql;
   create table want as
   select a.data as data1, a.seconds as seconds1 
         ,b.data as data2, b.seconds as seconds2
   from have as a,
        have as b
   where 0 le b.seconds - a.seconds lt 1
        and a.data ne b.data
   ;
quit;

Proc SQL implements ANSI standard SQL actions such as using a reflexive Cartesian join, the From have as a, have as b which joins a data set on itself for each record and then allows use of where to filter the results. More rules would be needed and may add significant code.

genemroz · Posted 05-28-2020 12:23 PM

Thank for this. I tried using various approaches with RETAIN and LAG() but got tripped up by the nuances of using them in conditional IF Statements. I never would have gotten to this approach. I'll test it on the target dataset and let you know how it goes.

Yes, there are instances were three (or more) consecutive records might meet the time difference criteria. I will probably want/have to deal with that eventually but thought I'd start with this simpler case. I'm interested any ideas you might have on how to deal with that more complex problem.

Thanks again for the prompt response.

Regards,

Gene

PGStats · Posted 05-29-2020 11:53 AM

If the data2 and seconds2 of an observation can be the data1 and seconds1 of the next observation when there are more than 2 consecutive obs within a second (such as the sequence JJJ-MMM below), then the logic for creating the pairs becomes a lot simpler:

data want;
retain;
set arbitrary;
if seconds - 1 <= seconds1 then output;
data1 = data;
seconds1 = seconds;
rename data=data2 seconds=seconds2;
run;

data1 	seconds1 data2 	seconds2
AAA 	111.1 	BBB 	112.0
DDD 	117.2 	EEE 	118.0
FFF 	120.3 	GGG 	121.0
JJJ 	130.0 	KKK 	130.2
KKK 	130.2 	LLL 	130.5
LLL 	130.5 	MMM 	130.9

PG

genemroz · Posted 05-29-2020 09:18 PM

My sincere thanks to all who responded to my query. This is such an amazing community, so quick to help. I chose this solution among the many because it opened my eyes to the power of SQL.

Thanks to all,

Gene

PGStats · Posted 05-28-2020 12:04 PM

Read the data only once with RETAIN + some logic:

data want;
retain;
set arbitrary;
if missing(seconds1) then do;
    data1 = data;
    seconds1 = seconds;
    end;
else do;
    if seconds - seconds1 <= 1 then do;
        output;
        call missing(data1, seconds1);
        end;
    else do;
        data1 = data;
        seconds1 = seconds;
        end;
    end;
rename data=data2 seconds=seconds2;
run;

PG

MCoopmans · Posted 05-28-2020 12:08 PM

Here is another one :


data a2;
  set arbitrary;
  length old_data $3;
  result= catx(" ",old_data, data);
  diff=seconds - lag(seconds);
  if diff le 1 and _n_ gt 1 then output;
  old_data = data;
  retain old_data ;
 drop old_data data seconds;
run;

Greetings,

Mathias.

JMS · Posted 05-29-2020 06:34 AM

data have;                    
  input Data $ Seconds;        
  datalines;                    
AAA 111.1                    
BBB 112.0                    
CCC 115.0                    
DDD 117.2                    
EEE 118.0                    
FFF 120.3                    
GGG 121.0                    
HHH 123.0                    
JJJ 130.0                    
;                            
run;                          
proc transpose data=have;    
 id data;                    
run;                          
                             
proc print;                  
run;

_NAME_ AAA BBB CCC DDD EEE FFF GGG HHH JJJ
Seconds 111.1 112 115 117.2 118 120.3 121 123 130

alex_a · Posted 05-29-2020 04:10 AM

This would solve multiple occurrences of less than one second. However, if there are multiple sequences of less than one second, you will get one line with all the data that are less than one second (maybe you should set a reference point?). I enriched a little your example, just to make sure you can get all of the values.

data arbitrary;
input Data $ Seconds;
datalines;
AAa 108.0
AAA 111.1
AAB 111.2
BBB 112.0
CBC 112.1
CBD 112.2
CBE 112.7
CCC 115.0
DDD 117.2
EEE 118.0
EEF 118.1
FFF 120.3
GGG 121.0
HHH 123.0
JJJ 130.0
;;;
run;

proc sort data = work.arbitrary;
by seconds data;
run;

data work.new_arbitrary;
set work.arbitrary;
retain grp 0;
lag_seconds = lag(seconds);
if ~missing(lag_seconds) then diff_seconds = seconds - lag_seconds;
else diff_seconds = .;
if diff_seconds < 1 then grp = grp;
else grp = grp + 1;
run;

proc sort data = work.new_arbitrary;
by grp seconds data;
run;

proc transpose data = work.new_arbitrary out = work.new_data prefix = data;
by grp;
var data;
run;

proc transpose data = work.new_arbitrary out = work.new_seconds prefix = seconds;
by grp;
var seconds;
run;

data work.final (where = (~missing(data2)) drop = grp);
merge work.new_data (in = a drop = _name_) work.new_seconds (in = b drop = _name_);
by grp;
if a and b;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!