macro vs. array vs. hash table for data merging

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

macro vs. array vs. hash table for data merging

I have posted this question before.. but need a bit more help with it.

 

I have two large datasets -

1. air concentration data, which includes datetimestamp and air concentrations

2. flight acitivity data, which includes datetimestamp, flight ID, latitude, longitude, altitude.

 

I need to merge these 2 datasets by datetimestamp, but have to do it for each flight ID.

 

For example, if I have 1000 time points in air concentration dataset, I need to merge it with flight ID=1 for all 1000 time points, and need to keep all the missings as well. In other words, if the flight activity data only has 500 time points, the outputted dataset should still have 1000 time points after merging. 

 

In the end, I want just one large merged dataset including all flight IDs. In other words, if I have 100 flight IDs in the dataset, and have 1000 time points, then I want my final dataset to have 100,000 observations. 

 

Can anyone help me with this -- not sure if I should use macro vs. array vs. hash table either...

 

Any help will be much much appreciated!!!

 

 


Accepted Solutions
Solution
‎05-23-2016 10:29 AM
PROC Star
Posts: 1,760

Re: macro vs. array vs. hash table for data merging

@Ksharp Same solution at the same time! Smiley Happy Except I keep the final merge faster by keeping concentration data in the select unique TIME clause to avoid one left join.

View solution in original post


All Replies
Super User
Posts: 5,437

Re: macro vs. array vs. hash table for data merging

Sounds like a SQL LEFT JOIN to me.
But before you go any further with advanced programming techniques you may want get some training. The techniques you sugest in the title are somewhat totally different.
Data never sleeps
PROC Star
Posts: 1,760

Re: macro vs. array vs. hash table for data merging

The easiest and fastest is probably to use SQL.

 

You can do this in 2 easy steps: ensure all flights match all times, then get the flight information if the time exists for this flight.

 

This works provided unique time and flight values in each table, which you imply.

 

data AIR_CONCENTRATION;
  do TIME=1 to 1e3;
    CONCENTATION=ranuni(0);
    output;
  end;
run;
data FLIGHT_ACTIVITY;
  do TIME=1 to 1e3 by 2;
    FLIGHT_ID=int(TIME/10);
    ALTITUDE=3e4+ranuni(0)*1e3;
    output;
  end;
run;

proc sql;
  create table CARTESIAN as
    select a.*, f.FLIGHT_ID 
    from AIR_CONCENTRATION  a
        ,FLIGHT_ACTIVITY    f  ;

  create table FINAL as  
    select c.*
         , f.ALTITUDE
    from CARTESIAN        c
      left join
         FLIGHT_ACTIVITY  f
      on  c.TIME     =f.TIME 
      and c.FLIGHT_ID=f.FLIGHT_ID ;
quit;
      
Super User
Posts: 10,044

Re: macro vs. array vs. hash table for data merging

You don't need to split the table. SQL can do that. But if you have a big table , I suggest to use Hash Table.

 

data air;
input datetimestamp : time9. concentrations;
format  datetimestamp  time9.;
cards;
02:30:31 3
03:30:31 2
04:30:31 1
05:30:31 4
;
run;
data flight;
input flightID  datetimestamp : time9. latitude longitude ;
format  datetimestamp  time9.;
cards;
1 03:30:31 2 4
1 05:30:31 2 4
2 02:30:31 2 4
2 04:30:31 2 4
;
run;

proc sql;
 create table want as
  select a.*,b.concentrations,c.latitude,c.longitude 
   from 
    (select * from
     (select distinct flightID from flight),
     (select distinct datetimestamp from air) ) as a 
   left join air as b on a.datetimestamp=b.datetimestamp 
   left join flight as c on a.flightID=c.flightID and a.datetimestamp=c.datetimestamp;
quit;
   
   
   
Solution
‎05-23-2016 10:29 AM
PROC Star
Posts: 1,760

Re: macro vs. array vs. hash table for data merging

@Ksharp Same solution at the same time! Smiley Happy Except I keep the final merge faster by keeping concentration data in the select unique TIME clause to avoid one left join.

Occasional Contributor
Posts: 15

Re: macro vs. array vs. hash table for data merging

[ Edited ]

@Ksharp this is pretty straight forward, and I actually understand the coding..

But, when I ran this, I only got one observation for each flight ID.

Also, the timestamps for the air concentration dataset and the flight activity dataset don't always match -- air concentration dataset has more complete timestamps (almost every single second), and the flight acitivity dataset definitely has less time points... So I basically want to keep all the timepoints in the air conc dataset, and merge with the flight acitivity dataset... and I'm ok with having missings for those time points where I do not have flight activity data...

PROC Star
Posts: 1,760

Re: macro vs. array vs. hash table for data merging

if you run the code provided you'll see you get repeated flight IDs.

So you must have used the provided code wrongly.

Occasional Contributor
Posts: 15

Re: macro vs. array vs. hash table for data merging

ur right- I fixed one thing and it worked like a miracle-

you guys are SAS-geniuses!!!

 

thanks so much Smiley Happy 

Occasional Contributor
Posts: 15

Re: macro vs. array vs. hash table for data merging

This code works on a smaller scale, but SAS refuses to run it with my larger datasets...

 

I'm trying to find a way to make this process a bit more efficient, since I will be throwing out a lot of observations after I do the interpolation... Because I will only be using those observations that meet a certain condition for my final analysis.

 

For each flightID, I just want to do the merging of the two datasets for the time range of the first observation to the last observation for that flightID.

 

For example, if flightID=2 has the first latitude observation at 10:30:29AM and last observation at 10:35:30, then I only want to merge the two datasets for that time range...... 

 

 

 

PROC Star
Posts: 1,760

Re: macro vs. array vs. hash table for data merging

[ Edited ]

This will reduce the join size significantly, and only requires a small modification:

 


proc sql;
  create table CARTESIAN as
    select a.*, f.FLIGHT_ID 
    from AIR_CONCENTRATION      a
        ,(select FLIGHT_ID, min(TIME) as MIN_TIME , max(TIME) as MAX_TIME 
         from FLIGHT_ACTIVITY 
         group by FLIGHT_ID)    f  
    where MIN_TIME <= TIME <= MAX_TIME;

  create table FINAL as  
    select c.*
         , f.ALTITUDE
    from CARTESIAN        c
      left join
         FLIGHT_ACTIVITY  f
      on  c.TIME     =f.TIME 
      and c.FLIGHT_ID=f.FLIGHT_ID ;
quit;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 802 views
  • 3 likes
  • 4 in conversation