BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
imsenny
Calcite | Level 5

I have posted this question before.. but need a bit more help with it.

 

I have two large datasets -

1. air concentration data, which includes datetimestamp and air concentrations

2. flight acitivity data, which includes datetimestamp, flight ID, latitude, longitude, altitude.

 

I need to merge these 2 datasets by datetimestamp, but have to do it for each flight ID.

 

For example, if I have 1000 time points in air concentration dataset, I need to merge it with flight ID=1 for all 1000 time points, and need to keep all the missings as well. In other words, if the flight activity data only has 500 time points, the outputted dataset should still have 1000 time points after merging. 

 

In the end, I want just one large merged dataset including all flight IDs. In other words, if I have 100 flight IDs in the dataset, and have 1000 time points, then I want my final dataset to have 100,000 observations. 

 

Can anyone help me with this -- not sure if I should use macro vs. array vs. hash table either...

 

Any help will be much much appreciated!!!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

@Ksharp Same solution at the same time! 🙂 Except I keep the final merge faster by keeping concentration data in the select unique TIME clause to avoid one left join.

View solution in original post

9 REPLIES 9
LinusH
Tourmaline | Level 20
Sounds like a SQL LEFT JOIN to me.
But before you go any further with advanced programming techniques you may want get some training. The techniques you sugest in the title are somewhat totally different.
Data never sleeps
ChrisNZ
Tourmaline | Level 20

The easiest and fastest is probably to use SQL.

 

You can do this in 2 easy steps: ensure all flights match all times, then get the flight information if the time exists for this flight.

 

This works provided unique time and flight values in each table, which you imply.

 

data AIR_CONCENTRATION;
  do TIME=1 to 1e3;
    CONCENTATION=ranuni(0);
    output;
  end;
run;
data FLIGHT_ACTIVITY;
  do TIME=1 to 1e3 by 2;
    FLIGHT_ID=int(TIME/10);
    ALTITUDE=3e4+ranuni(0)*1e3;
    output;
  end;
run;

proc sql;
  create table CARTESIAN as
    select a.*, f.FLIGHT_ID 
    from AIR_CONCENTRATION  a
        ,FLIGHT_ACTIVITY    f  ;

  create table FINAL as  
    select c.*
         , f.ALTITUDE
    from CARTESIAN        c
      left join
         FLIGHT_ACTIVITY  f
      on  c.TIME     =f.TIME 
      and c.FLIGHT_ID=f.FLIGHT_ID ;
quit;
      
Ksharp
Super User

You don't need to split the table. SQL can do that. But if you have a big table , I suggest to use Hash Table.

 

data air;
input datetimestamp : time9. concentrations;
format  datetimestamp  time9.;
cards;
02:30:31 3
03:30:31 2
04:30:31 1
05:30:31 4
;
run;
data flight;
input flightID  datetimestamp : time9. latitude longitude ;
format  datetimestamp  time9.;
cards;
1 03:30:31 2 4
1 05:30:31 2 4
2 02:30:31 2 4
2 04:30:31 2 4
;
run;

proc sql;
 create table want as
  select a.*,b.concentrations,c.latitude,c.longitude 
   from 
    (select * from
     (select distinct flightID from flight),
     (select distinct datetimestamp from air) ) as a 
   left join air as b on a.datetimestamp=b.datetimestamp 
   left join flight as c on a.flightID=c.flightID and a.datetimestamp=c.datetimestamp;
quit;
   
   
   
ChrisNZ
Tourmaline | Level 20

@Ksharp Same solution at the same time! 🙂 Except I keep the final merge faster by keeping concentration data in the select unique TIME clause to avoid one left join.

imsenny
Calcite | Level 5

@Ksharp this is pretty straight forward, and I actually understand the coding..

But, when I ran this, I only got one observation for each flight ID.

Also, the timestamps for the air concentration dataset and the flight activity dataset don't always match -- air concentration dataset has more complete timestamps (almost every single second), and the flight acitivity dataset definitely has less time points... So I basically want to keep all the timepoints in the air conc dataset, and merge with the flight acitivity dataset... and I'm ok with having missings for those time points where I do not have flight activity data...

ChrisNZ
Tourmaline | Level 20

if you run the code provided you'll see you get repeated flight IDs.

So you must have used the provided code wrongly.

imsenny
Calcite | Level 5

ur right- I fixed one thing and it worked like a miracle-

you guys are SAS-geniuses!!!

 

thanks so much 🙂 

imsenny
Calcite | Level 5

This code works on a smaller scale, but SAS refuses to run it with my larger datasets...

 

I'm trying to find a way to make this process a bit more efficient, since I will be throwing out a lot of observations after I do the interpolation... Because I will only be using those observations that meet a certain condition for my final analysis.

 

For each flightID, I just want to do the merging of the two datasets for the time range of the first observation to the last observation for that flightID.

 

For example, if flightID=2 has the first latitude observation at 10:30:29AM and last observation at 10:35:30, then I only want to merge the two datasets for that time range...... 

 

 

 

ChrisNZ
Tourmaline | Level 20

This will reduce the join size significantly, and only requires a small modification:

 


proc sql;
  create table CARTESIAN as
    select a.*, f.FLIGHT_ID 
    from AIR_CONCENTRATION      a
        ,(select FLIGHT_ID, min(TIME) as MIN_TIME , max(TIME) as MAX_TIME 
         from FLIGHT_ACTIVITY 
         group by FLIGHT_ID)    f  
    where MIN_TIME <= TIME <= MAX_TIME;

  create table FINAL as  
    select c.*
         , f.ALTITUDE
    from CARTESIAN        c
      left join
         FLIGHT_ACTIVITY  f
      on  c.TIME     =f.TIME 
      and c.FLIGHT_ID=f.FLIGHT_ID ;
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1785 views
  • 3 likes
  • 4 in conversation