05-22-2016 04:37 PM
I have posted this question before.. but need a bit more help with it.
I have two large datasets -
1. air concentration data, which includes datetimestamp and air concentrations
2. flight acitivity data, which includes datetimestamp, flight ID, latitude, longitude, altitude.
I need to merge these 2 datasets by datetimestamp, but have to do it for each flight ID.
For example, if I have 1000 time points in air concentration dataset, I need to merge it with flight ID=1 for all 1000 time points, and need to keep all the missings as well. In other words, if the flight activity data only has 500 time points, the outputted dataset should still have 1000 time points after merging.
In the end, I want just one large merged dataset including all flight IDs. In other words, if I have 100 flight IDs in the dataset, and have 1000 time points, then I want my final dataset to have 100,000 observations.
Can anyone help me with this -- not sure if I should use macro vs. array vs. hash table either...
Any help will be much much appreciated!!!
05-22-2016 10:20 PM
05-22-2016 04:51 PM
05-22-2016 10:14 PM
The easiest and fastest is probably to use SQL.
You can do this in 2 easy steps: ensure all flights match all times, then get the flight information if the time exists for this flight.
This works provided unique time and flight values in each table, which you imply.
data AIR_CONCENTRATION; do TIME=1 to 1e3; CONCENTATION=ranuni(0); output; end; run; data FLIGHT_ACTIVITY; do TIME=1 to 1e3 by 2; FLIGHT_ID=int(TIME/10); ALTITUDE=3e4+ranuni(0)*1e3; output; end; run; proc sql; create table CARTESIAN as select a.*, f.FLIGHT_ID from AIR_CONCENTRATION a ,FLIGHT_ACTIVITY f ; create table FINAL as select c.* , f.ALTITUDE from CARTESIAN c left join FLIGHT_ACTIVITY f on c.TIME =f.TIME and c.FLIGHT_ID=f.FLIGHT_ID ; quit;
05-22-2016 10:17 PM
You don't need to split the table. SQL can do that. But if you have a big table , I suggest to use Hash Table.
data air; input datetimestamp : time9. concentrations; format datetimestamp time9.; cards; 02:30:31 3 03:30:31 2 04:30:31 1 05:30:31 4 ; run; data flight; input flightID datetimestamp : time9. latitude longitude ; format datetimestamp time9.; cards; 1 03:30:31 2 4 1 05:30:31 2 4 2 02:30:31 2 4 2 04:30:31 2 4 ; run; proc sql; create table want as select a.*,b.concentrations,c.latitude,c.longitude from (select * from (select distinct flightID from flight), (select distinct datetimestamp from air) ) as a left join air as b on a.datetimestamp=b.datetimestamp left join flight as c on a.flightID=c.flightID and a.datetimestamp=c.datetimestamp; quit;
05-22-2016 10:20 PM
05-22-2016 11:34 PM - edited 05-22-2016 11:37 PM
@Ksharp this is pretty straight forward, and I actually understand the coding..
But, when I ran this, I only got one observation for each flight ID.
Also, the timestamps for the air concentration dataset and the flight activity dataset don't always match -- air concentration dataset has more complete timestamps (almost every single second), and the flight acitivity dataset definitely has less time points... So I basically want to keep all the timepoints in the air conc dataset, and merge with the flight acitivity dataset... and I'm ok with having missings for those time points where I do not have flight activity data...
06-01-2016 01:08 PM
This code works on a smaller scale, but SAS refuses to run it with my larger datasets...
I'm trying to find a way to make this process a bit more efficient, since I will be throwing out a lot of observations after I do the interpolation... Because I will only be using those observations that meet a certain condition for my final analysis.
For each flightID, I just want to do the merging of the two datasets for the time range of the first observation to the last observation for that flightID.
For example, if flightID=2 has the first latitude observation at 10:30:29AM and last observation at 10:35:30, then I only want to merge the two datasets for that time range......
06-01-2016 09:39 PM - edited 06-01-2016 09:40 PM
This will reduce the join size significantly, and only requires a small modification:
proc sql; create table CARTESIAN as select a.*, f.FLIGHT_ID from AIR_CONCENTRATION a ,(select FLIGHT_ID, min(TIME) as MIN_TIME , max(TIME) as MAX_TIME from FLIGHT_ACTIVITY group by FLIGHT_ID) f where MIN_TIME <= TIME <= MAX_TIME; create table FINAL as select c.* , f.ALTITUDE from CARTESIAN c left join FLIGHT_ACTIVITY f on c.TIME =f.TIME and c.FLIGHT_ID=f.FLIGHT_ID ; quit;
Need further help from the community? Please ask a new question.