8292402Hi Paige, and everyone sorry I do not know how to reply in a way that everyone can see. Essentially what I want to do is to identify the users who stay in one firm for less than one year (so my data is career history data). data user10_ind0_tag;
set user10_ind0;
start_date_dt = input(startdate, yymmdd10.);
end_date_dt = input(enddate, yymmdd10.);
enddate_imputed = coalesce(end_date_dt, input('2022-12-31', yymmdd10.));
duration_in_days = intck('DAY', start_date_dt, enddate_imputed);
miss_end = missing(enddate);
miss_start = missing(startdate);
turnover = (duration_in_days < 365);
run; I have 16 files, each of it is 40GB. each of it has around 82924025 rows, and i have the following columns: uid, pid, company name raw, company url, company cleaned name, company priname, company name, ultimate company name (for various company name, i will just keep raw name, cleaned name, and ultimate company name), location raw, region, country, state, mas, startdate, enddate, jobtitle raw, mapped_role, job category, role_k150,role_k500,rolek1000, code1, code 2 (used to linked to external data), ticker, wexchange, naics, naics_desc, rcid, frcid, senority, rn, salary. I would like to keep even the raw variable to verify the data accuracy, because sometimes it has one worker write their job as independent company, which means they are self-employeed, but the data provider put this guy in a company called independent inc. I keep the raw in order to identify such cases. I try to cut into small pieces because the other user on the server is taken 70%-98% of the memory, I felt sometimes my code is not processing because my data cannot be read into the remaining memory, thus I try to make them smaller. I appreciate all your help further.
... View more