Effects of Express Entry Application Features on Immigration Case Processing Time

by Mila Semenova

Investigation Overview

The subject of this investigation is an Canadian Immigration Express Entry Permanent Residence Applications dataset from MyImmigrationTracker - a collaborative community web-service which allows immigrant applicants to enter and then track and analyze the progress of their applications.

My main goal is to find out the application case features which affect on application processing time.

Dataset Overview

The dataset consists of 10.762 applications finalised from April, 2015 to August, 2019. The applicantion case is described by 19 case features. I took the next of them for further analysis:

  • the name of the applicant's immigration program (stream)
  • the number of applicant's family members (dependents)
  • the applicant's countries of nationality and residence (nationality and residence)
  • the office where applicantion case were processing (visa_office)
  • whether the case was approved (PPR) or refused (Refused) (final_decision)
  • whether the case had optional processing stages like additional documents request and security screening
  • processing time in days
In [54]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime
import calendar

%matplotlib inline

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
In [56]:
# load in the dataset into a pandas dataframe
cases = pd.read_csv('cases_master.csv')

# create dataframe for only complete cases
complete_cases = cases.query('final_decision != "None"' ).copy()

# convert variables to datetime
complete_cases.final_decision_date = pd.to_datetime(complete_cases.final_decision_date)
complete_cases.aor_date = pd.to_datetime(complete_cases.aor_date)

# create a new column processing_days
complete_cases['processing_days'] = (complete_cases['final_decision_date'] - complete_cases['aor_date'])

# create two additional columns: additional documents request stage (add_docs_request) and 
# security screening (ss) with yes-no values

for index, row in complete_cases.iterrows():
    if row.add_docs_request_date != 'None':
        complete_cases.loc[index, 'add_docs_request'] = 'Yes'
    else:
        complete_cases.loc[index, 'add_docs_request'] = 'No'
    if row.ss_start_date != 'None':
        complete_cases.loc[index, 'security_screening'] = 'Yes'
    else:
        complete_cases.loc[index, 'security_screening'] = 'No'

# order the columns        
complete_cases = complete_cases[['case', 'aor_date', 'stream', 'current_status', 'dependents', 'province',
                                'nationality', 'residence', 'visa_office', 'med_passed_date', 'add_docs_request',
                                'add_docs_request_date', 'rprf_paid_date', 'security_screening', 'ss_start_date',
                                'final_decision_date', 'final_decision', 'processing_days', 'state']]

# create a new column processing_days
complete_cases['processing_days'] = (complete_cases['final_decision_date'] - complete_cases['aor_date']).dt.days

# create two additional columns: additional documents request stage (add_docs_request) and 
# security screening (ss) with yes-no values

for index, row in complete_cases.iterrows():
    if row.add_docs_request_date != 'None':
        complete_cases.loc[index, 'add_docs_request'] = 'Yes'
    else:
        complete_cases.loc[index, 'add_docs_request'] = 'No'
    if row.ss_start_date != 'None':
        complete_cases.loc[index, 'security_screening'] = 'Yes'
    else:
        complete_cases.loc[index, 'security_screening'] = 'No'

# order the columns        
complete_cases = complete_cases[['case', 'aor_date', 'stream', 'current_status', 'dependents', 'province',
                                'nationality', 'residence', 'visa_office', 'med_passed_date', 'add_docs_request',
                                'add_docs_request_date', 'rprf_paid_date', 'security_screening', 'ss_start_date',
                                'final_decision_date', 'final_decision', 'processing_days', 'state']]

# delete rows where processing_days values equal or less than 0
complete_cases = complete_cases.drop(complete_cases[complete_cases.processing_days <= 0].index)

# reset indices
complete_cases = complete_cases.reset_index(drop = True)

# convert columns to categorical
complete_cases.add_docs_request = complete_cases.add_docs_request.astype('category')
complete_cases.security_screening = complete_cases.security_screening.astype('category')
complete_cases.dependents = complete_cases.dependents.astype('category')
complete_cases.stream = complete_cases.stream.astype('category')
complete_cases.province = complete_cases.province.astype('category')
complete_cases.final_decision = complete_cases.final_decision.astype('category')
complete_cases.current_status = complete_cases.current_status.astype('category')
complete_cases.state = complete_cases.state.astype('category')

Overview of Analyzed Case Features

In [8]:
# application cases by final decisions

all_cases = len(complete_cases)
ppr_cases = len(complete_cases.query('final_decision == "PPR"'))
ref_cases = len(complete_cases.query('final_decision == "Refused"'))

print('\nThere are {} complete application cases.\n'.format(all_cases))
print('{} ({:0.1f}%) of them were approved.'.format(ppr_cases, 100 * ppr_cases / all_cases))
print('Only {} ({:0.1f}%) of them were refused.'.format(ref_cases, 100 * ref_cases / all_cases))

plt.figure(figsize = [5, 5])
y_ticks = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000]

sb.countplot(data = complete_cases, x = 'final_decision', color = 'steelblue')

categories = complete_cases['final_decision'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(complete_cases))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')

plt.title('\nDisrtibution of Final Decision\n')
plt.xlabel('\nFinal Decision')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.show();
There are 10762 complete application cases.

10624 (98.7%) of them were approved.
Only 138 (1.3%) of them were refused.
In [57]:
# distribution of additional documents request stage

ad_cases = len(complete_cases.query('add_docs_request == "Yes"'))
no_ad_cases = len(complete_cases.query('add_docs_request == "No"'))

print('\n{} ({:0.1f}%) cases had additional documents request.'.format(ad_cases, 100 * ad_cases/all_cases))
print('{} ({:0.1f}%) cases did not have additional documents request.'.format(no_ad_cases, 100*no_ad_cases/all_cases))

plt.figure(figsize = [5, 5])
y_ticks = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000]
ad_order = complete_cases['add_docs_request'].value_counts(ascending = True).index
sb.countplot(data = complete_cases, x = 'add_docs_request', color = 'steelblue', order = ad_order)

categories = complete_cases['add_docs_request'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(complete_cases))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')

plt.title('\nDistribution of Additional Documents Request Stage\n')
plt.xlabel('\nPresence of Additional Documents Request Stage')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.ylim(0,9500)
plt.show();
2325 (21.6%) cases had additional documents request.
8437 (78.4%) cases did not have additional documents request.
In [58]:
# distribution of security screening stage
ss_cases = len(complete_cases.query('security_screening == "Yes"'))
no_ss_cases = len(complete_cases.query('security_screening == "No"'))

print('\n{} ({:0.1f}%) cases had security screening.'.format(ss_cases, 100 * ss_cases / all_cases))
print('{} ({:0.1f}%) cases did not have security screening.'.format(no_ss_cases, 100 * no_ss_cases / all_cases))

plt.figure(figsize = [5, 5])
y_ticks = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000]
ss_order = complete_cases['security_screening'].value_counts(ascending = True).index
sb.countplot(data = complete_cases, x = 'security_screening', color = 'steelblue', order = ss_order)

categories = complete_cases['security_screening'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(complete_cases))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')

plt.title('\nDistribution of Security Screening Stage\n')
plt.xlabel('\nPresence of Security Screening Stage')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.show();
202 (1.9%) cases had security screening.
10560 (98.1%) cases did not have security screening.
  • More than 60% of applicants applied for FSW stream outland and inland;
  • 28.1% - for CEC stream
  • only 16.5% had provincial nominations.
In [11]:
# distribution of streams 
plt.figure(figsize = [10, 5])
y_ticks = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000]
stream_order = ['FSW-Outland', 'FSW-Inland', 'CEC', 'PNP-Outland', 'PNP-Inland']
sb.countplot(data = complete_cases, x = 'stream', color = 'steelblue', order = stream_order)

categories = complete_cases['stream'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(complete_cases))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')


plt.title('\nDistribution of Express Entry Streams\n')
plt.xlabel('\nExpress Entry Streams')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.show();

The most applicants (75.2%) were single or did not have a big family (only a spouse or a spouse and one child).

In [12]:
# distribution of number of dependents
plt.figure(figsize = [10, 5])
y_ticks = [0, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000, 3200, 3400]

sb.countplot(data = complete_cases, x = 'dependents', color = 'steelblue')

categories = complete_cases['dependents'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.2f}%'.format(100 * categories[label.get_text()] / len(complete_cases))
    plt.text(loc, categories[label.get_text()] + 100, percentage, ha = 'center', va = 'baseline')


plt.title('\nDistribution of Number of Dependents\n')
plt.xlabel('\nNumbers of Dependents')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.show();

Almost half of applicants (48.3%) had Indian citizenship.

In [13]:
# create a new dataframe with countries of nationality and their number of applications
countries_nationality = complete_cases.groupby([
    'nationality']).size().reset_index().rename(columns = {0: 'case_count_n'}).sort_values(by = [
    'case_count_n'], ascending = False).reset_index(drop = True)

# cut countries which represented less than in 50 applications 
popular_countries_n = countries_nationality[countries_nationality.case_count_n >= 50]

# counts the number of applications for countries which represented less than in 50 applications
other_countries_n = pd.DataFrame([[
    'Other', countries_nationality[countries_nationality.case_count_n < 50].sum(axis = 0)[
        'case_count_n']]], columns = ['nationality_n','case_count_n'])


# create copy of complete_cases dataframe and replace names of countries represented less than in 50 applications 
# with Other
countries_cases_n = complete_cases.copy()

for index, row in countries_cases_n.iterrows():
    if row.nationality in popular_countries_n.nationality.tolist():
        pass
    else:
        countries_cases_n.loc[index, 'nationality'] = 'Other'
In [14]:
# distribution of the countries of nationality
plt.figure(figsize = [10, 8])
x_ticks = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500]

nationality_order = countries_cases_n.nationality.value_counts().index
sb.countplot(data = countries_cases_n, y = 'nationality', color = 'steelblue', order = nationality_order)
plt.title('\nDistribution of Countries of Nationality\n')
plt.xlabel('\nNumber of Application Cases\n')
plt.ylabel('Countries\n')
plt.xticks(x_ticks)
plt.show();
In [16]:
# distribution of the countries of nationality
plt.figure(figsize = [5, 5])
y_ticks = [0, 1000, 2000, 3000, 4000, 5000, 6000]

nationalities_order = ['India', 'Other', 'Unspecified']
sb.countplot(data = nationalities, x = 'nationality', color = 'steelblue', order = nationalities_order)

categories = nationalities['nationality'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(nationalities))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')

plt.title('\nDistribution of Countries of Nationality\n')
plt.xlabel('\nCountries of Nationality')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.ylim(0,6000)
plt.show();

26% future immigrants applied for immigration programs from inside Canada.

In [17]:
# create a new dataframe with countries of residence and their number of applications
countries_residence = complete_cases.groupby([
    'residence']).size().reset_index().rename(columns = {0: 'case_count_r'}).sort_values(by = [
    'case_count_r'], ascending = False).reset_index(drop = True)

# cut countries which represented less than in 50 applications 
popular_countries_r = countries_residence[countries_residence.case_count_r >= 50]

# counts the number of applications for countries which represented less than in 50 applications
other_countries_r = pd.DataFrame([[
    'Other', countries_residence[countries_residence.case_count_r < 50].sum(axis = 0)[
        'case_count_r']]], columns = ['residence_r','case_count_r'])


# create copy of complete_cases dataframe and replace names of countries represented less than in 50 applications 
# with Other
countries_cases_r = complete_cases.copy()

for index, row in countries_cases_r.iterrows():
    if row.residence in popular_countries_r.residence.tolist():
        pass
    else:
        countries_cases_r.loc[index, 'residence'] = 'Other'
In [18]:
# distribution of countries of residence
plt.figure(figsize = [10, 8])
x_ticks = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 5500]

residence_order = countries_cases_r.residence.value_counts().index
sb.countplot(data = countries_cases_r, y = 'residence', color = 'steelblue', order = residence_order)
plt.title('\nDistribution of Countries of Residence\n')
plt.xlabel('\nNumber of Application Cases\n')
plt.ylabel('Countries\n')
plt.xticks(x_ticks)
plt.show();
In [19]:
# create dataframe with only three values of the residence column: Canada, Other and Unspecified
residences = complete_cases.copy()

for index, row in residences.iterrows():
    if row.residence == 'Unspecified' or row.residence == 'Canada':
        pass
    else:
        residences.loc[index, 'residence'] = 'Other'

# check the result 
residences.residence.value_counts()
Out[19]:
Other          7350
Canada         2805
Unspecified     607
Name: residence, dtype: int64
In [20]:
# distribution of the countries of residence
plt.figure(figsize = [5, 5])
y_ticks = [0, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]

residences_order = ['Canada', 'Other', 'Unspecified']
sb.countplot(data = residences, x = 'residence', color = 'steelblue', order = residences_order)

categories = residences['residence'].value_counts()
locs, labels = plt.xticks()

for loc, label in zip(locs, labels):
    percentage = '{:0.1f}%'.format(100 * categories[label.get_text()] / len(residences))
    plt.text(loc, categories[label.get_text()] + 300, percentage, ha = 'center', va = 'baseline')

plt.title('\nDistribution of Countries of Residences\n')
plt.xlabel('\nCountries of Residences')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.ylim(0,8500)
plt.show();
  • 42% of the application cases with a specified visa office were processed in Canada, particularly in Ottawa.
  • 31.5% of applicants did not have information about their visa office.
In [63]:
# distribution of the cities of visa offices
plt.figure(figsize = [10, 9])
x_ticks = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]

visa_office_order = complete_cases.visa_office.value_counts().index
sb.countplot(data = complete_cases, y = 'visa_office', color = 'steelblue', order = visa_office_order)
plt.title('\nDistribution of Cities of Visa Offices\n')
plt.xlabel('\nCities of Visa Offices')
plt.ylabel('Number of Application Cases')
#plt.xticks(x_ticks)
plt.show();

Overview of Processing time

Proсessing time from 2015 to the current time has a right skewed distribution with a long tail which means that the most of application cases were finalised in relatively short time, and only few of them were processing much longer.

In [60]:
# some basic statistics for processing time in days
print('- Maximum processing time in days for all complete cases: {} days.'.format(complete_cases.processing_days.max()))
print('- Minimum processing time in days for all complete cases: {} days.'.format(complete_cases.processing_days.min()))
print('- Average processing time in days for all complete cases: {:0.0f} days.'.format(complete_cases.processing_days.mean()))
print('- 25% application cases were finalised in {:0.0f} days.'.format(complete_cases.processing_days.describe()[4]))
print('- 50% application cases were finalised in {:0.0f} days.'.format(complete_cases.processing_days.describe()[5]))
print('- 75% application cases were finalised in {:0.0f} days.'.format(complete_cases.processing_days.describe()[6]))

# distribution of processing time in days
bin_edges = np.arange(0, complete_cases.processing_days.max() + 25, 50)

x_ticks = np.arange(0, complete_cases.processing_days.max() + 50, 50)
y_ticks = [0, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800, 3000]

plt.figure(figsize = [9, 5])
plt.hist(data = complete_cases, x = 'processing_days', bins = bin_edges, rwidth = 0.8, color = 'lightslategray')
plt.title('\nProcessing Time in Days\n')
plt.xlabel('\nDays')
plt.ylabel('Number of Applications\n')
plt.xticks(x_ticks)
plt.yticks(y_ticks)
plt.xlim(0, 850)
plt.show()
- Maximum processing time in days for all complete cases: 817 days.
- Minimum processing time in days for all complete cases: 7 days.
- Average processing time in days for all complete cases: 133 days.
- 25% application cases were finalised in 78 days.
- 50% application cases were finalised in 120 days.
- 75% application cases were finalised in 169 days.

To better understand the data I would like to explore changes of numbers of finalised cases over the time.

Firsty, I will make a chart representing quantities of application cases finalised per each year.

In [26]:
# create a new dataframe 
cases_full_date = complete_cases.copy()

# split date values into a year, a month and a day in final_decision_date colunm
# and create three new columns with these values
cases_full_date['final_decision_day'] = cases_full_date['final_decision_date'].dt.day
cases_full_date['final_decision_month'] = cases_full_date['final_decision_date'].dt.month
cases_full_date['final_decision_year'] = cases_full_date['final_decision_date'].dt.year

cases_full_date['final_decision_month'] = cases_full_date[
    'final_decision_month'].apply(lambda x: calendar.month_abbr[x])


cases_full_date = cases_full_date[['case', 'aor_date', 'stream', 'current_status', 'dependents', 'province',
                                  'nationality', 'residence', 'visa_office', 'med_passed_date', 'add_docs_request',
                                  'add_docs_request_date', 'rprf_paid_date', 'security_screening', 'ss_start_date',
                                  'final_decision_date', 'final_decision_year', 'final_decision_month',
                                  'final_decision_day', 'final_decision', 'processing_days', 'state']]
In [27]:
# numbers of finalised cases per year
fig, ax1 = plt.subplots(figsize = [5, 5])
y_ticks = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]

cases_years = sb.countplot(ax = ax1, data = cases_full_date, x = 'final_decision_year', color = 'lightslategray')

cases_years.set_xticklabels(cases_years.get_xticklabels())
for p in cases_years.patches:
    height = p.get_height()
    cases_years.text(p.get_x() + p.get_width() / 2, height + 0.1, height, ha = 'center', va = 'bottom')

plt.title('\nNumber of Finalised Cases per Year\n')
plt.xlabel('\nYears')
plt.ylabel('Number of Application Cases\n')
plt.yticks(y_ticks)
plt.ylim(0, 5000)
plt.show();

The next chart will represent the numbers of finalised cases per months in each given year.

In [28]:
# numbers of finalised cases per months in each year
fig, ax1 = plt.subplots(figsize = [13, 5])
months_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
years_order = [2015, 2016, 2017, 2018, 2019]

# cases_months = 
cases_months = sb.countplot(data = cases_full_date, x = 'final_decision_month', hue =
             'final_decision_year', order = months_order, hue_order = years_order, color = 'lightslategray')

plt.title('\nNumber of Finalised Cases per Month\n')
plt.xlabel('\nMonths')
plt.ylabel('Number of Application Cases\n')
cases_months.get_legend().set_title('Years')
plt.show();

Exploration of Case Features which can affect Processing Time

On average refused cases were finalised faster. Only the small part of applications got refuse much later at other stages of case processing.

In [29]:
# processing time in days for approved and refused applications
plt.figure(figsize = [10, 5])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
sb.violinplot(data = complete_cases, x =
              'processing_days', y = 'final_decision', color = 'lightsteelblue', inner = 'quartile', linewidth = 1)



plt.title('\nApplication Case Processing Time and Final Decision\n')
plt.xlabel('\nDays')
plt.ylabel('Final Decision\n')
plt.xticks(x_ticks)
plt.show();

Processing time of cases with additional documents request is slightly longer than without.

In [30]:
# processing time in days for application with and without additional documents request stage
plt.figure(figsize = [10, 5])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
sb.violinplot(data = complete_cases, x =
              'processing_days', y = 'add_docs_request', color = 'lightsteelblue', inner =
              'quartile', linewidth = 1)

plt.title('\nApplication Case Processing Time and Additional Documents Request Stage\n')
plt.xlabel('\nDays')
plt.ylabel('Additional Documents Request Stage\n')
plt.xticks(x_ticks)
plt.show();

The presence of security screening stage affected processing time significantly.

In [31]:
# processing time in days for application with and without security screening stage
plt.figure(figsize = [10, 5])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]

sb.violinplot(data = complete_cases, x =
              'processing_days', y = 'security_screening', color = 'lightsteelblue',  linewidth = 1, inner =
              'quartile')

plt.title('\nApplication Case Processing Time and Security Screening Stage\n')
plt.xlabel('\nDays')
plt.ylabel('Additional Documents Request Stage\n')
plt.xticks(x_ticks)
plt.show();

Processing time is not much different for FSW and CEC streams while PNP applications were processed longer.

In [32]:
# processing time in days for application with and without security screening stage
plt.figure(figsize = [10, 10])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
stream_order = ['FSW-Outland', 'FSW-Inland', 'CEC', 'PNP-Outland', 'PNP-Inland']

sb.violinplot(data = complete_cases, x =
              'processing_days', y = 'stream', color =
              'lightsteelblue',  linewidth = 1, order = stream_order, inner = 'quartile')

plt.title('\nApplication Case Processing Time and Express Entry Streams\n')
plt.xlabel('\nDays')
plt.ylabel('Express Entry Streams\n')
plt.xticks(x_ticks)
plt.show();

The number of dependents does not affects processing time except for really big families. However, there are only 0,03% applicants in the dataset who have 7 and more dependents and further analysis allowed me consider these three cases outliers.

In [33]:
# processing time in days for application with different numbers of dependents
plt.figure(figsize = [10, 10])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
dependants_order = ['0', '1', '2', '3', '4', '5', '6', '7 and more']

sb.boxplot(data = complete_cases, x =
           'processing_days', y = 'dependents', color =
           'lightsteelblue',  linewidth = 1, order = dependants_order)

plt.title('\nApplication Case Processing Time and Number of Dependents\n')
plt.xlabel('\nDays')
plt.ylabel('Number of Dependents\n')
plt.xticks(x_ticks)
plt.show();

The citizenship and the residence in some Middle East countries like Iran, Iraq, Lebanon, Palestine, Syrian Arab Republic, Jordan, Libya had a great impact on the extending of processing time.

In [35]:
# create a new dataframe with countries of nationality and their number of applications
countries_nationality_pt = complete_cases.groupby([
    'nationality']).size().reset_index().rename(columns = {0: 'case_count_n_pt'}).sort_values(by = [
    'case_count_n_pt'], ascending = False).reset_index(drop = True)

# cut countries which represented less than in 10 applications 
popular_countries_n_pt = countries_nationality_pt[countries_nationality_pt.case_count_n_pt >= 10]

# counts the number of applications for countries which represented less than in 10 applications
other_countries_n_pt = pd.DataFrame([[
    'Other', countries_nationality_pt[countries_nationality_pt.case_count_n_pt < 10].sum(axis = 0)[
        'case_count_n_pt']]], columns = ['nationality_n_pt','case_count_n_pt'])


# create copy of complete_cases dataframe and replace names of countries represented less than in 50 applications 
# with Other
countries_cases_n_pt = complete_cases.copy()

for index, row in countries_cases_n_pt.iterrows():
    if row.nationality in popular_countries_n_pt.nationality.tolist():
        pass
    else:
        countries_cases_n_pt.loc[index, 'nationality'] = 'Other'
In [69]:
# processing time in days for applicants from different countries of nationality
plt.figure(figsize = [10, 10])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
nationality_order = countries_cases_n_pt.nationality.value_counts().index

sb.boxplot(data = countries_cases_n_pt, x =
           'processing_days', y = 'nationality', color =
           'lightsteelblue',  linewidth = 1, order = nationality_order)

plt.title('\nApplication Case Processing Time and Countries of Nationality\n')
plt.xlabel('\nDays')
plt.ylabel('Countries\n')
plt.xticks(x_ticks)
plt.show();
In [37]:
# create a new dataframe with countries of nationality and their number of applications
countries_residence_pt = complete_cases.groupby([
    'residence']).size().reset_index().rename(columns = {0: 'case_count_r_pt'}).sort_values(by = [
    'case_count_r_pt'], ascending = False).reset_index(drop = True)

# cut countries which represented less than in 10 applications 
popular_countries_r_pt = countries_residence_pt[countries_residence_pt.case_count_r_pt >= 10]

# counts the number of applications for countries which represented less than in 10 applications
other_countries_r_pt = pd.DataFrame([[
    'Other', countries_residence_pt[countries_residence_pt.case_count_r_pt < 10].sum(axis = 0)[
        'case_count_r_pt']]], columns = ['residence_r_pt','case_count_r_pt'])


# create copy of complete_cases dataframe and replace names of countries represented less than in 50 applications 
# with Other
countries_cases_r_pt = complete_cases.copy()

for index, row in countries_cases_r_pt.iterrows():
    if row.residence in popular_countries_r_pt.residence.tolist():
        pass
    else:
        countries_cases_r_pt.loc[index, 'residence'] = 'Other'
In [70]:
# processing time in days for for applicants from different countries of residence
plt.figure(figsize = [10, 10])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]
residence_order = countries_cases_r_pt.residence.value_counts().index

sb.boxplot(data = countries_cases_r_pt, x =
           'processing_days', y = 'residence', color = 'lightsteelblue',  linewidth = 1, order = residence_order)

plt.title('\nApplication Case Processing Time and Countries of Residence\n')
plt.xlabel('\nDays')
plt.ylabel('Countries\n')
plt.xticks(x_ticks)
plt.show();

Although the graph indicates that processing time in Canadian visa offices slightly shorter I cannot make this conclusion since the number of cases with unspecified visa offices is rather large and can misrepresent the real situation.

In [39]:
# processing time in days for three groups of values of the visa_office column
plt.figure(figsize = [10, 5])
x_ticks = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850]

visa_office_order = ['In Canada', 'Not in Canada', 'Unspecified']
sb.boxplot(data = visa_office_countries, x =
           'processing_days', y = 'visa_office', color =
           'lightsteelblue',  linewidth = 1, order = visa_office_order)

plt.title('\nProcessing Time and Countries of Visa Offices\n')
plt.xlabel('\nDays')
plt.ylabel('Countries\n')
plt.xticks(x_ticks)
plt.show();

Processing time of immigration cases has increased over the past three years and each year the number of applications which have been processed within 180 days (the time limit announced by Canadian immigration authorities) is reduced.

In [40]:
# create plot for average processing time by years
plt.figure(figsize = [5, 5])

sb.barplot(data = cases_full_date.groupby([
    'final_decision_year']).mean().round(0).astype(int).reset_index(), x =
           'final_decision_year', y = 'processing_days', color = 'lightslategray')

plt.title('\nAverage Processing Time by Years\n')
plt.xlabel('\nYears')
plt.ylabel('Average Processing Time in Days \n')
plt.show();
In [46]:
# create a plot for distribution of processing time from year to year
bin_edges = np.arange(0, cases_full_date.processing_days.max() + 17.5, 25)

x_ticks = np.arange(0, cases_full_date.processing_days.max() + 25, 25)

pt_by_years = sb.FacetGrid(data = cases_full_date.query('final_decision_year != 2015'), col =
                           'final_decision_year', height = 5)
pt_by_years.map(plt.hist, 'processing_days', bins = bin_edges, rwidth = 0.89, color = 'lightslategrey')
pt_by_years.set_titles('{col_name}')

plt.subplots_adjust(top = 0.85)
pt_by_years.fig.suptitle('Processing Time in Days by Years')

pt_by_years.axes[0,0].set_xlabel('\nDays')
pt_by_years.axes[0,1].set_xlabel('\nDays')
pt_by_years.axes[0,2].set_xlabel('\nDays')
pt_by_years.axes[0,3].set_xlabel('\nDays')

pt_by_years.axes[0,0].set_ylabel('Number of Applications\n')

plt.show()

The combinations of application feautures affecting processing time

- final desicions
- the presence of optional processing stages
- the countries of citizenship and residence
- Express Entry streams

Processing time of the refused cases was shorter than processing time of the approved cases if there were not any optional processing stages.

In [47]:
# create two plot to analyze how the combinations of final decision and the presence of optional processing stages 
# can affect processing time
plt.figure(figsize = [14, 6])

# processing time for the combinations of final decision and additional documents request stage
plt.subplot(1, 2, 1)
ad_fin_dec = sb.violinplot(data = complete_cases, y = 'processing_days', x = 'add_docs_request', hue =
              'final_decision', split = True, color = 'lightsteelblue',  linewidth = 1, order = [
                  'No', 'Yes'], inner = 'quartile')

plt.title(
    '\nProcessing Time for Approved and Refused Cases\nwith and without Additional Documents Request\n')

plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nAdditional Documents Request Stage')
ad_fin_dec.get_legend().set_title('Final Decision')

# processing time for the combinations of final decision and security screening stage
plt.subplot(1, 2, 2)
ad_fin_dec = sb.violinplot(data = complete_cases, y = 'processing_days', x = 'security_screening', hue =
              'final_decision', split = True, color = 'lightsteelblue',  linewidth = 1, order = [
                  'No', 'Yes'], inner = 'quartile')

plt.title('\nProcessing Time for Approved and Refused Cases\nwith and without Security Screening\n')

plt.ylabel('Processing Time in Days\n');
plt.xlabel('\nSecurity Screening Stage');
ad_fin_dec.get_legend().set_title('Final Decision')

plt.show()

The presence of both optional processing stages noticeably increases processing time. And the security screening stage without additional documents request extends processing time much more than the additional documents request stage without security screening.

In [48]:
# create heatmap to find out the average processing time for the combinations of optinal processing stages
plt.figure(figsize = [7, 5])

cat_means = complete_cases.groupby(['security_screening', 'add_docs_request']).mean()['processing_days']
cat_means = cat_means.reset_index(name = 'processing_days_avg')
cat_means = cat_means.pivot(index = 'add_docs_request', columns = 'security_screening', values =
                            'processing_days_avg')

ad_ss = sb.heatmap(cat_means, annot = True, fmt = '.0f', cmap = 'Blues', cbar_kws = {'label' :
                                                                                     '\nProcessing Time in Days'})

plt.title('\nAverage Processing Time for Cases\nwith and without Optional Processing Stages\n')
ad_ss.set_ylabel('Additional Documents Request Stage\n')
ad_ss.set_xlabel('\nSecurity Screening Stage')

plt.show()

Processing time of the cases with applicants who had citizenship of Iran, Iraq, Lebanon, Palestine, Syrian Arab Republic, Jordan, Libya or lived in these countries when applied for immigration program is longer regardless the presence of additional documets request stage comparing with applicants from other countries.

But it is obvious that cases with these stages were processed longer than cases without.

In [ ]:
# divide all countries of nationality and residence into three groups: Other, Unspecified and Middle East for 
# Iran, Iraq, Lebanon, Palestine, Syrian Arab Republic, Jordan, Libya
countries_group = complete_cases.copy()

for index, row in countries_group.iterrows():
    if row.nationality in ['Iran', 'Iraq', 'Lebanon', 'Palestine, State of', 'Syrian Arab Republic', 'Jordan',
                           'Libya']:
        countries_group.loc[index, 'nationality'] = 'Middle East'
    elif row.nationality == 'Unspecified':
        pass
    else:
        countries_group.loc[index, 'nationality'] = 'Other'

for index, row in countries_group.iterrows():
    if row.residence in ['Iran', 'Iraq', 'Lebanon', 'Palestine, State of', 'Syrian Arab Republic', 'Jordan', 'Libya']:
        countries_group.loc[index, 'residence'] = 'Middle East'
    elif row.residence == 'Unspecified':
        pass
    else:
        countries_group.loc[index, 'residence'] = 'Other'
In [50]:
# create two plot to analyze how the combinations of the countries of nationality and 
# the presence of optional processing stages can affect processing time
plt.figure(figsize = [20, 7])

# processing time for the combinations of countries of nationality and additional documents request stage
plt.subplot(1, 2, 1)
countries_n_ad = sb.boxenplot(data = countries_group, y = 'processing_days', x = 'nationality', hue =
              'add_docs_request', palette = 'Blues',  order = ['Middle East', 'Other', 'Unspecified'])

plt.title(
    '\nProcessing Time for Cases from Different Countries of Nationality\nwith and without Additional Documents Request\n')


plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nCountries on Nationality')
countries_n_ad.get_legend().set_title('Additional Documents Request')

# processing time for the combinations of countries of nationality and security screening stage
plt.subplot(1, 2, 2)
countries_n_ss = sb.boxenplot(data = countries_group, y = 'processing_days', x = 'nationality', hue =
              'security_screening', palette = 'Blues',  order = ['Middle East','Other', 'Unspecified'])

plt.title(
    '\nAverage Processing Time for Cases from Different Countries of Nationality\nwith and without Security Screening Stage\n')

plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nCountries on Nationality')
countries_n_ss.get_legend().set_title('Security Screening')


plt.show()
In [51]:
# create two plot to analyze how the combinations of the countries of residence and 
# the presence of optional processing stages can affect processing time
plt.figure(figsize = [20, 7])

# processing time for the combinations of countries of residence and additional documents request stage
plt.subplot(1, 2, 1)
countries_r_ad = sb.boxenplot(data = countries_group, y = 'processing_days', x = 'residence', hue =
              'add_docs_request', palette = 'Blues',  order = ['Middle East', 'Other', 'Unspecified'])

plt.title(
    '\nProcessing Time for Cases from Different Countries of Residence\nwith and without Additional Documents Request\n')


plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nCountries of Residence')
countries_r_ad.get_legend().set_title('Additional Documents Request')

# processing time for the combinations of countries of residence and security screening stage
plt.subplot(1, 2, 2)
countries_r_ss = sb.boxenplot(data = countries_group, y = 'processing_days', x = 'residence', hue =
              'security_screening', palette = 'Blues',  order = ['Middle East','Other', 'Unspecified'])

plt.title(
    '\nProcessing Time for Cases from Different Countries of Residence\nwith and without Security Screening Stage\n')

plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nCountries of Residence')
countries_r_ss.get_legend().set_title('Security Screening')


plt.show()

The presence of both optional processing stages increases processing time for all Exptress Entry streams and the security screening stage affects much more than the additional documents request stage.

However as we can see on the right plot the impact of the security screening is bigger for Provincial Nominee Program (PNP) cases especially for the cases applied from inside Canada.

In [52]:
# create two plot to analyze how the combinations of the countries of nationality and 
# the presence of optional processing stages can affect processing time
plt.figure(figsize = [20, 7])

# processing time for the combinations of countries of nationality and additional documents request stage
plt.subplot(1, 2, 1)
stream_n_ad = sb.boxenplot(data = complete_cases, y = 'processing_days', x = 'stream', hue =
              'add_docs_request', palette = 'Greens', order = ['FSW-Outland', 'FSW-Inland', 'CEC',
                                                               'PNP-Outland', 'PNP-Inland'])

plt.title(
    '\nProcessing Time for Cases of Different Streams\nwith and without Additional Documents Request\n')


plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nExpress Entry Streams')
stream_n_ad.get_legend().set_title('Additional Documents Request')

# processing time for the combinations of countries of nationality and security screening stage
plt.subplot(1, 2, 2)
stream_n_ss = sb.boxenplot(data = complete_cases, y = 'processing_days', x = 'stream', hue =
              'security_screening', palette = 'Greens', order = ['FSW-Outland', 'FSW-Inland', 'CEC',
                                                               'PNP-Outland', 'PNP-Inland'])

plt.title(
    '\nProcessing Time for Cases of Different Streams\nwith and without Security Screening Stage\n')

plt.ylabel('Processing Time in Days\n')
plt.xlabel('\nExpress Entry Streams')
stream_n_ss.get_legend().set_title('Security Screening')


plt.show()

Summary

Not all applications case features have noticeable effect on processing time. The most crucial factors as the investigation showed are:

  • the presence of optional processing stages: additional documents request and security screening (the combination of them maximizes processing time);

  • the citizenship or residence in Iran, Iraq, Lebanon, Palestine, Syrian Arab Republic, Jordan, Libya;

  • immigration to Canada through Provincial Nominee Program especially if the future immigrant applies from inside Canada.