Kiefer Co:
@kieferandco
KieferAndCo
ChairmanCo

My Favourite Humans

Methods of Detecting Home Language Shift in Canadian Census Data

Natural Sciences Meet Social Sciences: Census Data Analytics for Detecting Home Language Shifts

In late 2019, I wrote a data mining final paper with my university friends, Chris Choy, Matthew Fogel, Clarke Garrioch, and Katie Martchenko.

Our professor, Dr. Carson Kai-Sang Leung of the University of Manitoba found our paper good enough to publish!

We were published in IMCOM 2021, the 15th International Conference on Ubiquitous Information Management and Communication (IMCOM).

You can see the paper here: IEEE Xplore or ResearchGate or contact me for one of our draft copies.

This gives us an Erdős number of 4 (mirror), and more importantly for a bunch of computer scientists, a Turing number and von Neumann number of 6!

Overview and Inspiration

Chris, Matthew, Clarke, Katie and I were friends through various classes and co-op terms. Through a desire to graduate with a databases specialization and a need for three more credit hours, we took Data Mining with Dr. Leung and were bound by the common need to write a good paper.

The first thing we needed was a topic, and after a few brainstorming sessions, we realized something we all had in common: being Canadians whose ancestors didn't always speak our national English or French!

Chris and I are both part Chinese, but we didn't speak our fathers' dialects. Matthew didn't speak his family's Yiddish, and Clarke didn't speak his family's Cree. Katie knew fluent Russian, but had mixed results passing on the language to her own daughter.

Languages being passed on seemed to be hit or miss, and we wanted to get some insights on why that was the case.

A few Google and DuckDuckGo searches later, and we discovered that Statistics Canada had a "Public Use Microdata File" (PUMF) chock-full of useful census data, including language data!

Science!

Our idea was simple:

The census data recorded two languages per person:

  • MTNNO: Mother Tongue (a person's first language)
  • HLANO: Home Language (language spoken at home)

If a person's home language was different from their mother tongue, we recorded that person as having experienced a "home language shift".

For our purposes (Canadian data), that usually meant that their home language was English or French, and their mother tongue was something else.

We tested three different data mining methods. The goal was to test each method's ability to predict whether a person would have a home language shift, and to find out which other traits predicted for or against a shift.

We tested:

Our findings included:

  • The Random Forest approach outperformed the other two
  • The two tree-based approaches (decision tree and random forest) had more false negatives, while the naive Bayes classifier had more false positives
  • Incidence of language shift was highest among second generation immigrants
    • This made sense to us, as first generation immigrants were more connected to their home communities and third+ generation immigrants would be born as native English or French speakers after their parents shifted
  • Middle age groups shifted more often, likely owing to similar reasons as above (young speakers growing up with English and French, and old speakers having less propensity to change)
  • Mother's place of birth (i.e. mother being the immigrant, and therefore the mother tongue source) being more influential than father's place of birth
  • Many socioeconomic factors influencing language shift
    • Very high and very low income families experienced higher rates of shift than middle income
    • Similarly, part-time workers and workers who worked many hours experienced more language shift than average full-time workers
    • Students who went to schools far from their homes experienced more language shift
    • Having a large home with many rooms meant less language shift, but having expensive homes or high costs of living was linked to more language shift
  • We found many related sources and programs linking language proficiency with economic well-being and even fertility
  • Newer census data features improved performance and revealed more insights
    • E.g. the new IMMCAT5 (detailed immigration category) was better than IMMSTAT (immigration status), as it separated immigrants who arrived before 1980
      • This group had very high rates of language shift, suggesting pressure to assimilate might have been higher a few decades ago
    • ABOID (detailed aborignal identity), was better than BFNMEMB (First Nation or band membership)
      • The new feature tracked Métis and Inuit as distinct groups, which was useful as these two groups were less likely to shift home languages

–Kiefer