Demographic inference

Sociodemographic information is crucial for many social questions.

There are more and more methods and APIs for gender or ethnicity inference from names and photos. However, many methods are black boxes and it is unclear how well they perform.

Lockhart2023Name-based demonstrated that these inference method are pretty poor especially for minority groups. Computer algorithms infer gender, race and ethnicity. Here’s how to avoid their pitfalls is a good guideline.

Name-based methods

The basic idea is that the first name or last name often signal various social identity of a person. For instance, if we saw someone with “Jennifer” as their first name, the person is very likely to be a woman. This is an intuitive and simple method that may work fairly well.

However, there are lots of issues with name-based inference methods. Although Kozlowski2021avoiding discusses ways to avoid bises, Lockhart2023Name based (preprint: Lockhart2022what) argues that this is more or less hopeless due to the unequal distribution of misrecognition. Stop Mapping Names to Gender makes a strong opposition argument to this approach.

Tools

There exist some APIs and tools. - https://genderize.io - Punchline: http://genderize.io isn’t research-grade, and its errors can’t be modeled because it’s not open to inspection by Aaron Clauset - http://abel.lis.illinois.edu/cgi-bin/ethnea/search.py - Ethnea – the ethnicity predictor. Aaron’s recommendation - https://databank.illinois.edu/datasets/IDB-9087546 - https://pypi.org/project/ethnicolr/ - Buskirk2022open source

How well do these method work? Santamaria2018comparison is an example paper that evaluated existing gender inference methods.

Image-based methods

When pictures are available, Computer vision can be applied to the pictures.

Some papers