Concepedia

Publication | Open Access

Lessons from archives

213

Citations

15

References

2020

Year

Eun Seo Jo, Timnit Gebru

Unknown Venue

TLDR

Data collection and annotation are central to fairness, accountability, transparency, and ethics in machine learning, yet they remain largely overlooked; archives and libraries offer long‑standing practices that address consent, power, inclusivity, transparency, ethics, and privacy. The paper proposes establishing a new ML specialization dedicated to data collection and annotation methodologies that rely on institutional frameworks and procedures. It outlines five archival document‑collection approaches that can guide sociocultural machine‑learning data practices. By illustrating archival methods, the study urges machine‑learning research to adopt more systematic, interdisciplinary data‑collection practices.

Abstract

A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the machine learning (ML) pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.

References

YearCitations

2019

4.4K

2019

1.4K

2019

1.1K

2018

1K

1994

746

2019

630

2007

257

2018

205

2013

175

2013

95

Page 1