Publication | Open Access
A Method of Automated Nonparametric Content Analysis for Social Science
771
Citations
42
References
2009
Year
Social Data AnalysisEngineeringCorpus LinguisticsJournalismSocial SciencesText MiningNatural Language ProcessingAltmetricsComputational Social ScienceSocial MediaInformation RetrievalData ScienceDocument AnalysisDocument ClassificationAutomated Content AnalysisContent AnalysisStatisticsAbstract AnalysisSocial Medium MiningU.s. PresidencyAutomatic ClassificationKnowledge DiscoveryInformation ExtractionCategory ProportionsQuantitative Social Science ResearchSocial Medium DataArts
The surge in digitized text offers vast opportunities for social science, yet manual coding of blogs, speeches, and other unstructured sources is infeasible, and existing automated methods that focus on document classification can produce biased estimates of population category proportions. This work introduces a method that directly optimizes for unbiased estimation of category proportions, even when the underlying classifier performs poorly. The authors demonstrate the approach on diverse datasets—including daily public opinion on the U.S.
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents, whereas social scientists instead want generalizations about the population of documents, such as the proportion in a given category. Unfortunately, even a method with a high percent of individual documents correctly classified can be hugely biased when estimating category proportions. By directly optimizing for this social science goal, we develop a method that gives approximately unbiased estimates of category proportions even when the optimal classifier performs poorly. We illustrate with diverse data sets, including the daily expressed opinions of thousands of people about the U.S. presidency. We also make available software that implements our methods and large corpora of text for further analysis.
| Year | Citations | |
|---|---|---|
Page 1
Page 1