Catherine Blake, University of Illinois at Urbana Champaign
Scientific literature plays a critical role in informing public health policy. For example, systematic reviews by the Cochrane Collaboration and Health Technology Assessment play an important role in evidence based medicine which in turn can influence government policies on standards of health care. In epidemiology meta-analytic results (a quantitative form of a systematic review) can influence legislation, product label requirements, and public services.
The systematic review process is typically a group activity that requires users to identify a comprehensive collection of articles, extract information from those articles, verify the accuracy of those extracted facts, and analyze the extracted facts using either qualitative or quantitative techniques (Blake & Pratt, 2006). Although systematic reviews accurately capture evidence, the process is time-consuming, taking 28 months from the original conception through to publication (Petrosino, 1999) and 1139 hours (Allen & Olkin, 1999). With more than 21 million citations in MEDLINE and an additional 1900 new citations added every week, the manual techniques currently used are becoming increasingly difficult to apply. Consider a breast cancer expert. It would be difficult, but necessary for her to consider the 33,883 articles published on breast cancer during the 28 months required to conduct a systematic review. Faced with the daunting task of sifting through currently available and recently added articles, our breast cancer expert may turn to other strategies to reduce the number of articles, such as constraining her hypothesis or her selection criterion. However, both of these constraints introduce undesirable biases, and thus reduce the validity of her review to inform public policy.
The second key challenge of a systematic review is that articles considered in are drawn from published literature and thus may suffer from publication bias, which is when articles that find statistically significant findings are more likely to be published than articles that do not show statistical significance, even though the methodology of the both study are the same. “For any given research area, one cannot tell how many studies have been conducted but never reported. The extreme view of the ""file drawer problem"" is that journals are filled with the 5% of the studies that show Type I errors, while the file drawers are filled with the 95% of the studies that show nonsignificant results.” (Rosenthal, 1979)
In this paper we describe the Multi-User Extraction for Information Synthesis (METIS) system that automatically identifies information required in a meta-analysis from full text scientific articles. Such an approach can be used within the existing systematic review process to reduce the time between when findings are published and when public policy is updated. We then demonstrate how the automatically extracted information can be used to create a synthetic control group estimate so that information from articles that would not be considered in a traditional meta-analysis could be incorporated into the analysis. Although this a controversial suggestion from a meta-analytic perspective, our goal here is not to replace meta-analysis, but rather provide alternative ways to leverage textual “big data”.
Figure 1 shows this new type of analysis using breast cancer articles in key epidemiology journals, which were published between 1997 and 2002. Articles that report alcohol consumption as primary information (in the title, keywords or abstract text) are shown as black and articles that report consumption only within the full text (secondary information) are shown in grey. Although scientists who conduct systematic reviews place a high priority on obtaining the “file drawer” articles, studies of the manual systematic review process (Blake & Pratt, 2006) show that users review the abstracts before retrieving the full text, so it is unlikely that these studies would have been found. In one traditional meta-analysis of breast cancer and alcohol consumption (Ellison, et al, 2001) 71 of the 72 articles included have alcohol or a synonym in the title, keywords or abstract. Some would argue that methodologically studies that do report the disease and the risk factor as primarily information should not be included in a meta-analysis.
The results of this study show that more than 60% of the breast cancer articles that report alcohol consumption (17 out of the 28 studies) include the data as secondary information (only in the full text) and thus would not be included in a traditional meta-analysis. More important than the number of primary versus secondary articles is the degree to which the findings reported in those articles differ. Of the four articles that would be included in a traditional analysis (circled), three suggest that ever drinking is higher in subjects with breast cancer (the cases) and only one study suggests that alcohol consumption is not a breast cancer risk factor. In contrast of the 17 studies that report alcohol consumption as secondary information, 6 show positive effect, 6 show a negative affect and 5 show no effect. Despite including both primary and secondary information, the METIS results are consistent with the earlier cited traditional meta-analysis, which suggests a small positive effect size between ever drinking and breast cancer.
Figure 1 – The METIS Summary of breast cancer articles published between 1997- 2002.
In this paper “big data” takes the form of published peer review articles. The proposed strategy has been used to quantify the association between smoking and impotence (Tengs & Osgood, 2001), and this earlier manual analysis took approximately six weeks to complete on the 1008 articles [Personal communication T.Tengs, 2002]. Assuming that a similar amount of time would be required for the 240,000 breast cancer articles such an analysis would take 27 years. Although issues such as access to full text need to be resolved before this approach can be applied to all breast cancer articles, METIS can be used on the subset of articles that are available electronically. Moreover, METIS can reduce the time required for a traditional analysis by automatically identifying information from the articles and thus reduce the time to integrate new scientific findings into public policy
Allen, E., & Olkin, I. (1999). Estimating time to conduct a meta-analysis from number of citations retrieved. Journal of the American Medical Association, 282(7), 634-635.
Blake, C., & Pratt, W. (2006). Collaborative Information Synthesis I: A Model of Information Behaviors of Scientists in Medicine and Public Health. Journal of the American Society for Information Science, 57(13), 1740-1749.
Ellison, R.C., Zhang, Y., McLennan, C.E., & Rothman, K.J. (2001). Exploring the Relation of Alcohol Consumption to Risk of Breast Cancer. American Journal of Epidemiology, 154(8), 740-747.
Petrosino, A. (1999). Lead Authors of Cochrane Reviews: Survey Results. Report to the Campbell Collaboration. Cambridge, MA: University of Pennsylvania.
Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638-641.
Tengs, T., & Osgood, N.D. (2001). The link between smoking and impotence : two decades of evidence. Preventive Medicine, 32(6), 447-452.