# Program Abstracts

## April 26-27, 2019 at the University of Rochester, New York

### EIGHTH ANNUAL CONFERENCE OF THE UPSTATE CHAPTERS OF THE AMERICAN STATISTICAL ASSOCIATION

### Abstracts

See Program for more detail.

** **

**Friday, April 26**

**POSTER SESSION ****Saunders Research Building Atrium**

**6:00-7:00**** ****Variance heterogeneity in psychological research: A Monte-Carlo study of the consequences for meta-analysis**

Bruce Blaine

Statistics and Data Sciences Program, St. John Fisher College

__E-mail__: bblaine@sjfc.edu

Variance heterogeneity is common in psychological research. Surveys of psychological research show that variance ratios (VRs) in two-group studies average around 2.5, with a substantial minority of studies having much higher VRs. Research has established that variance heterogeneity disturbs Type I error rates of parametric tests in primary research. Fixed-effects meta-analysis is a common statistical method in psychology for synthesizing primary research, and plays an important role in cumulative science and evidence-based practice. Little is known about the consequences of variance heterogeneity for meta-analytic estimates. The present research reports a Monte Carlo study in which the results of *k* = 8 or 20 primary studies were generated from each of the distributions N(100, 15) and N(106, 15), for δ = 0.40 (effect size). Variance heterogeneity was created by contaminating the second distribution with elements from a N(106, 45) distribution in proportions ranging from 0.00 to 0.25, to achieve VRs ranging from 1.0 to 3.0. Each simulated fixed-effects meta-analysis (5000 replications) yielded the following estimates: Hedges’ g, CI_{95%} coverage, and I^{2}. In the baseline (VR = 1.0) simulation, g = 0.40 and CI_{95%} coverage = 0.950. In general, larger VRs at the primary-study level were associated with smaller Hedges’ gs and poorer CI_{95%} coverage at the meta-analytic level. For example, at VR = 2.6, g = 0.30 and CI_{95%} coverage = 0.801. In other words, a meta-analysis of studies that simulated the average VR in psychological research substantially underestimated the true effect and inflated the Type I error rate. Study-level variance heterogeneity also inflated estimates of between-study variance (I^{2}), which has implications for meta-regression modeling. This study demonstrates that widely used meta-analytic methods do not produce accurate parameter estimates in the presence of study-level variance heterogeneity.

**Local piecewise polynomial regression on a network §**

Yang Liu and David Ruppert

Department of Statistics and Data Science, Cornell University

__E-mail__: yl2443@cornell.edu

This paper develops a statistically principled approach to density estimation on a network. We formulate nonparametric density estimation on a network as a nonparametric regression problem by way of binning. Nonparametric regression using local polynomial kernel-weighted least squares has been studied rigorously, and its asymptotic properties make it superior to the kernel estimator or the Nadaraya-Waston estimator. To tackle the unique challenges of a network, we propose a two-step local piecewise polynomial regression procedure. We study in detail the special case of local piecewise linear regression and derive the leading bias and variance terms using weighted least squares matrix theory. We show that the proposed approach will remove the bias that has been noted for existing methods near a vertex.

**Impact of celebrities in advertising campaigns §**

Alaa Yasin

Media and Communications Program, St. John Fisher College

__E-mail__: ay09970@sjfc.edu

This study explored the impact of celebrity-led advertising campaigns on revenue of athletic apparel companies. Such campaigns can be expensive, and estimates of their return on investment are therefore important. Because the costs of advertising are typically proprietary, proxies for these costs were used. Because advertising campaigns create “buzz” surrounding a product or company, Google Trends can be used to identify likely campaign dates. Further searching of the events of that time identified the presence or absence of a celebrity campaign, and this was used as a proxy for advertising costs. These proxies were then compared to quarterly revenues as reported to Securities and Exchange Commission. Other events (e.g., misconduct of celebrities involved, major sporting events during the time of the campaign) may also have contributed to revenues. Hence, multiple regression and time series analyses were used to determine the effectiveness of these campaigns.

**Historical study of the relationship between the federal funds rate and the inflation rate §**

Aaron Wilkins

Economics Program and Statistics Program, St. John Fisher College

__E-mail__: arw08381@sjfc.edu

It is believed that in order to control high inflation rates, the Federal Reserve Bank (“the Fed”) increases the federal funds rate and when the inflation rate gets low, the Fed takes the opposite approach. (The federal funds rate is a rate of interest that banks charge each other to lend funds and stay above the reserve requirement, set by the government.) This project examines the relationship between the federal funds rate and the inflation rate. Sixty-five years of historical inflation rates and federal funds rates were used as the basis for this exploration. Because a time lag between the setting of the federal funds rate and its impact on the economy may exist, recurrence quantification analysis was used in the time series analysis. The results provide insight into the relationship between the federal funds rate and inflation and test the impact of using the federal funds rate as an anti-inflationary tool.

**Impact of select social and economic factors on health §**

Taylor Palermo

Statistics Program, St. John Fisher College

__E-mail__: tlp00467@sjfc.edu

Discussions regarding health involve a number of variables, although diet and exercise may be the most prominent in popular press. Although both contribute to health, there may be other less publicized factors that have a large impact as well. Data from the 2010 Census, the University of Wisconsin Population Health Institute, and the Centers for Disease Control and Prevention were all used to explore the impact of some of these less publicized factors on health. Measures of health are difficult, but the premature death rate was used as a proxy for overall health. Variables that were used to make predictions of this measure of health included commute time, level of education, and percent of the population who are under the age of 65 and do not have health insurance. Linear models tell us one story and path models tell us a deeper, more meaningful story. Applications to public health policies will be discussed.

**Valuing a running back §**

Ryan Ingerson

Economics Program, St. John Fisher College

__E-mail__: rwi01833@sjfc.edu

The National Football League has had its fair share of controversy in the past year. One of these surrounds the value that organizations place on players and more specifically on the running back position. Le’veon Bell is at the center of this discussion and voices a strong opinion about his high value. However, the running back position has been belittled of late as it is now described as an extremely “interchangeable” position. (The term “interchangeable” is referring to the opinion that production from the position is not necessarily correlated with the skill at that position). Using data including production statistics from the running backs as well as their interaction with other positions from the past 3 years, an indirect model was used to explore this issue. The results may help front offices in their pursuit to maximize team performance while minimizing team cost.

**A simulation of baseball pitchers and their effectiveness §**

Zachary Ryan

Sport Management Program, St. John Fisher College

__E-mail__: zrr09423@sjfc.edu

A recent trend in Major League Baseball is the change in roles of starting pitchers (e.g., the use of the “opener”). Previous studies have projected overall team wins using simulations, but were unable to simulate pitcher-level results. This study created a model to effectively simulate a baseball game using pitcher-level data, thereby allowing for an exploration of pitching impacts. Additionally, a user interface was developed to facilitate these explorations in a wide variety of scenarios. These results will be valuable to a front office of a Major League team as a way to evaluate their roster.

**Statistics timeline §**

Kayla Kolacz, Jamie Hagerty, and Susan Mason

Department of Psychology, Niagara University

__E-mail__: kkolacz@mail.niagara.edu

This poster depicts important dates in the history of statistics. The timeline begins with the use of the arithmetic mean in 450 BC and includes several milestones in the history of statistics through modern times. Early contributions are essential steps in the development of modern statistics, which has applications in scientific research, politics, sports, economics, planning, and government. A full appreciation of statistics requires an understanding of its past. For students, that understanding could come from a statistics timeline hanging in the classroom. Alternatively, the information could be sprinkled throughout statistics lectures, or students could be assigned a class project to research famous statisticians. Regardless of the pedagogical approach used, exposure to information about the history of statistics deepens the students understanding of the field and its value to other fields of inquiry.

**Statistics education and career opportunities §**

Jamie Hagerty, Kayla Kolacz, and Susan Mason

Department of Psychology, Niagara University

__E-mail__: jhagerty@mail.niagara.edu

According to the Bureau of Labor Statistics, the employment of “statisticians is projected to grow 33 percent from 2016 to 2026, much faster than the average for all occupations. Businesses will need these workers to analyze the increasing volume of digital and electronic data.” Whether or not students plan careers as statisticians, their job prospects are improved if they have a background in statistics. Completing a major or a minor at the bachelor’s level, or simply taking a single undergraduate statistics course, will make a student more desirable to many employers. This poster describes some of the career opportunities for which statistics is either a required or a preferred qualification. The poster also outlines the various undergraduate majors that include a course in statistics as one of the curricular requirements.

**Ridge-penalized subset selection for regression §**

Matthew Corsetti and Derick Peterson

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

__E-mail__: Matthew_Corsetti@urmc.rochester.edu

We propose a novel procedure for variable selection and high-dimensional parameter estimation in linear models. Ridge-Penalized Subset selection (RiPS) leverages the variable selection mechanics of best subsets (L_{0}) regression while simultaneously improving parameter estimation through the use of a ridge (L_{2}) penalty. By combining the L_{0} and L_{2} penalties, RiPS adaptively allows for both sparse and non-sparse solutions, with ridge regression and best subset regression as special cases. Our simulations indicate that RiPS uniformly dominates best subset selection and Ordinary Least Squares (OLS) with respect to mean squared prediction error. RiPS dramatically outperforms ridge regression, best subset selection, and OLS in moderate signal-to-noise settings and has comparable performance to ridge regression in low signal-to-noise settings.

**An analytic approximation to the Bayesian decision statistic for continuous gravitational**

**waves**

John Whelan and John Bero

College of Science, Rochester Institute of Technology

__E-mail__: jtwsma@rit.edu

We consider the Bayesian detection statistic for a targeted search for continuous gravitational waves, known as the B-statistic. This is a Bayes factor between signal and noise hypotheses, produced by marginalizing over the four amplitude parameters of the signal. We show that by Taylor-expanding to first order in certain averaged combinations of antenna patterns (elements of the parameter space metric), the marginalization integral can be performed analytically, producing a closed-form approximation in terms of confluent hypergeometric functions. We demonstrate using Monte Carlo simulations that this approximation is as powerful as the full B-statistic, and outperforms the traditional maximum-likelihood F-statistic, for several observing scenarios that involve an average over sidereal times. We also show that the approximation does not perform well for a near-instantaneous observation, so the approximation is suited to continuous wave observations rather than transient modelled signals such as compact binary inspiral.

**Bayesian and unsupervised machine learning for jazz music analysis §**

Qiuyi Wu, College of Science, Rochester Institute of Technology

__E-mail__: qw9477@rit.edu

Extensive studies have been conducted on both musical scores and audio tracks of western classical music with the finality of learning and detecting the key in which a particular piece of music was played. Both the Bayesian approach and modern unsupervised learning via latent Dirichlet allocation have been used for such learning tasks. In this research work, we venture out of the western classical genre and embrace and explore jazz music. We consider the musical score sheets and audio tracks of some of the giants of jazz like Duke Ellington, Miles Davis, John Coltrane, Dizzie Gillespie, Wes Montgomery, Charlie Parker, Sonny Rollins, Louis Armstrong, Bill Evans, Dave Brubeck, and Thelonious Monk. We specifically employ Bayesian techniques and modern topic modelling methods and a combination of both to explore tasks such as automatic improvisation detection, genre identification, key learning (how many keys did the giants of jazz tend to play in, and what are those keys), and even elements of the mood of the piece.

**SESSION 1A ****1W-501**

** **

**Methods to Address Security Vulnerability**

** **

**9:30-9:50 ****In pursuit of insights from vulnerability discovery metrics §**

Nuthan Munaiah and Andrew Meneely

Department of Software Engineering, Rochester Institute of Technology

Developers use a plethora of software metrics to discover and fix mistakes as they engineer software. A subset of these metrics, called vulnerability discovery metrics, have been proposed to help developers discover security vulnerabilities in software. However, despite promising empirical evidence, vulnerability discovery metrics are seldom relied upon in practice. In prior research, the effectiveness of these metrics has typically been expressed using the effectiveness of a prediction model that uses the metrics as explanatory variables. These prediction models, being black boxes, do not provide any insights to help contextualize the predictions. However, by systematically interpreting the models and metrics, we can provide developers with nuanced insights about factors that have led to security mistakes in the past. In this presentation, we will showcase a preliminary approach to using vulnerability discovery metrics to provide insightful feedback to developers as they engineer software. We collected ten metrics (churn, collaboration centrality, complexity, contribution centrality, nesting, known offender, source lines of code, number of inputs, number of outputs, and number of paths) from six open-source projects. We assessed the generalizability of the metrics across two contextual dimensions (application domain and programming language) and between projects within a domain, computed thresholds for the metrics using an unsupervised approach from the literature, and assessed the ability of these unsupervised thresholds to classify risk from historical vulnerabilities in the Chromium project. The preliminary approach that will be showcased is part of an ongoing research project to automatically aggregate insights from the various analyses of vulnerability discovery metrics to generate natural language feedback on security.

**9:55-10:15 ****Adv-DWF: Defending against deep-learning-based website fingerprinting attacks with adversarial traces §**

Mohammad Rahman, Nate Matthews, Aneesh Joshi, and Matthew Wright

Center for Cybersecurity, Rochester Institute of Technology

Mohsen Imani

Qualys, Inc.

** **

Website Fingerprinting (WF) is a type of traffic analysis attack that enables a local passive eavesdropper to infer the victim’s activity even when the traffic is protected by encryption, a VPN, or some other anonymity system like Tor. Leveraging a deep-learning classifier, a WF attacker can gain up to 98% accuracy against Tor. Existing WF defenses are either too expensive in terms of bandwidth and latency overheads (e.g. 2-3 times as large or slow) or ineffective against the latest attacks. In this paper, we explore a novel defense, Adv-DWF, based on the idea of adversarial examples that have been shown to undermine machine learning classifiers in other domains. Our Adv-DWF defense adds padding to a traffic trace in a manner that fools the classifier into classifying it as coming from a different site. The technique drops the accuracy of the state-of-the-art attack augmented with adversarial training from 98% to 35%, while incurring a reasonable 56% bandwidth overhead. For most of the cases, the state-of-the-art attack’s accuracies of our defense are at least 45% and 14% lower than state-of-the-art defenses WTF-PAD and Walkie-Talkie (W-T), respectively. The Top-2 accuracy of our defense is at best 56.9%, while it is over 98% for W-T. In addition, for the most cases, the bandwidth overheads of our defense are at least 8% and 6% lower than those for WTF-PAD and W-T, respectively, showing its promise as a possible defense for Tor.

** **

**SESSION 1B ****1W-502**

** **

**Misleading Statistics**

** **

**9:30-9:50 Critical numeracy**

Susan Mason, Jamie Hagerty, and Kayla Kolacz

Department of Psychology, Niagara University

Elizabeth Reid

Department of Mathematics, Marist College

** **

After the 2016 election, college campuses renewed their commitment to teaching critical literacy. The need for an educated electorate was clear. Voters need to be on guard for “fake news” from tabloid journalists, Facebook trolls, and talking heads on cable stations. Critical literacy, the skill of analyzing and challenging texts, is an important life skill. The same is true of critical numeracy, which is the skill of analyzing and challenging mathematical information. The uninformed consumer is an easy target for those wishing to mislead with text or data. Examples of misleading statistics are easily found in advertising, science, sports, politics, the media, and the news. Common misrepresentations include implying causation from correlation, creating biased graphs, and generalizing from an unrepresentative sample. In each case, the presenter is trying to influence the audience’s interpretation of the information or event, rather than letting the results speak for themselves. When we teach statistics to undergraduate students, we know that some will go on to graduate school or directly into careers where they will be analyzing data. Others will follow a different path. Regardless of their next career steps, though, we have a responsibility to help our students become critical consumers of statistics.

** **

**9:55-10:15 ****Decision-making and statistics**

Elizabeth Reid

Department of Mathematics, Marist College

** **

Alice and Bob are arrested for robbing a bank and stealing a car. The prosecutors do not have enough evidence to convict Alice and Bob on the principal charge of bank robbery, but do have enough evidence to convict on the lesser charge of motor vehicle theft. If neither of them talk, they would both serve 1 year on the lesser charge. If both talk, they each get 3 years in prison. So why do Alice and Bob decide to cooperate with the prosecution instead of each other? Statistics and results from game theory can be very misleading. We will explore several different types of games and analyze both pure and mixed strategies in the context of a game being played a single time.

** **

**SESSION 1C ****1W-509**

** **

**Statistical Investigations in Sports Analytics**

** **

**9:30-9:50 ****Inferential and sensitivity studies of the baseball wins above replacement metric §**

Rob Weber

Statistics and Data Sciences Program, St. John Fisher College

** **

The Wins Against Replacement (WAR) metric for player evaluation, although widely used by fans and organizations, has not been closely studied. Although there are several competing versions of the metric, each using different weights for the various components, inferential comparisons of the versions have not been undertaken previously. Additionally, the sensitivity of the metrics to the weights has not been explored. The present study compares the various versions, both by considering the ability of WAR to predict actual wins and by comparing the assignment of values to players. We found no significant differences among several of the most common versions. We also tested the sensitivity of WAR to changes in the weights used and found that even extreme changes in weights have relatively little effect, indicating that the metrics are relatively insensitive to the details of the weighting. These results help to identify limitations of the current WAR metric and indicate potential extensions of the metric.

** **

**9:55-10:15 ****Optimizing NFL receiver route combinations §**

Jacob Tarnowski and Kaili Saffran

Statistics and Data Sciences Program, St. John Fisher College

Using data from the NFL season, statistical analysis was performed in R Markdown to optimize receiver route combinations based on the offensive formation, defensive formation, and how many defenders are in the box. First, we defined a successful route as when a receiver achieves a separation from the nearest defender that is greater than or equal to the average distance between a receiver and the nearest defender given that the receiver caught the ball. Next, we determined the route that each receiver ran for each play by comparing the quartiles of the coordinates for each route, and then clustering them so that the most similar ones were grouped together. We then were able to determine which route had the highest chance of success when provided the receiver, offensive formation, defensive formation, and the number of defenders in the box. Performing this analysis is significant because it will provide important statistical information for receivers and quarterbacks in the NFL to enable them to choose routes that will be successful based on the type of defense they are facing.

**SESSION 1D ****1W-510**

** **

**Methods and Applications in Biostatistics**

** **

**9:30-9:50 ****Modeling growth dynamics of premature and termed infants: A SAS macro tool**

Hongmei Yang and Sanjukta Bandyopadhyay

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

Kristin Scheible, Mary Caserta, and Gloria Pryhuber

Department of Pediatrics, University of Rochester Medical Center

Ann Falsey

Department of Medicine, University of Rochester Medical Center

David Topham

Department of Microbiology and Immunology, University of Rochester Medical Center

The World Health Organization (WHO) offers guidelines and tools for growth standards that depict expected ranges and trajectories of anthropometric measurements for babies born at 40 weeks Post Menstrual Age (PMA) or later and aged up to 5 years. The Fenton group published growth chart data for pre-term babies born by 36 weeks PMA but aged up to 50 weeks. Babies born prematurely follow a different growth trajectory from term-birth babies before termed age, and thus it is inappropriate to use the WHO tool to calculate their z-scores for growth monitoring, and a user-friendly tool is needed for this group of infants. In addition, neither WHO’s nor Fenton’s growth chart provides growth standards for term-birth babies born between 37 and 40 weeks PMA. We aimed to provide a universal SAS macro tool for z-score calculation for all babies aged up to 5 years regardless of their birth terms. Furthermore, we sought to shed light on missing growth standards for term-birth babies born before 40 weeks by applying Cole’s LMS method to the growth data collected at URMC. The LMS models three moments: M (Median) for center, S (Coefficient of variation) for scale, and L (Box-Cox power) for skewness, where these moments are spline smoothed to age. The result suggests that a smooth transition between Fenton’s and WHO’s growth curves is achieved by 50 weeks PMA. Potential limitations of the macro and suggestions for future developments are discussed. It is hoped that the program will be of use to clinicians as well as to the research community for nutrition and developmental studies of both preterm and full-term infants as a user-friendly tool.

** **

**9:55-10:15 Model-based clustering for high-dimensional longitudinal data with regularization §**

Luoying Yang and Tongtong Wu

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

** **

In this talk I introduce a model-based clustering method with variable selection for high-dimensional longitudinal data. This research is motivated by the Trial of Activity in Adolescent Girls (TAAG), which aimed to examine multi-level factors related to the change in physical activity by following up a cohort of 783 girls over 10 years from adolescence to early adulthood. Our goal is to identify the intrinsic grouping of subjects with similar patterns of physical activity trajectories and the most relevant predictors among over 800 candidate variables within groups. Existing methods only allow clustering and variable selection conducted over two steps, while our new method can perform the tasks simultaneously. By assuming each subject is drawn from a finite Gaussian mixture distribution, model effects and cluster labels are estimated based on the restricted maximum log-likelihood via the Expectation-Maximization algorithm, with Smoothly Clipped Absolute Deviation (SCAD) penalty and group lasso (L_{2} norm) penalty applied on the fixed effects and random effects, respectively, to induce sparsity in predictors for efficient parameter estimation and identification. Bayesian Information Criterion is used to determine the optimal cluster number and tuning parameter values for the penalties. Our numerical studies show that the new model has advantages over other existing clustering methods such as faster computation and more accurate clustering, and is able to accommodate complex data with multi-level and longitudinal effects.

**SESSION 2A ****1W-501**

** **

**Machine Learning: Methods and Applications**

** **

**10:25-10:45**** Multi-stage fault warning for large electric grids using anomaly detection and machine**

**learning §**

Sanjeev Raja

College of Engineering, University of Michigan

Ernest Fokoué

College of Science, Rochester Institute of Technology

** **

In the monitoring of a complex electric grid, it is of paramount importance to provide operators with early warnings of anomalies detected on the network, along with a precise classification and diagnosis of the specific fault type. In this paper, we propose a novel multi-stage early warning system prototype for electric grid fault detection, classification, subgroup discovery, and visualization. In the first stage, a computationally efficient anomaly detection method based on quartiles detects the presence of a fault in real time. In the second stage, the fault is classified into one of nine pre-defined disaster scenarios. The time series data are first mapped to highly discriminative features by applying dimensionality reduction based on temporal autocorrelation. The features are then mapped through one of three classification techniques: support vector machine, random forest, and artificial neural network. Finally in the third stage, intra-class clustering based on dynamic time warping is used to characterize the fault with further granularity. Results on the Bonneville Power Administration electric grid data show that i) the proposed anomaly detector is both fast and accurate; ii) dimensionality reduction leads to dramatic improvement in classification accuracy and speed; iii) the random forest method offers the most accurate, consistent, and robust fault classification; and iv) time series within a given class naturally separate into five distinct clusters that correspond closely to the geographical distribution of electric grid buses.

** **

**10:50-11:10**** Zonal influence classification: A robust extension of the nearest neighbors learning**

**paradigm**

Ernest Fokoué

College of Science, Rochester Institute of Technology

This research work seeks to harness and extend the intuitive appeal of the nearest neighbors learning paradigm by proposing, developing and applying a novel concept that I define as zonal influence, used to compute the relative importance of each training vector within its class. A discriminant function for each class is thereafter created as a kernel expansion with the zonal influence score of each point as the main coefficient. The resulting function estimator is the zonal influence classifier (ZIC). It is worth mentioning that like with other nearest neighbor-like machines, ZIC is readily adapted quite seamlessly to regression. We present various simulated and real-life examples that demonstrate the competitive predictive performance of ZIC.

**SESSION 2B ****1W-502**

** **

**Experimental Design**

** **

**10:25-10:45 Simulation-based inference on the response of a {q,2} simplex-lattice design §**

Tejasv Bedi and Robert Parody

College of Science, Rochester Institute of Technology

** **

Many products are formulated using specific combinations of ingredients put together as mixtures. Mixture designs are a special type of response surface designs that are often used to decide which combinations of the mixture components will yield the optimal properties for that product. For researchers and practitioners, it is of further interest to obtain interval estimates of the predicted response in mixture models. Traditionally, Scheffé’s simultaneous confidence intervals have been put to use to perform such a task, but it was often the case that the intervals estimated were very wide because of a conservative critical point. The idea behind this research is to introduce simulation-based confidence intervals for {q,2} simplex-lattice designs (basic form of mixture designs) that are much less conservative and provide tighter interval estimates of the predicted response in mixture models.

** **

**10:50-11:10 A study of robustness of the serial dilution designs**

Leonid Khinkis and Milburn Crotzer

Department of Mathematics and Statistics, Canisius College

Nonlinear regression modelling with sigmoid curves is commonly used in toxicology and bioassays. These curves are often represented by a variant of a logistic (Hill) function. The Fisher information matrix plays a key role in the parameter estimation and is also pivotal in choosing an efficient experimental design. Serial dilution designs (aka geometric designs) cover the entire curve and use the points that are distributed evenly on a logarithmic scale. As such, they are a popular design choice, particularly, in the preliminary studies. This paper examines robustness of the D-optimal dilution designs with respect to the parameter values assumed for their calculation. Our work is distinct from the reports on the topic available in the published literature since it incorporates the notion of the global curvature of a nonlinear regression model developed by the authors.

**SESSION 2C ****1W-509**

** **

**Analysis of Genetic/Genomic Data**

** **

**10:25-10:45 Survival analysis and feature derivation using latent Dirichlet allocation §**

Yi Liu Chen

Department of Mathematics, SUNY Geneseo

** **

Millions of people are diagnosed with some form of cancer, with breast and prostate cancer being the most common in women and men, respectively. Various treatments are available, but each of them shows mixed results based on the clinical characteristics of the patient. Survival analysis is a critical component in deciding upon a certain treatment regime. Additional data such as the genomic data of patients can be applied, but they suffer from high dimensionality. We introduce several dimension reduction methods on genomic data. In addition to popular methods such as principal component analysis, we utilize Latent Dirichlet Allocation (LDA), a topic model, to reduce the dimension of genomic data. LDA reduces the genes in each patient to latent variables known as topics. We perform our analysis on two sets of microarray data: GEO for prostate cancer and METABRIC for breast cancer. We create LDA models for both datasets, and we implement survival analysis on the METABRIC data. Results from GEO dataset demonstrate the computational cost of applying LDA to genomic data. For application, the METABRIC models show that the inclusion of genomic data in survival analysis after dimension reduction provides better performance than using only clinical data.

** **

**10:50-11:10 Likelihood-based mixture modeling of genetic regulatory networks §**

David Burton and Matthew McCall

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

Gene regulatory networks (GRNs) encode interactions within cells that control the expression levels of genes, and thereby, proteins. GRNs have been shown to play a vital role in both normal cellular function and malignancy. Despite this, there are few statistical methods for estimating GRNs from gene expression measurements. Further, most available methods focus upon delivering a single network estimate by maximizing efficiency and utilizing dimensionality reduction to examine the vast space of possible networks. Rather than one estimate, we believe a posterior distribution of possible networks would be more appropriate. We propose a ternary network mixture model with likelihood-based scores that can deliver multiple network estimates from a dataset. By utilizing a parallel tempering algorithm, likelihood-based mixture modeling can deliver a maximum likelihood estimate of a gene regulatory network given gene expression data. Our method is demonstrated on qPCR data from single perturbation experiments conducted in murine cancer cell lines. This method will be included in an upcoming release of the ‘ternarynet’ package in the Bioconductor environment for R.

**SESSION 2D ****1W-510**

** **

**Applications of Data Science**

** **

**10:25-10:45 Childhood Trauma: It doesn’t end there §**

Nicole Pellman

Statistics Program and School of Education, St. John Fisher College

** **

Childhood trauma can have various effects on an individual’s life, far past the span of childhood. This project will explore the effects of childhood traumatic experiences on four aspects of life: physical, emotional, social, and academic. The framework of this project is based upon the work of Abraham Maslow, addressing how a child can be affected if their needs are not met. Maslow stated that there is a hierarchy of needs, and higher levels of needs cannot be met until the basal levels are met. As traumatic experiences can impact the status of needs being met, this is an important concept to consider in this analysis because academic performance is at a higher level of need. It is important for educators and other professionals to know what is going on within students so that they are able to best serve them. To take a deeper look into how children are affected by trauma, this study will use path models and regression trees to explore how each aspect of life is impacted by traumatic experiences. The data utilized for this project come from the 2016 National Survey of Children’s Health. Additionally, this study will investigate the altered results that occur in these models when a mentor figure is present in a student’s life. The results of these analyses will indicate steps that educators can take to intervene and alter the effects of traumatic experiences among children.

** **

**10:50-11:10 ****Risk analysis using data analysis: Predicting credit card default probability ****§**

George Kuliner

Department of Mathematics, SUNY Geneseo

In my talk, I will review various classification and clustering methods to determine whether or not the client will default on their credit card based on amount loaned, age, sex, martial status, and previous payment history. We train the model on a small subset of clients that are known to have defaulted or not. The methods include KNN classification, Bayes classification, support vector machine, and artificial neural networks. We merge these methods with classical statistical methods to estimate the population proportion of default rates. Beginners in data science are encouraged to attend.

**SESSION 2E ****SRB 1.416**

** **

**Network Methods and Data**

**10:25-10:45**** ****A kernel extension of regression analysis with network cohesion ****§**

Eddie Pei

College of Science, Rochester Institute of Technology

Improving the predictive performance of regression models is typically achieved by either collecting more meaningful data or using more powerful models. With the availability of network data coming from social networks, many researchers have recently cleverly incorporated network information into regression models leading to better predictive performance. Most of the models in existence have been from the generalized linear model family, mainly because of the need for interpretability. In this work, we present an extension via kernelization to capture nonlinearities and extend the input space beyond continuous variables. In this presentation, we will show some derivations of the kernel version of regression with network cohesion and also show the initial computational results on simulated and real-life data. Ultimately, we intend to construct an R package of methods for network cohesion data.

**10:50-11:10**** ****Recommender systems for session-based data**

Iordan Slavov and Naeem Nowrouzi

Hunter College, CUNY

Recommender systems have grown in importance and application diversity, from recommendation of Facebook friends to applications in healthcare and finance. The most commonly used recommenders are neighborhood-based Collaborative Filtering (CF). In session data, i.e., data produced by the interactions of a user with a product during a given timeframe, CF is not the best choice since it ignores the order of the events in the data. Recurrent Neural Networks (RNNs) are designed to take advantage of such sequential data. Gated Recurrent Units (GRUs) were introduced to deal with the vanishing/exploding gradient problem that occurs in RNNs. It was recently shown that a recommender based on a GRU and augmented with a special sampling outperforms the common CF methods by up to 50% in terms of accuracy (Recall@N) on the RecSys Challenge 2015 click-stream dataset. RNNs are commonly fitted with backpropagation through time (BPTT), which has computation and memory cost that scales linearly with the number of time steps. For click-stream datasets, the sessions may be very long, which would render BPTT impractical. Recently, variants of another RNN fitting approach, the Recurrent Backpropagation (RBP), were shown to outperform BPTT. In this communication, we show through data experiments that RBP modified for session data is the most efficient of the considered methods in terms of computation and memory cost.

**SESSION 3A ****1W-501**

** **

**Biostatistical Methods for Hypothesis Testing**

** **

**11:20-11:40 Adaptive dose-response studies with generalized multiple contrast tests §**

** **

Shiyang Ma and Michael McDermott

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

** **

Adaptive designs can be efficient and highly informative when used appropriately. We propose a two-stage adaptive design based on generalized multiple contrast tests (GMCTs) for detecting a dose-response signal, i.e., establishing proof-of-concept (PoC). In the first stage, a candidate set of several plausible dose-response models with pre-specified parameter values is chosen and, for each model, a test is performed for significance of the optimally chosen contrast among the dosage means. At the interim analysis, using the first stage data, the parameters of the candidate models are estimated and the optimal contrasts are adapted based on the updated models for use in the second stage. Within each stage, a GMCT is used to combine the dependent p-values arising from those contrast test statistics and stage-wise overall p-values are obtained. These two stage-wise p-values are then combined using combination tests, such as Fisher’s product test or the inverse normal combination test, to detect a PoC signal. Simulation studies show that if the optimal contrast associated with the true dose-response model is highly correlated with those of the candidate models, the adaptive design is slightly less powerful than the corresponding non-adaptive design. In contrast, if the optimal contrast associated with the true dose-response model is not highly correlated with those of the candidate models, i.e., if the selection of the candidate set of models is not well informed by evidence from preclinical and early-phase studies, the proposed adaptive design is more powerful.

** **

**11:45-12:05 ****Combining dependent p-values with a quantile-based approach ****§**

Yu Gu, Michael McDermott, and Xing Qiu

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

Statistical methods for combining p-values are widely used in bioinformatics and other research fields. Classical methods, such as Fisher’s method and Stouffer’s method, focus on combining independent p-values. In recent years, there has been an increasing interest in combining a large number of dependent p-values. In addition, in many practical applications, the observed p-values may contain outliers due to technical issues presented in the data. Robust p-value combination methods are required to obtain correct inferences for these data. We propose a robust p-value combination method based on a quantile of the transformed p-values that is robust to outliers and capable of incorporating the correlation structure of the data. We derived the asymptotic distribution of the overall test statistic, i.e., the q^{th} quantile of the inverse normal-transformed p-values Φ^{-1}(*p _{i}*), as well as the theoretical type I error probability and statistical power of our test based on large sample theory. These results were verified by thorough simulation studies. In additional simulation studies, the proposed robust p-value combination method was compared with several competing methods, including Fisher’s method, Stouffer’s method, and Kost and McDermott’s method for combining dependent p-values. We showed that the proposed method was the only one that controlled the type I error probability at the nominal level when outliers were present and the p-values were dependent. The advantage of the proposed method increased as the dependence increased. Furthermore, this gain in robustness did not result in significant loss of statistical power in the absence of outliers. Theoretical derivation showed that even when the p-values were mutually independent and there were no outliers present, the Pitman asymptotic relative efficiency between the proposed method and Fisher’s method was acceptable. This theoretical result was verified by simulation studies.

**SESSION 3B ****1W-502**

** **

**Probability Distributions**

** **

**11:20-11:40 Working with an unknown distribution**

Bernard Ricca, Kris Green, and Mark McKinzie

Department of Mathematics, Computer Science, and Statistics, St. John Fisher College

** **

Many statistics and data science programs include calculus, probability, and linear algebra because these concepts provide the basis for almost all statistical methods. This paper, however, presents an application of these math ideas directly to the solution of a problem, which provides additional motivation for students to master these fields. The problem explored is how to compare a sample distribution to a population whose mean, median, standard deviation, and size are known, but whose distribution is both unknown and unlikely to be normal. The data come from student course evaluations at a small liberal arts college, so claims about whether an instructor’s results are significantly above or below the department or college mean are of more than passing interest. Due to the finite precision of the reported population means and standard deviations, a brute force approach is infeasible. However, linear algebra techniques can be used to reduce the search space to a manageable size. Even then, there are multiple solutions for the population distribution that fits the known parameters, so an approach for choosing among those solutions is developed. Additionally, Lagrange multipliers can be used to produce an answer. Sample results will be presented and limitations of the methods will be discussed.

** **

**11:****45****-****12****:****05**** ****Probability Playground: An interactive website for exploring probability distributions and their relationships §**

Adam Cunningham

Department of Biostatistics, SUNY at Buffalo

With increasing numbers of students studying probability theory, statistics, and data science, the need for online educational materials in this area has never been greater. Although many websites exist for this purpose, for the most part they consist of purely static written materials and graphics. The few interactive websites exploring probability distributions are limited in scope, and none that we are aware of illustrate the relationships between different distributions. Probability Playground (http://www.acsu.buffalo.edu/~adamcunn/probability/probability.html) was designed from the beginning to be a highly interactive website for exploring probability distributions and their relationships. The design philosophy focuses on developing intuition through exploration. It uses web technologies such as JavaScript, jQuery and D3 graphics to implement several novel features, including: (1) an intuitive interface for exploring the shapes of over 20 probability mass and density functions; (2) dynamic loading of examples illustrating the range of shapes distributions can take; and (3) interactive exploration of the relationships between distributions through transformation of variables, summing variables, sampling, and limiting distributions. The intended audience for Probability Playground includes undergraduate and graduate students looking to gain an intuitive understanding of the most common probability distributions, high school students taking classes such as AP Statistics, and educators looking for interactive materials to use in their teaching. Probability Playground is a novel contribution for both its scope and degree of interactivity. It provides a valuable addition to the field of statistics and data science education.

**SESSION 3C ****SRB 1.416**

** **

**Applications of Data Science and Technology in Education**

** **

**11:20-11:40 Using topic modeling to classify course descriptions**

Kirk Anne

Computing and Information Technology, SUNY Geneseo

** **

With changes in SUNY general education requirements and financial aid rules, SUNY Geneseo is reviewing its general education requirements to accommodate the new rules. The challenge is to create an inventory of courses and how each course contributes to a set of learning outcomes. This presentation will walk through the analysis of over 1,000 descriptions and the use of topic modeling and word similarity to partially automate the inventory process.

** **

**11:45-12:05 ****Introduction to Blockchain: Incorporating emerging technologies in the classroom ****§**

Travis Brodbeck

Siena College Research Institute

Necip Doganaksoy

Department of Accounting and Business Law, Siena College

Students embark on their academic journey through colleges and universities to prepare themselves with the necessary skills and knowledge to be effective in the workplace. Oftentimes, academic institutions prepare their students for the current needs of the market rather than the future needs of employers. Preparing students for the needs of the past is not a purposeful decision but rather a problem of resources, complexity, and expertise. Blockchain technology is a relatively new concept that is in its early beginnings of being adopted in the classroom, although its adoption by businesses is ever increasing. This presentation focuses on bridging the gap of the knowledge of current students and needs of future employers by taking a complex concept like Blockchain and breaking its basic properties down into hands-on exercises that build a foundation for students to deepen their understanding through future research and application.

**SESSION 3D ****1W-509**

** **

**Connections between Machine Learning and Statistics**

** **

**11:20-11:****35**** ****Predicting hospital re-admission for patients hospitalized with diabetes ****§**

Xiaoyu Wan

Goergen Institute for Data Science, University of Rochester

** **

In past decades, hospital readmissions have been the subject of retrospective surveys and prospective trials with a view to their prevention. A hospital readmission is when a discharged patient gets readmitted to a hospital within a certain period of time. The need for hospital readmission for certain conditions indicates the hospital quality. Identifying patients at high risk early in hospitalization can help to reduce the readmission rate, and hospitals can focus on preparing readmission for patients at high risk to shorten the length of readmission. This study is a secondary analysis using machine learning and statistical methods. The data set includes 101,766 observations, representing 10 years (1999-2008) of clinical care at 130 hospitals and integrated delivery networks across the United States. The goal of our analysis is to identify the determining factors that lead to risk of readmission and correspondingly being able to predict which patients will get readmitted. We conducted analyses using logistic regression, decision trees, random forests, and XGboost classfier to predict the readmission rate. Each algorithm was evaluated using a 10-fold stratified cross-validation. All of our algorithms were evaluated using the area-under-the curve (AUC), which is equivalent to the *c*-statistic in the binary classification scenario. In comparing the four models, XGBoost performed best for predicting the admission rate, achieving the highest accuracy of 0.94, with an AUC of 0.61. The second-best model was random forest, which achieved 0.92 accuracy and an AUC of 0.94. In this study, we also identified the most important factors as time-in-hospital, number of inpatient stays, and number of diagnoses, which appear to be associated with the severity of the disease.

** **

**11:35-11:50 ****Movie recommendation system using the MovieLens data set §**

He Huang

Goergen Institute for Data Science, University of Rochester

A movie recommendation system is designed to recommend movies to users. MovieLens (a movie recommendation service) data sets are used in this project. Collaborative Filtering using KNN is the primary method for predict the user ratings on movies. We implemented two comparative memory-based Collaborative Filtering approaches: one is user-based, and another is item-based. Also, a model-based Collaborative Filtering approach using SVD was proposed in order to optimize the performance in large data sets. Finally, the performance of the three algorithms was evaluated. In movie data sets, there were three columns indicating movie ID, title, and genres. There were 9742 rows of non-null data in these data sets. Our methods were as follows: (1) In user-based Collaborative Filtering, a score for an unrated item was produced by combining the ratings of users similar to the user. The idea is that people similar to a user have liked *Y*; therefore, we predict that the user will like *Y*. (2) In item-based Collaborative Filtering, a rating (*u*, *i*) was produced by looking at the set of items similar to *i* (interaction similarity); then the ratings by *u* of similar items were combined into a predicted rating. The idea is that a user previously liked similar items to *Y*; therefore, we predict that the user will like *Y*. (3) SVD is an algorithm that decomposes a matrix R into the best lower rank (i.e., smaller/simpler) approximation of the original matrix R. We used RMSE and run time to evaluate the three approaches to find the most suitable parameters to generate the lowest RMSE.

**11:50-12:05 ****Leverage subsampling method §**

Yuexi Wang

Goergen Institute for Data Science, University of Rochester

One popular method for dealing with large-scale data is sampling. One can first choose a small portion of the full data, and then use this sample to carry out computations of interest for the full data. The most commonly used method is uniform subsampling, which means that each point has the same probability to be chosen in the sample. In many situations, it is very easy to construct a “worst-case” input for which uniform random sampling will perform poorly. A more recent approach is sampling based upon statistical leverage scores. A leverage score is a measure of how far away the independent variable values of an observation are from those of other observations. In this approach, instead of assigning each data point the same probability, they are assigned based on their leverage scores so that there is a better chance of getting high leverage points in the sample. We generated three data sets using different leverage distributions: (1) GA data: *X* is generated from a multivariate normal distribution; leverage scores are close to uniform. (2) T3 data: *X* is generated from a multivariate *t* distribution with 3 df; leverage scores are moderately skewed. (3) T1 data: *X* is generated from a multivariate *t* distribution with 1 df; leverage scores are severely skewed. Each data set has 5000 observations with 50 features. The number of simulations was 1000, and several subsample sizes were studied. As the sample size increases, the performance of the uniform, true, and approximate leverage methods converge. The uniform sampling method performs very poorly when the data has skewed leverage. From simulated examples, we can clearly tell that leverage subsampling is a useful method to sample big data.

**SESSION 3E ****1W-510**

** **

**Methods for Variable Selection and Measuring Variable Importance**

** **

**11:20-11:40 Ridge-penalized subset selection for regression**

Derick Peterson and Matthew Corsetti

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

** **

Bridge regression is often used to constrain the complexity of fitted regression models. It is well known that only bridge indices between 0 (best subsets) and 1 (LASSO) induce automatic subset selection by shrinking some regression coefficients all the way to 0, whereas larger bridge indices such as 2 (ridge) admit nonzero estimates of all regression coefficients and better handle multicollinearity and diffuse signals. The elastic net combines the L_{1} penalty with a modified L_{2} penalty, inheriting the computational efficiency associated with bridge indices of at least 1 as well as the tendency of both LASSO and ridge to over-shrink coefficients on useful predictors while including many noise variables. Our Ridge-Penalized Subset selection (RiPS) method combines the L_{0} and L_{2} penalties, adaptively allowing both sparse and non-sparse solutions, with ridge regression and subset selection as special cases. In terms of mean squared prediction error, simulations demonstrate that RiPS: (1) uniformly dominates both subset selection and the unbiased Ordinary Least Squares (OLS) estimator; (2) outperforms ridge regression except in high noise scenarios, where its performance is close; and (3) dramatically outperforms ridge regression, subset selection, and OLS for moderate signal-to-noise ratios.

** **

**11:****4****5-1****2****:****0****5**** ****Explanation of novel methods for measuring variable importance in the presence of highly ****correlated predictors ****§**

Sailee Rumao

College of Science, Rochester Institute of Technology

In this presentation, I will be giving an overview of Variable Importance Analysis (VIA) and the different methods in various disciplines for measuring variable importance. I will also discuss its applications and the problems associated with the same. The focus of the presentation will be scrutiny of methods of assessing variable importance in the presence of highly correlated predictors (input variables). These methods include a variance-based Variable Importance Measure (VIM) known as Sobol’s Indices and moment-independent VIMs. I will also be talking about the initial stages of my thesis work to deal with correlation in measuring variable importance. The thesis includes methodology to discriminate the contribution of correlated predictors from that of uncorrelated ones in the ranking of variable importance measures. It also includes the study of the contribution of correlated and uncorrelated predictors to model output variance. Ultimately, my goal is to implement this technique on Random Forests.

**SESSION 4 ****1W-304**

** **

**Keynote Lecture**

** **

**1:25-2:35 Opportunities for innovation in data science**

** **

Wendy Martinez

U. S. Bureau of Labor Statistics

In this presentation, I will discuss several projects in statistics and data science that were conceived of and implemented by young researchers at the Bureau of Labor Statistics. Some examples of these projects include R Shiny apps for dynamic mapping of national employment statistics and the automatic generation of news releases, analyzing unstructured text from interviewer notes, and unsupervised learning (or clustering) of total survey error in employment statistics. As I go through these examples, I will highlight how these projects came about, the complexities associated with the data, and the innovative uses of statistics. I will conclude my talk with some information on careers in the Federal government and how to apply for them.

**SESSION 5A ****1W-501**

** **

**Analysis of Textual Data**

** **

**3:30-3:50**** Methods of prediction of the emotional intensity of tweets** **§**

Intisar Alhamdan

College of Science, Rochester Institute of Technology

** **

Affective computing is a field of computer science concerned with recognizing, analyzing and interpreting human emotions. Affective computing attempts to capture the attitudes of individuals in a range of media including audio, video, and text. Social media, in particular, are rich in expressions of peoples’ moods, opinions, and sentiments. This presentation focuses on predicting the emotional intensity expressed on the social network website Twitter. Twitter messages, or tweets, are short texts – fewer than 280 words – in which people express themselves with a variety of unique linguistic features including emoticons, hashtags, and abbreviations. In this study, we use lexical features, sentiment and emotion lexicons to extract features from tweets. We also use a form of transfer learning – word and sentence embeddings extracted from neural networks trained on large corpora. The estimation of emotional intensity is a regression task and we use linear and tree-based models for this task. We compare the results of these individual models as well as produce a final ensemble model that predicts the emotional intensity of tweets by combining the output of the individual models. Then, we use lexical features and word embeddings to train a recently introduced model designed to handle data with sparse or rare features. This model combines LASSO regularization with features grouped based on a hierarchical cluster model. Finally, we conduct an error analysis concerning these algorithms and emphasize areas that need to be improved.

** **

**3:55-4:15 Stylometry: A way to study Shakespeare with statistics**

Kirk Anne

Computing and Information Technology, SUNY Geneseo

Stylometry is the application of statistical analysis to study linguistic style. This is usually applied to written language, but has been also used to study music and art. The most common use of stylometry is to attribute authorship to anonymous or disputed documents. This presentation will cover the basics of stylometry and show a couple of results of well-known examples.

**SESSION 5B ****1W-502**

** **

**Data Analysis for Politics and Policy**

** **

**3:30-3:50**** ****An exploratory analysis of drug rehabilitation data and the need for evidence-based ****policy reform ****§**

Liam Rowland and Necip Doganaksoy

Department of Accounting and Business Law, Siena College

** **

Drug addiction has plagued our society for many years and the number of deaths from overdoses per year has quadrupled in the last 20 years. Through our analytical exploration of the Treatment Episode Data Set, a data set with qualitative information on 1.6 million rehabilitation patients all over the United States, we were able to identify some key characteristics of repeat rehabilitation patients, i.e., those patients who have been to rehabilitation prior to their latest stint. In addition, we investigated the massive difference between federal spending on rehabilitation and federal spending on drug prevention, and observed the distinct linkage between homelessness and drug addiction. Based on this evidence, we developed and proposed an idea for policy reform.

** **

**3:55-4:15 Evaluating partisanship in the 2018 midterm elections: Analysis of New York Times**

**Upshot/Siena College Research Institute 2018 live polling data §**

Travis Brodbeck

Siena College Research Institute

In the 63 days prior to the 2018 United States midterm elections, the New York Times and Siena College Research Institute partnered to conduct 96 polls of the contested battleground of House and Senate races. The data were weighted and published in real time as over 2.8 million phone calls were made, with each call and response appearing online as the poll was live. In this work, the aggregated data of over 48,000 responses are analyzed to determine what underlying patterns lie within the data and how they impact the current political landscape.

**SESSION 5C ****1W-509**

** **

**Teaching Probability and Statistics / Data Science in Public Health Careers**

** **

**3:30-3:50 ****Recognizing data science skills in schools**

Yusuf Bilgic

Department of Mathematics, SUNY Geneseo

** **

Haven’t we, as educators, recognized yet the influence of the data-centric era on teaching probability and statistics in schools, especially in New York? What is a data-analytic approach to designing school materials that improve rigor and data science skills? What school activities can teach the foundations of the mindsets and processes of predictive modeling and machine learning algorithms? In this talk, I will share our developed instructional cycle with a data-analytic approach and address these questions. A discussion will follow on what elements of statistics and probability should be included in an ideal school curriculum in the data science era.

** **

**3:55-4:15 ****Perspectives on data science in public health and a review of common statistical methods**

Christopher Ryan

SUNY Upstate Medical University Binghamton Clinical Campus

Among the CDC's Ten Essential Public Health Functions, several are inherently quantitative and data-based: monitor health status to identify and solve community health problems, diagnose and investigate health problems and health hazards in the community, evaluate effectiveness, accessibility, and quality of personal and population-based health services, and perform research for new insights and innovative solutions to health problems. Thus, public health could be an attractive and rewarding career direction for new graduates with a deep understanding of statistical concepts and techniques. Important skills include (1) explaining data science concepts to non-specialist government officials so that they understand what data can and cannot tell them; (2) conveying effectively the ideas of random variation and uncertainty in estimates; (3) disease surveillance, which is essentially a problem of anomaly detection; (4) descriptive statistics and graphical presentation of data; and (5) designing data collection and storage systems. In this presentation, I'll review my personal perspectives on data science in public health, and review the scarce literature on what statistical methods are commonly used in public health practice.

**SESSION 5D ****1W-510**

** **

**Statistics in Sports / Coarse-Grain Mapping Operators**

** **

**3:30-3:50 Prediction and uncertainty in college hockey**

John Whelan

College of Science, Rochester Institute of Technology

Adam Wodon

College Hockey News

** **

The pairwise probability matrix (https://www.collegehockeynews.com/ratings/probabilityMatrix.php) generates probabilities for the outcomes of future college hockey games using Monte Carlo simulations based on the KRACH ratings of the teams. KRACH (Ken's Ratings for American College Hockey) is constructed from the maximum likelihood estimates of the team strength parameters of the Bradley-Terry-Zermelo model for paired comparison experiments. One limitation of the current implementation is that it doesn't incorporate the uncertainties associated with the maximum likelihood estimates. We describe a modification of the procedure that randomly varies the team strength parameters with each Monte Carlo trial, within the estimated uncertainty, to obtain more robust probabilities for the future predictions.

** **

**3:55-4:15 Hierarchical graph-based approach to encode coarse-grained mapping operators §**

Maghesree Chakraborty and Andrew White

Department of Chemical Engineering, University of Rochester

Chenliang Xu

Department of Computer Science, University of Rochester

Coarse-grain (CG) molecular dynamics simulation allows us to overcome the limitations of all-atom molecular dynamics simulation in terms of length and time scales. While CG molecular dynamics has been successfully applied to various systems like protein folding, there is still a need for a general guideline to determine the choice of CG mapping operators. We will demonstrate a newly developed technique of incorporating multiple CG representations of a molecule within a single hierarchical graph. With the ability to extract valid CG mapping operators from the graph and relevant thermodynamic properties like entropy driving the selection of one among the various mapping operators extractable from the hierarchical graph, this technique is a step towards automating the selection of CG mapping operators. As a proof of concept, we have built a hierarchical graph for methanol that encodes all symmetry-preserving CG mapping operators. We will also demonstrate the feasibility of using the graph for automated CG mapping operator selection by using the uniform entropy flattening method to select CG mapping operators for methanol.

**§** **Indicates presentations that are eligible for the student presentation awards**