Please consider downloading the latest version of Internet Explorer
to experience this site as intended.
Tools Search Main Menu

Better Living Through Statistics

April 20-21, 2018 at the University of Rochester, New York


SEVENTH ANNUAL CONFERENCE OF THE UPSTATE CHAPTERS OF THE AMERICAN STATISTICAL ASSOCIATION


Abstracts

See program for the schedule

 

UP-STAT 2018

ABSTRACTS

 

DEPARTMENT OF BIOSTATISTICS AND                           Wegmans 1400

COMPUTATIONAL BIOLOGY COLLOQUIUM

 

4:00-5:30        Revisiting the genomewide threshold of 5 × 10-8 in 2018

 

Bhramar Mukherjee

Department of Biostatistics, University of Michigan

E-mail: bhramar@umich.edu

 

During the past two years, there has been much discussion and debate around the perverse use of the p-value threshold of 0.05 to declare statistical significance for single null hypothesis testing.  A recent recommendation by many eminent statisticians is to redefine statistical significance at p < 0.005 [Benjamin et al., Nature Human Behaviour, 2017].  This new threshold is motivated by the use of Bayes Factors and true control of false positive report probability.  In genomewide association studies, a much smaller threshold of 5 × 10-8 has been used with notable success in yielding reproducible results while testing millions of genetic variants.  I will first discuss the historic rationale for using this threshold.  I will then investigate whether this threshold, which was proposed about a decade ago, needs to be revisited with the current genomewide data we have in terms of the newer sequencing platforms, imputation strategies, testing rare versus common variants, the existing knowledge we have gathered regarding true association signals, and controlling other metrics associated with multiple hypotheses testing beyond the family-wise error rate.  I will discuss notions of Bayesian error rates for multiple testing and use connections between the Bayes Factor and the Frequentist Factor (the ratio of power and Type I error) for declaring new discoveries.  Empirical studies using data from the Global Lipids Consortium will be used to evaluate, if we applied various thresholds/decision rules in 2008 or 2009, how many of the most recent GWAS results (in 2013) we would detect and what our “true” false discovery rate would be.  This is joint work with Zhongsheng Chen and Michael Boehnke at the University of Michigan.

 

POSTER SESSION                                                                      Goergen Atrium

 

6:00-7:00        Empirical Bayes method and random forest in high-dimensional DNA methylation data analysis

 

Liling Zou

Department of Medical Statistics, Tongji University School of Medicine

Dongmei Li

Clinical and Translational Science Institute, University of Rochester Medical Center

Timothy Dye

Department of Obstetrics and Gynecology, University of Rochester Medical Center

E-mail: zouliling_59@tongji.edu.cn

 

High-dimensional DNA methylation data analysis poses great challenges for researchers.  Empirical Bayes and random forest methods could be used for differential methylation analyses; however, they take different approaches and there is yet no comparative study of these two methods.  In this study, we systematically investigated empirical Bayes and random forest methods used in DNA methylation differential analysis and compared their power through simulation studies and real data experiments.  We simulated methylation data for 1000 loci with various sample sizes from case and control groups.  We assumed various mean differences of the first k loci between the two groups.  The results showed that the advantages of empirical Bayes and random forest methods used in DNA methylation differential analysis are clear.  The power of the empirical Bayes and random forest methods increases with increased sample size, especially when the proportion of differential genes is small.  The variability in power is higher with random forest than with empirical Bayes, especially when the sample size is small.  When both sample size and differential gene proportions rise, the power of the two methods gradually converged.  The comparison of rheumatoid arthritis DNA methylation analysis based on the empirical Bayes and random forest methods provided results that were consistent with those obtained in the simulation studies.  Both empirical Bayes and random forest methods are recommended to effectively deal with high-dimensional methylation data with small sample sizes.  The power of both methods increases with the sample size and proportion of differential expressed genes.  In comparison, empirical Bayes is more robust than the random forest method.

 

Regularized estimation of high-dimensional sparse spectral density §

 

Yiming Sun, Yige Li, and Sumanta Basu

Department of Biological Statistics and Computational Biology, Cornell University

E-mail: ys784@cornell.edu

 

Multivariate spectral density estimation is a canonical problem in time series and signal processing with applications in diverse scientific fields including economics, neuroscience and environmental sciences.  In this work, we develop a non-asymptotic theory for regularized estimation of high-dimensional spectral density matrices of linear processes using thresholded versions of averaged periodograms.  Our results ensure that consistent estimation of spectral density is possible under high-dimensional regime log p / n → 0 as long as the true spectral density is weakly sparse.  These results complement and improve upon existing results for shrinkage-based estimators of spectral density, which require no assumption on sparsity but only ensure consistent estimation in a regime p2 / n → 0. 

 

Gene filtering technique prior to causal pathway analysis §

 

Lorin Miller, Jeffrey Mieczinkowski, David Tritchler, and Fan Zhang

Department of Biostatistics, SUNY at Buffalo

E-mail: lorinmil@buffalo.edu

 

When studying certain diseases, it can be useful to explore genotypes in conjunction with gene expression to better predict and understand root causes or outcomes.  Many algorithms exist to find these causal pathways by searching for combinations of genotypes and expression that best interact with each other and lead to the outcome of interest.  It is not uncommon in these settings for the total number of pathway elements to be significantly smaller than the high-dimensional genotyping data and gene expression data.  Additionally, given that the underlying objective involves exploring optimal interactive relationships between two high dimensional sources, many of the current pathway analysis algorithms are computationally intensive.  To lower computation time and potentially produce more accurate results, a gene filtering technique to help eliminate the obvious irrelevant items is being developed.  This gene filtering technique involves simple calculations of estimated correlations and applies conservative filters on items with small correlations based on clustering methods.  The simplicity of the calculation and conservative approach for removal criteria makes this technique easy to compute and apply prior to performing formal pathway analysis.

 

Swinging for the fences: are home run records helping baseball?

 

Matthew Kochan

Canisius Library, Canisius College

Milburn Crotzer

Department of Mathematics and Statistics, Canisius College

E-mail: kochanm@canisius.edu

 

In recent years, baseball has witnessed declining batting averages and runs scored as the strikeout rate has surged.  Oddly, the home run trend has also declined even as hitters were apparently swinging for the fences.  Starting in 2014, the home run trend seems to be reversing, along with runs scored.  In 2016, a record 111 players hit 20 or more home runs.  The upswing in home runs has not seen a corresponding increase in overall game attendance.  In fact, attendance has remained relatively flat over the years and has decreased slightly since 2014.  We examine these trends over the 2001-2016 time period, discuss possible reasons for the trends, and assess implications for the game.

 

Modeling alphabet incorporated class A using fractional Brownian motion §

 

James Caruana and Damien Halpern

Department of Mathematics, SUNY Geneseo

E-mail: dvh2@geneseo.edu

 

We will be determining and adapting variables used in the Wiener process to model the stock price of Alphabet Incorporated Class A for the next 12 months.  The stock price will be analyzed using a Fractional Brownian Motion approach that we plan on applying to other companies similar in size to Alphabet Incorporated.  Contrary to typical Brownian motion, fractional Brownian motion increments are not strictly independent of each other; there is a long run dependency on past prices that we believe is a better model for analysis of a stock.  Factors such as volatility and drift will be determined by collecting data on Alphabet Incorporated’s stock performance over the past 5 years and utilizing the statistical data software R to analyze the information.

 

Gender distribution of concussions in collegiate athletics, 1988-2004 §

 

Ann Zhang

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: ayzhang24@gmail.com

 

This study analyzed the distribution of concussion by gender in four National Collegiate Athletic Association (NCAA) sports for which there are comparable male and female teams: soccer, basketball, baseball/softball, and lacrosse.  Data reported to the NCAA Injury Surveillance System, a cross-sectional survey with a complex, multistage probability design, were aggregated for the sixteen available years from 1988 to 2004 and used to conduct statistical analysis.  It was found that for soccer (p = 0.005) and basketball (p = 0.003), female athletes were disproportionately affected by concussions, while in lacrosse (p = 0.008) men were more likely to experience concussions.  There was no difference in susceptibility for baseball/softball (p = 0.39).  This study not only demonstrated that there are significant gender differences in concussion rates, but also challenged prevailing social and clinical perceptions that concussions are primarily a men’s sports problem.  This analysis of granular data reveals that female athletes in certain sports are actually at higher risk for concussion than their male counterparts.

 

A flexible orientational bias Monte Carlo multiple-try Metropolis algorithm §

 

Alexis Zavez and Sally Thurston

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

E-mail: Alexis_Zavez@urmc.rochester.edu

 

We propose an extension of the Orientational Bias Monte Carlo (OBMC) Multiple-Try Metropolis (MTM) algorithm described by Liu et al.  Compared to the traditional Metropolis-Hastings (MH) algorithm, OBMC leads to faster convergence in some molecular applications and mixture model scenarios, where the MH algorithm may not perform well.  At each iteration, OBMC requires drawing k proposal values and k reference values from a symmetric proposal distribution, in contrast to the MH algorithm, which draws a single proposal value.  Our flexible OBMC (OBMC-F) method allows the number of proposal values to differ from the number of reference values.  We illustrate the MH, OBMC and OBMC-F visually, and compare their performances under various simulated scenarios to identify the advantages and disadvantages of each method.

 

Visualization and outlier detection: phase, voltage, and frequency for power stations §

 

Chandini Ramesh and Ernest Fokoué

College of Science, Rochester Institute of Technology

Esa Rantanen

Department of Psychology, Rochester Institute of Technology

E-mail: cr4383@rit.edu

 

We explore different approaches to detect an outlier (bus) of a power station and employ various visualization techniques, which would help the operators at the power stations to recognize an anomaly among the buses.  The given time series data are preprocessed and an unsupervised learning method at a given time point is used.  This gives a group of similar clusters among the buses.  This is followed by selecting a cluster as an outlier and visualizing them on the maps.  We also visualize the rate of change of the data on the plots.

 

SESSION 1A                                                                                  Goergen 101

 

Methodological Challenges and Advancements in the Current Research on the Impacts of Fracking

 

9:30-9:45        Using difference-in-differences to assess the health consequences of shale gas development drinking water contamination

 

Elaine Hill

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: Elaine_Hill@urmc.rochester.edu

 

This study assesses whether there exist health risks associated with shale gas development by means of drinking water contamination.  We build a novel data set that links infant health outcomes from birth records in Pennsylvania from 2003-2014 to gas well activity using the exact geographic locations of a mother’s residence, gas wells, and public drinking water source intake locations.  Furthermore, temporal information on births, wellbore activity, and water sampling helps us narrow the timing of exposure.  To begin, we retain all births to mothers in community water systems (CWS) whose ground water intakes lie within 10 km of a gas well.  An infant is potentially exposed to gas well activity if a wellbore is drilled 1) within the gestation period of that infant, and 2) within a close proximity of the intake location for the CWS in which the mother lives.  Using a difference-in-differences approach, we estimate the impact of increasing such exposure within 1 km of CWS intakes on birth outcomes, removing any effects from changes in water contamination over the same period as measured by the impacts of gas well threats at farther distances (1-10 km).  We further disentangle the health risks stemming from water contamination as opposed to another medium by using a subsample of infants who are only exposed to gas wells through threatened CWS intakes but do not live in close proximity of a gas well.  Our research contributes to an increasing body of research that estimates the causal impacts of shale gas development on health.

 

9:45-10:00      The impact of unconventional natural gas development on acute myocardial infarction hospitalizations among the elderly population: an application of the difference-in-differences design §

 

Linxi Liu

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: linxi_liu@urmc.rochester.edu

 

Since 2000, unconventional natural gas development (UNGD) has rapidly increased in the United States, especially in Pennsylvania’s Marcellus Shale.  Community concerns regarding UNGD include the potential health effects of increased UNGD-related air pollution.  Since reduced air quality is a risk factor for acute myocardial infarction (AMI), we hypothesize that UNGD could increase AMI risk among the elderly population.  We develop a novel county-level panel database of AMI inpatient hospitalizations, UNGD well measures, and regional sociodemographic information to examine the associations between UNGD and AMI hospitalizations among the elderly before and after the UNGD boom in Pennsylvania.  We use a linear model in a difference-in-differences quasi-experimental design to assess changes in AMI hospitalization rates in highly exposed counties (top 15% of drilling activity) vs. less exposed counties over time (pre-2007 vs. post-2007).  Models include fixed effects for county and year to control for unobserved time-invariant characteristics and temporal variation, and county-level time-varying demographic and socioeconomic covariates to adjust for changes in the population's composition.  In sensitivity analyses, we test our results across numerous cutoffs for well density and timing.  We find that highly exposed counties demonstrate an additional 3.35 (95% CI: 0.68-6.02) hospitalizations per 10,000 population in the elderly.  Our estimates are robust to multiple sensitivity analyses.  We conclude that a high density of UNGD wells increases AMI hospitalizations among the elderly population in Pennsylvania.  Our study is among the first to offer an epidemiologic application of a difference-in-differences quasi-experimental design to the study of the impact of UNGD on health.

 

10:00-10:15    The new gold standard: using geology as a random predictor of where and when drilling happens §

 

Andrew Boslett

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: Andrew_Boslett@urmc.rochester.edu

 

Most of the studies on impacts associated with resource extraction use production level-related proxies as the principal treatment variable.  Unfortunately, studies that use these variables are potentially biased if there are unobservables driving both treatment and outcome levels.  In this presentation, we discuss a strategy applied in the economic literature to mitigate this source of bias: two-stage modes that rely on instrumental variables associated with spatial-varying geological quality.  Using this general method, economists have made steps towards avoiding “endogeneity” bias by isolating variation in treatment that is only related to underlying geological quality.  This geologically-based variation in treatment has the benefit of being randomly assigned, having been determined millions of years in advance of potential human manipulation.  We present an introduction of the method and then provide a brief application of the method in a study evaluating the impacts of the unconventional oil and gas boom on light pollution.

 

SESSION 1B                                                                                  Goergen 109

 

Statistical Methods for Drug Evaluation

 

9:30-9:50        An evaluation of statistical approaches to post-marketing surveillance §

 

Yuxin Ding and Marianthi Markatou

Department of Biostatistics, SUNY at Buffalo

Robert Ball

Center for Drug Evaluation and Research, Food and Drug Administration

E-mail: yuxindin@buffalo.edu

 

Safety of medical products presents a serious concern worldwide.  Surveillance systems of post-market medical products have been established for continual monitoring of adverse events in many countries, and the proliferation of electronic health record systems (EHR) further facilitates continual monitoring for adverse events.  In this study, we review existing statistical methods for signal detection that are mostly in use in post-marketing safety surveillance of spontaneously reported adverse events.  These include the proportional reporting ratios (PRR), reporting odds ratios (ROR), Bayesian Confidence Neural Network (BCPNN), Multi-item Gamma Poisson Shrinker (MGPS), the corresponding FDR‑based methods derived from the aforementioned methods, and the likelihood ratio test based method.  We use three different methods to generate data (adverse event based, drug based, and a modification of the Ahmed et al. [2009, 2010] method) and study the performance of these statistical methods.  Performance metrics include type I error probability, power, number of correctly identified signals, false discovery rate and sensitivity, among others.  In all simulation studies, the results show superior performance of the likelihood ratio test.  A critical discussion and recommendations for choosing from these methods are presented.  An application to the FAERS database is illustrated using the Rhabdomyolysis-related adverse events reported to FDA during the 3rd quarter of 2014 to the 1st quarter of 2017 for statin drugs.

 

9:55-10:15      Generalized multiple contrast tests in proof-of-concept dose-response studies §

 

Shiyang Ma and Michael McDermott

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

E-mail: Shiyang_Ma@urmc.rochester.edu

 

In the process of developing drugs, proof of concept studies can be helpful in determining whether there is any evidence of a dose-response relationship.  A global test for this purpose that has gained popularity is a component of the MCP-Mod procedure, which involves the specification of several plausible dose-response models.  For each model, a test is performed for significance of an optimally chosen contrast among the sample means.  An overall p-value is obtained from the distribution of the minimum of the (dependent) p-values arising from these contrast tests.  This can be viewed as a method for combining dependent p-values.  We generalize this idea to the use of different statistics for combining the dependent p-values, such as Fisher’s combination method or the inverse normal method.  Simulation studies show that the method based on the minimum p-value has very stable power across a wide range of true dose-response models.  Under a certain range of true mean configurations, however, especially when the true means are strictly increasing with dose, the Fisher and inverse normal methods tend to be more powerful.

 

SESSION 1C                                                                                  Goergen 108

 

Applications of Spatial Modeling

 

9:30-9:50        An inhomogeneous Poisson model of the spatial intensity of opioid overdoses, as represented by EMS naloxone use

 

Christopher Ryan

SUNY Upstate Medical University Binghamton Clinical Campus

E-mail: cryan@binghamton.edu

 

Opioid overdose can be thought of, in part, as a spatial process.  The present study models the observed spatial point pattern of EMS calls for opioid overdose as a function of sociodemographic and geographic predictors.  The Susquehanna EMS Region has a population of about 300,000 and a surface area of about 5,500 square kilometers.  Thirteen of 71 EMS agencies in the Region participated; together they account for the vast majority of all Regional EMS calls.  I fit an inhomogeneous Poisson model to the spatial pattern of opioid overdose incident locations, using as candidate predictors census-tract population density, housing ownership, poverty, and age distribution, plus the distance from the event to the nearest “minimart” (convenience store/gas station).  I used the observed spatial intensity of all EMS calls as an offset.  As expected, there was much confounding between predictors.  In the final model, each 1% increase in owner-occupancy rate was associated with a 84% decrease in spatial intensity of opioid overdose EMS calls.  A 10% decrease in overdose intensity associated with each kilometer of distance from the nearest minimart was of borderline significance.  The model fit the data reasonably well, and it explained  part of previously-observed clustering.  Residential owner-occupancy is associated with lower density of opioid overdose EMS calls, and distance to the nearest minimart is of borderline significance. The model accounted for at least some of the observed clustering; models allowing specifically for interaction between points may be useful.

 

9:55-10:15      Spatial analysis of Americans’ attitudes towards guns §

 

Jiangmei Xiong and Joseph Ciminelli

Department of Mathematics, University of Rochester

E-mail: jxiong4@u.rochester.edu

 

The recent school shooting on February 14, 2018 in Parkland, Florida has invoked discussions and debates over the topic of gun control across the country.  Measuring the attitudes of American constituents becomes ever more important as lawmakers seek to create new gun-related policy.  In this presentation, we analyze the feelings of American Twitter users towards gun use.  In particular, we examine Tweets containing the word “gun”, perform a sentiment analysis of such Tweets, and spatially model how sentiment toward gun control varies over the United States.  For the spatial model, we present a hierarchical Bayesian representation of sentiment variations in a Gaussian spatial process based on the locations of Twitter users.  We are ultimately interested in whether there is a pattern on how people’s sentiment towards gun use changes over geographical areas, and if such a pattern exists, how it can be described with a spatial model.  Through our presented work, people’s sentiments toward guns in different locations are modeled and can be used by local policymakers to advocate for appropriate gun control regulations that match public opinion in their districts.

 

SESSION 1D                                                                                  Wilmot 116

 

Methods for Image Classification

 

9:30-9:50        Humpback whale image identification §

 

Joseph Tadros and Yi Liu Chen

Department of Mathematics, SUNY Geneseo

E-mail: jt22@geneseo.edu

 

Hundreds of years of commercial whaling has caused many whale populations to reach near extinction and to be put on the endangered species list.  While most countries have recognized the International Whaling Commission’s 1986 ban of commercial whaling, recovering populations still struggle with rising ocean temperatures and competition with the commercial fishing industry for food. In order to monitor and aid in the recovery of these whale populations, conservation scientists analyze photos of whales taken from surveillance systems and determine the species.  Previously, scientists have had to do this work manually.  Thus, the goal of this project is to develop an algorithm to classify images of whales based upon species.  The idea as well as the motivation for this project was provided by the Humpback Whale Identification Challenge found on Kaggle.  We will implement standard image recognition algorithms such as convolutional neural networks and we will upload our results to Kaggle.

 

9:55-10:15      Emotion recognition using deep learning §

 

Tolulope Olatunbosun

Department of Mathematics, SUNY Geneseo

Ian Costley and Peter Murphy

Department of Physics, SUNY Geneseo

E-mail: to3@geneseo.edu

 

Human facial expressions transcend language barriers and are understood throughout all cultures.  Recent developments in deep learning allow for facial abstraction, in which the primary facial features that compose each specific emotion can be mapped.  Using a convolutional neural network, a model can be built to represent emotions.  The model can then be used to classify emotions for an arbitrary face.  A set of pre-labeled faces will be analyzed in this manner, and the resulting model will be used to classify new faces based on pre-trained emotional features.

 

SESSION 1E                                                                                  Goergen 110

 

Methods for Detecting Change-Points and Phase Transitions

 

9:30-9:50        Change-point detection and issue localization based on fleet-wide fault data

 

Necip Doganaksoy

Department of Accounting and Business Law, Siena College

Zhanpan Zhang

GE Global Research

E-mail: ndoganaksoy@siena.edu

 

Modern industrial assets (e.g., generators, turbines, engines) are outfitted with numerous sensors to monitor key operating and environmental variables.  Unusual sensor readings (i.e., high temperature, excessive vibration, low current) trigger rule-based actions (also known as faults) that range from warning alarms to immediate shutdown of the asset to prevent potential damage.  A review of the current research in condition monitoring shows that much of this work is concentrated in diagnostics engineering and machine learning.  The general goal is to develop detection algorithms with improved performance properties.  In this paper, we take a different approach at analysis and modeling of fault data.  We utilize fault data logs with the goal of localizing fault occurrences to an identifiable set of assets.  Such localization is essential for addressing the root-cause(s) of fault that affect a large number of units.  Our technical development is based on the generalized linear modeling framework and change-point detection.  Specifically, we develop heuristic algorithms to detect both single and multiple changes that simultaneously affect multiple industrial assets.  The performances of the proposed detection and localization algorithms were evaluated through Monte Carlo simulation fault data streams under different scenarios.  In this talk, we will present our approach and outline further research topics.

 

9:55-10:15      Detecting phase transitions in small group conversations through entropy methods §

 

Jennifer Lentine and Bernard Ricca

Statistics and Data Science Program, St. John Fisher College

E-mail: jml06750@sjfc.edu

 

Entropy methods (also known as mutual information methods) are often used to analyze time series categorical data, such as results from qualitative coding of small group discussions.  A maximum in the calculated entropy is indicative of a phase transition.  This paper presents a Bayesian-inspired method to distinguish between random fluctuations that result in a local maximum value of entropy and a maximum value of entropy that indicates a transition in the system.  Using probabilities derived from the data on one side of the local maximum as a prior and the probabilities derived from the data on the other side as a likelihood, posteriors are calculated.  Through bootstrapping and chi-squared tests, significant differences between priors and posteriors can be determined.  Applications to real world data include the ability to detect phase changes in group interactions.

 

SESSION 2A                                                                                  Goergen 101

 

Understanding Drinking Water Data: Findings and Implications for Policy

 

10:25-10:40    New York State public drinking water contamination: trends, levels, and correlates §

 

Alexis Zavez

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

E-mail: Alexis_Zavez@urmc.rochester.edu

 

Under the Safe Drinking Water Act (SDWA), public water systems are required to regularly sample drinking water contaminants and report exceedances to both regulatory authorities and local communities.  We utilize SDWA drinking water sampling results to better understand trends in water contamination in New York State.  We analyze four key water contaminants: arsenic, nitrate, lead, and coliform.  We first calculate trends in water contamination over time by determining how many people likely consume water above contamination thresholds.  Next, we compare current levels of each contaminant to their corresponding maximum contaminant level (MCL).  Contaminant MCLs are often higher than maximum contaminant level goals (MCLGs), which represent the level of a contaminant below which there is no known or expected risk to health.  Since reducing MCLs towards MCLGs could improve public health state-wide, we propose hypothetical reductions for current MCLs and estimate the necessary response required by policymakers to achieve these levels.  We support our recommendations with spatial correlation metrics.  We find higher levels of arsenic and nitrate in high-income counties, and higher levels of lead in low-income counties.  We determine that both arsenic and nitrate MCLs could be reduced by half or more without requiring a large response from public water systems.  However, reductions in the MCLs for lead and coliform could require greater action from system management or may not be feasible if contamination is caused by environmental factors.

 

10:40-10:55    Coliform and the weather: evidence from over 10,000 public water systems §

 

Devin Sonne

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: dsonne@u.rochester.edu

 

The Total Coliform Rule is the main federal regulation designed to monitor and remedy bacterial contamination in American public drinking water, and it is also the most common reason for water quality violations.  The rule requires public water systems to take water quality samples each month and test these samples for coliform bacteria.  While coliform bacteria are not usually harmful, the test is considered useful as a general measure of bacterial contamination: coliform results that exceed standards are followed by tests for more specific bacteria, such as fecal coliform or E. coli.  We use coliform sample test results from over 10,000 public drinking water systems in ten US states, with data typically spanning over at least five recent years for each system, to study reported coliform contamination rates and how they vary with local weather conditions.  Approximately 1.8% of routine samples taken test positive; rates are higher for smaller systems.  Recent weather conditions have statistically significant, though small, effects on the chance of coliform contamination: coliform contamination is on average more likely in warm weather; the relationship with precipitation is less clear.  We address measurement error in our weather variables, and also explore heterogeneity of the weather effect across system characteristics.  These results are useful for private homeowners interested in choosing a time to test their water wells for coliform to obtain “worst case” estimates of contamination.

 

10:55-11:10    The effects of drinking water contamination on birth outcomes: new estimates from drinking water samples §

 

Richard DiSalvo

Department of Economics, University of Rochester

E-mail: rdisalv2@gmail.com

 

We study the relationship between drinking water contamination and birth outcomes, using panel data on births and water quality from 2003 through 2014 for the state of Pennsylvania.  To separate the effects of drinking water contamination from likely confounders, we fit models that identify these parameters using only within-water system over-time variation, or only within-mother across-sibling variation.  We find that having greater overall public drinking water contamination, within the range typically experienced by Pennsylvanians, leads to a small but precise deterioration in birth outcomes.  This finding also appears for overall chemical and coliform contamination.  Precise water sampling data also affords us the opportunity to estimate effects by trimester and by smaller contaminant groups.  We find most of the effect occurs for contamination during the third trimester.  Contaminant-specific estimates are discussed.

 

SESSION 2B                                                                                  Goergen 109

 

New Perspectives on Sport Analytics

 

10:25-10:40    Reconsidering wins above replacement as a metric §

 

Rob Weber

Statistics and Data Science Program, St. John Fisher College

E-mail: raw04717@sjfc.edu

 

This paper analyzes the use and value of the now commonly known baseball metric Wins Above Replacement (WAR).  The metric estimates the number of wins a player brings to his team that the average replacement level player wouldn’t.  I argue that this widely used statistic is of questionable morality, unnecessary, and frankly irresponsible.  The concept that few individuals or organizations consider before using a metric to determine a player’s whole ability and value is that any single number that attempts to show a human being’s entire ability to perform an action will always lose an immense amount of detail.  I argue that if a metric is created that will intentionally ignore detail to the extent that WAR does, then using it at all is not necessary or appropriate in any important decision-making situation.  To extend the thought process further, one can question how responsible it is to use a statistic like this so often in the public eye.  It has the potential to spread misinformation and ill-conceived notions about players and their value.  The easy alternative to using WAR is to simply use a range of diverse statistics and metrics to sufficiently display all the quantifiable qualities of a baseball player.  Examples of metrics that do not suffer from these defects will be presented.

 

10:40-10:55    Are the worth it? §

 

Kaylee Gassner

Statistics and Data Science Program, St. John Fisher College

E-mail: kag08312@sjfc.edu

 

The purpose of this research is to determine if a current professional football player’s performance in the National Football League (NFL) during their designated contract period was worth the money spent.  In Major League Baseball (MLB), getting players on base has proven to naturally produce runs needed to win games.  Similarly, it would be logical to say that there is a correlation between total yards produced by a team in a season and the number of wins the team achieves in a season.  This correlation will be used to help assign a monetary value to each yard and win for each respective team.  A contract analysis of players of similar age and production over a designated period will help determine which player contracts were worth the money they were paid.  The final step to this research is to determine at what age a player is expected to see a decline in production using an aging curve.  Additionally, this research is meant to predict the future performance of a football player in the NFL to determine an adequate contract value and length during contract negotiations.  Survival analysis will be used to analyze the expected duration for a player to perform as expected and when to expect a decline in production.

 

10:55-11:10    Optimal empirical game plan for box lacrosse §

 

Eddy Tabone

Statistics and Data Science Program, St. John Fisher College

E-mail: ejt03959@sjfc.edu

 

As the horizon of sports analytics has continued to expand throughout this decade, lacrosse has consistently been overlooked.  Working with the Rochester Knighthawks of the National Lacrosse League throughout the 2017-2018 season has opened the opportunity to begin pioneering analytics in box lacrosse.  Recording the offensive and defensive tendencies on a possession-by-possession basis by both the Rochester Knighthawks and their opponents has created empirical data to make inferences on the value of shots by the following variables: distances and target on net, efficiency through the number of shots in a possession and over the course of twelve, five-minute chunks that make up the 60 minutes of a single lacrosse game, when in the shot clock shot opportunities are taken, and how possessions end.  With these inferences, conclusions can be drawn to develop potential optimal empirical offensive and defensive game plans to, in turn, substantially improve a team’s chances of winning any given game.

 

SESSION 2C                                                                                  Goergen 108

 

Methods for Image Analysis with Applications in Medicine

 

10:25-10:45    Analyzing skin lesions in dermoscopy images using convolutional neural networks §

 

Vatsala Singh and Ifeoma Nwogu

Department of Computer Science, Rochester Institute of Technology

E-mail: vs2080@rit.edu

 

In this paper, we discuss the problem of automatic skin lesion analysis, specifically melanoma detection, by using deep learning techniques to perform classification on dermoscopic images available publicly.  Skin cancer, of which melanoma is a type, is the most prevalent form of cancer in the US and more than four million cases are diagnosed in the US every year.  The 5-year survival rate of melanoma is 98% when detected and treated early, yet an estimated 9,730 people will die of melanoma in 2017 due to late stage diagnosis.  Although fewer than 1% of skin cancer cases are melanoma, it accounts for the vast majority of skin cancer deaths.  For this reason, there is an urgent need for readily available and accessible tools for melanoma screening and detection.  In this work, we present our efforts towards an accessible, deep learning-based system that can be used for skin lesion classification, thus leading to an improved melanoma screening system.  For classification, a deep convolutional neural network architecture is implemented.  Also, the some of the important hand-coded features like 166-D color histogram distribution, edge histogram, and Multiscale Color Local Binary Patterns are extracted using computer vision techniques and fed into a decision tree classifier.  Average of the two mentioned classifiers is taken for the final results.  The classification task achieves an accuracy of 80.3%, AUC score of 0.69, and precision score of 0.805.

 

10:50-11:10    Spatial regression analysis of diffusion tensor imaging data for subjects with sub-concussive head blows §

 

Yu Gu and Xing Qiu

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

Patrick Asselin and Jeffrey Bazarian

Department of Emergency Medicine, University of Rochester Medical Center

E-mail: Yu_Gu@urmc.rochester.edu

 

Sports-associated concussions and repetitive head impacts have emerged as an important focus of brain injury research.  Prior studies linked repeated head impacts to axonal injury, which may cause long-term changes in brain structure and neurocognitive function.  Structural MRI techniques such as diffusion tensor imaging (DTI) have been helpful in identifying various brain pathologies, especially the spectrum of traumatic brain injury, with higher sensitivity in vivo.  We conduct a retrospective study with 28 football players.  The preseason and post-season DTI scans for each of them were collected, along with helmet accelerometers.  We employed a novel algorithm (Spatial REgression Analysis of DTI, or SPREAD), including pre-processing and five main stages: (1) quantifying the uncertainty by resampling techniques; (2) reconstructing the signal by spatial regression; (3) calculating resampling-based p-values and global summary statistics based on functional norms; (4) controlling type I error by suitable multiple comparison procedures; and (5) group-level inference based on both meta-analysis and functional summary statistics.  We demonstrated that output measures based on the SPREAD analysis have great discriminant power to distinguish athletes from the controls.  We also observed significant correlations between accelerometers and DTI measures, which imply a dose-response link between repeated head impacts and white matter changes.  The combined adjusted p-value map obtained through meta-analysis illustrated that repeated head impacts cannot be accounted for by any single specific region of interest, which supports the heterogeneity hypothesis of brain sub-concussion related to football games.  We will present our findings from both real-data analysis and simulation studies.  We believe that our method can be applied to other research problems that require analyzing medical images collected from highly heterogeneous subjects.

 

SESSION 2D                                                                                  Wilmot 116

 

Applications of Text Mining: Human Health and Life Aesthetics

 

10:25-10:45    Natural language processing of clinical notes in electronic health records §

 

Xupin Zhang

Warner School of Education, University of Rochester

Zhen Tan

Department of Biochemistry and Biophysics, University of Rochester Medical Center

E-mail: xzhang72@u.rochester.edu

 

Heart failure is a common and high risk condition.  About 2% of adults over the age of 65 have heart failure.  Heart failure is often called congestive heart failure and occurs when the heart is unable to pump sufficiently to maintain blood flow to meet the body’s needs.  In this study, we used a publicly available critical care database called MIMIC III to develop algorithms to identify hospitalized patients with heart failure.  We developed three algorithms for heart failure identification using electronic health record data: (1) using a “bag of words” approach to divide clinical notes into each word as features; (2) adopting Kang’s method, EliIE, to prepare features from clinical notes; and (3) combining features in algorithm (2) with features selected from structured data from electronic health records.  By comparing machine learning methods of different complexity, a robust and efficient approach was provided to physicians.  Relevant predictive features for heart failure will be reported.

 

10:50-11:10    Music mining with a topic modeling approach for improvisational learning §

 

Qiuyi Wu

College of Science, Rochester Institute of Technology

E-mail: wu.qiuyi@mail.rit.edu

 

Extensive studies have been conducted on both musical scores and audio tracks of western classical music with the finality of learning and detecting the key in which a particular piece of music was played.  Both the Bayesian approach and modern unsupervised learning via latent Dirichlet allocation have been used for such learning tasks.  In this research work, we venture out of the western classical genre and embrace and explore jazz music.  We consider the musical score sheets and audio tracks of some of the giants of jazz like Duke Ellington, Miles Davis, John Coltrane, Dizzy Gillespie, Wes Montgomery, Charlie Parker, Sonny Rollins, Louis Armstrong (Instrumental), Bill Evans, Dave Brubeck, and Thelonious Monk (Pianist).  We specifically employ Bayesian techniques and modern topic modelling methods (and even occasionally a combination of both) to explore tasks such as automatic improvisation detection, genre identification, key learning (how many keys did the giants of jazz tend to play in, and what are those keys), and even elements of the mood of the piece.

 

SESSION 2E                                                                                  Goergen 110

 

Models for Analysis of Health Outcomes Data

 

10:25-10:45    Impact of hospital share of nursing home-originating hospitalizations on risk adjusted hospital 30-day readmission rates §

 

Zhiqiu Ye, Helena Temkin-Greener, Yue Li, and Orna Intrator

Department of Public Health Sciences, University of Rochester Medical Center

Ghinwa Dumyati

Department of Medicine, University of Rochester Medical Center

Dana Mukamel

Department of Medicine, University of California–Irvine

E-mail: Zhiqiu_Ye@urmc.rochester.edu

 

Hospitals are increasingly penalized for excess readmissions under the expanding Hospital Readmission Reduction Program (HRRP).  With inadequate risk adjustment, hospitals with a higher concentration of vulnerable patients may experience higher readmission rates.  Few studies have examined the role of serving frail older adults admitted from nursing homes.  We examined the effect of hospital share of nursing home-originating hospitalizations (NOHs) – percentage of admissions directly from nursing homes – on HRRP hospital readmission rates.  We studied 11,660,818 fee-for-service Medicare beneficiaries aged 65+ who were hospitalized in 3,399 HRRP-participating hospitals from July 2010-June 2013.  The 100% Medicare Provider Analysis and Review file and Minimum Data Set were used to identify NOHs.  Outcome variables were hospital-wide and the five HRRP target condition-specific risk-adjusted 30-day readmission rates.  We tested two-stage least squares models with hospital share of NOHs being instrumented by the average density of the nursing home population in the local counties (Staiger-Stock F-statistic=46.3).  Based on the Wu-Hausman test we did not reject exogeneity and reported only the ordinary least squares results.  The mean hospital share of NOHs was 12.5% (standard deviation 8.6%).  After controlling for hospital and market characteristics and state fixed effects, hospitals in the highest and the 3rd quartile of NOHs had higher readmission rates than hospitals in the 1st quartile (β=0.32 and 0.20, respectively, both p<0.001).  The relationship persisted for condition-specific readmissions.   Our findings suggest that hospitals having a higher share of NOHs perform worse on the HRRP readmission measures, and highlight the potential unintended consequences of HRRP penalties on a hospital’s capacity to serve nursing home residents.

 

10:50-11:10    The economic disparities and suicides: multi-level analyses of panel time series data in the United States

 

Bruce Sun

Department of Mathematics, SUNY Buffalo State

E-mail: sunbq@buffalostate.edu

The effects of growing social and income inequalities on suicides during the last three decades in the United States are confusing and even contradictive.  In this research, the selected economic and social risks of suicide associated with the inequalities among 51 states in the United States were identified and the effects were estimated by panel time series analyses.  To our knowledge, there exist no previous studies that estimated the dynamic model of American state-level-suicides using panel time series data together with multilevel analyses.

 

SESSION 3A                                                                                  Goergen 101

 

Advancing Health Services Research through Novel Models and Data

 

11:20-11:35    Leveraging underutilized data sets to illuminate the relationship between pharmaceutical detailing and physician prescribing behavior §

 

                        Teraisa Mullaney

                        Department of Public Health Sciences, University of Rochester Medical Center

E-mail: Teraisa_Chloros@urmc.rochester.edu

 

Detailing is a pharmaceutical industry practice where drug manufactures market medications to physicians directly.  Little is known on how pharmaceutical companies target physicians or how this practice impacts prescribing behavior.  To address this possible conflict of interest, the Affordable Care Act requires public reporting of all payments to physicians by pharmaceutical companies.  Open Payments is a publicly available archive of all detailing payments from mid-2013 through 2016.  The Medicare Provider Utilization and Payment Data: Part D Prescriber data set is a publicly available archive of all prescription payments by Medicare Part D from 2013 through 2016.  Health services researchers have not published on the intersection of these two novel datasets.  This analysis tests for effects of detailing payments on physician behavior.  Moreover, the public has drawn associations between drug manufacturer practices and the current opioid epidemic.  Little is known scientifically around detailing concerning opioids or statistically around the effect of detailing on physician prescribing of opioids.  To quantify assumptions around this public health crisis, OxyContin, an opioid medication, was the focus of this study.  Individual physicians were matched within both data sets using physician name and location.  Detailing payments concerning OxyContin, including total number of payments and total amount in dollars, were tested as predictors for OxyContin prescription factors, including total claim count, total day supply count and total drug cost.  A time variant was included in the analysis to test for a lagged effect.  This analysis leverages underutilized data sets to illuminate possible correlations between financial pharmaceutical incentives and physician prescribing behavior.

 

11:35-11:50    Testing relationships of latent constructs with limited data §

 

Chelsea Katz

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: Chelsea_Lyons@urmc.rochester.edu

 

This study employs a method to test relationships between latent constructs, not explicitly measured in data, to help explain phenomena in early research.  A model of physician decision-making was developed to explain why patients with a mental illness who are admitted for an acute myocardial infarction are less likely to receive cardiovascular procedures, compared to patients without a mental illness.  The model utilizes the constructs of uncertainty and risk tolerance, which are hypothesized to be functions of mental illness in the decision-making process.  Hospital claims data available for this study provide no explicit measurement of these constructs.  However, a non-linear functional form was derived by representing the constructs mathematically using a logistic probability density function, based on their underlying theoretical meanings.  This functional form was estimated using maximum likelihood estimation.  If the parameters representing the relationship between mental illness and the psychological constructs are distinct from zero, their estimators are biased due to multiple sources of unobserved error.  However, they are unbiased for parameters equal to zero.  Therefore, while not possible to estimate unbiased nonzero values for the parameters representing the constructs of interest, it is possible to test whether they are equal to zero.  When an estimate for a particular construct is statistically different from zero, there is early evidence that mental illness may be related to use of cardiovascular procedures, through that construct.  Due to its non-linearity, it is unlikely the functional form is observed in the data, if something besides the proposed model explains the phenomenon.

 

11:50-12:05    Using maximum likelihood estimation to test improvements to conceptual framework of health services utilization §

 

Alina Denham

Department of Public Health Sciences, University of Rochester Medical Center

E-mail: Alina_Dehnam@urmc.rochester.edu

 

Our study offers an improvement to the widely used conceptual framework of health services utilization developed by Andersen and Newman.  Rather than a priori classifying predictors of healthcare utilization into predisposing, enabling, and need factors (to be included in a regression function), we propose an explicit non-linear structural model in which predisposing, enabling (access), and need factors as well as additional factors of sensitivity of access and sensitivity of need play specific roles.  Monte Carlo evaluation of the structural model and maximum likelihood estimation shows that the true parameters can be recovered.  We apply the structural model to 2014 Health and Retirement Study data to identify the specific roles – predisposing, sensitivity of need, or sensitivity of access – played by various psychosocial variables in seeking doctor visits.  We find that pessimism, hopelessness, and social support from children, family members other than spouse and children, and from friends play all three roles: predisposing factors, access sensitivity, and need sensitivity.  Size of social network plays the roles of both sensitivity factors, but is not a predisposing characteristic.  Optimism and life satisfaction are need sensitivity factors.  We conclude that our structural model is able to identify specific roles of observed variables in driving healthcare utilization.

 

SESSION 3B                                                                                  Goergen 109

 

Biomedical Science Applications

 

11:20-11:40    Gene pathway analysis reveals neuropathologies in mice that may be linked to autism spectrum disorder in humans §

 

Valeriia Sherina

Department of Biostatistics and Computational Biology, University of Rochester Medical Center

Carolyn Klockel, Jakob Gunderson, Joshua Allen, Marissa Sobolewski, and Deborah Cory-Slechta

Department of Environmental Medicine, University of Rochester Medical Center

Jason Blum and Judith Zelikoff

Department of Environmental Medicine, New York University School of Medicine

E-mail: Valeriia_Sherina@urmc.rochester.edu

 

A growing body of evidence indicates that the developing central nervous system is a target of air pollution toxicity.  Increasingly, epidemiological reports indicate that exposure to the particulate matter (PM) fraction of air pollution during fetal development is associated with increased risk of autism spectrum disorder in children.  This study tested whether exposure to concentrated ambient particles (CAPs) during the fetal period induces altered white matter and dysregulated metal homeostasis in offspring brains.  To understand potential underlying mechanisms, we performed RNAseq followed by pathway/ontology analyses.  Global cerebellar gene expression was significantly affected by gestational CAPs exposure at post-natal days 11 and 12.  Pathway and ontology analyses revealed significant upregulation in energy homeostasis, lipid homeostasis, and metal homeostasis that could underlie the observed cerebellar myelin and metal pathologies.  This study provides novel evidence of PM-induced cerebellar toxicity and may further help to elucidate specific mechanistic targets of PM-induced neuropathologies.​

 

11:45-12:05    Representation of spheroprotein surface topography using standard cartographic techniques: potential advantages over traditional methods

 

Vicente Reyes

Ronin Institute

E-mail: Vicente.Reyes@ronininstitute.org

 

Analysis of the protein surface is crucial since the biological properties of proteins are largely determined by their surface characteristics.  Meanwhile, it is well-established that a significant majority of proteins fold as compact globules (‘spheroproteins’), and as such may be likened to the earth.  For centuries, the earth’s surface has been visualized and analyzed using cartographic projection methods.  Recently we developed a procedure that transforms protein 3D structure coordinates from Cartesian (x,y,z) to spherical (ρ,φ,θ), whose origin is the geometric centroid of the protein and where ρ corresponds to the protein’s radius from its centroid to any atom in its structure, φ to the latitudes, and θ to the longitudes.  The procedure also allows for the separation of the hydrophilic outer layer of the protein from its hydrophobic core.  Projection of such protein outer layer onto a 2D plane may then be achieved using the transformation equation for the particular projection method.  The “sea level” is determined by choosing the shortest distance from the origin to the minimum ρ value, denoted as ρ0, and using the “elevations” of surface points above ρ0 as the z-coordinates of the 2D-projected map.  Plots of these 2D raise-relief maps (‘terrain models’) are rendered using appropriate software such as Minitab, MatLab and GnuPlot.  Such 2D projections present a number of advantages for protein surface analysis over 3D-based models, which we shall discuss.  We have implemented the procedure using FORTRAN 77/90 programs for seven standard map projection methods, namely: sinusoidal, Wagner, Gall-Peters, Lambert, Mercator, Miller, and Aitoff, and compared them with each other in both all-atom and reduced protein representations.  We are currently refining the procedure and adding more features and utilities.

 

SESSION 3C                                                                                  Goergen 108

 

Applications of Data Science for Social Good

 

11:20-11:40    Exploratory analysis of data on homelessness: paving the way for data analytics for social good §

 

Travis Broadbeck and Necip Doganaksov

Department of Accounting and Business Law, Siena College

E-mail: tm23brod@siena.edu

 

In the past decade, data science and analytics have evolved from vaguely understood technical concepts to well established fields significantly impacting our daily lives in areas such as social media, e-commerce, and increasing numbers of internet connected devices.  The increase in demand for analytic skill sets has led to increased academic offerings mostly confined to academic departments such as computer science, statistics, engineering, and mathematics because these fields provide the technical foundation used by data science and analytics specialists.  A key thrust of this paper is to make data science and analytics more accessible to students in liberal arts and humanities fields through sharing the case study of our unique partnership with a local organization (CARES, Inc.).  Before our partnership, CARES’ original data on homelessness spanning multiple years and locations within New York State was primarily used for periodic aggregate reporting.  This paper describes our exploratory analysis that led to new insights and discoveries that would not have occurred without this partnership.  Our collaboration with the local nonprofit addressing homelessness in New York State is the only known effort like this and provides a suitable introduction of key concepts and applications of data science and analytics for social and humanitarian causes.

 

11:45-12:05    Point process prediction §

 

Italo Sayan

College of Science, Rochester Institute of Technology

E-mail: ixs3409@rit.edu

 

Patrol allocation is a challenge for police departments.  Currently, precincts use a crime analyst to decide hot-spots and allocate units accordingly.  However, learning algorithms can offer a more systematic approach to the problem.  On average, at both a national and local level, governments designate one-third of their budgets to policing.  Spending the budget in the most optimal way is a matter of key importance.  Crime, earthquakes, and tweets have a particular characteristic in common.  The occurrence of one event increases the probability of subsequent events.  Earthquakes can produce aftershocks, tweets can produce subsequent re-tweets, and crimes follow the same behavior.  Academic efforts have shown that a crime elevates the risk of repetition on nearby areas.  Models have been developed to capture patterns of self-exciting point processes (SEPP).  Earthquakes, crimes, and tweets all follow SEPP behavior. It is possible to leverage the previous SEPP literature to produce a tool for police departments.  Many advances have been made.  Epidemic type aftershock sequence (ETAS) models were adapted from seismology to produce habitation burglary prediction models.  First, it is important to understand how SEPP models work.  Then, I'll explain how to apply them to crime prediction using R code.  Finally, I will produce a web app to serve a predictive crime map.  My main contribution is the programmatic application of George Mohler's SEPP model.  For serving the map, I will use Google Maps API.  Data from the San Antonio Police department will be used for modelling purposes.  Previous results show that SEPP models predict 1.4-2.2 times as much crime compared to a dedicated crime analyst.  Police patrols using ETAS forecasts led to a average 7.4% reduction in crime volume as a function of patrol time.

 

SESSION 3D                                                                                  Wilmot 116

 

Undergrad, Meet Data Science: Notes from the Field on a Data Science Minor

 

11:20-11:35    Introducing data science: statistics, computing, and more

 

Anne Geraci and Katie Donovan

Statistics and Data Science Program, St. John Fisher College

E-mail: ageraci@sjfc.edu

 

STAT160, Introduction to Data Science, is the introductory and foundational course in the Data Science minor at St. John Fisher College.  The course focuses on the “pillars” of statistical inference (e.g., statistical significance, practical importance, generalizability, and causality) using simulation-based methods in R and RStudio.  We find that randomization makes statistical concepts more concrete and accessible to students than traditional theory-based methods.  We build knowledge of R incrementally and utilize in-class Research Assistants during more intensive periods of programming.  The course combines several aspects, including statistical inference, data management, programming, interpretation and communication of results, and thinking holistically about quantitative data analysis.  Students practice these skills in homework assignments, during in-class examples, in traditional written exams, and in a semester-long data analytic project on a topic related to their major.

 

11:35-11:50    Programmatic issues in an undergraduate data science minor

 

  1. Evan Blaine

Statistics and Data Science Program, St. John Fisher College

E-mail: bblaine@sjfc.edu

 

We address some issues involved in developing an undergraduate Data Science program, and discuss them in the context of the Data Science minor at St. John Fisher College.  The presentation will show how American Statistical Association curriculum guidelines for statistical science led to course and curriculum development work, culminating in program requirements, a course progression, and a skills map for the Data Science program.  We discuss the need for disciplinary inquiry and problem solving in a Data Science program.  Our program combines a badge of statistical and computing study with disciplinary electives that are integrated in a capstone Data Science experience, and we illustrate some of those disciplinary connections.  Finally, we address the statistical computing challenges of a Data Science program, particularly as they play out in a Data Science minor, and explain our use of R as our program’s computing environment.  Our experience is that Data Science is inherently multidisciplinary and adds value to students’ education in many liberal arts disciplines and their preparation for careers in those fields.

 

11:50-12:05    Tying it all together

 

Bernard Ricca

Statistics and Data Science Program, St. John Fisher College

E-mail: bricca@sjfc.edu

 

STAT375, Data Analysis and Statistical Computing, serves as a capstone experience for the Data Science minor at St. John Fisher College (it also serves as a milepost for the Statistics major).  This course ties together three strands of the data science minor: learning to learn, undertaking all aspects of a project, and connection to the discipline, and formalizes the idea of statistical computing.  Each student chooses a project, typically connected to her/his discipline, to carry out during the semester and publicly present.  The students are supported in their work through a combination of mini-lectures on various statistical and/or computing topics, individual consultation with the instructor, and judicious (and guided!) use of available online resources.  These projects provide students an opportunity to solidify and demonstrate their skills, and the public presentations serve as a recruiting tool.  Additionally, the course requires student work on ePortfolios, which provide both concrete examples of skills to potential employers and data for program assessment.  Sample student work and comments about the course, along with the ePortfolio process, will be shown.

 

SESSION 3E                                                                                  Goergen 110

 

Statistics and Machine Learning for Prediction

 

11:20-11:40    Integrating machine learning algorithms and statistical methodology to predict college graduation status and final GPA §

 

Robert Tumasian, Diana DeFilippis, and Sydney Ng

Department of Mathematics, SUNY Geneseo

E-mail: rat3@geneseo.edu

 

Understanding the factors that affect student performance can provide important information to college administrators.  Ten years of college cohort data are analyzed by constructing several predictive machine learning models to determine which factors affect final GPA, graduation status, and major retention.  Additionally, predictor strength is examined to determine the most influential factors for academic success.  We will also address the issue of missing observations in our data.

 

11:45-12:05    Predicting stock market prices using machine learning algorithms §

 

Andrew Flannery, George Kuliner, and Chase Yaeger

Department of Mathematics, SUNY Geneseo

E-mail: ajf16@geneseo.edu

 

We will build predictive models for the S&P 500 index values for some top companies using the historical data from recent years, with some predictors including SPY values and companies’ most valued securities.  Considering time series effects paired with the use of some state-of-the-art machine learning algorithms, our goal is to build high performing models by testing them with the current index values.

 

SESSION 4                                                                                     Wegmans 1400

 

Keynote Lecture

 

1:25-2:35        How big data can leverage small data and conversely

 

                        Bhramar Mukherjee

Department of Biostatistics, University of Michigan

E-mail: bhramar@umich.edu

 

While reviewing a recent article by a famous statistician from Harvard University, I was struck by the following sentence: “Seeing scientific applications turn into methodological advances is always a joy, at least for those of us who care about advancing the science of data, in addition to advancing science with data.”  In this talk, I will try to share this “joy” (and associated anxiety) of being a quantitative scientist at a time when our science and society are undergoing unprecedented information/data revolution.  I will present three ideas/examples: (1) Shrinkage estimation to combine heterogeneous data sources; (2) Expanding an existing risk prediction model with auxiliary summary information that might be publicly available; and (3) A phenomewide association study with polygenic risk scores and electronic health records using data from the Michigan Genomics Initiative, a longitudinal biorepository at Michigan Medicine.  The examples are designed to illustrate that principled study design and data science methodology are at the heart of doing good science with data.  This is joint work with many students and colleagues at the University of Michigan.

 

SESSION 5A                                                                                  Goergen 101

 

Statistics Education

 

3:30-3:50        Success in a statistics course: how important is grit?

 

Susan Mason

Department of Psychiatry, Niagara University

Elizabeth Reid

Department of Mathematics, Marist College

E-mail: sem@niagara.edu

 

Students who are successful in basic statistics courses tend to be those who have confidence in their analytical skills.  As professors, we know the importance of background and skill level, but we also recognize the need for students to come to class with a positive attitude and a commitment to learning.  In this paper, we discuss the factors associated with student success, including the student’s ability and background, characteristics of the course, the classroom atmosphere, the student’s attitude, and the student’s behavior.  One performance-related characteristic that has been the subject of recent research is grit.  When students persevere and demonstrate a passion for achieving long-term goals, even when a course is challenging for them, the students are said to have grit.  We examined grit in statistics students by studying the relationships between the students’ attitudes, behaviors, and grades.  In one study we administered a questionnaire to new students, asking them to evaluate their own attitudes and behaviors relative to those of others taking the course. The student’s responses were then correlated with their grades to determine the predictive value of self-assessments.  Another approach we used was to examine anonymous course evaluation forms.  A series of questions on the forms asked students to reflect on their own effort and diligence throughout the semester.  The forms also asked students what grade they expected to receive in the course.  Comparisons between the responses of more successful and less successful students contribute to our understanding of the behaviors associated with success.

 

3:55-4:15        Active learning in mathematics and statistics courses

 

Elizabeth Reid

Department of Mathematics, Marist College

E-mail: Elizabeth.Reid@marist.edu

 

“Why do I need to know this?” is a question that is asked far too often in mathematics and statistics classes.  In the presentation we explore the benefits of active learning to students taking such courses.  It is helpful to students when they are permitted to collaborate with their peers, are provided applications and analogies for the material presented, and are allowed to use concepts learned in class to answer a question that they are interested in.  There are three levels of understanding for any topic.  The most basic level is being able to follow along when someone explains a concept to you.  The second level is demonstrating the ability to apply the concept to answer a question, and the third level is the capability to explain the concept clearly to someone else.  Active involvement with the material facilitates deeper levels of understanding and a greater appreciation of why the course is important.

 

SESSION 5B                                                                                  Goergen 109

 

Methods for Regression Analysis

 

3:30-3:50        Comparing the efficiency of shrinkage estimators

 

Marvin Gruber

College of Science, Rochester Institute of Technology

E-mail: mjgsma@rit.edu

 

We consider a linear regression model with prior assumptions about the mean and the dispersion of the regression parameters.  Averaging over Zellner’s balanced loss function, the optimum estimator (the one with the smallest risk) is a convex linear combination of the least-squares estimator and a ridge-type estimator.  We call this a Liu-type estimator because it is a generalization of the estimator proposed by Liu.  The ridge-type estimator is optimal for the classical mean dispersion error.  Using Farebrother’s result, we compare the risk of these estimators to that of the least-squares estimator both averaging over Zellner’s balanced loss function and the mean dispersion error.  We also give examples of similar comparisons of ridge- and Liu-type estimators to each other.

 

3:55-4:15        Statistical learning approach to modal regression

 

Yunlong Feng

Department of Mathematics and Statistics, SUNY Albany

E-mail: ylfeng@albany.edu

 

In this presentation, I will talk about the modal regression problem from a statistical learning point of view.  It will be shown that modal regression can be approached by means of empirical risk minimization techniques.  A framework for analyzing and implementing modal regression within the statistical learning context will be developed.  Theoretical results concerning the generalization ability and approximation ability of modal regression estimators will be provided.  Connections and differences between the proposed modal regression method and existing ones will also be illustrated.  Numerical examples will be given to show the effectiveness of the newly proposed modal regression method.

 

SESSION 5C                                                                                  Goergen 108

 

Predicting Outcomes using the Bradley-Terry Model

 

3:30-3:50        Prediction and evaluation with the Bradley-Terry model: a college hockey case study

 

John Whelan

College of Science, Rochester Institute of Technology

E-mail: john.whelan@astro.rit.edu

 

The Bradley-Terry-Zermelo model has been widely used to evaluate paired comparison experiments, with applications ranging from taste tests to rating chess players.  It is, among other things, the basis of the KRACH rating system used by several college hockey websites, which uses the maximum-likelihood value of the Bradley-Terry strength parameter for each team.  One use of such a model is to assign probabilities to outcomes of future games based on past results.  These probabilities can be assessed by use of a Bayes factor in light of the actual results of the predicted games.  We illustrate this by application to the example of NCAA Division I Men's Ice Hockey.  We compare the performance of two methods of assigning probabilities, under different prior assumptions: 1) assuming each team's strength is equal to the MLE, or the maximum a posteriori value for one of several prior choices, and 2) estimating the posterior predictive probability using a Gaussian expansion about the MAP point for the chosen prior.

 

3:55-4:15        Big data analytics: some ethical and legal aspects

 

Reneta Barneva

Department of Applied Professional Studies, SUNY at Fredonia

E-mail: Reneta.Barneva@fredonia.edu

 

The analysis of big data has led to a number of new applications in health sciences, safety and security, meteorology, astronomy, and even agriculture.  It requires the development of new theoretical methods for data processing and data organization and storage.  While scientists are excited by the new opportunities, big data analytics poses some ethical questions and concerns, which are often overlooked.  In this talk, some case studies will be considered to illustrate the concerns and the first steps taken to solve the emerging issues will be reviewed.

 

SESSION 5D                                                                                  Wilmot 116

 

Quantifying Treatment Effects in Clinical Studies

 

3:30-3:50        Investigations of a nonparametric effect size statistic for studies with two independent samples

 

Bernard Ricca and B. Evan Blaine

Statistics and Data Science Program, St. John Fisher College

E-mail: bricca@sjfc.edu

 

Cohen’s d is a widely-reported standardized effect size statistic in research.  Research has established that Cohen’s d does not perform well in the presence of non-normal and heteroscedastic data.  Robust expressions of d are available, but few nonparametric alternative effect size statistics exist.  A nonparametric analogue of Cohen’s d effect size, using median and median absolute deviation, is proposed and its characteristics are explored.  Using simulated data, the proposed effect size, ΔMAD, is found to be a more accurate measure of group differences for data sets that include outliers.  Simulations also show that that ΔMAD is also a more appropriate measure of group differences for non-normal and/or heteroscedastic data.  Surprisingly, these simulations indicate that ΔMAD provides a better estimate of effect size than Cohen’s d even when the usual tests (e.g., Levene test for equality of variances, Shapiro-Wilk normality test) fail to detect significant deviations from normality and/or equal variance.  An investigation of additional useful properties of ΔMAD as an instrument for meta-analyses, and the implications of this study for meta-analyses, are presented.

 

3:55-4:15        Measures of overall survival reported and utilized in randomized controlled trials of cancer therapies

 

Eva Culakova

Department of Surgery, University of Rochester Medical Center

E-mail: Eva_Culakova@urmc.rochester.edu

 

Recent randomized control trials (RCTs) of cancer immunotherapy care have brought new challenges to statistical analysis.  The survival patterns in these trials tend to deviate from the proportional hazards assumption that is more commonly satisfied in studies of conventional chemotherapy treatments.  The increasing prevalence of immunotherapy treatments in clinical research has led to increased interest in the development of novel statistical methods that would be a better methodological fit for these trials.  In a paper published in The Journal of Clinical Oncology (2016), Trinquart et al. focused on the ratio of restricted mean survival time (RMST) and recommended routine reporting of RMST measures in trials with a time-to-event outcome.  In this presentation, we plan to compare theoretical properties of the conventional measure of hazard ratio (HR) relative to the ratio of RMST.  Our focus will be on comparing ratio of RMST to HR in studies with good prognosis, such as RCTs in early stage breast cancer that have low mortality, and also in studies with poor prognosis and matured survival curves.  Additionally, we will use a systematic review of RCTs of chemotherapy in patients with advanced breast cancer that were published between 1990 and 2013 to provide a brief overview of reported statistical measures related to overall survival.

 

SESSION 5E                                                                                  Goergen 110

 

Statistics and Data Science Applications in Media and Astronomy

 

3:30-3:50        Predicting a quasar’s redshift and radio brightness from Sloan Digital Sky Survey data §

 

Alexander Belles, William Freed, and John Robinson

Department of Mathematics, SUNY Geneseo

E-mail: ab40@geneseo.edu

 

In astronomy, the use of statistical learning methods is becoming increasingly important in the era of large sky surveys.  The Sloan Digital Sky Survey (SDSS) is one such survey that has surveyed the sky in optical light and has led to the discovery of thousands of quasars, a type of galaxy with an accreting supermassive black hole at its center.  Using a data set of detected quasars from SDSS and several statistical and machine learning algorithms, we will predict the object’s redshift, a proxy for distance, using the object’s brightness in a variety of wavelength ranges.  A secondary goal will be to classify a quasar as radio-bright using brightnesses in optical wavelengths.  Determining a quasar’s redshift and radio brightness are important to studying these extragalactic sources.

 

3:55-4:15        Understanding news readers at Globo.com with data science

 

Andre Ramos

Globo.com

E-mail: axl5988@rit.edu

 

In the digital era, advertisers have moved away from newspapers and magazines to a place where they are able to track the way people interact with their brands.  Along with that, consumption has migrated from desktop to mobile, which has caused new user interfaces such as news feeds and a faster internet is providing a better scenario streaming content.  With that in mind, publishers have to move away from ‘vanity’ metrics, such as pageviews and unique visitor, and have to now focus on creating a habit of consumption.  This means working in the dimensions of frequency and time.  We will talk about how the largest news portal in Brazil, with over 86 million monthly users, is using big data and data science to understand our users to provide better content and figure out strategies that are no longer based only on advertising.

 

 

§ Indicates presentations that are eligible for the student presentation awards