Industry Professional Poster Competition
Purdue University’s Krenicki Center for Business Analytics & Machine Learning will be awarding $1,000 for the industry professional (non-student) poster. Scoring will be based on content, layout, and presentation. This is great way to present your work and get feedback now! Attendance at the poster competition is never light and may allow you to connect with more people interested in your work.
Thank you to the Industry Professional Poster Competition Sponsor
Student Poster Competition
SAS is sponsoring an award of $1,000 for the top student poster. Scoring will be based on content, layout, and presentation. Certificates will be awarded for 2nd and 3rd place student posters. Each participant among the top three posters will also receive a FREE SAS certification exam voucher. The judging panel will be posted on the website as the conference draws nearer.
Thank you to the Student Poster Competition Sponsor
Poster Competition Judges
Aaron Burciaga, CAP, ACE
Senior Practice Manager
US Federal Partner Professional Services
Amazon Web Services
Chief Data Scientist, Senior Research Scientist & Assistant Professor
Sr. Director & Sr. AI Principal Engineer
Data & Analytics
Juan R. Jaramillo, Ph.D.
MSBA Academic Director
Associate Professor of Analytics
Hua Ni, D.Sc., PMP, CAP
Principal Data Scientist, Associate Partner
Cognitive & Analytics
Data & Technology Transformation
IBM Consulting, U.S. Federal
Jennifer Lewis Priestley, Ph.D.
Professor of Statistics and Data Science
School of Data Science and Analytics
Kennesaw State University
Shreyas Subramanian, Ph.D.
Principal AI/ML Specialist Solutions Architect
Amazon Web Services
Yan Xu, Ph.D.
Director, Operations Research & IML
Monday Poster Session
Estimating Customer Lifetime Value in InsuranceRohan Das, Yi-Chen Chiou, Udayan Kate, Jiayu Zhang, Sanjana Santhanakrishnan, Yang Wang, Purdue University, West Lafayette, IN
Customer Lifetime Value (CLV) is an important metric for companies to focus on growing or maintaining their revenue streams from their customer base. This metric can help organizations provide a direction to focus the marketing expenditures and efforts by helping identify customers who have high potential in terms of longevity of association and identify the possibility of selling additional products (cross-selling). The metric can also help identify those customers who probably do not have high potential in these terms, further optimizing the expenses and effort on marketing the products. This paper explores the possible methods of estimating customer lifetime value in the insurance industry and develops a model that can help predict customer lifetime value of customers. In the insurance industry, estimating the customer lifetime value consists of calculating the revenue generated by a customer through premiums of different lines of business, like Auto, Homeowner, Life, etc, and the claims paid by the company. It is essential to estimate the time a customer will stay associated with an organization as most of the revenue generated for an insurance company comes from the renewal of policies. In this project, the methodology used consists of a probabilistic approach with a structural model where the CLV of a customer is calculated based on the present revenue of a customer and the probabilistically weighted value which will be generated by that customer. The Customers will then be classified into categories based on the CLV which will help us derive insights into customer segments with high potentials for generating revenue.
Digital Servitization – Managing Risks and Tradeoffs in Configuration of Resources for Business Model InnovationJuliana Hsuan, Professor, Copenhagen Business School, Frederiksberg, Denmark
Many manufacturers are embarking on digital servitization as a strategy to compete through value propositions that integrate products with new development of services and software systems. The undertaken journeys of Danish manufacturers showcase the tradeoffs and risks considered (between standardization and innovation) in the configuration of required resources for business model innovation.
Using Django and Mongodb to Develop a Translation Quality Checking ToolRyan Egbert, Subhashree Chowdhury, Poorna K. Narasimhan, Suyash Sukthankar, Purdue University, West Lafayette, IN
As literacy continues to spread throughout the world, more and more documents are being translated into languages that did not have prior access to such information. To ensure accurate and complete translations, this project provides a system through which the quality of a translation can be determined. A visual interface was built using the Django web framework to allow users the ability to check the quality of translations. Further, users can view previous versions of a translation and are given insights on how to improve certain aspects of the translation. Currently, there are no full-fledged web applications that provide insights into translation similarity, comprehensibility, and readability quality checks through an interactive interface.
Forecasting Pesky SKUs for Auto Parts RetailerApoorva Singh, Saachi Lalwani, Jose Galindo, Shubham Agarwal, Adithya Bharadwaj Umamahesh, Yang Wang, Purdue University, West Lafayette, IN
The current issue faced by the client involved lost sales and increased holding costs for leftover inventory. Both issues have a direct impact on the economic profits of the firm and are thus of pressing importance to the company. We have used historical sales data in our project in order to better understand the patterns in sales which can then give us an idea of the future sales. Through this study, we have identified anomalous SKUs based on outlier detection and understanding the statistical significance of each input predictor. We have defined thresholds in sales per store amount to classify each SKU as “pesky”, i.e. underperforming in some stores and over-performing in others, or not. Further, we have attempted to forecast the demand for these anomalous SKUs in order to improve the inventory management and sales reporting of the firm. We explored and applied various prediction models such as Linear Regression, Lasso Regression, and Random Forest. This will not only reduce holding costs and avoid lost sales, but also streamline the supply-chain as it gives the client a better understanding of the parts that need to be supplied to each store.
Customer’s Willingness To WaitAman Bhargava, Aditya Gupta, Arjun Mishra1, Simranjeet Singh Khalsa1, Yang Wang, Purdue University, West Lafayette, IN
A model to predict the number of days a customer is willing to wait for a specific SKU based on historical data that will help companies can better anticipate customer needs and optimize the shelf space by stocking units for which willingness to wait is low. The model will eventually help in a company’s aim to achieve SKU rationalization to avoid losing out on sales due to lack of inventory on hand. The willingness to wait for a SKU is dependent on multiple factors like part type, customer requirement and urgency, delivery time to the closest store, and availability of SKU with competitors in the area. Hence, the analysis needs to be done at a SKU level for a cluster of stores using survival analysis techniques tracking the customer drop-off for products as delivery time increases. The support from client regarding providing of required data and legacy insights has been a key part of our project structure. Based on these findings, this research presents insights into predicting the time a customer is willing to wait for a specific SKU using time-series, survival and regression concepts in Python and R.
E-commerce Performance Forecasting and Digital Marketing OptimizationBidrupa Sinha, Tamaralayefa Timiyan, Rachel Fagan, Anil Cavale, Prabhuram Popuri, Matthew A. Lanham, Purdue University, West Lafayette, IN
This study develops a data-driven marketing strategy playbook and dashboard for a small, but rapidly growing (approx. $6.5M revenue in 2021) e-commerce premium candle brand that could be generalized to other firms designing and deploying analytics-based marketing solutions. The brand is direct-to-consumer and relies heavily on digital marketing. It has recently begun to monitor its digital marketing metrics including revenue based on channels, promotions, and products as well as conversion rates related to each channel. There is a potential for growth as the brand continues to leverage data-driven decision making. Three key opportunities where analytics can be integrated to offer additional decision-support and insight into their brand’s marketing strategy were identified: (1) forecasting future growth, (2) optimizing marketing costs, and (3) improving day-to-day metric benchmarking and website operations. In this study we first developed growth forecasts of the brand in terms of candle product types, revenue generation, and revenue by marketing channel. Next, we created an optimization model to streamline company’s text and email digital marketing efforts to reduce unnecessary spending. Lastly, we provided a general marketing playbook that shows how the metrics, if captured consistently over time, along with the model results could help the client in their daily operations (i.e., examining landing page speeds, identifying the optimal time to send promotions, and more). Based on the results of our forecasts and analysis, we provided predictive accuracy for 2022 and a forecasting dashboard so the brand can implement our models in the future. This way, the company is in a better position to improve their marketing activities and overall revenue in following years.
Building Modern, Cloud-based Data Pipelines in the ClassroomCody Baldwin, Director, MS of Business Analytics, University of Wisconsin-Madison, Madison, WI
The proliferation of cloud data warehouses that are efficient, cost-effective, and scalable are changing the way we do analytics. There is a need in the labor market for people who can use these cloud data warehouses to build cloud-based data pipelines that automatically extract, load, and transform data, so it is ready and available for analytics. However, given the complex constellation of tools, actually teaching students to build these pipelines can be a challenge. In this poster, we share how we develop these skills at the University of Wisconsin-Madison.
Using Python Libraries to Predict, Optimize, and Provide End Users Decision ConfidencePaul Chen, So Yeon Baik, Matthew A. Lanham, Purdue University, West Lafayette, IN
Modeling is a fundamental process in many aspects of scientific research, engineering, and business. Predictive modeling using interpretable parametric models to more sophisticated machine learning models has many use cases. Likewise, Algebraic Modeling Languages (AMLs) have emerged as a necessary capability when formulating large complex optimization models. Often prediction and optimization is used together to provide the end user decision guidance (e.g., predict demand-optimize price, predict churn-optimize incentive, predict cancer-optimize treatment, etc.). Our work demonstrates a code design that integrates a trained predictive model using the python library PyCaret into an optimization model using Pyomo. Using a publicly available dataset we predict window breakage from manufacturing process settings. Then via the Pyomo AML optimize what the process settings should be to minimize breakage rate. What makes this case study useful to the audience is that decision-makers such as the manufacturing technician that must set the process settings often want to know how an outcome (e.g. window breakage) might change if the “optimal” settings are not used. Our Python code design shows how to efficiently integrate the predictive model into the optimization model and then show an outcome distribution based on how the user might want to change an input parameter. The Data Scientist to the Developer would find our example as a great use case to extend to their problem.
Partial Association Between Mixed Data: Assessing the Impact of Covid-19 on College Student Well-beingZhaohu(Jonathan)Fan, Shaobo Li, Dungang Liu, Ivy Liu, Philip Morrison, University of Cincinnati, Cincinnati, OH
The outbreak of COVID-19 has lowered the well-being of college students across the world according to existing studies. In this paper, we study the association between well-being and common psychological factors. We analyze the data from two cohorts of first year undergraduates (in New Zealand) in April 2019 and 2020 (early pandemic), which enables a counterfactual to explore the impact of COVID-19. We found that by controlling for age and gender, the other covariates (students’ healthiness, loneliness and accommodation) account for more of the association between well-being and anxiety in 2020 than that in 2019, implying an increased moderating effect of these covariates on the association after the strike of COVID-19. Our empirical findings may deliver various insights to domain experts and lead to more specific studies to assist university policy makers and healthcare providers in decision-making. The empirical analysis in this paper is based on our proposed framework of partial association analysis for mixed data. Specifically, we propose to assess partial association using the rank-based measure, Kendall’s tau, based on a unified residual that can be obtained from any general parametric model for continuous, binary and ordinal outcome. We show that the conditional independence between two outcome variables is equivalent to the independence between the corresponding pair of unified residuals. We also show several useful statistical properties of the proposed partial association measure. A practical guide that covers estimation and inference is provided.
A Hierarchical Approach and Analysis of Assortment OptimizationDhruv Shrivastava, Utkarsh Bajaj, Prerana Das, Manandeep Gill, Aaron Chen, Toolika Agrawal, Matthew A Lanham, Purdue University, West Lafayette, IN
Assortment planning is one of the most important and challenging applications of analytics in retail. Often retailers use a two-stage approach where in the first stage they run thousands of prediction experiments to identify what best captures expected demand. In the second stage, they decide which combination of products will lead to the best sales for a particular store – a classic knapsack-type problem. This work focuses specifically on combinatorial assortment optimization (or second stage) and how the hierarchical nature of the decisions and analysis that needs to occur can lead to drastically different outcomes in-store performance. Using data such as inventory, historical sales, wait times, geographical activity, budgetary constraints, product variety, and shelf space we formulate various linear and integer programming models to demonstrate how the assortment can change using sensitivity analysis on the constraint parameters. We provide our client a strategy in how to set those parameters in the assortment optimization process to achieve strategic revenue outcomes. This work was performed using the CVXPY python package on Purdue Universities’ high-performance Bell cluster, which is one of the top 500 HPCs in the world.
Combing AI and Optimization for Crew PlanningBurak Cankaya, Bulent Erenay, Eyyub Yunus Kibis, Aaron Glassman, Embry Riddle Aeronautical University, Lake Mary, FL
The airline industry has been a volatile industry for supply and demand in the last few years. Buying aircraft, training pilots, assigning them for fleets, and furloughing and laying-off are strategic decisions dependent on economic, human resource, and union agreement constraints. The crew planning should consider all good and bad scenarios. In this research, we consider the AI and Stochastic Optimization models to minimize the cost for airlines and maximize the flight hours of the crew assigned to fleets, and exemplify it in various business scenarios.
Recommendation System for a Timeshare Travel Exchange CompanyPranav Anand, Meghan Harris, Samineni Chandra Vadan, Nikhitha Meela, Souradeep Chakroborthy, Yang Wang, Purdue University, West Lafayette, IN
It seems like, more and more, consumers want companies to provide tomorrow’s needs yesterday. Or rather, consumers want companies to provide things easy without hassle. Why forage through countless searches of available timeshares when an online booking portal can tell you what you like? The ability to recommend desired choices for destination stays helps a company establish themselves as consumers’ most preferred company. Inspired by this, we created a recommendation system utilizing real-time search data to help a for the consumers. The recommendation system aims to suggest customized search options based on previous user search activity. We used real-time search data of past users to understand the trends and patterns these users have based on predicting their propensity to travel and on predicting the bookings they made t to train the recommendation system.
Experiments and Perceptions in Machine TranslationXue Han, Su Tien Lee, Mu Hua Hsu, Li Ci Chuang, Hsiao Chien Wei, Matthew Lanham, Purdue University, West Lafayette, IN
We examine off-the-shelf machine translation (MT) models provided by Google and Microsoft platforms to gauge how well they translate. The motivation for this work is that MT has evolved to a highly intelligent level through deep learning methods. However, all the major translation platforms serve general users. When used for specific domains, situations, or languages, the translation might not catch terms or tones accurately. Platforms such as Google and Microsoft offer a way to build customized models based on general neural MT models, which often provides a professional translation. However, the translation quality of these platforms varies based on customer survey research. Our client is applying Google’s AutoML to provide an all-in-one translation service to its customers but must figure out how translation performance could be improved for different contexts and domains. We designed and iterated through several experiments, where each one is based on different text datasets using various data manipulation methods for Google and Microsoft Azure platforms. We discovered that we could improve translations by preparing the trained datasets in a certain fashion and achieve higher translation accuracy at the same time. Interestingly, the performance of the platforms was opposite of what customer’s surveyed expectations.
Identification of Trends Across Sports for Computational Journalism Using Pattern Recognition and Anomaly DetectionYuvraj Daga, Parth Mau, Drew Bertram, Malay Rai, Jyotisman Banerjee Purdue University, West Lafayette, IN
In this study, we discuss the development of a computational journalism model capable of identifying statistical trends within American collegiate athletics. Television, radio, and print media sources utilize computational journalism to inform their audience of interesting statistics related to the game at hand. Currently, there is no cross-sport computational model capable of identifying significant trends within collegiate sports. In this study, we evaluated the current use of computational journalism in collegiate sports and the associated limitations. We then collected and analyzed college football statistical data to define and identify significant trends related to individuals and teams. From this data, we used enhanced machine learning algorithm on python to create a model capable of identifying significant trends across college football, basketball, soccer, and baseball.
Sales Forecasting Using High-Performance ComputingYu Lin Tai, Purdue University, West Lafayette, IN
The growing need for organizations to efficiently forecast sales is attributed to the impact on profitability. When a firm can accurately predict the trends and insights from extensive sales data, it provides them a strong positioning in the market with increased operational efficiency. Our goal with this paper, “Sales Forecasting Using High-Performance Computing,” is to capture the opportunity to improve the predictive power of the regression model by employing high-performance computational capabilities. The motivation for this study is that demand forecasting must manage the computational complexity and accuracy tradeoff. Previous implementations for sales forecasting rely on linear regression with limited computational ability to run numerous experiments required to assess the effects of interaction terms. We aim to use national auto-parts dealer SKU level data of filters, brakes, and batteries information to build a robust regression model with optimal interaction terms and incorporate feature engineering and hyper-parameter tuning with maximum high-performance capability. We design a forecast engine that uses the previous year’s sales data as features into an ensemble of predictive models to determine what items have maximal potential in the inventory and assortment planning. Our approach has improved the current “bottom-up approach” model leading to higher interpretability and lower time constraints. It has also reduced inventory management costs by utilizing a more efficient and analytically driven approach to allocate products across stores and overall assortment planning.
Clustering and Prediction Model for Customer Engagement and ActivationDurriya Korasawala, Chetan Solanki, Nityanand Kowtha, Karandeep Mann, Paridhi Jain, Yang Wang, Purdue University, West Lafayette, IN
Clustering in the timeshare industry is to understand the different segments of customers for a timeshare exchange platform. The aim of this study it to build a customer segmentation model which is scalable and feasible, a predictive model estimating the likelihood of buying, and suggest strategies leading to better conversion and greater revenues. This study will help identify the most active and dormant customers based on what value they bring to the company in terms of bookings. The aim is to find a similar group of customers who can be targeted with the right marketing content and use these groups for further recommendations of listed properties. Using data such as inventory, members, transactional, etc., Various models have been applied, such as Logistic Regression, Random Forest and decision tree classifier to predict customer’s travel preferences and the likelihood of purchase, the RFM, CHAID (Chi-square Automatic Interaction Detection), and K-means clustering will be helpful in grouping members with similar behavior so that businesses can target them together with relevant marketing techniques.
Tuesday Poster Session
A Prescriptive Analytics Strategy to Optimize Fantasy Basketball Lineups Using Gurobi
Iosif Pappas, Jerry Yurchisin, Zed Dean, Steven Edwards, Mario Ruthmair, Lindsay Montanari, Gurobi Optimization, Beaverton, OR
Fantasy basketball is a game that allows its participants to assume the role of a basketball team manager. The metric that is used to evaluate the performance of the athletes is fantasy points, and hence the aim of the manager is to a priori select the athletes that will collect the maximum amount of fantasy points. In this respect, the goal of this work is to demonstrate how data science and mathematical optimization can readily be integrated to optimally create NBA basketball lineups using the Gurobi Optimizer. Specifically, by starting from an available dataset with information about the previous performances of NBA basketball players, we augment their recent performances through a moving horizon framework. Based on that, we construct and compare three types of data-driven models to predict the future fantasy points of the available players. Directly utilizing these predictions to satisfy the requirements of our basketball team is prohibitive, due to a significant number of viable lineup combinations. To address this, we formulate and solve a mixed-integer linear programming model (MILP) with Gurobi to select the optimal lineup, by simultaneously respecting the in-game position and budget constraints of the problem. Our results demonstrate that through such a prescriptive analytics approach, optimal fantasy sports decision-making can be achieved.
Intelligently Ordering Machine Translation Seed Data to Improve Local Language Translation
Devansh Batra, Manideep Sharma, Gagan Pahuja, Amaan Ansari, Jai Woo Lee, Paul Chen, Matthew Lanham, Purdue University, West Lafayette, IN
Neural machine translation (MT) has become the state-of-the-art methodology for any MT task. However, there remain areas for improvement in the optimization of algorithms, hyperparameters, and the seed data itself for a more effective MT. Only a negligible fraction of the 7000+ currently spoken languages have sufficient text corpora to train MT models. This data scarcity results in systematic inequalities in the performance of MT across the world’s languages. This research addresses the seed data concern to determine an optimized order of seed data which results in both more accurate and quicker translations as compared to a random order. This is achieved by dividing chapters from the English Bible into train and test data, then feeding all possible ordering combinations of train data one by one to identify which training order achieves a pre-defined BLEU score on test data in the least amount of time and with the least number of iterations. Once an optimized order is determined, a comparison is made between this order and the worst order. Our findings suggest that the diversity and depth of semantic domains are key in achieving the most accurate MT, and provide a pathway to accelerate translation for local languages.
“Buy-again” Product Recommendation Engine through Machine Learning Using Customer Price Sensitivity
Feras Seder, Lohita Srinivasan, Mukul Sharma, Sai Nithin Reddy Mannepuli, Shreyansh Jain, Vinit Jadhav, Matthew Lanham, Purdue University, West Lafayette, IN
We discuss how we built a personalized recommendation engine that monitors customer behavior and provides personalized product recommendations to customers as per their needs. With the rise of e-commerce in the last decade, particularly in the retail segment, customers have to choose the products they want from a wide selection which makes customers’ shopping experience exhausting. From a service provider’s perspective, there is an associated risk of losing the customer if they do not get the desired products in a timely manner.
We demonstrate by implementing a personalized recommendation system that understands the customers’ buying behavior, which is based on the purchase history, the frequency at which products are bought, and the money spent, a firm can mitigate this risk and also enhance customers’ buying experience, because customers were recommended the right products at the right time.
Our recommendation engine has been designed to include in-store and online purchases, availability, and consider factors related to product pricing into consideration. This approach considered a pool of products that were bought by the customer at least once in the recent past and attributed them to the ‘buy-again’ category. Based on these factors, our personalized recommendation engine resulted in the identification and ranking of the top products that will be purchased by each customer during their next purchase.
Risk Modeling and Automated Dashboards for Banks
Rohit Soans, Rajas Kapure, Taronish Gotlaseth, Sheen Dhar, Yi-Hsuan Hsu, Purdue University, West Lafayette, IN
With the introduction of new regulations in banking, robust qualitative and quantitative risk models need to be deployed to make sound data driven decisions. Moreover, risk reporting via visualization tools needs to undergo enhancements and modifications after new standards are enacted. Our surveys showed that the stakeholder sentiment deemed the pre-existing visualization tool (Tableau) and manual infrastructure as unsatisfactory. An analysis on Tableau’s competitors in the visualization space revealed PowerBI to be a favorable alternative. Understanding the data and its interactions in Microsoft SQL, ensuring data sanity to avoid any anomalies on the dashboard and reporting outliers helped in establishing data completeness and accuracy. As the FASB replaced the existing accounting standard for the credit loss estimation in 2019, banks and other financial institutions involve lending activities are now in transition of applying a new methodology, Current Expected Credit Losses (CECL), to calculate the credit loss for loans. To comply with the new regulation, banks need to modify the existing assumptions of their credit loss models and update the periodical credit loss reports. As a result, another dimension of our analysis focuses on improving credit risk management. The first part involves time series modelling to enhance the credit loss estimation model using techniques like ARIMA. The second part involves building an automatic reporting tool with Excel VBA and PowerBI to enable the front-end employees of the bank in adjusting the existing credit loss reporting procedures.
Streamlining the Machine Learning Lifecycle to Improve Retail Sales Performance Using MLflow
Aditya Roy Choudhary, Matthew Lanham, Purdue University, West Lafayette, IN
Organizations leveraging machine learning seek to streamline their machine learning development lifecycle. Machine learning model development possesses additional new challenges as compared to the traditional software development lifecycle. These include but are not limited to tracking experiments, code & data versioning, reproducing results, model management, and deployment. In this work, we describe the implementation of MLflow, an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. MLflow is a rapidly growing open-source community with contributions from industry leaders in data science and integration with extensive machine learning libraries, algorithms, and programming languages. We use MLflow on Azure databricks to streamline the process of building a recommender system to predict the user preference for a product or the likelihood of the user purchasing a product, given they are targeted with coupons in a promotional campaign. Finally, the entire machine learning pipeline is integrated with Flask using Rest API to serve the model on real-time and batch inferencing. The author is looking forward to sharing their experience from this project to help others take advantage of the power of MLflow.
Analytical Approaches for Inventory Stratification and Optimizations
Cheng-Yu Chen, Nidhi Bhende, Kriti Sayal, Naveen Kumaran Kumar, Hemanth Sai Ram Challagolla, Purdue University, West Lafayette, IN
The study focuses to equip our business partner, a leader in auto glass repair and manufacture, with optimized SKU stratification and inventory stocking strategy. The approach is based on multi-criteria inventory classification modelling as opposed to traditional ABC classification used by the business partner. The goal of the study is to help achieve an optimal inventory classification to increase profits and optimized inventory costs. To meet this goal, we experimented with demand model, profit model, cost criterion model, K means clustering, weighted K means clustering using Analytical Hierarchy Processing (AHP), Fuzzy clustering, and weighted Fuzzy clustering using AHP. The Fuzzy Clustering with AHP helped us achieve the most optimized inventory stratification in terms of profit, demand, cost, lead time and buyout cost attribution. This approach provides enough flexibility to modify the class size to meet capacity, hence a preferred choice for inventory classification.
Impact of Learning Interventions: An Analytics Case Study with the Top School District in Indiana
Alejandro Brillembourg Cuenca, Michael Jonelis, Vivek Rao, Rupal Bilaiya, Sriya Musunuru, Yang Wang, Purdue University, West Lafayette, IN
There are some differences between the administration of a non-profit and a for-profit enterprise. Unlike for-profit organizations, non-profits are often challenged with shorter-term decisions due to the pressing constraint of having to optimally allocate limited resources and funding. Even in terms of staffing, rarely can a non-profit compete for talent with for-profit organizations. This is particularly noticeable in the case of publicly-funded school districts. The following is a list of potential stakeholders to consider in a school district environment: students, immediate family of students, teachers, school principals and assistant principals, central office administrative staff, secretarial-clerical and technical-clerical staff, facilities and custodial staff, all respective unions who have engaged with the school in a collective bargaining agreement, school board members, local business leaders and potential donors, and all other citizens who by geographic delineation share community with the school. In a world where education is highly valued, chances are that this is a matter that is important to most. In general, budgets may vary significantly year-to-year and unexpected challenges may arise. School districts deal with a population that is most vulnerable and for which all those considered adults in society are responsible. The challenges imposed by a crisis like the COVID-19 pandemic can certainly push all stakeholders, particularly school leaders and administrators, to get creative in their search for solutions. This paper contains a literature review on the K-12 education industry in the United States. Academic publications were reviewed in the following areas: the impact of the COVID-19 pandemic in K-12 education, measuring and categorizing factors which influence student learning, managing learning loss and increasing academic gains, and choosing treatment assignments by measuring treatment effect and causal inference. The client, one of the largest public school districts in the state of Indiana, is retroactively looking to gain insights into their chosen strategy against the COVID-19 pandemic: the allocation of public funding into a summer program and later concurrent tutoring offered through three different methods. Their current aim is to evaluate whether their response aided in an overall reduction of learning loss thus increasing academic gains, and to consider the possibility of choosing the most appropriate treatment assignment based on the specific characteristics of a student. Any findings from this analysis will aid the school district in the future allocation of resources and the tailored assignment of treatments. In evaluating the performance of selected treatments to increase academic gains, the findings of this study can be summarized as: measuring the treatment effect on the treated (TOT) for the summer school intervention and the local average treatment effect (LATE) for the concurrent tutoring interventions. Furthermore, student clustering, based on demographics and academic performance, is considered for the best assignment of students into the intervention that will maximize their academic gains.
IP Detective: Patent Infringement Detection Using BERT
Hrohaan Malhotra, Lakshay Vohra, Puja Gupta, Jonathan Mathai, Gokul Harindranath, Buyang Li, Matthew A. Lanham, Purdue University, West Lafayette, IN
Patents play a significant part in innovation and help individuals and companies safeguard and retain ownership of their ideas. However, patent infringement is common, and more than 2,500 patent infringement suits are filed each year. Currently, patent infringement detection is largely done manually, and companies spend approximately $600 to identify each case of infringement. Our work provides an approach to automate this process through machine learning. Our model first vectorizes patent text using a BERT model trained on the patent text and then calculates similarity scores between competing patent claims. We developed an architecture that not only identifies the similarity of two patents at an overall level but also on each subsection and cross-evaluates the similarity between these sections. This was implemented by creating a matrix of all possible subsection combinations between two patents and populating the matrix with relevant ‘similarity’ scores. The overall score is then calculated by taking a weighted average of the subsection similarities, where the weights were calculated by training a logistic regression model based on historical cases of infringement. Looking at subsection scores along with the overall score, we can identify the potential infringement of two competing patent claims rather accurately. With this model, the cost of manual patent-infringement detection can be significantly reduced as claims can be prioritized by a legal team to review based on the probability of infringement.
Material Constituent Significance to Quality
David Bayba, Kurt Gaiser, INTEL, Chandler, AZ
Modern microprocessors are produced to such microscopic and exacting dimensions that even the smallest variations in equipment, process, and materials can be troublesome. However, it is also cost prohibitive to require absolute perfection on all parameters. In this work we will discuss a process using modern machine learning methods to learn which material constituents are very important and which are of less importance.
Reliance on Science: Patent Citations to Scientific Articles
Ceren Konak, Matthew Marx, Cornell University, Ithaca, NY
We substantially improve performance of extracting citations embedded in patent front, sentences, and paragraphs by leveraging not only machine-learning methods also hand-tuned heuristics, retrieving 16.8 million in-text citations from worldwide patents since 1836 to scientific articles as captured by the Microsoft Academic Graph, PubMed, and Digital Object Identifiers since 1800. We find that at the extraction stage alone, nearly one-quarter of citations are lost unless hand-tuned heuristics are employed. However, we remain open to the possibility that retraining open-source machine learning (ML) packages like GROBID may narrow the gap.
We use rule-based pattern matching techniques to find citations without a journal name and even without a year and fuzzy matching for author names, article titles, bibliographic information, and journal names—any of which might be misspecified. Next, we adopt the open-source GROBID machine learning library, which has been trained to extract citations from text and tag fields, including author, title, journal, and page. We also score the performance of our extraction, linking, and scoring procedure in terms of false negatives and false positives. In this process, we score author name, publication title, and bibliographic extractions depending on different factors.
Automated A/B Testing and Measurement Framework
Padma Dwivedi, Amit Zutshi, Zainab Abdulla Aljaroudi, Akshay Jayan, Achintya Acharya, Matthew A. Lanham, Purdue University, West Lafayette, IN
Data scientists tend to reinvent the wheel with regards to design, methodology, and execution every time a new business problem presents the need for an individualized A/B test. While each test presents its unique challenges, an ‘accelerator’ – that helps automate the identification of test and control groups and provides a framework to compare and choose the best algorithm based on the scenario in question – would drastically reduce the time to market for an analyst. We developed an accelerator with the aforementioned premise by creating a modular product using Python programming. This tool can ingest a data set and a set of user inputs (A/A test vs. A/B test, segmentation needed or not, option to modify considered variables), success factors, and needed test sensitivity to help with the identification of test and control scenarios. Further, it also provides lift/drop numbers brought about by the intervention being tested. Clustering algorithms are used to determine optimum test and control cohorts based on similarity criteria. The impact of the intervention is calculated under statistical significance between test and control groups and, finally, insights and recommendations are published.
Predicting Covid-19 Tweets Sentiment with SAS Enterprise Miner and SAS Sentiment Analysis Studio
Tuan Le, Oklahoma State University, Stillwater, OK
The Covid-19 pandemic is the most severe world-wide public health crisis in our generation with more than 2 million associated deaths as of January 2021. Posing a great challenge to our modern medicine, this pandemic was met with the rapid deployment of several types of vaccines all over the world. Since the vaccines are new and they were deployed rapidly in a fragmented information environment, we are interested in finding out how they are perceived among world population. Using a sample of more than 60,000 tweets, SAS® Enterprise Miner and SAS® Sentiment Analysis Studio, this paper analyzes and extracts insights about people’s perception regarding Covid-19’s vaccine. The Astra Zeneca vaccine in particular, has some concerned with the blood clot reports. However, generally people have positive attitude about the vaccines. Within SAS® Enterprise Miner, we build predictive models (Text Rule Builder, Logistic Regression, Decision Tree, Neural Network) to identify features and classification of the tweets on a smaller sub-sample. Additionally, we use SAS® Sentiment Analysis Studio to build a statistical model to classify positive and negative tweets. This paper demonstrated how traditional predictive models can be applied to text analytics.
Product Pattern Ratio Optimization and Customer Demand Prediction for Fashion Industry
Soyeon Baik, Yen Tsz Huang,Ting-Yun Cheng, Huihui Zhang, Srinikhil Bolneyti, Yang Wang, Purdue University, West Lafayette, IN
Demand prediction and optimization methods are implemented techniques to solve the inventory problem that a pattern-making fashion company is currently facing. Because of the fast-changing trend and seasonality, product assortment and inventory management have become important issues in the fashion industry. These issues are especially crucial for the pattern-making fashion company where the importance of seasonality is greater than that of the basic and solid item-making fashion company. The ability to predict the optimal ratio of pattern to solid ratio of their products will not only improve customer satisfaction but also sales by decreasing the inventory-related costs. This study is mainly divided into two parts: demand prediction and optimization by analyzing the last four years of transaction data. First, to build the demand prediction model, the following methods are applied; linear regression, time series, random forest, and gradient boosting. For the next step, based on the demand prediction results, the optimization methods are applied to maximize sales. Variables such as product category, pattern to solid ratio, and price are taken into consideration along with a set of constraints, such as the discount price. This approach can help the pattern-making fashion company by proposing the idea of how to predict customer demand and maximize sales based on the optimization model.
Minimizing Risk in Ocean Shipping Contracts
Asad Husain, Yuxuan Li, Jackson Bronkema, Diego Carlos Chavez, Kumar Rahul, Kshitij Virdi, Yang Wang, Purdue University, West Lafayette, IN
Working with our sponsor, an industry leader in the high-end furniture business, we were able to forecast container shipping volume by ocean lane. Prior to this project, our sponsor did not have a point of reference as to their expected lane-wise shipping volume. We applied several time series models including decomposition, ARIMA, auto-ARIMA and Holt-Winters for each lane. We effectively modelled the top 20 lanes by aggregate volume, as these lanes account for 53.8% of the total shipping volume our sponsor has imported in the past seven fiscal years. Our recommendation model helps them build long-term contracts with the carriers. This will help to mitigate the amount of financial risk that our sponsor faces when negotiating contracts with ocean freight carriers.
Building a Massive Multilingual Database of Words Mapped to Semantic Domains
Rajarshi Biswas, Arpan Datta, Nitesh Wagh, Nikhil Katiki, Rajib Mahato, Matthew A. Lanham, Purdue University, West Lafayette, IN
Languages are essential to human life. It is not only a form of communication but helps transfer knowledge and experience across generations. There are currently around 6500 languages being spoken. According to UNESCO Atlas of the World’s Languages in Danger, between 1950 and 2010, 230 languages went extinct and as of 2018, a third of the world’s languages have fewer than 1,000 speakers left. Every two weeks a language dies with its last speaker, 50 to 90 percent of them are predicted to disappear by the next century. In this paper, we will be discussing how we have created a multilingual database comprising of 500+ languages and mapped them to semantic domains. The team has mapped the English words with their corresponding semantic domains and subsequently mapped words from other languages with the same/similar meaning to their respective semantic domain. Mapping these semantic domains not only creates a diverse list of words but also groups of words that are going to be useful for literary translation. The team then leverages Dgraph graph database to store such mapped robust data for further research possibilities in the field of language translation. This paper will help fellow linguistic practitioners in accessing accurate translations as well as create a repository of earth’s inherent knowledge in the form of words and phrases. This is going to aide in the future translation of indigenous languages.