Skip to content


Track: Marketing Analytics

Data Cleaning for Text Analytics

Wednesday, April 14, 2-2:40pm EDT

70 to 80% of the time and effort in a data science project is spent in data cleaning and preparation, yet the modeling step gets all the attention. We aim to change that in this presentation. We will share text cleaning methods using regular expressions and our testing methods for text cleaning in our continuous integration pipeline. We’ll discuss the benefits of maintaining context and give examples of how we have done that in our solution. In addition, we will introduce an AI driven technical thesaurus trained on PC computer technical jargon that can correctly group related terms into single categories.

Joyce Weiner image

Joyce Weiner

Joyce Weiner

Principal AI Engineer at Intel Corporation

Joyce Weiner is a Principal AI Engineer in the Client Computing Group at Intel Corporation. She focuses on using data to drive change and improve efficiency. Before joining the Client Computing Group, Joyce worked in Fab and Assembly Test Manufacturing. Her book, “Why AI/Data Science Projects Fail: How to Avoid Project Pitfalls” was published in 2021. Joyce has a BS in Physics from Rensselaer Polytechnic Institute, and an MS in Optical Sciences from the University of Arizona. She is married and in her free time enjoys drawing, calligraphy, and reading.