Track: Marketing Analytics
Data Cleaning for Text Analytics
Wednesday, April 14, 2-2:40pm EDT
70 to 80% of the time and effort in a data science project is spent in data cleaning and preparation, yet the modeling step gets all the attention. We aim to change that in this presentation. We will share text cleaning methods using regular expressions and our testing methods for text cleaning in our continuous integration pipeline. We’ll discuss the benefits of maintaining context and give examples of how we have done that in our solution. In addition, we will introduce an AI driven technical thesaurus trained on PC computer technical jargon that can correctly group related terms into single categories.