Pavani Guttula |

About Me

I’m a Senior Principal Engineer at Eli Lilly, where I lead data engineering for scientific platforms that power drug discovery at scale. My approach to data engineering is less about moving data with fancy tools and more about delivering it in formats that are generic enough to last and shaped precisely for how people actually use it — a philosophy I’ve refined working closely with scientists, analysts, and ML teams.

I hold a Master’s in Computer Science from Purdue, with specializations in Data Science and Software Engineering. My interests span data engineering, Data Science, Machine Learning, NLP, and their applications across science.

Previously I was at Deloitte Consulting as a Data Engineer, and interned at Galois Inc. as a Research Software Engineer.

Experience

Eli Lilly

https://www.lilly.com/

Senior Principal Engineer

June, 2020 - Now

Eli Lilly is a US-based pharmaceutical giant recently known for inventing an antibody therapeutic for COVID-19.

I led the design and delivery of a scientific data platform that ingests and harmonizes large-scale experimental data, powering drug discovery decisions for 3,000+ scientists across 13 departments.

Pipelines & scale: Built and scaled high-volume pipelines processing 12M+ records across 40K+ experiments, hitting sub-15-minute end-to-end refresh SLAs. Reduced full rebuild times from days to hours, and incremental refreshes from over an hour to minutes.

Data modeling: Led multi-parameter result modeling initiatives that eliminated legacy constraints and achieved a 300x reduction in manual effort for scientists. What excited me most was having to pick up large molecule domain knowledge — understanding the science well enough to engineer data that was actually useful, not just technically correct.

Consumer-centric products: Partnered with scientists, analysts, and product stakeholders to shift from ingestion-focused pipelines to consumer-centric data products, improving data usability and trust across the platform.

AI/ML & search: Contributed to a Cortex PubMed ingestion pipeline (1.5M+ records), built metadata APIs for Kernel-Lilly scientific discovery, and developed a federated deep search POC combining NLP + ElasticSearch over 1,100+ clinical trials in collaboration with NVIDIA & Google.

Cloud & reliability: Organized AWS workshops and drove RStudio modernization saving $150K/year. Embedded Great Expectations for data quality and patched a critical production vulnerability in the Marketplace backend.

Galois

https://galois.com

Research Intern

May - August, 2019

Galois works in close collaboration with DARPA to secure USA's cyber-physical infrastructure.

As a Research Intern, I was responsible for jump-starting the MuseML project.

Deloitte

https://www2.deloitte.com

Technical Consultant / Data Engineer

June 2013 - May 2017

Deloitte is one of the world's largest technical consultancies and professional services networks.

As a technical consultant and a data engineer, my responsibilities included analyzing clients’ business requirements, designing data warehouse schemas (Data Vault, Snowflake etc.), developing ETL frameworks, and generating BI reports for finance, pricing and rating sectors.I worked mostly on Financial Service Insdustry(FSI) domain building ETL pipelines for clients like Anthem INc., State Auto INc.,

Projects

MuseML

https://muse.dev

Using AI to help programmers write better code

This is a research project I started during my internship at Galois Inc., and continued at Purdue. The project focuses on analyzing the quality of source code in a software project through a combination of techniques from Software Engineering, Machine Learning, and Natural Language Processing. In particular, I developed a novel classification algorithm that combines ML-based classifiers (e.g., Naive Bayes, SVM, & Random Forests) with topic modeling techniques from NLP (e.g., LDA) to automatically triage the bug reports from static analysis tools (e.g., FBInfer) into true and false positives. MuseML therefore helps developers accurately gauge the quality of source code in a software project, while also helping them quickly improve the code quality by prioritizing bug fixes.

Contractual obligations prevent me from disclosing the source code and reports from this project. However, I would be happy to give a presentation and provide references.

Technologies used: Python, Scikit-Learn, SciPy, Pandas, MongoDB, Docker.

C Compiler & Interpreter

A compiler and an interpreter for a C-like programming language.

For a course project at Purdue, I built a fully-functional interpreter and a compiler for a C-like programming language in C. I implemented all phases of an industry-standard compiler, including lexical analysis & parsing (via Lex and YaCC), type checking, dataflow analysis to detect uninitialized and unused variables, precise error localization, register allocation, and code generation.

Technologies used: C, GDB, Unix.

Fuzz Testing-Guided Static Analysis

Using Fuzz Testing to Improve the Accuracy of Static Analysis on C Programs

Static analysis is the technique of analyzing source code of programs to find bugs. Industry-standard static analysis tools, such as Facebook’s Infer, generate many false positives, i.e., they report bugs that are not real bugs. In this research project, I explored ways to increase the effectiveness of Infer by combining its static analysis technique with a automated software testing technique called Fuzz Testing. The idea is to automatically triage the bug reports issued by Infer based on a Fuzz Tester’s (e.g., AFL’s) ability to reproduce the bug. The results form this project are mixed.

Technologies used: C, GDB, AFL, Unix.

News Analysis

Machine Learning and Natural Language Processing to Analyze News-making Events

Analyzing news data is a crucial task for providing more organized and easy access to news articles, and also to make predictions about a future event (e.g., an election) based on how the discourse is evolving. In this project, we performed NLP on the source data, trained deep learning models like CNN, LSTM and ensemble models like Random Forests to accurately categorize news articles. We also performed sentiment analysis on the news dataset to visualize and predict how the public sentiment around an event is evolving. A report on this project can be found here.

Technologies used: Python, NLTK, Pandas, Keras, NumPy, GenSim, GLove.

Urban Sound Classification

Using Deep Learning to Classify Audible Sounds in Urban Areas

In this project I worked with a large audio dataset to classify urban sounds into a number of different categories e.g., the sound of construction work, vehicle horns, street music, gun-shots, etc. I designed multiple classification algorithms based on SVM, Logistic Regression, and 1D Convolutional Neural Networks, and compared their performance. On 8732 labeled data points, SVM-based classification performed with an 81% accuracy, followed by CNN with 75%. I have also demonstrated that carefully engineered audio features, such as MEL-Frequency Cepstrum Coefficient, Spectogram etc., give 8 times better performance than trivially selected baseline features. A report on this project can be found here.

Technologies used: Python, PyTorch, NumPy, SKLearn, Seaborn, MUDA.

Speed Dating Analysis

Learning the patterns and predicting the outcomes of speed dating

Dating preferences provide insights into the complex psychology of an individual. In this project I analyzed the data from a speed dating application using a combination of data mining techniques to determine the features that act as best predictors of a date’s success. I then compared several classification algorithms, including Naive Bayes, Decision Trees, SVM, Logistic Regression, and Random Forests, in terms of their accuracy at predicting the outcome of a speed date given the same set of features.

Technologies used: Python, NumPy, SciPy, Pandas.

Education

Purdue University

Master of Science

January 2018 - May 2020

Purdue runs one of the world's best graduate programs in Computer Science (http://csrankings.org).

Current GPA is 3.68/4.0. Relevant coursework includes Data Mining (CS573), Statistical Machine Learning (CS578), Natural Language Processing (CS577), Compilers and Programming Systems (CS502), Computer Networks (CS523), and Introduction to Simulation and Modeling (CS543). I was supported by the Department of Computer Science through a Teaching Assistantship (TA). I served as the Head TA for the undergraduate Software Engineering course (CS407). References will be provided on request.

Osmania University, India.

Bachelor of Engineering

August 2009 - May 2013

Osmania University is a public state university in India

I did Bachelor of Engineering (BE) at Chaitanya Bharati Institute of Technology (CBIT) under Osmania University, Hyderabad, India. My major study was in Electrical and Electronics Engineering.

A Little More About Me

I was one of the three graduate students to have been awarded a travel scholarship by Purdue CS to attend Grace Hopper Conference for Women in Computing (2019).
I have Coursera certifications in Web Development and Python Programming.
I organized a MuseML Hackathon at Galois.
I was a founding member of the StateAuto DevOps team at Deloitte.