I was both the technical project lead as well as a lead engineer tocreate a data platform to allow data scientists to run explorative data analyses and train ML models.
As technical project lead, my responsibilities included helping team members to understand technical aspects of the project, facilitating the exchange of ideas and making decisions, encouraging and monitoring engineering quality, communicating with stakeholders and structuring the process through work breakdowns, estimation and planning.
As an engineer, I was deeply involved in all phases of the project. After exploring and evaluating many technologies, we settled on Databricks with Apache Spark and Delta Lake on AWS as the basis for the platform. Some of the alternatives that I explored and evaluated included Google BigQuery, Airflow, dbt, Tableau and Looker.
My most significant contributions were in the areas of technology evaluation, the creation of architectural concepts (e.g. for infrastructure, deployment and data modeling and data access control), as well as implementation. A particular challenge was to ensure data separation and access control, as the platform collected data from multiple customers, whose data had to be kept separate. This challenge was solved using Databricks’ integrated data catalog.
The platform made it much easier for data scientists and others in the company to access and work with data, gain insights and it helped to standardize data science processes and technologies.
I did Requirements engineering, technology selection, proof of concepts, conceptual and architectural work, implementation of infrastructure (IaC), ETL pipelines and CI/CD pipelines, data modeling, dashboarding, integrating operational systems, onboarding users and evaluation. Test-driven development, architecture documentation, presenting, Scrum.
I used Databricks, Delta Lake, lakehouse architecture, AWS (IAM, EC2, Lambda, S3, CloudWatch, SNS), Apache Spark (Scala, Python, SQL), Unity Catalog, DataFrame API, BigQuery, Tableau, Metabase, Scala (Cats Effect, FS2, ScalaTest), Python (library development, PyTest), Apache Kafka, Prometheus, Terraform, PostgreSQL, GitHub, CI/CD, Confluence, Jira, arc42