Location: Richmond, VA or MCLean, VA or Plano, TX (Onsite)
Key Responsibilities:
Create, maintain, and optimize ETL/ELT pipelines to ingest, process, and manage data from various sources using Python, Apache Spark, and AWS services.
Design data models, build data structures, and implement data storage solutions that ensure data integrity, consistency, and security.
Tune data processing workflows for performance, scalability, and cost efficiency on distributed systems using Spark and AWS.
Work with cross-functional teams (e.g., data science, product, analytics) to understand data requirements and support business needs. Document data workflows, processes, and solutions for transparency and reproducibility
Implement data quality checks, error handling, and recovery processes. Ensure compliance with data governance and security protocols.
Key Qualifications:
Proficient in Python for data processing, scripting, and automation.
Experience with Spark for data transformation, distributed processing, and ETL workflows.
Hands-on experience with core AWS services like S3, Lambda, Glue, EMR, Redshift, and RDS. Knowledge of IAM, CloudFormation, and/or Terraform for infrastructure management is a plus.
Strong understanding of SQL, data warehousing, and database design principles.
Familiarity with data modeling, schema design, and query optimization.
Other Skills:
Experience with version control (Git) and CI/CD practices.
Strong problem-solving skills and ability to work in an Agile environment.
Excellent communication skills and ability to work with non-technical stakeholders.
Preferred Qualifications:
Familiarity with additional tools like Airflow for workflow orchestration.
Experience with data streaming technologies (e.g., Kafka, Kinesis).