Senior Data Engineer Resume
Portland, OR
SUMMARY
- Passionate Developer with 7+ years of hands - on experience in developing, configuring, implementing Big-data ecosystems, and executing complex enterprise solutions involving in Petabyte scale datasets, data pipelines and analytics.
- Good experience in working with petabyte-scale Hadoop/Spark ecosystem on Google
- Cloud Platform, AWS, Snowflake, Hortonworks, Cloudera using Sqoop, SparkSQL, Spark, Hive, Airflow, Azkaban, Oozie.
- Extraordinary Understanding of Hadoop building and Hands on involvement with Hadoop segments such as Job Tracker, Task Tracker, Name Node, Data Node and HDFS.
- Extensive knowledge on NoSQL databases like Cassandra and Mongo DB.
- Solid understanding of the hadoop distributed file system data handling in the hdfs which is coming from other sources.
- Worked on Spark Sql to create the data frames on the data coming from hdfs with the different file formats like ORC, JSON, PARQUET, Avro and storing the data back to hdfs.
- Experience in converting Hive queries into Spark transformations using Spark RDDs and Python.
- Hands-on experience in provisioning and managing multi-tenant Cassandra cluster on public cloud environment - Amazon Web Services (AWS) - EC2, Open Stack.
- Having good experience on Python scripting language to use in Spark development life cycle.
- Developed multiple components using Python and deployed on the Yarn cluster and compared the performance of Spark with Hive and SQL.
- Experience in building and architecting multiple Data Pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Responsible for GIT version control to commit the code developer which further used for deployment using build and release tool Jenkins.
- Experience in complete Software Development Life Cycle (SDLC) with hands-on experience in using different software methodologies like Waterfall, Agile and Scrum.
- Good knowledge of business process analysis and design, re-engineering, cost control, capacity planning, performance measurement and quality.
TECHNICAL SKILLS
Big Data Eco systems: HDFS, Hive, YARN, Sqoop, Oozie, Apache Spark, Spark Sql, Apache Tez, Impala, Airflow
No SQL DB’s & RDBM’s: Cassandra, Mongo DB, Oracle, SQL Ser, MySQL, Teradata, Postgres.
Programming Languages: Python, Scala
Version control: SVN, GitHub, Bitbucket. GitLab
Business Intelligence Tools: Tableau, Spotfire and Jupiter
Tools: and IDE Eclipse, DB Visualizer, IntelliJ.
Cloud Technologies: Amazon Web Services (AWS), Cloudera and Horton Works, CDP, GCP.
PROFESSIONAL EXPERIENCE
Confidential, Portland, ORSenior Data Engineer
Responsibilities:
- Developed various Spark applications using Python to perform various enrichment of these click stream data merged with user profile data.
- Implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Python.
- Implemented Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Worked on optimizing and tuning Teradata views and SQL to improve the performance of batch and response time of data for users.
- Building reusable Data ingestion and Data transformation framework using Python.
- Used Python for SQL/Crud operation in data base, file extraction/transformation and Generation.
- Worked on import and export data from MySql and Oracle into HDFS and Hive using Spark.
- Developed Apache kudu component for data storage system and for better performance.
- Performed SQL Joins among Hive tables to get input for Spark batch process.
- Implemented ETL framework using Spark and Python and loaded standardize data into Hive.
- Manipulate, serialize, model data in multiple forms like JSON, XML.
- Used Teradata data mover to copy data and objects such as tables and statistics from one system to another.
- Developed and deployed the outcome of spark and Scala code in Hadoop cluster running on GCP.
- Developed a deep understanding of AWS vast data sources and using these data sources provided the solution to business problems.
- Imported data from AWS S3 into Spark RDD to perform the transformations and actions on RDD’s.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Used Jira for bug tracking and Bit Bucket to check-in and checkout code changes.
Environment: Python, HDFS, Hive, Spark, Spark Sql, AWS EC2, S3, Agile, Linux, Oracle, Teradata, Delta Lake, Apache kudu, GCP.
Confidential, Orlando, FL
PySpark/Data Engineer
Responsibilities:
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Design and develop ETL integration patterns using Python on Spark.
- Involved in converting Hive/SQL queries into Spark transformation using Spark RDD’s and Python.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
- Created the tables, views and Macros in Teradata according to the requirements.
- Translating user inputs into ETL and Siebel Analytics design docs.
- Worked on Performance tuning on Spark Application.
- Worked with Apache Spark SQL and data frame functions to perform data transformations and aggregations on complex semi structured data.
- Developed standards for ETL framework for the ease of reusing similar logic across the board.
- Hands on experience in creating RDDs, transformations and actions while implementing Spark applications.
- Written, tested and implemented Teradata Fast load, Multiload and BTEQ scripts, DML, DDL.
- Developed CI/CD system with Jenkins on Kubernetes container environment, utilizing Kubernetes and Docker for the CI/CD system to build, test and deploy.
- Implemented the workflows using Apache Oozie framework to automate the tasks.
- Experience in moving data between GCP and Azure using Azure data factory.
- Responsible for data integrity checks, data validation checks and data certification check in order to provide the quality data for the downstream applications.
Environment: Python, HDFS, Hive, Spark, Spark Sql, Agile, Linux, MySQL, IntelliJ, Eclipse., GCP, Apache Oozie, SQL.
Confidential, Seattle, WA
PySpark Developer
Responsibilities:
- Developed proof of concepts for training images and pdfs to check the accuracy of the files and extraction data can be done from the given customer files.
- Created algorithms to check the accuracy of the files.
- Developed web-based applications using Python 3.7 and Flask.
- Worked with Python libraries like tensor flow, tesseract, for extraction of images.
- We have implemented PySpark for Transformation and Actions in Spark.
- Used object-relational mapped (ORM) to automate the transfer of data stored in relational databases tables into objects.
- Developed data format file that is required by the Model to perform analytics using Spark SQL and Hive query language.
- Implemented various Spark Actions and Transformations by creating RDD's in HDFS.
- Implemented Oozie workflow engine to run multiple Hive and Python jobs.
- Executed the Spark jobs on Cloudera distribution then migrated all Spark and Python components into CDP environments.
- Involved in developing business reports by writing complex SQL queries.
Environment: Python, HDFS, Hive, Spark, Cloudera, Yarn, CDP, Oozie Agile, Linux, MySQL, Agile, IntelliJ, Eclipse.
Confidential
Python/Spark Developer
Responsibilities:
- Managing data analysis and processing activities involving analyzing, studying, and summarizing data for extracting useful information which would assist in strategic decision-making and planning.
- Collating appropriate data for use in database & conducting related research; utilizing statistical programming languages and analytical packages/libraries (Python).
- Updating trends, patterns, and correlation in case of complex data sets; preparing concise data reports and data visualizations for the management to help in the decision-making process.
- Building and maintaining the big data technology stack; ensuring the stack is high in performance, reliability, uptime, and scalability further.
- Extracting Data from different data sources like RDS (Mysql, Postgres) and third-party APIs.
- Parsing/Extracting CSV, JSON files and automating the extraction using python scripts.
- Performed Data quality checks for the data to make sure the data is clean and usable.
Environment: Python, PySpark, Pandas, Airflow, MySql, Postgres, AWS (S3, Lambda, EC2), IntelliJ, Eclipse.
Confidential
Intern/Python Developer
Responsibilities:
- Participated in all phases of the System Development Life Cycle Analysis (SDLC) and responsible for gathering requirements, system analysis, design, development, testing and deployment.
- Used Python to write data into JSON files for testing Django Websites. Created scripts for data modelling and data import and export.
- Developed a fully automated continuous integration system using Git, Gerrit, Jenkins, MySQL and custom tools developed in Python and Bash.
- Played a key role in a department wide transition from Subversion to Git, which resulted in an increase in efficiency for the development community.
- Automated the daily and weekly build process to allow us to build daily builds twice a day for faster turnaround time for submitted code changes.
- Hands on experience in the development of user interface using HTML, JSP, CSS and Java Script.
Environment: Python, MySql, Git, Bash Script, Django, Jenkins, Gerrit, Waterfall, IntelliJ, HTMP, CSS, JSP.