Data Engineer Resume
Dallas, TX
SUMMARY
- 6+ years of overall experience as Data Engineer, Hadoop Developer and ETL/SQL developer, comprises designing, development, and implementation of data models for enterprise - level application.
- Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS, GCP.
- Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, and Hue.
- Extensive experience in working with NO SQL databases and its integration Dynamo DB, Mongo DB, Cassandra and HBase
- Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
- Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Involved in loading the structured and semi structured data into spark clusters using SparkSQL and Data Frames Application programming interface (API)
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
- Excellent knowledge in using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
- Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
- Accomplished on HBase to load and retrieve data for real time processing using Rest API.
- Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
- Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
- Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.
- Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS),Google Cloud Platform,Snowflake.
- Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
- Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.
- Skilled in using Amazon Redshift to perform large scale database migrations.
- Ingested data into Snowflake cloud data warehouse using Snowpipe.
- Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.
- Performed in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Extensive knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Hands on Experience in using Visualization tools like Tableau, Power BI.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
- Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
- Proficient in relational databases like Oracle, MySQL and SQL Server.
- Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS.
- Capable in working with SDLC, Agile and Waterfall Methodologies.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS
Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive,Spark, Cloudera Manager, Kafka, Flume.
ETL Tools: Informatica, Talend
NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.
Scheduling: Apache Airflow, Oozie
Monitoring and Reporting: Tableau, Custom shell scripts, PowerBI
Hadoop Distribution: Horton Works, Cloudera
Build Tools: Maven, Docker, Jenkins
Programming & Scripting: Python, JAVA, SQL, Shell Scripting, C, C++, Scala
Databases: Oracle, MY SQL, Teradata, PostgreSql
Version Control: GIT, Github
Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7
Cloud Computing: AWS, GCP, Snowflake,Azure
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential
Responsibilities:
- Migrated an entire oracle database to BigQuery using Dataproc.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
- Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
- Configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Assisted with performance tuning and monitoring.
- Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
- Developed PySpark code and Spark-SQL for faster testing and processing of data.
- Implemented batch processing using DataProc to create clusters of data quickly.
- Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
- Developed custom NIFI processors for parsing the data from XML to JSON format and filter broken files.
- Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
- Used PySpark to convert panda’s data frame to Spark Data frame.
- Utilized BigQuery to process the data received from spark data frame and to store it in the Google cloud for further analysis.
- Used Kafka Utils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
- Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
- Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
- Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
- Provided design recommendations and resolved technical problems.
- Assisted with data capacity planning and node forecasting.
- Involved in performance tuning and troubleshooting Hadoop cluster.
- Hands in experience on No SQL database like Hbase, Cassandra.
- Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
Environment: Spark, Hive, Maven, Microservices, GitHub, Splunk, PySpark, Dataproc, Big query, PowerBI, SQOOP, Java 1.8, Linux, Aqua-data studio, NIFI, Google cloud, J2EE, HDFS, Kafka, MySQL.
Data Engineer
Confidential, Dallas TX
Responsibilities:
- Involved in the project life cycle including the design, development, and implementation of verifying data received in the data lake.
- Designed and developed security framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Data load from Informatica server to HDFS of the EMR service using the Sqoop.
- Accessing data from the data lake and receiving into AWS S3.
- Implemented spring boot microservices to process the messages into the Kafka cluster setup
- Utilized spark with python to extract XML data and converted into Hive tables.
- Implemented python scripts on notebooks to read tables as PySpark data frames for analysis.
- Used joins to integrate tables originating from different sources.
- Utilized PySpark to partition and bucket the data to facilitate optimal processing in larger stages.
- Implemented User Defined Functions (UDF) on data frames for analyzing and processing the data.
- Worked on Qlik replicate for database replication and ingestion.
- Defined and utilized Window Functions for aggregation.
- Used ranking functions (rank, dense rank, percent rank, ntile, row number) and aggregation function (sum, min, max) in spark.
- Performed real time analysis of transaction data with spark streaming and Apache Casandra database.
- Installed and configured apache airflow for workflow management and created workflows in python.
- Creation of Roles and Polices using IAM and monitoring the Data pipelines using AWSCloudWatch
- Stored resultant tables and data frames as CSV files in AWS S3.
- Added support for AWS S3 and RDS to host static /media files and the database into amazon cloud.
- Worked on creation of customer Docker container images, tagging, and pushing of data images.
- Created PL/SQL views, stored procedures, database triggers and packages.
- Performed Unit testing, integration testing using pyTest.
- Used Selenium Library to write functioning test automation process that allowed the simulation of submitting different requests from multiple browsers to web application.
- Created S3 buckets and managed the policies for S3 buckets and utilized S3 bucket and Glacier for storage. Backup on AWS and storage of data using DynamoDB.
- Graphical representation using the AWS Quick sight and created the Queries on the Snowflake Databases
- Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on SageMaker.
Environment: Python, Scala, Spark, HDFS, Hive, Mongo DB, HBase, MapReduce, Kafka, Hadoop, AWS, Linux, ETL,Sqoop,Informatica,Microsoft Excel, Git,Airflow.
Hadoop Developer
Confidential, Melbourne FL
Responsibilities:
- Designed distributed solutions for parallel processing of large data.
- Designed ETL (Talend) mapping and workflows to fetch data from multiple sources.
- Implemented one time data migration of multistate level data from SQL server to Snowflake by using Python and SnowSQL.
- Created and compared solutions with NoSQL databases and SQL server solutions.
- Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
- A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on snowflake cloud data warehouse.
- Designed distributed solutions for parallel processing of large data.
- Created and compared solutions with NoSQL databases and SQL server solution
- Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
- Involved in managing and reviewing Hadoop log files.
- Involved in running Hadoop streaming jobs to process terabytes data.
- Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
- Designed both Managed and External tables in Hive to optimize performance, performed partitioning and bucketing for query tuning.
- Developing and maintaining Workflow scheduling jobs in Oozie for importing data from RDBMS to Hive.
- Rendered and delivered reports in desired formats by using reporting tools such as Tableau
Environment: Hadoop, HDFS, Hive, Kafka, Sqoop, Oozie, HBase, MySQL, Oracle, Spark, Eclipse, Splunk, GitHub,Snowflake,Talend.
ETL/SQL Developer
Confidential
Responsibilities:
- BODS Consultant in four full life cycle data migration implementation projects (Waterfall methodology).
- Used Talend Open studio for data integration to combine, convert and update data present at various sources.
- Proficient in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Developed objects based on client requirements using Workflows, Data flows, Transformations, Global variables, Script, Database data-store Objects (Calculation views, Tables, Template tables), and flat files.
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Used different transformations like Query, Table Comparison, Map operation, Validation, Case, Row Generation, Pivot, Reverse Pivot, Merge, Key generation, Data transfer, SQL Transform.
- Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Python scripts and UDF's to perform transformations on large datasets.
- Experienced in job debugging, job check in and check out to central repository and labeling
- Worked on HP Application Life cycle management tool for defect analysis and resolution.
Environment: Talend, Azure,Python, MySQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL,Databricks,Cassandra