We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • 6+ years of overall experience as Data Engineer, Hadoop Developer and ETL/SQL developer, comprises designing, development, and implementation of data models for enterprise - level application.
  • Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS, GCP.
  • Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, and Hue.
  • Extensive experience in working with NO SQL databases and its integration Dynamo DB, Mongo DB, Cassandra and HBase
  • Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
  • Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
  • Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
  • Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
  • Involved in loading the structured and semi structured data into spark clusters using SparkSQL and Data Frames Application programming interface (API)
  • Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
  • Excellent knowledge in using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
  • Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
  • Accomplished on HBase to load and retrieve data for real time processing using Rest API.
  • Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
  • Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
  • Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.
  • Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS),Google Cloud Platform,Snowflake.
  • Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
  • Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.
  • Skilled in using Amazon Redshift to perform large scale database migrations.
  • Ingested data into Snowflake cloud data warehouse using Snowpipe.
  • Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.
  • Performed in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
  • Extensive knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Hands on Experience in using Visualization tools like Tableau, Power BI.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
  • Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
  • Proficient in relational databases like Oracle, MySQL and SQL Server.
  • Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS.
  • Capable in working with SDLC, Agile and Waterfall Methodologies.
  • Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive,Spark, Cloudera Manager, Kafka, Flume.

ETL Tools: Informatica, Talend

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.

Scheduling: Apache Airflow, Oozie

Monitoring and Reporting: Tableau, Custom shell scripts, PowerBI

Hadoop Distribution: Horton Works, Cloudera

Build Tools: Maven, Docker, Jenkins

Programming & Scripting: Python, JAVA, SQL, Shell Scripting, C, C++, Scala

Databases: Oracle, MY SQL, Teradata, PostgreSql

Version Control: GIT, Github

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, GCP, Snowflake,Azure

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential

Responsibilities:

  • Migrated an entire oracle database to BigQuery using Dataproc.
  • Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
  • Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
  • Configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Assisted with performance tuning and monitoring.
  • Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
  • Developed PySpark code and Spark-SQL for faster testing and processing of data.
  • Implemented batch processing using DataProc to create clusters of data quickly.
  • Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
  • Developed custom NIFI processors for parsing the data from XML to JSON format and filter broken files.
  • Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
  • Used PySpark to convert panda’s data frame to Spark Data frame.
  • Utilized BigQuery to process the data received from spark data frame and to store it in the Google cloud for further analysis.
  • Used Kafka Utils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
  • Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
  • Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
  • Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
  • Provided design recommendations and resolved technical problems.
  • Assisted with data capacity planning and node forecasting.
  • Involved in performance tuning and troubleshooting Hadoop cluster.
  • Hands in experience on No SQL database like Hbase, Cassandra.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.

Environment: Spark, Hive, Maven, Microservices, GitHub, Splunk, PySpark, Dataproc, Big query, PowerBI, SQOOP, Java 1.8, Linux, Aqua-data studio, NIFI, Google cloud, J2EE, HDFS, Kafka, MySQL.

Data Engineer

Confidential, Dallas TX

Responsibilities:

  • Involved in the project life cycle including the design, development, and implementation of verifying data received in the data lake.
  • Designed and developed security framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Data load from Informatica server to HDFS of the EMR service using the Sqoop.
  • Accessing data from the data lake and receiving into AWS S3.
  • Implemented spring boot microservices to process the messages into the Kafka cluster setup
  • Utilized spark with python to extract XML data and converted into Hive tables.
  • Implemented python scripts on notebooks to read tables as PySpark data frames for analysis.
  • Used joins to integrate tables originating from different sources.
  • Utilized PySpark to partition and bucket the data to facilitate optimal processing in larger stages.
  • Implemented User Defined Functions (UDF) on data frames for analyzing and processing the data.
  • Worked on Qlik replicate for database replication and ingestion.
  • Defined and utilized Window Functions for aggregation.
  • Used ranking functions (rank, dense rank, percent rank, ntile, row number) and aggregation function (sum, min, max) in spark.
  • Performed real time analysis of transaction data with spark streaming and Apache Casandra database.
  • Installed and configured apache airflow for workflow management and created workflows in python.
  • Creation of Roles and Polices using IAM and monitoring the Data pipelines using AWSCloudWatch
  • Stored resultant tables and data frames as CSV files in AWS S3.
  • Added support for AWS S3 and RDS to host static /media files and the database into amazon cloud.
  • Worked on creation of customer Docker container images, tagging, and pushing of data images.
  • Created PL/SQL views, stored procedures, database triggers and packages.
  • Performed Unit testing, integration testing using pyTest.
  • Used Selenium Library to write functioning test automation process that allowed the simulation of submitting different requests from multiple browsers to web application.
  • Created S3 buckets and managed the policies for S3 buckets and utilized S3 bucket and Glacier for storage. Backup on AWS and storage of data using DynamoDB.
  • Graphical representation using the AWS Quick sight and created the Queries on the Snowflake Databases
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on SageMaker.

Environment: Python, Scala, Spark, HDFS, Hive, Mongo DB, HBase, MapReduce, Kafka, Hadoop, AWS, Linux, ETL,Sqoop,Informatica,Microsoft Excel, Git,Airflow.

Hadoop Developer

Confidential, Melbourne FL

Responsibilities:

  • Designed distributed solutions for parallel processing of large data.
  • Designed ETL (Talend) mapping and workflows to fetch data from multiple sources.
  • Implemented one time data migration of multistate level data from SQL server to Snowflake by using Python and SnowSQL.
  • Created and compared solutions with NoSQL databases and SQL server solutions.
  • Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
  • A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on snowflake cloud data warehouse.
  • Designed distributed solutions for parallel processing of large data.
  • Created and compared solutions with NoSQL databases and SQL server solution
  • Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
  • Involved in managing and reviewing Hadoop log files.
  • Involved in running Hadoop streaming jobs to process terabytes data.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions
  • Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
  • Designed both Managed and External tables in Hive to optimize performance, performed partitioning and bucketing for query tuning.
  • Developing and maintaining Workflow scheduling jobs in Oozie for importing data from RDBMS to Hive.
  • Rendered and delivered reports in desired formats by using reporting tools such as Tableau

Environment: Hadoop, HDFS, Hive, Kafka, Sqoop, Oozie, HBase, MySQL, Oracle, Spark, Eclipse, Splunk, GitHub,Snowflake,Talend.

ETL/SQL Developer

Confidential

Responsibilities:

  • BODS Consultant in four full life cycle data migration implementation projects (Waterfall methodology).
  • Used Talend Open studio for data integration to combine, convert and update data present at various sources.
  • Proficient in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Developed objects based on client requirements using Workflows, Data flows, Transformations, Global variables, Script, Database data-store Objects (Calculation views, Tables, Template tables), and flat files.
  • Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Used different transformations like Query, Table Comparison, Map operation, Validation, Case, Row Generation, Pivot, Reverse Pivot, Merge, Key generation, Data transfer, SQL Transform.
  • Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Python scripts and UDF's to perform transformations on large datasets.
  • Experienced in job debugging, job check in and check out to central repository and labeling
  • Worked on HP Application Life cycle management tool for defect analysis and resolution.

Environment: Talend, Azure,Python, MySQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL,Databricks,Cassandra

We'd love your feedback!