Data Engineer Resume Dallas TX - Hire IT People

SUMMARY

6+ years of overall experience as Data Engineer, Hadoop Developer and ETL/SQL developer, comprises designing, development, and implementation of data models for enterprise - level application.
Good knowledge in Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions and Amazon AWS, GCP.
Hands on experience in using Hadoop ecosystem components like Hadoop, Hive, Pig, Sqoop, HBase, Cassandra, Spark, Spark Streaming, Spark SQL, Oozie, Zookeeper, Kafka, Flume, MapReduce framework, Yarn, and Hue.
Extensive experience in working with NO SQL databases and its integration Dynamo DB, Mongo DB, Cassandra and HBase
Good Knowledge on architecture and components of Spark, and efficient in working with Spark Core, Spark SQL, Spark streaming and expertise in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
Experience in configuring Spark Streaming to receive real time data from the Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet and Hive.
Extensively used Spark Data Frames API over Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
Involved in loading the structured and semi structured data into spark clusters using SparkSQL and Data Frames Application programming interface (API)
Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required.
Excellent knowledge in using Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets.
Good understanding and knowledge of NoSQL databases like MongoDB, HBase and Cassandra.
Accomplished on HBase to load and retrieve data for real time processing using Rest API.
Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi.
Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
Strong knowledge in working with ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage.
Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS),Google Cloud Platform,Snowflake.
Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.
Skilled in using Amazon Redshift to perform large scale database migrations.
Ingested data into Snowflake cloud data warehouse using Snowpipe.
Extensive experience in working with micro batching to ingest millions of files on Snowflake cloud when files arrive to staging area.
Performed in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
Extensive knowledge in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
Hands on Experience in using Visualization tools like Tableau, Power BI.
Experience in importing and exporting the data using Sqoop from HDFS to Relational Database Systems and from Relational Database Systems to HDFS.
Extensive experience in development of Bash scripting, T-SQL, and PL/SQL scripts.
Proficient in relational databases like Oracle, MySQL and SQL Server.
Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS.
Capable in working with SDLC, Agile and Waterfall Methodologies.
Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive,Spark, Cloudera Manager, Kafka, Flume.

ETL Tools: Informatica, Talend

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB.

Scheduling: Apache Airflow, Oozie

Monitoring and Reporting: Tableau, Custom shell scripts, PowerBI

Hadoop Distribution: Horton Works, Cloudera

Build Tools: Maven, Docker, Jenkins

Programming & Scripting: Python, JAVA, SQL, Shell Scripting, C, C++, Scala

Databases: Oracle, MY SQL, Teradata, PostgreSql

Version Control: GIT, Github

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, GCP, Snowflake,Azure

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential

Responsibilities:

Migrated an entire oracle database to BigQuery using Dataproc.
Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators
Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise data from BigQuery.
Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
Configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Assisted with performance tuning and monitoring.
Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
Developed PySpark code and Spark-SQL for faster testing and processing of data.
Implemented batch processing using DataProc to create clusters of data quickly.
Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
Developed custom NIFI processors for parsing the data from XML to JSON format and filter broken files.
Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
Used PySpark to convert panda’s data frame to Spark Data frame.
Utilized BigQuery to process the data received from spark data frame and to store it in the Google cloud for further analysis.
Used Kafka Utils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
Provided design recommendations and resolved technical problems.
Assisted with data capacity planning and node forecasting.
Involved in performance tuning and troubleshooting Hadoop cluster.
Hands in experience on No SQL database like Hbase, Cassandra.
Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.

Environment: Spark, Hive, Maven, Microservices, GitHub, Splunk, PySpark, Dataproc, Big query, PowerBI, SQOOP, Java 1.8, Linux, Aqua-data studio, NIFI, Google cloud, J2EE, HDFS, Kafka, MySQL.

Data Engineer

Confidential, Dallas TX

Responsibilities:

Involved in the project life cycle including the design, development, and implementation of verifying data received in the data lake.
Designed and developed security framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
Data load from Informatica server to HDFS of the EMR service using the Sqoop.
Accessing data from the data lake and receiving into AWS S3.
Implemented spring boot microservices to process the messages into the Kafka cluster setup
Utilized spark with python to extract XML data and converted into Hive tables.
Implemented python scripts on notebooks to read tables as PySpark data frames for analysis.
Used joins to integrate tables originating from different sources.
Utilized PySpark to partition and bucket the data to facilitate optimal processing in larger stages.
Implemented User Defined Functions (UDF) on data frames for analyzing and processing the data.
Worked on Qlik replicate for database replication and ingestion.
Defined and utilized Window Functions for aggregation.
Used ranking functions (rank, dense rank, percent rank, ntile, row number) and aggregation function (sum, min, max) in spark.
Performed real time analysis of transaction data with spark streaming and Apache Casandra database.
Installed and configured apache airflow for workflow management and created workflows in python.
Creation of Roles and Polices using IAM and monitoring the Data pipelines using AWSCloudWatch
Stored resultant tables and data frames as CSV files in AWS S3.
Added support for AWS S3 and RDS to host static /media files and the database into amazon cloud.
Worked on creation of customer Docker container images, tagging, and pushing of data images.
Created PL/SQL views, stored procedures, database triggers and packages.
Performed Unit testing, integration testing using pyTest.
Used Selenium Library to write functioning test automation process that allowed the simulation of submitting different requests from multiple browsers to web application.
Created S3 buckets and managed the policies for S3 buckets and utilized S3 bucket and Glacier for storage. Backup on AWS and storage of data using DynamoDB.
Graphical representation using the AWS Quick sight and created the Queries on the Snowflake Databases
Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on SageMaker.

Environment: Python, Scala, Spark, HDFS, Hive, Mongo DB, HBase, MapReduce, Kafka, Hadoop, AWS, Linux, ETL,Sqoop,Informatica,Microsoft Excel, Git,Airflow.

Hadoop Developer

Confidential, Melbourne FL

Responsibilities:

Designed distributed solutions for parallel processing of large data.
Designed ETL (Talend) mapping and workflows to fetch data from multiple sources.
Implemented one time data migration of multistate level data from SQL server to Snowflake by using Python and SnowSQL.
Created and compared solutions with NoSQL databases and SQL server solutions.
Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
A good experience on understanding of architecting, designing and operationalization of largescale data and analytics solutions on snowflake cloud data warehouse.
Designed distributed solutions for parallel processing of large data.
Created and compared solutions with NoSQL databases and SQL server solution
Simulated production environment in terms of YARN resource usage, User access to submit jobs and Oozie workflows in secured environment.
Involved in managing and reviewing Hadoop log files.
Involved in running Hadoop streaming jobs to process terabytes data.
Implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions
Designed Columnar families in Cassandra and Ingested data from RDBMS, performed transformations and exported the data to Cassandra.
Designed both Managed and External tables in Hive to optimize performance, performed partitioning and bucketing for query tuning.
Developing and maintaining Workflow scheduling jobs in Oozie for importing data from RDBMS to Hive.
Rendered and delivered reports in desired formats by using reporting tools such as Tableau

Environment: Hadoop, HDFS, Hive, Kafka, Sqoop, Oozie, HBase, MySQL, Oracle, Spark, Eclipse, Splunk, GitHub,Snowflake,Talend.

ETL/SQL Developer

Confidential

Responsibilities:

BODS Consultant in four full life cycle data migration implementation projects (Waterfall methodology).
Used Talend Open studio for data integration to combine, convert and update data present at various sources.
Proficient in working with Azure cloud platform (HDInsight, DataLake, DataBricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
Developed objects based on client requirements using Workflows, Data flows, Transformations, Global variables, Script, Database data-store Objects (Calculation views, Tables, Template tables), and flat files.
Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
Used different transformations like Query, Table Comparison, Map operation, Validation, Case, Row Generation, Pivot, Reverse Pivot, Merge, Key generation, Data transfer, SQL Transform.
Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Python scripts and UDF's to perform transformations on large datasets.
Experienced in job debugging, job check in and check out to central repository and labeling
Worked on HP Application Life cycle management tool for defect analysis and resolution.

Environment: Talend, Azure,Python, MySQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL,Databricks,Cassandra

We provide IT Staff Augmentation Services!

Data Engineer Resume

Dallas, TX

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship