We provide IT Staff Augmentation Services!

Sr. Data Engineer Resume

5.00/5 (Submit Your Rating)

Chicago, IL

SUMMARY

  • Over 7+ years of progressive complex IT experience in Software Life Cycle Development including analysis, design (system/database/OO), development, deployment, testing, documentation, implementation & maintenance of application software in Web - based environments, and Client/Server architectures.
  • Hands on experience in Serverless technologies like AWS GLUE to perform ETL operations, Lambda Functions to trigger the pipeline.
  • Practical experience in python Boto3 library to access the AWS S3 storage.
  • Expertise in AWS EMR and Spark deployment wif S3 file system.
  • Working Experience on designing and implementing complete end-to-end Hadoop infrastructure using MapReduce, Hive, PIG, Sqoop, Oozie, Flume, Spark, HBase and Zookeeper.
  • Extensively worked on Spark using Python and Scala on cluster for computational (analytics), installed it on top of Hadoop.
  • Dealt wif Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
  • Expert inPySpark, Python, Pysql and design techniqueas well as experience working across large environments wif multiple operating systems.
  • Working experience in various Linux server environments from DEV all the way to PROD and along wif cloud powered strategies embracing AWS and Azure.
  • Exposure to job workflow scheduling and monitoring tools like Oozie (hive, pig) and DAG (lambda).
  • Installed and configuredFlume, Hive, Pig, Sqoop, HBaseon the Hadoop cluster.
  • Hands on experience in Automation of Sqoop incremental imports by using Sqoop and automating jobs using Oozie.
  • Extensively used Stash, Bit-Bucket and GITHUB for the code control purpose.
  • Expertise in Tracking, Documenting, Capturing, Managing and Communicating the Requirements using Requirement Traceability Matrix (RTM)which helped in controlling numerous artifacts produced by the teams across the deliverables for a project.
  • Developed Critical Items project in which the change analysts will check for the critical items in the change order and will release the change order from Agile.
  • Extensive experience in using SQL and PL/SQL to write Stored Procedures, Functions, Packages, snapshots, Triggers, and optimization wif Oracle, DB2 and MySQL databases.
  • Strong working experience on Teradata query performance tuning by analyzing CPU, AMP Distribution, Table Skewness, and IO metrics.
  • Good understanding of working on Artificial Neural Networks and Deep Learning models using Theano and Tensor Flow packages using in Python.
  • Experienced in designing, built, and deploying a multitude of applications utilizing almost all the AWS stack (Including EC2, R53, S3, RDS, DynamoDB, SQS, IAM, and EMR), focusing on high availability, fault tolerance, and auto-scaling.

TECHNICAL SKILLS

Hadoop/Big-Data Technologies: Hadoop, Apache Spark, HDFS, Map Reduce, Sqoop, Hive, Oozie, Zookeeper, Cloudera Manager, Kafka, Flume

Programming & Scripting: Python, Scala, SQL, Shell Scripting, R

Databases: MY SQL, Oracle, PostgreSQL, MS-SQL Server, Teradata

NO SQL Database: HBase, Cassandra, Dynamo DB, Mongo DB, Cosmos DB

Hadoop Distribution: Horton Works, Cloudera, Spark

Version Control: Git, Bitbucket, SVN

Operating Systems: Linux, Unix, Mac OS-X, CentOS, Windows 10, Windows 8, Windows 7

Cloud Computing: AWS, Azure, GCP

PROFESSIONAL EXPERIENCE

Confidential, Chicago, IL

Sr. Data Engineer

Responsibilities:

  • Importing data from different sources like customer database, MySQL database, MongoDB, SFTP folder for converting raw data in structured format and analyzing the different patterns/trends of customers
  • Utilizing python/API calls to import data from databases like MySQL, PostgreSQL, MongoDB and different web applications
  • Importing data from various data sources wif Apache Airflow and performing the data transformation using Hive, MySQL and loading data in HDFS
  • Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing
  • Using Spark, performed various transformations/actions and the result data is saved back to HDFS from their to target database Snowflake
  • Migrating data from external sources like S3, used text/parquet file formats to be loaded into AWS redshift and designing, developing ETL processes in AWS Glue
  • Developing the code in PySpark to transform raw data into structured format jobs on EMR
  • Utilizing Apache Spark to write the code in PySpark and SparkSQL to process data on Amazon EMR and performing necessary transformation
  • Processing the data wif stateless and state full transformations wif Spark Streaming programs to process near real time data from Kafka and Kinesis
  • Building the structured data model wif elastic search by using Python/Spark and developing the ETL pipelines for further analysis
  • Receiving event from S3 bucket by creating lambda deployment function and configuring it
  • Writing AWS Lambda functions in python for AWS's Lambda which is invoking python scripts to perform various transformations and analytics on large data sets in EMR clusters
  • Deploying AWS Lambda code from Amazon S3 buckets and implementing a 'serverless' architecture using API Gateway, Lambda, and Dynamo DB
  • Developing Airflow Workflow to schedule batch and real-time data from source to target
  • Monitor Resources and Applications using AWS Cloud Watch, including creating alarms to monitor metrics such as EBS, EC2, ELB, RDS, S3, SNS and configured notifications for the alarms generated based on events defined
  • Working on Apache Airflow for data Ingestion, Hive & Spark for data processing & Oozie for designing complex workflows in Hadoop framework
  • Developed and run UNIX shell scripts and implemented auto deployment process
  • Experienced in installing, configuring and monitoring the Airflow cluster for source to target jobs
  • Generating the necessary reports in Tableau by creating the workflow model wif data lake in Hadoop

Environment: AWS (EC2, S3, EBS, ELB, RDS, Cloud Watch), Cassandra, PySpark, Apache Spark, HBase, Apache Kafka, Hive, Kinesis, Python, Spark streaming, Machine Learning, Snowflake, Oozie, Tableau, Power BI, NoSQL, PostgreSQL, Shell Scrip, Scala

Confidential, Dallas, TX

Data Analytics Engineer

Responsibilities:

  • Chaired a team of 2 people for telecom inventory maintenance of 5+ customer using SQL server and achieved historic annual savings of 8% wif service providers AT&T, Verizon, Granite Telecom, CenturyLink.
  • Collaborated wif cross functional operations teams to gather, organize, customize and analyze product related data to maintain inventory of customers worth around $10M- $40M
  • Used SQL on data sets to provide ad-hoc data requests and created a report metrics which accelerated reporting by 3%
  • Developed SQL queries of 8% efficiency to fetch required data, analyzed expense based on requirements
  • Analysed the sql scripts and designed it by using PySpark SQL for faster performance.
  • Built Jupyter notebooks using PySpark for extensive data analysis and exploration.
  • Implemented code coverage and integrations using Sonar for improving code testability.
  • Pushed application logs and data streams logs to Applications Insights for monitoring and alerting purpose.
  • Worked on migrating data from HDFS to Azure HD Insights and Azure Databricks.
  • Experience designing solutions in Azure tools like Azure Data Factory, Azure Data Lake, SQL DWH, Azure SQL & Azure SQL Data Warehouse, Azure Functions.
  • Migrated existing processes and data from our on-premises SQL Server and other environments to Azure Data Lake.
  • Used Azure Databricks for fast, easy and collaborative spark-based platform on Azure.
  • Used Databricks to integrate easily wif the whole Microsoft stack.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files wif different schema in to Hive ORC tables.
  • Exposure to Data Lake Implementation and developed Data pipelines and applied business logic utilizing Apache Spark.
  • Implemented various optimization techniques for Spark applications for improving performance.
  • Developed Jenkins and Ansible pipelines for continuous integration and deployment purpose.
  • Built SFTP integrations using various Azure Data Factory solutions for external vendors on boarding.
  • Data ingestion to one or more Azureservices (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)and processing the data inAzure Databrick
  • Developed automated file transfer mechanism using python from MFT, SFTP to HDFS.
  • Prepared technical specifications to develop ETL mappings to load data into various tables confirming to the business rules
  • Designed and implemented data profiling and data quality improvement solution to analyze, match, cleanse, and consolidate data before loading into data warehouse
  • Utilization of Power BI functions and Pivot Tables to further analyze in given complex data
  • Extracted data from CRM systems to Staging Area and loaded the data to the target database by ETL process using Informatica Power Center
  • Analyzed the data by performing Hive queries and running Pig scripts to no user behavior

Environment: MS SQL Server, Tableau, Power BI, MS office, SSRS, SSIS, SAS, PL/SQL,Azure.

Confidential

Data Engineer

Responsibilities:

  • Actively worked wif Business Partners and Designers to get the Requirements and create High Level Application Design.
  • Preparing Dashboards using calculations, parameters inTableauand DevelopingTableaureports dat provide clear visualizations of various industry specific KPIs.
  • Access and change extremely large datasets through filtering, grouping, aggregation, and statistical calculation.
  • Consult wif customers and inside partners to accumulate necessities and set milestones in the research, advancement, and execution periods of the project lifecycle.
  • Understanding ofTableaufeatures like calculated fields, parameters, table calculations, row-level security, R integration, joins, data blending, and dashboard actions.
  • Develop, organize, manage, and maintain graph, table, slide and document templates dat will allow for efficient creation of reports.
  • Created cubes using packages which are built in framework manager.
  • Created and supported congas Transformer models based on the dimensions, levels and measures required for the analysis studio.
  • Created Standard Reports using Report Studio, like Dashboards, List Reports, Crosstab Reports and Chart Reports and Ad- Hoc Reports using Query Studio.
  • Created reports which involved Multiple Prompts, Filters, multi-query reports.
  • Modify the existing report based upon the change request by the client.
  • Developed complex reports using Drill through, conditional blocks and render variables.
  • Involved in status calls for requirement gathering and updating the requirements in the Design document and publishing the updated document into share point.
  • Involved in unit testing of reports and model.
  • Scheduling and Distributing Reports through Schedule Management in Congo’s Connection.

Environment: TableauServer,TableauDesktop, Teradata SQL Assistant, Teradata Administrator, SQL, Share Point, Agile-Scrum, Microsoft Office Suite.

Confidential

ETL Developer

Responsibilities:

  • Designed and built terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Redshift for large scale data handling Millions of records
  • Developed workflows in Oozie for business requirements to extract the data using Sqoop
  • For data exploration stage used Hive to get important insights about the processed data from HDFS
  • Worked on Big data on AWS cloud services me.e., EC2, S3, EMR and DynamoDB
  • Expertise noledge in Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done
  • Responsible for ETL and data validation using SQL Server Integration Services
  • Designed and Developed ETL jobs to extract data from Salesforce replica and load it in data mart in Redshift
  • Involved in Data Extraction from Oracle and Flat Files using SQL Loader Designed and developed mappings using Informatica
  • Developed PL/SQL procedures/packages to kick off the SQL Loader control files/procedures to load the data into Oracle
  • Build and maintain complex SQL queries for data analysis, data mining and data manipulation
  • Developed Matrix and tabular reports by grouping and sorting rows
  • Actively participated in weekly meetings wif the technical teams to review the code
  • Participate in requirement gathering and analysis phase of the project in documenting the business requirements by conducting workshops/meetings wif various business users
  • Built machine learning algorithms like linear regression, decision tree, random forest for continuous variable problems, estimated machine learning algorithm's performance for time series data
  • Analyzed large data sets using pandas to identify different trends/patterns about data
  • Utilized regression models using SciPy to predict future data and visualized them
  • Managed large datasets using Panda’s data frames and MySQL for analysis purposes
  • Developed schemas to handle reporting requirements using Tableau

Environment: Python, Hadoop, Map Reduce, Hive QL, Hive, HBase, Sqoop, Cassandra, Flume, Tableau, Impala, Oozie, MYSQL, Oracle SQL, Pig Latin, AWS, NumPy.

We'd love your feedback!