We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

0/5 (Submit Your Rating)

New York, NY

SUMMARY

  • Overall 8+ Years of strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
  • Implemented various frameworks for data pipelines and workflows using HBase, Kafka with Spark/PySpark, Python and Scala
  • Expertise in Python andScala, user - defined functions (UDF) for Hive and Pig using Python.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
  • Implemented Kafka producer and consumer applications on Kafka cluster setup with help of Zookeeper.
  • Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Expert knowledge of developing Power BI solutions by configuring the data security and assigning the licenses and sharing the reports and dashboards to the difference user groups in the organization
  • Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
  • Experience in building data pipelines using Azure Data factory, Azure Databricks and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data warehouse and Controlling and granting database access.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Experience in setting up monitoring infrastructure for Hadoop cluster using Nagios and Ganglia.
  • Participates in the development improvement and maintenance of snowflake database applications.
  • Solid ability to querying and optimize diverse SQL Data Bases like MySQL, Oracle, Postgres and NoSQL Data bases like Apache HBase, Cassandra
  • Implemented near real time data pipeline using framework based on Kafka, Spark.
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the HDFS.

TECHNICAL SKILLS

Operating Systems: UNIX, LINUX, Solaris, Windows

Programming languages: Python, Scala, PySpark, Shell Scripting, SQL, PL/SQL and UNIX Bash

Big Data: Hadoop, Sqoop, Apache Spark, NiFi, Kafka, Snowflake, Cloudera, Horton Works, PySpark, Spark, Spark SQL

Data bases: Oracle, SQL Server, My SQL, DB2, Sybase, Netezza, Hive, Impala, Snowflake

Cloud Technologies: AWS, AZURE

Containerization Tools: Kubernetes, Docker, Docker Swarm

IDE Tools: PyCharm, Toad, SQL Developer, SQL *Plus, Sublime Text, VI Editor

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica and Tableau.

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Others: AutoSys, Crontab, ArcGIS, Clarity, Informatica, Business Objects, IBM MQ, Splunk

PROFESSIONAL EXPERIENCE

Confidential, New York, NY

Senior Big Data Engineer

Responsibilities:

  • Involved in Requirement gathering, Business Analysis and translated business requirements into Technical design inHadoopand Big Data.
  • Involved in working Amazon Web Services (AWS) using Elastic Map Reduce (EMR), Redshift, and EC2 for data processing.
  • Created various data pipelines using Spark, Scala and SparkSQL for faster processing of data
  • Designed number of partitions and replication factor for Kafka topics based on business requirements and worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Worked with a team to migrate from Legacy/On prem environment into AWS.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Used Python and Shell scripting to build pipelines.
  • Developed data pipeline using SQOOP, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
  • Developed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
  • Developed a Python Script to load the CSV files into the S3 buckets and createdAWS S3buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future references.
  • Involved in SQOOP implementation which helps in loading data from various RDBMS sources toHadoopsystems.
  • Involved in Hadoopinstallation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
  • Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data.
  • Developed and executed a migration strategy to move Data Warehouse from an Oracle platform to AWS Redshift.
  • Developed Python scripts to extract the data from the web server output files to load into HDFS.
  • Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
  • Extensively worked with Avro and Parquet, XML, JSON files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
  • Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
  • Worked on Written a python script which automates to launch the EMR cluster and configures the Hadoop applications using boto3.
  • Stored and retrieved data from data-warehouses using Amazon Redshift.
  • Experienced in analyzing and Optimizing RDD's by controlling partitions for the given data
  • Experienced in writing live Real-time Processing using Spark Streaming with Kafka
  • Implemented Spark using Python and Spark SQL for faster processing of data and Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
  • Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster
  • Assisted in Cluster maintenance, cluster monitoring, adding and removing cluster nodes and Installed and configured Hadoop, Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and pre-processing.
  • Involved in file movements between HDFS andAWSS3 and extensively worked with S3 bucket inAWS.
  • Automated and monitored complete AWS infrastructure with terraform.
  • Created data partitions on large data sets in S3 and DDL on partitioned data.
  • Converted allHadoopjobs to run in EMR by configuring the cluster according to the data size
  • Involved in ConfiguringHadoopcluster and load balancing across the nodes.
  • Created various types of data visualizations using Python and Tableau.
  • Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis
  • Used HiveQL to analyze the partitioned and bucketed data and compute various metrics for reporting
  • Worked with querying data using SparkSQL on top of Spark engine.
  • Involved in managing and monitoringHadoopcluster using Cloudera Manager.
  • Assisted in creating and maintaining Technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.

Environment: HDFS, Hive, Scala, Sqoop, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Datastage, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Python, Elastic search, data Lake, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, NIFI, Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes.

Confidential, Oak Brook, IL

Big Data Engineer

Responsibilities:

  • Extracted and updated the data into HDFS using Sqoop import and export.
  • Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
  • Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
  • Implement Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning
  • Creating Data factory pipelines that can bulk copy multiple tables at once from relational database to Azure data lake gen2
  • Used Databricks for encrypting data using server-side encryption.
  • Developed and designeddata integrationandmigration solutionsinAzure.
  • Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
  • Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
  • Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and Scala.
  • Used Airflow to monitor and schedule the work
  • Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources
  • Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
  • Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.
  • Built real time pipeline for streaming data using Events hub/Microsoft Azure Queue and Spark streaming.
  • Responsible to manage data coming from different sources through Kafka.
  • Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
  • Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
  • Used Delta Lake as it is an open-source data storage layer which delivers reliability to data lakes.
  • Create custom logging framework for ELT pipeline logging using Append variables in Data factory
  • Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
  • Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
  • Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data
  • Worked in aSAFE (Scaled Agile Framework)team with daily stand ups, sprint planning, quarterly planning.

Environment: Hadoop, Spark, MapReduce, Kafka, Scala, JAVA, Azure Data Factory, Data Lake, Databricks, Azure DevOps, PySpark, Agile, Power BI, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment

Confidential, Jacksonville, FL

Data Engineer

Responsibilities:

  • Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
  • Implementing and Managing ETL solutions and automating operational processes.
  • Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
  • Implemented Work Load Management (WML) in Redshift to prioritize basic dashboard queries over more complex longer-running adhoc queries. This allowed for a more reliable and faster reporting interface, giving sub-second query response for basic queries.
  • Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
  • Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
  • Applied various machine learning algorithms and statistical modeling like decision tree, logistic regression, Gradient Boosting Machine to build predictive model using scikit-learn package in Python
  • Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
  • Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
  • Worked publishing interactive data visualizations dashboards, reports /workbooks on Tableau and SAS Visual Analytics.
  • Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
  • Defined and deployed monitoring, metrics, and logging systems on AWS.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
  • Integrated Kafka with Spark Streaming for real time data processing
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin
  • Developed code to handle exceptions and push the code into the exception Kafka topic.
  • Was responsible for ETL and data validation using SQL Server Integration Services.
  • Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for completeness, duplication, accuracy, and consistency
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Created various complex SSIS/ETL packages to Extract, Transform and Load data
  • Used Oozie Scheduler system to automate the pipeline workflow and orchestrate the map reduces jobs that extract the data on a timely manner

Environment: SQL Server, Erwin, Kafka, Python, MapReduce, Oracle, AWS, Redshift, Informatica RDS, NOSQL, Snow Flake Schema, MySQL, PostgreSQL.

Confidential

Hadoop Developer

Responsibilities:

  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple MapReduce jobs in java and Scala for data cleaning and preprocessing.
  • Experienced in installing, configuring and using Hadoop Ecosystem components.
  • Installed and configured Hive and writtenHive UDFsand Used Map Reduce and Junit for unit testing.
  • UsedDataStax Cassandraalong with Pentaho for reporting.
  • Queried and analyzed data fromDataStaxCassandrafor quick searching, sorting and grouping.
  • Experienced in working with various kinds of data sources such as Teradata and Oracle. Successfully loaded files to HDFS from Teradata, and load loaded from HDFS to hive and impala.
  • UsedYarn Architecture and MapReduce in the development cluster for POC.
  • Supported MapReduce Programs those are running on the cluster. Involved in loading data from UNIX file system to HDFS.
  • Designed and implemented a product search service usingApache Solr/Lucene.
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Experienced in Importing and exporting data into HDFS and Hive usingSqoop.
  • Participated in development/implementation of Cloudera Hadoop environment.
  • Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real-time analysis
  • Worked in installing cluster,commissioning & decommissioning of Data nodes, Name node recovery, capacity planning, and slots configuration.

Environment: CDH, MapReduce, Scala, Kafka, spark, Solr, HDFS, Hive, pig, Impala, Cassandra, Java, SQL, Tableau, PIG, Zookeeper, Pentaho, Sqoop, Python, Teradata, CentOS.

Confidential

Database Developer

Responsibilities:

  • Written complex queries to do analysis work and henceforth generate reports to validate the results produced.
  • Fix the proposed solution using PL/SQL stored procedures and functions.
  • Create test scenarios and test the proposed fix in the QA region.
  • Document the fix based on Confidential standards.
  • Run one shot to clean up the unwanted data from the database.
  • Have written a lot of PL/SQL stored procedures to automate many manual processes.
  • Responsible for creating Tables, Views, and Indexes in the HDQ2 region (which is development test region) as and when required.
  • Modify the Visio diagram for the fixes that go to production.
  • Written crontab scripts in UNIX to automatically perform jobs at a specified time.
  • Designed and developed data loading processes using SQL*Loader, PL/SQL and Unix Shell scripting.
  • Used ftp to put/get the files to/from the oracle server.

Environment: Oracle 9i, PL/SQL, TOAD, SQL*Loader, MS Visio, ERWIN, MS-ACCESS, Windows XP, UNIX Sun Solaris.

We'd love your feedback!